Saturday, 26 March 2022

Google BigQuery vs BigTable

 

BigQuery

BigTable

  • BigQuery is Google Cloud’s fully managed, petabyte-scale, and cost-effective analytics data warehouse that lets you run analytics over vast amounts of data in near real-time.
  • You can use bq command-line tool or Google Cloud Console to interact with BigTable.
  • You can access BigQuery by using the Cloud Console, by using the bq command-line tool, or by making calls to the BigQuery REST API using a variety of client libraries such as Java, .NET, or Python.
  • A dataset is contained within a specific project. Datasets are top-level containers that are used to organize and control access to your tables and views.
  • You specify a location for storing your BigQuery data when you create a dataset. After you create the dataset, the location cannot be changed, but you can copy the dataset to a different location, or manually move (recreate) the dataset in a different location.
  • You can set control access to datasets in BigQuery at table and view level, column-level, or use IAM.
  • There are several ways to ingest data into BigQuery:
  • Batch load a set of data records.
  • Stream individual records or batches of records.
  • Use queries to generate new data and append or overwrite the results to a table.
  • Use a third-party application or service.
  • Data loaded in BigQuery can be exported in several formats. BigQuery can export up to 1 GB of data to a single file. If you are exporting more than 1 GB of data, you must export your data to multiple files. When you export your data to multiple files, the size of the files will vary.
  • Jobs are actions that BigQuery runs on your behalf to load data, export data, query data, or copy data.
  • An external data source (also known as a federated data source) is a data source that you can query directly even though the data is not stored in BigQuery. Instead of loading or streaming the data, you create a table that references the external data source.
  • A fully managed, scalable NoSQL database service for large analytical and operational workloads.
  • You can use cbt command-line tool or Google Cloud Console to interact with BigTable.
  • Cloud Bigtable is a sparsely populated table that can scale to billions of rows and thousands of columns, enabling you to store terabytes or even petabytes of data. A single value in each row is indexed; this value is known as the row key.
  • Cloud Bigtable is ideal for storing very large amounts of single-keyed data with very low latency. It supports high read and write throughput at low latency, and it is an ideal data source for MapReduce operations.
  • Cloud Bigtable stores data in massively scalable tables, each of which is a sorted key/value map. The table is composed of rows, each of which typically describes a single entity, and columns, which contain individual values for each row. Each row is indexed by a single row key, and columns that are related to one another are typically grouped together into a column family. Each column is identified by a combination of the column family and a column qualifier, which is a unique name within the column family.
  • To use Cloud Bigtable, you create instances, which contain up to 4 clusters that your applications can connect to. Each cluster contains nodes, the compute units that manage your data and perform maintenance tasks.
  • A Cloud Bigtable instance is a container for your data. Instances have one or more clusters, located in different zones. Each cluster has at least 1 node.
  • Cloud Bigtable backups let you save a copy of a table’s schema and data, then restore from the backup to a new table at a later time.
  • Dataflow templates allow you to export data from Cloud Bigtable from a variety of data types then import the data back into Cloud Bigtable.
  • Replication for Cloud Bigtable enables you to increase the availability and durability of your data by copying it across multiple regions or multiple zones within the same region. You can also isolate workloads by routing different types of requests to different clusters.
  • You can use Dataproc to create one or more Compute Engine instances that can connect to a Cloud Bigtable instance and run Hadoop jobs.

Google Cloud Functions vs App Engine vs Cloud Run vs GKE

 Serverless compute platforms like Cloud Functions, App Engine, and Cloud Run lets you build, develop, and deploy applications while simplifying the developer experience by eliminating all infrastructure management.

On the other hand, Google Kubernetes Engine (GKE) runs Certified Kubernetes that helps you facilitate the orchestration of containers via declarative configuration and automation.

Both Google serverless platforms and GKE allows you to scale your application based on your infrastructure requirement. Here’s a table to help you identify when to use these specific services.

Cloud Functions

App Engine

  • Cloud Functions is a fully managed, serverless platform for creating stand-alone functions that respond to real-time events without the need to manage servers, configure software, update frameworks, and patch operating systems.
  • With Cloud Functions, you write simple, single-purpose functions that are attached to events produced from your cloud infrastructure and services.
  • Cloud Functions can be written using JavaScript, Python 3, Go, or Java runtimes which make both portability and local testing more familiar.
  • Functions are stateless. The execution environment is often initialized from scratch, which is called a cold start and they take significant amounts of time to complete.
  • It is a serverless execution environment that can be used for building and connecting your cloud services. It can serve IoT workloads, ETL, webhooks, Kafka messages, analytics, and event-driven services.
  • Cloud Functions are great for building serverless backends, doing real-time data processing, and creating intelligent apps.
  • App Engine is a fully managed, serverless platform for hosting and developing highly scalable web applications. It lets you focus on your code while App Engine manages infrastructure concerns.
  • You can scale your applications from zero to planet-scale without having to worry and manage infrastructure.
  • You can build your application in Node.js, Java, Ruby, C#, Go, Python, or PHP runtimes. Moreover, you can also bring any library and framework to App Engine by supplying a Docker container.
  • Each Cloud project can only contain a single App Engine application. Once App Engine is created on a project, you are not allowed to change the location of your application.
  • App Engine can seamlessly host different versions of your application, and help you effortlessly create development, test, staging, and production environments.
  • With App Engine, you can route incoming traffic to different versions of your application, A/B test it, and perform incremental feature rollouts by using traffic splitting.
  • App Engine easily integrates with Cloud Monitoring and Cloud Logging to monitor your app’s health and performance. It also works with Cloud Debugger and Error Reporting to help you diagnose and fix bugs quickly.
  • You can run your applications in App Engine using the standard or flexible environments. You are allowed to simultaneously use both environments for your application to take advantage of each environment’s individual benefits.

Cloud Run

Google Kubernetes Engine (GKE)

  • Cloud Run is a managed serverless compute platform that helps you run highly scalable containerized applications that can be invoked via web requests or Pub/Sub events.
  • It is built upon an open standard Knative, that enables the portability of your applications
  • You can pick the programming language of your choice, any operating system libraries, or even bring your own binaries.
  • You can leverage container workflows since Cloud Run integrates well with services in the container ecosystem like Cloud Build, Artifact Registry, Docker.
  • Your container instances run in a secure sandbox environment isolated from other resources.
  • With Cloud Run, you can automatically scale up or down from zero to N depending on traffic. 
  • Cloud Run services are regional and are automatically replicated across multiple zones.
  • Cloud Run provides an out-of-the-box integration with Cloud Monitoring, Cloud Logging, Cloud Trace, and Error Reporting to monitor the health performance of an application.
  • Google Kuberenetes Engine (GKE) is a managed Kubernetes service that facilitates the orchestration of containers via declarative configuration and automation.
  • It integrates with Identity Access Management (IAM) to control access in the cluster with your Google accounts and role permissions you set.
  • GKE runs Certified Kubernetes. This enables portability to other Kubernetes platforms across cloud and on-premises workloads.
  • You can eliminate operational overhead expenses by enabling auto-repair, auto-upgrade, and release channels
  • GKE lets you reserve a CIDR range for your cluster, allowing your cluster IPs to coexist with private network IPs via Google Cloud VPN.
  • With GKE, you can choose clusters designed to the availability, version stability, isolation, and pod traffic requirements of your mission-critical workloads.
  • You can automatically scale your application deployment up and down based on CPU and memory utilization.
  • By default, your cluster nodes are automatically updated with the latest release version of Kubernetes. Kubernetes release updates are quickly made available within GKE.
  • Google Kubernetes Engine integrates well with Cloud Logging and Cloud Monitoring via Cloud Console, making it easy to gain insight into your application.

Google Cloud Storage vs Persistent Disks vs Local SSD vs Cloud Filestore

 

Google Cloud StoragePersistent DisksLocal SSDCloud Filestore
  • Cloud Storage is a service for storing your objects in Google Cloud. An object is an immutable piece of data consisting of a file of any format. You store objects in containers called buckets.
  • You specify a location for storing your object data when you create a bucket. You can either select region, dual-region, and multi-region as location. Objects stored in a multi-region or dual-region are geo-redundant.
  • Cloud Storage offers different storage classes for various storage requirements: Standard, Nearline, Coldline, and Archive.
  • GCS offers unlimited storage with no minimum object size.
  • Cloud Storage offers two systems for granting users permission to access your buckets and objects: IAM and Access Control Lists (ACLs). These systems act in parallel – in order for a user to access a Cloud Storage resource, only one of the systems needs to grant the user permission.
  • Cloud Storage always encrypts your data by default on the server-side before it is written to disk, at no additional charge. You also have an option to do your own encryption before uploading it to Cloud Storage.
  • Block storage service, fully integrated with Google Cloud products like Compute Engine and GKE.
  • It can be attached to virtual machine (VM) instances running in Compute Engine or Google Kubernetes Engine
  • Transparently resize, quickly back up, and support simultaneous readers
  • Persistent disks ensure data integrity by storing data redundantly in zones or regions and are designed for high durability.
  • They are located independently from your virtual machine instances. This means you can detach or move your disks to retain your data even after deleting your instances.
  • You can create snapshots to back up data from your zonal or regional persistent disks.
  • Snapshots are geo-replicated and available for restore in all regions by default. Snapshots of a block device can take place in minutes rather than hours.
  • You can resize your existing persistent disks to scale based on performance and storage space requirements.
  • Persistent Disks are automatically encrypted to protect your data, in transit or at rest. You can supply your own key, or we will automatically generate one for you.
  • Ephemeral locally attached block storage for virtual machines and containers. 
  • Local SSDs have higher throughput and lower latency than standard persistent disks or SSD persistent disks.
  • The data that you store on a local SSD persists only until the instance is stopped or deleted. 
  • Local SSDs are designed to offer very high IOPS and low latency.
  • Compute Engine automatically encrypts your data when it is written to local SSD storage space. You can’t use customer-supplied encryption keys with local SSDs.
  • You can create an instance with 16 or 24 local SSD partitions for 6 TB or 9 TB of local SSD space, respectively.
  • Instances with shared-core machine types can’t attach any local SSD partitions.
  • Fully managed service for file migration and storage. Easily mount file shares on Compute Engine VMs.
  • Filestore instances are fully managed NFS file servers on Google Cloud for use with applications running on Compute Engine virtual machines (VMs) instances or Google Kubernetes Engine clusters.
  • Filestore share can be accessed both from a Compute Engine instance within the same VPC or from remote clients.
  • File shares can also be accessed from Google Kubernetes Cluster. The cluster must be in the same Google Cloud project and VPC network as the Filestore instance unless the Filestore instance is on a shared VPC network. Currently, Filestore instances can only be created on a shared VPC network from the host project

Google Compute Engine vs App Engine

 

Google Compute Engine

Google App Engine

Compute Engine delivers configurable virtual machines running in Google’s data centers with access to high-performance networking infrastructure and block storage solutions.

App Engine is a fully managed, serverless platform for developing and hosting web applications at scale.

Delivered as Infrastructure-as-a-Service (IaaS)

Delivered as Platform-as-a-Service (PaaS)

Supported Languages: Any

Supported Languages: Go, Python, Java, Node.js, PHP, Ruby (.Net and Custom runtimes for Flexible Environment)

A machine type is a set of virtualized hardware resources available to a virtual machine (VM) instance, including the system memory size, virtual CPU (vCPU) count, and persistent disk limits. In Compute Engine, machine types are grouped and curated by families for different workloads. You can choose from general-purpose, memory-optimized, and compute-optimized families.

You can run your applications in App Engine using the flexible environment or standard environment. You can also choose to simultaneously use both environments for your application and allow your services to take advantage of each environment’s benefits.

You can create a collection of virtual instances and manage them as a single entity by creating instance groups. Instance groups can be managed instance groups (MIGs) or unmanaged instance groups.

Instances are the basic building blocks of App Engine, providing all the resources needed to successfully host your application. App Engine can automatically create and shut down instances as traffic fluctuates, or you can specify the number of instances to run regardless of the amount of traffic.

Compute Engine offers autoscaling to automatically add or remove VM instances from a managed instance group based on increases or decreases in load. Autoscaling lets your apps gracefully handle increases in traffic, and it reduces cost when the need for resources is lower.

You can specify what type of scaling you want to implement by the following -Basic Scaling-Automatic Scaling-Manual Scaling- 

App Engine can scale down to 0 instances when no one is using your application.

General Workloads, VM migration to Compute Engine, Genomics data processing, BYOL or use license-included images

Modern web applications, Scalable mobile back ends

Google Cloud Build

 

  • Build, test, and deploy on Google Cloud Platform’s serverless CI/CD platform.

Features

  • Cloud build is a fully serverless platform that helps you build your custom development workflows for building, testing, and deploying.
  • Cloud Build can import source code from:
    • Cloud Storage
    • Cloud Source Repositories
    • GitHub
    • Bitbucket
  • Supports Native Docker.
    • You can import your existing Docker file.
    • Push images directly to Docker image storage repositories such as Docker Hub and Container Registry.
  • You can also automate deployments to Google Kubernetes Engine (GKE) or Cloud Run for continuous delivery.
  • Automatically performs package vulnerability scanning for vulnerable images based on policies set by DevSecOps.
  • You can package source into containers or non-container artifacts like Maven, Gradle, Go, or Bazel.

Pricing

  • The first 120 build-minutes per day is free.
  • The succeeding time is charged.

Google Container Registry

 

  • Container Registry is a container image repository to manage Docker images, perform vulnerability analysis, and define fine-grained access control.

Features

  • Automatically build and push images to a private registry when you commit code to Cloud Source Repositories, GitHub, or Bitbucket.
  • You can push and pull Docker images to your private Container Registry utilizing the standard Docker command-line interface.
  • The system creates a Cloud Storage bucket to store all of your images the first time you push an image to Container Registry
  • You have the ability to maintain control over who can access, view, or download images.

Pricing

  • Container Registry charges for the following:
    • Storing images on Cloud Storage
    • Network egress for containers stored in the registry.
  • Network ingress is free.
  • If the Container Scanning API is enabled in either Container Registry, vulnerability scanning is turned on and billed for both products.

GCP Developer Tools

 

  • A fully managed git repository where you can securely manage your code.

Features

  • You will be able to extend your git workflow with Cloud Source Repositories. Set up a repository as a Git remote. Push, pull, clone, log, and perform other Git operations as required by your workflow.
  • You can create multiple repositories for a single Google Cloud project. This allows you to organize the code associated with your cloud project in the best way.
  • View repository files from within the Cloud Source Repositories using Source Browser. You can filter your view to focus on a specific branch, tag, or commit.
  • Private repositories are for free.
  • Can be automatically synced with Github and Bitbucket repositories.
  • Integrates with Cloud Build to automatically build and test an image when changes are pushed to Cloud Source Repositories.
  • You can get insights on actions performed on your repository with Cloud Audit Logs.

Pricing

  • Cloud Source Repositories charges based on:
    • Per user
    • Storage
    • Egress network

Google Cloud Deployment Manager

 

  • Google Cloud Deployment Manager is an infrastructure deployment service that automates the creation and management of Google Cloud resources.

Features

  • You can write template and configuration files and utilize them to create deployments that have a variety of Google Cloud services working together, such as:
    • Cloud Storage
    • Compute Engine
    • Cloud SQL
  • A configuration defines the structure of your deployment. You must specify a configuration on a YAML file to create a deployment. It contains the following:
    • type and properties of the resources that are part of the deployment
    • any templates the configuration should use
    • additional subfiles that can be executed to create your final configuration.
  • It is recommended that you break your configuration into templates to simplify your deployment and make it easier to replicate and troubleshoot. A template is a separate file that defines a set of resources. You can reuse templates across different deployments, to help you manage complex deployments consistently.
  • Creating a deployment creates the resources that you defined in a configuration.

Deployment Management Roles

  • Deployment Manager Editor
    • Provides the permissions to create and manage deployments.
  • Deployment Manager Type Editor
    • Provides read and write access to all Type Registry resources.
  • Deployment Manager Type Viewer
    • Provides read-only access to all Type Registry resources.
  • Deployment Manager Viewer
    • Provides read-only access to all Deployment Manager-related resources.

Pricing

  • You only pay for the resources that you provision. Deployment Manager has no additional charge to Google Cloud Platform customers.

Google Cloud Monitoring

 

  • Cloud Monitoring collects metrics, events, and metadata, hosted uptime probes, and application instrumentation to gain visibility into the performance, availability, and health of your applications and infrastructure.

Features

  • Collect metrics from multicloud and hybrid infrastructure in real time.
  • Metrics, events, and metadata are displayed with rich query language that helps identify issues and uncover significant patterns.
  • Reduces time spent navigating between systems with one integrated service for metrics, uptime monitoring, dashboards, and alerts.

Workspaces

  • Cloud Monitoring utilizes workspaces to organize and manage its information.
  • A Workspace can manage the monitoring data for a single Google Cloud project, or it can manage the data for multiple Google Cloud projects and AWS accounts.
  • But, a Google Cloud project or an AWS account can only be associated with one Workspace at a time.
  • You must have at least one of the following IAM role name for the Google Cloud project to create a Workspace:
    • Monitoring Editor
    • Monitoring Admin
    • Project Owner

Cloud Monitoring Agent

  • The Cloud Monitoring agent is a collectd-based daemon that collects application and system metrics from virtual machine (VM) instances.
  • The Monitoring agent collects disk, network, CPU, and process metrics by default.
  • You can configure the Monitoring agent to monitor third-party applications.

Pricing

  • Monitoring charges only for the volume of ingested metric data and Cloud Monitoring API read calls that exceed the free monthly allotment.
  • Non-chargeable metrics and Cloud Monitoring API write calls don’t count towards the allotment limit.

Google Cloud Logging

 

  • An exabyte-scale, fully managed service for real-time log management.
  • Helps you to securely store, search, analyze, and alert on all of your log data and events.

Features

  • Write any custom log, from any source, into Cloud Logging using the public write APIs.
  • You can search, sort, and query logs through query statements, along with rich histogram visualizations, simple field explorers, and the ability to save the queries.
  • Integrates with Cloud Monitoring to set alerts on the logs events and logs-based metrics you have defined.
  • You can export data in real-time to BigQuery to perform advanced analytics and SQL-like query tasks.
  • Cloud Logging helps you see the problems with your mountain of data using Error Reporting. It helps you automatically analyze your logs for exceptions and intelligently aggregate them into meaningful error groups.

Cloud Audit Logs

Cloud Audit Logs maintains audit logs for each Cloud project, folder, and organization. There are four types of logs you can use:

1. Admin Activity audit logs

  • Contains log entries for API calls or other administrative actions that modify the configuration or metadata of resources.
  • You must have the IAM role Logging/Logs Viewer or Project/Viewer to view these logs.
  • Admin Activity audit logs are always written and you can’t configure or disable them in any way.

2. Data Access audit logs

  • Contains API calls that read the configuration or metadata of resources, including user-driven API calls that create, modify, or read user-provided resource data.
  • You must have the IAM roles Logging/Private Logs Viewer or Project/Owner to view these logs.
  • You must explicitly enable Data Access audit logs to be written. They are disabled by default because they are large.

3. System Event audit logs

  • Contains log entries for administrative actions taken by Google Cloud that modify the configuration of resources.
  • You must have the IAM role Logging/Logs Viewer or Project/Viewer to view these logs.
  • System Event audit logs are always written so you can’t configure or disable them.
  • There is no additional charge for your System Event audit logs.

4. Policy Denied audit logs

  • Contains logs when a Google Cloud service denies access to a user or service account triggered by a security policy violation.
  • You must have the IAM role Logging/Logs Viewer or Project/Viewer to view these logs.
  • Policy Denied audit logs are generated by default. Your cloud project is charged for the logs storage.

Exporting Audit Logs

  • Log entries received by Logging can be exported to Cloud Storage buckets, BigQuery datasets, and Pub/Sub topics.
  • To export audit log entries outside of Logging:
    • Create a logs sink.
    • Give the sink a query that specifies the audit log types you want to export.
  • If you want to export audit log entries for a Google Cloud organization, folder, or billing account, review Aggregated sinks.

Pricing

  • All features of Cloud Logging are free to use, and the charge is only applicable for ingested log volume over the free allotment. Free usage allotments do not come with upfront fees or commitments.

Google Cloud Billing

 

  • You can configure billing on Google Cloud in a variety of ways to meet different needs.
  • To use Google Cloud services, you must have a valid Cloud Billing account,

Features

  • If you have a project that is not linked to a Cloud Billing account, you will have limited use of products and services available for your project.

Cloud Billing Account & Payments Profile

  • Cloud Billing Account
    • It is set up in Google Cloud and is used to define who pays for a given set of Google Cloud resources and Google Maps Platform APIs.
    • Access control to a Cloud Billing account is established by IAM roles.
    • A Cloud Billing account is connected to a Google payments profile.
  • Google Payments Profile
    • Stores your payment instrument like credit cards and debit cards, to which costs are charged.
    • Stores information about who is responsible for the profile.
    • This serves as a document center where you can view invoices and payment history.

Cloud Billing Reports

  • The Cloud Billing Reports page allows you to view your Google Cloud usage costs at a glance and discover and analyze trends.
  • It shows a chart that plots usage costs for all projects linked to a Cloud Billing account.
  • You can select a date range, specify a time range, configure the chart filters, and group by project, service, SKU, or location to filter how you view your report.
  • Moreover, you can also forecast future costs using the Cloud Billing Reports to check out how much you are projected to spend, up to 12 months in the future.

Cloud Billing Budgets

  • You can define the scope of the budget to apply in:
    • Entire Cloud Billing account
    • One or more projects
    • One or more products
    • Other budget filters applicable to your Cloud Billing account.
  • You can specify the budget amount to your requirement, or base the budget amount on the previous month’s spend.
  • Moreover, you can also specify email alerts and declare the recipients in the following ways:
    • Using the role-based option (default), where you can send email alerts to billing admins and users on the Cloud Billing account.
    • Using Cloud Monitoring, where you can enlist other people in your organization (for example, project managers) to receive budget alert emails.
    • You can also use Pub/Sub for a more programmatic notification approach.

Overview of Cloud Billing roles in IAM

The following predefined Cloud Billing IAM roles are designed to allow you to use access control to enforce separation of duties in managing your billing:

  • Billing Account Creator (roles/billing.creator)
    • Create new self-serve (online) billing accounts.
    • Assigned at organization Level
    • Use this role for initial billing setup or to allow the creation of additional billing accounts. Users must have this role to sign up for Google Cloud with a credit card using their corporate identity.
  • Billing Account Administrator (roles/billing.admin)
    • Manage billing accounts (but not create them).
    • Can be assigned at the organization level or billing account.
    • This role is an owner role for a billing account. Use it to manage payment instruments, configure billing exports, view cost information, link and unlink projects, and manage other user roles on the billing account.
  • Billing Account User (roles/billing.user)
    • Link projects to billing accounts.
    • Can be assigned at the organization level or billing account.
    • This role has very restricted permissions, so you can grant it broadly, typically in combination with Project Creator. These two roles allow a user to create new projects linked to the billing account on which the role is granted.
  • Billing Account Viewer
    • View billing account cost information and transactions.
    • Can be assigned at the organization level or billing account.
    • Billing Account Viewer access would usually be granted to finance teams. It provides access to spend information but does not confer the right to link or unlink projects or otherwise manage the properties of the billing account.
  • Project Billing Manager (roles/billing.projectManager)
    • Link/unlink the project to/from a billing account.
    • Can be assigned at the organization level or billing account.
    • This role allows a user to attach the project to the billing account, but does not grant any rights over resources. Project Owners can use this role to allow someone else to manage the billing for the project without granting them resource access.

Google Cloud Console

  • Google Cloud Console is a web admin interface to manage your Google cloud infrastructure.

Features

  • You can create projects on Google Cloud Console.
  • With Cloud Console, you can quickly find and check the health of all your cloud resources in one place, including virtual machines, network settings, and data storage.
  • Logging
    • Manage and audit user access to project resources.
    • Track down production issues quickly by viewing logs.
  • You can explore the Google Cloud Marketplace and launch cloud solutions with just a few clicks.
  • Billing
    • View a detailed billing breakdown of your bills.
    • Set spending budgets to avoid unexpected surprises
  • Cloud Console enables you to connect to your virtual machines via Cloud Shell. You can quickly handle admin tasks using this instant-on Linux machine equipped with your favorite tools including Google Cloud SDK preconfigured and authenticated.

Pricing

  • Cloud Console is available at no cost to Google Cloud Platform customers.

 

Google Cloud Dataproc

 

  • Build fully managed Apache Spark, Apache Hadoop, Presto, and other OSS clusters on the Google Cloud Platform using Cloud Dataproc.

Features

  • You can spin up resizable clusters quickly with various virtual machine types, disk sizes, number of nodes, and networking options on Cloud Dataproc.
  • Dataproc provides autoscaling features to help you automatically manage the addition and removal of cluster workers.
  • Cloud Dataproc has built-in integration with the following Google Cloud services for a more complete and robust platform.
    • Cloud Storage
    • BigQuery
    • Cloud Bigtable
    • Cloud Logging
    • Cloud Monitoring
    • AI Hub
  • It is capable of image versioning. This will allow you to switch between different versions of the tools you want to use.
  • To avoid charges for inactive clusters, you can utilize Dataproc’s scheduled deletion.
  • You can manage your clusters via
    • Cloud Console Web UI
    • Cloud SDK
    • RESTful APIs
    • SSH access.
  • Dataproc can be provisioned with custom images according to your needs.
  • Workflow templates provide a flexible and simple mechanism for managing and executing workflows.

Pricing

  • Only pay for the resources you use and lower the total cost of ownership of OSS
  • Dataproc pricing is based on the number of vCPUs and the duration that they run.

Google Cloud Dataflow

 

  • Cloud Dataflow is a fully managed data processing service for executing a wide variety of data processing patterns.

Features

  • Dataflow templates allow you to easily share your pipelines with team members and across your organization.
  • You can also take advantage of Google-provided templates to implement useful but simple data processing tasks.
  • Autoscaling lets the Dataflow automatically choose the appropriate number of worker instances required to run your job.
  • You can build a batch or streaming pipeline protected with customer-managed encryption key (CMEK) or access CMEK-protected data in sources and sinks.
  • Dataflow is integrated with VPC Service Controls to provide additional security on data processing environments by improving the ability to mitigate the risk of data exfiltration.

Pricing

  • Dataflow jobs are billed per second, based on the actual use of Dataflow batch or streaming workers. Additional resources, such as Cloud Storage or Pub/Sub, are each billed per that service’s pricing.

Google Cloud Dataprep

 

  • Cloud Dataprep by Trifacta is an intelligent data service for visually exploring, cleaning, and preparing structured and unstructured data for analysis, reporting, and machine learning.

Features

  • You can transform structured or unstructured datasets of any size — megabytes to petabytes — with equal ease and simplicity.
  • Cloud Dataproc can transform datasets stored in CSV, JSON, or relational table formats.
  • You can process data stored in Cloud Storage, BigQuery, or from your desktop, then export the refined data to BigQuery or Cloud Storage for storage, analysis, visualization, or machine learning.
  • Uses a proprietary algorithm that interprets the data transformation intent of a user’s data selection.
  • You can leverage hundreds of transformation functions readily available to turn your data into the asset you want.
  • Cloud Dataprep enables users to collaborate on similar flow objects in real-time or to create copies for other team members to use for independent tasks.
  • Explore your data through interactive visual distributions to assist in your discovery, cleansing, and transformation process.
  • Cloud Dataprep automatically generates one or more samples of the data for display and manipulation in the client application to achieve performance optimization.

Pricing

  • Pricing is split across two variables;
    • Design – is priced on a per-project basis for an unlimited number of users.
    • Execution – consists of the Dataflow usage for running jobs in Dataprep.

Google BigQuery

 Google Cloud BigQuery

  • A fully managed data warehouse where you can feed petabyte-scale data sets and run SQL-like queries.

Features

  • Cloud BigQuery is a serverless data warehousing technology.
  • It provides integration with the Apache big data ecosystem allowing Hadoop/Spark and Beam workloads to read or write data directly from BigQuery using Storage API.
  • BigQuery supports a standard SQL dialect that is ANSI:2011 compliant, which reduces the need for code rewrites.
  • Automatically replicates data and keeps a seven-day history of changes which facilitates restoration and data comparison from different times.

Loading data into BigQuery

You must first load your data into BigQuery before you can run queries. To do this you can:

  • Load a set of data records from Cloud Storage or from a local file. The records can be in Avro, CSV, JSON (newline delimited only), ORC, or Parquet format.
  • Export data from Datastore or Firestore and load the exported data into BigQuery.
  • Load data from other Google services, such as
    • Google Ad Manager
    • Google Ads
    • Google Play
    • Cloud Storage
    • Youtube Channel Reports
    • Youtube Content Owner reports
  • Stream data one record at a time using streaming inserts.
  • Write data from a Dataflow pipeline to BigQuery.
  • Use DML statements to perform bulk inserts. Note that BigQuery charges for DML queries. See Data Manipulation Language pricing.

Querying from external data sources

  • BigQuery offers support for querying data directly from:
    • Cloud BigTable
    • Cloud Storage
    • Cloud SQL
  • Supported formats are:
    • Avro
    • CSV
    • JSON (newline delimited only)
    • ORC
    • Parquet
  • To query data on external sources, you have to create external table definition file that contains the schema definition and metadata.

Monitoring

  • BigQuery creates log entries for actions such as creating or deleting a table, purchasing slots, or running a load job.

Pricing

  • On-demand pricing lets you pay only for the storage and compute that you use.
  • Flat-rate pricing with reservations enables high-volume users to choose price for workloads that are predictable.
  • To estimate query costs, it is best practice to acquire the estimated bytes read by using the query validator in Cloud Console or submitting a query job using the API with the dryRun parameter. Use this information in Pricing Calculator to calculate the query cost.

Google Cloud Pub/Sub

 

  • Cloud Pub/Sub is a fully-managed real-time messaging service for event driven systems that allows you to send and receive messages between independent applications.

Features

  • Capable of global message routing to simplify multi-region systems.
  • Synchronous, cross-zone message replication and per-message receipt tracking ensure at-least-once delivery at any scale. Pub/Sub delivers each message at least once, so the Pub/Sub service might redeliver messages.
  • You can declare independent quota and billing for publishers and subscribers.
  • Cloud Pub/Sub doesn’t have shards or partitions. You just need to set your quota, publish, and consume.

Key Concepts

  • Topic
    • It is a named resource to which publishers send messages.
  • Subscription
    • Is a named resource representing the stream of messages from a specific topic, to be sent to the subscribing application.
  • Message
    • The combination of data and attributes that a publisher sends to a topic and is eventually sent to subscribers.
  • Message attribute
    • A key-value pair that a publisher can define for a message.

Publisher-subscriber relationships

  • A publisher application creates and sends messages to a topic.
  • Subscriber applications then create a subscription to a topic to receive messages from the topic.
  • Communication can be
    • one-to-many
    • many-to-one
    • many-to-many

Pricing

  • Pub/Sub pricing is calculated based upon monthly data volumes:
    • Message ingestion and delivery
    • Snapshots and retained acknowledged messages
  • The first 10 GB of data per month is offered free of charge.

Google Cloud Secret Manager

 

  • Secret Manager is a secure and convenient method to store API keys, passwords, certificates, and other sensitive data.
  • It provides a central place as the source of truth to manage, access, and audit secrets across Google Cloud.

Features

  • Secret names are project-global resources, but secret data is stored in regions.
  • You can choose specific regions in which to store your secrets.
  • Secret data is immutable and most operations take place on secret versions.
  • Secret Manager integrates with IAM.
  • Every interaction with Secret Manager generates an audit entry with Cloud Logging enabled to help you detect system anomalies.
  • You can enable context-aware access to Secret Manager from hybrid environments using VPC Service Controls.

Pricing

  • Secret Manager charges for operations and active secret versions.
  • A version is considered active if it is in the ENABLED or DISABLED state.