hands-on lab

Comparing Google Cloud Big Data Services

Difficulty: Intermediate

Duration: Up to 45 minutes

Students: 1,219

Rating: 4/5

Start lab

On average, students complete this lab in35m

Get guided in a real environmentPractice with a step-by-step scenario in a real, provisioned environment.

Learn and validateUse validations to check your solutions every step of the way.

See resultsTrack your knowledge and monitor your progress.

About

Author

Description

In this lab, you will understand first-hand the difference between some of Google Cloud's Big Data solutions: BigQuery, Managed Service for Apache Spark, and Dataflow. Specifically, you will create a BigQuery Dataset and table. You will then upload data to BigQuery directly, using Managed Apache Spark and Dataflow.

The following provides a brief high-level comparison between the services to review before starting the lab.

Comparing BigQuery, Managed Apache Spark, and Dataflow

BigQuery is a serverless data warehouse that can ingest, store, and query petabyte-scale data. BigQuery provides you with SQL capabilities for querying the data.

Cloud Managed Apache Spark is a managed Hadoop platform that includes Apache Spark, Flink, Hive, and other open-source tools and frameworks. Managed Apache Spark will provide you with full programming language capabilities in contrast to BigQuery's SQL-only query interface. If you need complex scripts to transform your data, then Managed Apache Spark may be a good solution.

Cloud Dataflow is a serverless data processing service fully managed by GCP that provides you with a platform for Apache Beam projects without having to worry about the underlying layer of a cluster, including load balancing and auto-scaling the number of workers for a job. In contrast to Managed Apache Spark where your code is tightly coupled to the job runner, i.e. the underlying platform, Cloud Dataflow allows you to focus on your business logic rather than focusing on how the underlying layer works. Cloud Dataflow also offers various ready-made templates from which to choose when establishing a task, making the process even easier.

The pricing models for each service also differ. BigQuery pricing varies based on the amount of data stored, ingested, and amount of data processed while executing queries. Managed Apache Spark pricing is based on the number of virtual CPUs in the cluster. Dataflow pricing is based on the CPU, memory, and persistent disk resources used while executing jobs.