Running Spark on Azure Databricks

Difficulty: Intermediate
Duration: 49 seconds
Students: 7,742
Rating: 4.8/5

Apache Spark is an open-source framework for doing big data processing. It was developed as a replacement for Apache Hadoop’s MapReduce framework. Both Spark and MapReduce process data on compute clusters, but one of Spark’s big advantages is that it does in-memory processing, which can be orders of magnitude faster than the disk-based processing that MapReduce uses.

In 2013, the creators of Spark started a company called Databricks. The name of their product is also Databricks. It’s a cloud-based implementation of Spark with a user-friendly interface for running code on clusters interactively.

Microsoft has partnered with Databricks to bring its product to the Azure platform. The result is a service called Azure Databricks. One of the biggest advantages of using the Azure version of Databricks is that it’s integrated with other Azure services. For example, you can train a machine learning model on a Databricks cluster and then deploy it using Azure Machine Learning Services.

In this lesson, we will start by showing you how to set up a Databricks workspace and a cluster. Next, we’ll go through the basics of how to use a notebook to run interactive queries on a dataset. Then you’ll see how to run a Spark job on a schedule.

Learning Objectives

  • Create a Databricks workspace, cluster, and notebook
  • Run code in a Databricks notebook either interactively or as a job

Intended Audience

  • People who want to use Azure Databricks to run Apache Spark for analytics

Prerequisites

  • Prior experience with Azure and at least one programming language

Additional Resources

The GitHub repository for this lesson is at https://github.com/cloudacademy/azure-databricks.