Running Spark on Azure Databricks
Apache Spark is an open-source framework for doing big data processing. It was developed as a replacement for Apache Hadoop’s MapReduce framework. Both Spark and MapReduce process data on compute clusters, but one of Spark’s big advantages is that it does in-memory processing, which can be orders of magnitude faster than the disk-based processing that MapReduce uses.
In 2013, the creators of Spark started a company called Databricks. The name of their product is also Databricks. It’s a cloud-based implementation of Spark with a user-friendly interface for running code on clusters interactively.
Microsoft has partnered with Databricks to bring its product to the Azure platform. The result is a service called Azure Databricks. One of the biggest advantages of using the Azure version of Databricks is that it’s integrated with other Azure services. For example, you can train a machine learning model on a Databricks cluster and then deploy it using Azure Machine Learning Services.
In this lesson, we will start by showing you how to set up a Databricks workspace and a cluster. Next, we’ll go through the basics of how to use a notebook to run interactive queries on a dataset. Then you’ll see how to run a Spark job on a schedule.
Learning Objectives
- Create a Databricks workspace, cluster, and notebook
- Run code in a Databricks notebook either interactively or as a job
Intended Audience
- People who want to use Azure Databricks to run Apache Spark for analytics
Prerequisites
- Prior experience with Azure and at least one programming language
Additional Resources
The GitHub repository for this lesson is at https://github.com/cloudacademy/azure-databricks.