Transforming Data With Apache Spark and Amazon EMR
Description
Amazon EMR (formerly known as Amazon Elastic Map Reduce) is a big data platform that supports many popular open-source data processing frameworks, including Apache Spark. Amazon EMR simplifies the configuration, provisioning, and scaling of clusters for data analysis and processing workloads.
Learning how to use Amazon EMR will help anyone looking to understand how to perform big data processing in the real world.
In this hands-on lab, you will tour an Amazon EMR cluster, place data and a script in a location accessible to Amazon EMR, submit a workload to an Amazon EMR cluster, and examine the results.
Please note an Amazon EMR cluster takes approximately ten minutes to create and become usable. Please ensure you have enough time available before starting the lab.
Learning objectives
Upon completion of this beginner-level lab, you will be able to:
- Understand the configuration of an Amazon EMR cluster
- Upload a script and data file to an Amazon S3 bucket
- Submit work to a cluster by adding a step
- Inspect the results of an Amazon EMR step
Intended audience
- Candidates for AWS Certified Data Engineer Associate certification
- Cloud Architects
- Data Engineers
- DevOps Engineers
- Machine Learning Engineers
Prerequisites
Familiarity with the following will be beneficial but is not required:
- Amazon EMR
- Amazon Simple Storage Service (S3)
- The Python scripting language
- The JavaScript Object Notation (JSON) data format
The following content can be used to fulfill the prerequisites: