Amazon Elastic MapReduce (Amazon EMR) makes it easy to process vast amounts of data in a variety of applications, including log analysis, web indexing, data warehousing, machine learning, financial analysis, scientific simulation, and bioinformatics. Amazon EMR uses Hadoop, an open-source framework, to distribute raw data and processing across a resizable cluster of Amazon EC2 instances.
Hadoop uses a distributed processing architecture called MapReduce in which a task is mapped to a set of servers for processing. The results of the computation performed by those servers are then reduced to a single output data set.
A high-level view of the EMR workflow is as follows:
The focus of this lab is configuring and launching an EMR cluster. You will be provided with sample input data sets and sample applications to process the data sets. Treating the application and data set as a "black box" will lift unneeded complexities and free you up to concentrate on the configuration component.
Please note that this lab involves creating a new Amazon EMR cluster which typically takes approximately ten minutes. Please ensure you have enough time available before starting the lab.
Upon completion of this lab, you will be able to:
Familiarity with the following will be beneficial but is not required:
The following content can be used to fulfill the prerequisites:
After completing the lab instructions the environment should look similar to:
October 3rd, 2024 - Resolved EMR cluster creation issue
November 29th, 2023 - Updated screenshots to reflect the latest user interface and updated the lab structure for clarity
July 26th, 2023 - Addressed user ban issue and added warning
March 30th, 2023 - Updated the instructions and screenshots to reflect the latest UI
December 27th, 2022 - Updated the instructions and screenshots to reflect the latest UI
September 13th, 2022 - Updated the instructions and screenshots to reflect the latest UI
December 13th, 2021 - Adjusted the allowed bandwidth for the lab to account for increased network usage by EMR
January 10th, 2019 - Added a validation Lab Step to check the work you perform in the Lab