Building a Data Pipeline in DC/OS
DC/OS was declared end of life October 31, 2021 and the content is no longer maintained
Description
Notice: DC/OS has been declared end-of-life. The lab instructions have been brought up to the end-of-life release. Due to limitations in DC\OS, the final lab step now simulates the real-time analysis of tweets.
It is relatively simple to create powerful data pipelines in DC/OS. In this Lab, you will learn how to perform streaming data analytics by building a data pipeline in DC/OS that combines multiple services and a Twitter-like application. You will review many of the fundamental concepts in using DC/OS along the way, including installing packages, using Marathon-LB to load balance traffic, and working with virtual IPs.
Lab Objectives
Upon completion of this Lab you will be able to:
- Install DC/OS packages with custom options using the DC/OS CLI
- Deploy a data pipeline using Kafka, Cassandra, and a social networking app
- Use the Zeppelin package and DC/OS Spark to perform basic streaming analytics on the data pipeline
Lab Prerequisites
You should be familiar with:
- Basic and intermediate DC/OS concepts including Virtual IPs and Marathon-LB
- Working at the command-line in Linux
- AWS services to optionally understand the architecture of the pre-created DC/OS cluster
Lab Environment
Before completing the Lab instructions, the environment will look as follows:
After completing the Lab instructions, the environment should look similar to:
Updates
January 19th, 2022 - Updated lab instructions to reflect the latest (end of life) DC/OS experience
August 1st, 2021 - Resolved an issue preventing the DC/OS cluster from provisioning
October 2nd, 2020 - Replaced CoreOS virtual machines (no longer available in AWS) with CentOS
January 10th, 2019 - Added a validation Lab Step to check the work you perform in the Lab