hands-on lab

Implementing an ETL Pipeline with AWS SDK for Pandas

Difficulty: Beginner
Duration: Up to 1 hour
Students: 2
Get guided in a real environmentPractice with a step-by-step scenario in a real, provisioned environment.
Learn and validateUse validations to check your solutions every step of the way.
See resultsTrack your knowledge and monitor your progress.

Description

AWS SDK for Pandas is a Python library supplied by Amazon that simplifies data science tasks when using Python to analyze and manipulate data. Built upon the popular Pandas library, it is performant and designed to be used at scale.

Learning how to use AWS SDK for Pandas will benefit anyone who is looking to make use of data science in the public AWS cloud.

In this hands-on lab, you will explore accessing different data stores using the library, and you will implement a Lambda function that uses it to process transaction data in real-time.

Learning objectives

Upon completion of this beginner-level lab, you will be able to:

  • Use a JupyterLab Notebook
  • Install and use the AWS SDK for Pandas library
  • Update an AWS Lambda function using the AWS CLI
  • Query data using Amazon Athena

Intended audience

  • Candidates for the AWS Certified Data Engineer Associate certification
  • Data Engineers
  • DevOps Engineers
  • Machine Learning Engineers
  • Software Engineers

Prerequisites

Familiarity with the following will be beneficial but is not required:

  • The Python scripting language
  • AWS Lambda
  • Amazon S3

The following content can be used to fulfill the prerequisites:

Environment before

Environment after

Covered topics

Lab steps

Exploring the AWS SDK for Pandas Library
Developing an Extract AWS Lambda Function
Logging In to the Amazon Web Services Console
Examining the Pipeline and Extracted Data