hands-on lab

Text Analysis and LLMs with Python - Module 9

Difficulty: Intermediate
Duration: Up to 1 hour and 30 minutes
Students: 2
Get guided in a real environmentPractice with a step-by-step scenario in a real, provisioned environment.
Learn and validateUse validations to check your solutions every step of the way.
See resultsTrack your knowledge and monitor your progress.

Description

Techniques for Latency Reduction and Model Optimization

In this lab, you will learn how to optimize large language models for latency and efficiency in real-world use cases. You’ll explore the relationship between model size, performance, and inference speed, calculate VRAM requirements, and evaluate trade-offs for deployment in latency-sensitive environments.

Learning objectives

Upon completion of this lab, you will be able to:

  • Identify the strengths, limitations, and challenges of LLMs in real-world use cases.
  • Discuss the relationship between model size, performance, and inference speed.
  • Calculate approximate VRAM requirements for running LLMs using model parameters and quantization levels.
  • Describe techniques for reducing model size.
  • List methods for improving inference speed.
  • Evaluate considerations for deploying LLMs in latency-sensitive environments.

Intended audience

This course is designed for:

  • Data Scientists
  • Software Developers
  • Machine Learning Engineers
  • AI Engineers
  • DevOps Engineers

Prerequisites

Completion of previous modules is highly recommended before attempting this lab.

Lab structure

Demo: LLM Latency & Model Optimization (Toy Lab)
In this demo, you will:
- Explore trade-offs between model size, memory, and inference speed.
- Calculate approximate VRAM requirements for different LLMs under various quantization levels.
- Run a small latency experiment with OpenAI API calls to observe how response times vary with input and output size.

Intended learning outcomes:
- Identify how model size and precision affect VRAM requirements.
- Calculate approximate VRAM requirements for different parameter counts and quantization levels.
- Observe how latency changes with prompt length and response size.
- Reflect on trade-offs between memory footprint, latency, and accuracy in deployment scenarios.

Activity: Scenario — The ER Chatbot
In this activity, you will act as a consultant tasked with deploying an AI triage assistant in a hospital Emergency Room. The system must answer accurately and safely within 1-second latency, using a single GPU with 24 GB VRAM. You will estimate VRAM needs, evaluate optimization strategies, and decide between local vs. hosted deployment under strict performance and privacy constraints.

Intended learning outcomes:
- Calculate VRAM requirements for LLMs of different sizes and quantization levels.
- Compare trade-offs between accuracy, memory, and inference speed.
- Propose optimization techniques to achieve latency requirements.
- Make deployment recommendations for latency-sensitive, high-stakes environments.
- Reflect on operational risks (accuracy drop, bias, privacy) and monitoring strategies.

Hands-on Lab UUID

Lab steps

0 of 1 steps completed.Use arrow keys to navigate between steps. Press Enter to go to a step if available.
  1. Starting the Notebooks