In this lab, you will learn how to optimize large language models for latency and efficiency in real-world use cases. You’ll explore the relationship between model size, performance, and inference speed, calculate VRAM requirements, and evaluate trade-offs for deployment in latency-sensitive environments.
Upon completion of this lab, you will be able to:
This course is designed for:
Completion of previous modules is highly recommended before attempting this lab.
Demo: LLM Latency & Model Optimization (Toy Lab)
In this demo, you will:
- Explore trade-offs between model size, memory, and inference speed.
- Calculate approximate VRAM requirements for different LLMs under various quantization levels.
- Run a small latency experiment with OpenAI API calls to observe how response times vary with input and output size.
Intended learning outcomes:
- Identify how model size and precision affect VRAM requirements.
- Calculate approximate VRAM requirements for different parameter counts and quantization levels.
- Observe how latency changes with prompt length and response size.
- Reflect on trade-offs between memory footprint, latency, and accuracy in deployment scenarios.
Activity: Scenario — The ER Chatbot
In this activity, you will act as a consultant tasked with deploying an AI triage assistant in a hospital Emergency Room. The system must answer accurately and safely within 1-second latency, using a single GPU with 24 GB VRAM. You will estimate VRAM needs, evaluate optimization strategies, and decide between local vs. hosted deployment under strict performance and privacy constraints.
Intended learning outcomes:
- Calculate VRAM requirements for LLMs of different sizes and quantization levels.
- Compare trade-offs between accuracy, memory, and inference speed.
- Propose optimization techniques to achieve latency requirements.
- Make deployment recommendations for latency-sensitive, high-stakes environments.
- Reflect on operational risks (accuracy drop, bias, privacy) and monitoring strategies.