In this lab, you will learn how to evaluate language models using different metrics and approaches, how to deploy them, and how to track their behaviour over time. You’ll work with both quantitative and qualitative evaluation techniques, including LLM-as-a-judge, and reflect on the challenges of assessing generative outputs.
Upon completion of this lab, you will be able to:
This course is designed for:
Completion of previous modules is highly recommended before attempting this lab.
Demo: Evaluating, Deploying, and Observing LLMs
In this demo, you will:
- Evaluate a grounded QA model that must answer only from a provided context.
- Generate answers and compute BLEU, ROUGE, BERTScore, and Perplexity.
- Apply LLM-as-a-judge to assess qualitative criteria such as Faithfulness, Truthfulness, Completeness, Relevance, Fluency, and Bias.
- Assemble a compact leaderboard and discuss how metrics complement or contradict each other.
Intended learning outcomes:
- Explain why evaluation depends on the task (here: grounded QA).
- Compute BLEU, ROUGE, BERTScore, and Perplexity for generated answers.
- Apply LLM-as-a-judge to evaluate multiple qualitative dimensions of answers.
- Compare metrics and identify when they align or diverge.
- Reflect on trade-offs between automatic metrics and qualitative judgments for deployment.
Activity: Summarization Showdown — Evaluate & Improve Summaries
In this activity, you will:
- Build and evaluate a single-sentence summarizer for short passages.
- Compare system outputs to gold summaries using BLEU, ROUGE, and BERTScore.
- Measure Perplexity for fluency.
- Use LLM-as-a-judge to grade Faithfulness, Truthfulness, Completeness, Relevance, Fluency, and Bias.
- Improve your summaries with prompt constraints and optional self-check loops.
- Track progress on a mini leaderboard and reflect on metric disagreements.
Intended learning outcomes:
- Apply BLEU, ROUGE, BERTScore, and Perplexity to measure summaries.
- Use LLM-as-a-judge to evaluate multiple qualitative dimensions.
- Prompt-engineer constraints and style guides to improve scores.
- Compare baseline vs. improved outputs using a structured leaderboard.
- Reflect on evaluation challenges and propose monitoring strategies for deployment.