hands-on lab

Text Analysis and LLMs with Python - Module 8

Difficulty: Intermediate
Duration: Up to 1 hour and 30 minutes
Students: 3
Get guided in a real environmentPractice with a step-by-step scenario in a real, provisioned environment.
Learn and validateUse validations to check your solutions every step of the way.
See resultsTrack your knowledge and monitor your progress.

Description

Evaluating, Deploying and Observing Models

In this lab, you will learn how to evaluate language models using different metrics and approaches, how to deploy them, and how to track their behaviour over time. You’ll work with both quantitative and qualitative evaluation techniques, including LLM-as-a-judge, and reflect on the challenges of assessing generative outputs.

Learning objectives

Upon completion of this lab, you will be able to:

  • Differentiate evaluation approaches based on the type of task.
  • Select appropriate metrics for regression, classification, clustering, information retrieval, and generative outputs.
  • Discuss challenges in evaluating generative model outputs and identify strategies to address them.
  • Apply word-level metrics to assess LLM performance.
  • Explain the role of benchmarks in evaluating LLMs and recognize their limitations.
  • Identify dimensions for human evaluation and apply them.
  • Describe LLM-as-a-judge techniques.
  • Outline tracking metrics for LLMs and explain their importance.
  • Compare deployment approaches for LLMs, including use cases, pros, and cons.

Intended audience

This course is designed for:

  • Data Scientists
  • Software Developers
  • Machine Learning Engineers
  • AI Engineers
  • DevOps Engineers

Prerequisites

Completion of previous modules is highly recommended before attempting this lab.

Lab structure

Demo: Evaluating, Deploying, and Observing LLMs
In this demo, you will:
- Evaluate a grounded QA model that must answer only from a provided context.
- Generate answers and compute BLEU, ROUGE, BERTScore, and Perplexity.
- Apply LLM-as-a-judge to assess qualitative criteria such as Faithfulness, Truthfulness, Completeness, Relevance, Fluency, and Bias.
- Assemble a compact leaderboard and discuss how metrics complement or contradict each other.

Intended learning outcomes:
- Explain why evaluation depends on the task (here: grounded QA).
- Compute BLEU, ROUGE, BERTScore, and Perplexity for generated answers.
- Apply LLM-as-a-judge to evaluate multiple qualitative dimensions of answers.
- Compare metrics and identify when they align or diverge.
- Reflect on trade-offs between automatic metrics and qualitative judgments for deployment.

Activity: Summarization Showdown — Evaluate & Improve Summaries
In this activity, you will:
- Build and evaluate a single-sentence summarizer for short passages.
- Compare system outputs to gold summaries using BLEU, ROUGE, and BERTScore.
- Measure Perplexity for fluency.
- Use LLM-as-a-judge to grade Faithfulness, Truthfulness, Completeness, Relevance, Fluency, and Bias.
- Improve your summaries with prompt constraints and optional self-check loops.
- Track progress on a mini leaderboard and reflect on metric disagreements.

Intended learning outcomes:
- Apply BLEU, ROUGE, BERTScore, and Perplexity to measure summaries.
- Use LLM-as-a-judge to evaluate multiple qualitative dimensions.
- Prompt-engineer constraints and style guides to improve scores.
- Compare baseline vs. improved outputs using a structured leaderboard.
- Reflect on evaluation challenges and propose monitoring strategies for deployment.

Hands-on Lab UUID

Lab steps

0 of 1 steps completed.Use arrow keys to navigate between steps. Press Enter to go to a step if available.
  1. Starting the Notebooks