hands-on lab

Text Analysis and LLMs with Python - Module 7

Difficulty: Intermediate
Duration: Up to 2 hours and 30 minutes
Students: 2
Get guided in a real environmentPractice with a step-by-step scenario in a real, provisioned environment.
Learn and validateUse validations to check your solutions every step of the way.
See resultsTrack your knowledge and monitor your progress.

Description

Training and Fine-Tuning Language Models

In this lab, you will learn why training or fine-tuning may be necessary for specific use cases, how embedding models are trained, and how different fine-tuning methods align models with desired behaviours. You’ll explore semantic search, re-ranking, and practical tuning choices through demos and hands-on activity.

Learning objectives

Upon completion of this lab, you will be able to:

  • Describe why training or fine-tuning a language model may be necessary for specific use cases.
  • Describe what embedding models do and how they create meaningful representations in vector space.
  • Outline the process of training embedding models, including data preparation, architecture choice, and contrastive learning approaches.
  • Distinguish between the three stages of LLM training: pre-training, supervised fine-tuning, and preference tuning.
  • Investigate continued pre-training and when it is useful for domain adaptation.
  • Describe masked language modeling and its role in LLM training.
  • Compare supervised fine-tuning and preference tuning, and explain how each aligns models with desired behaviour.

Intended audience

This course is designed for:

  • Data Scientists
  • Software Developers
  • Machine Learning Engineers
  • AI Engineers
  • DevOps Engineers

Prerequisites

Completion of previous modules is highly recommended before attempting this lab.

Lab structure

Demo: Train & Fine-Tune with OpenAI — Embeddings, Re-Ranking, and Tuning Paths
In this demo, you will:
- Explore bi-encoder style semantic similarity with OpenAI embeddings.
- Use in-batch negatives and inspect similarity matrices to build intuition.
- Apply cross-encoder re-ranking with a chat model for precision.
- Prepare a small JSONL dataset for supervised fine-tuning (SFT) and launch a fine-tune job.
- Position continued pre-training (MLM), SFT, and preference tuning in context.

Intended learning outcomes:
- Explain Siamese/Twin (bi-encoder) embeddings and measure cosine similarity.
- Describe in-batch negatives and analyse a similarity matrix.
- Build a cross-encoder re-ranker by asking a chat model to score pair similarity.
- Prepare a small dataset and understand the pipeline to launch a fine-tune job.
- Differentiate continued pre-training, SFT, and preference tuning.

Activity: Semantic Search + Re-Ranking + Tuning Choices
In this activity, you will:
- Build a mini FAQ semantic search system using bi-encoder embeddings for retrieval.
- Apply cross-encoder re-ranking with a chat model to improve results.
- Prepare a small supervised fine-tuning dataset to enforce strict output formatting.
- Reflect on when to use continued pre-training, SFT, or preference tuning in real-world adaptation.

Intended learning outcomes:
- Implement bi-encoder retrieval with embeddings.
- Explain in-batch negatives with a similarity matrix.
- Apply cross-encoder re-ranking for precision.
- Prepare a small dataset for supervised fine-tuning.
- Reflect on the role of continued pre-training, SFT, and preference tuning.

Hands-on Lab UUID

Lab steps

0 of 1 steps completed.Use arrow keys to navigate between steps. Press Enter to go to a step if available.
  1. Starting the Notebooks