hands-on lab

Text Analysis and LLMs with Python - Module 4

Difficulty: Intermediate
Duration: Up to 1 hour and 30 minutes
Students: 2
Get guided in a real environmentPractice with a step-by-step scenario in a real, provisioned environment.
Learn and validateUse validations to check your solutions every step of the way.
See resultsTrack your knowledge and monitor your progress.

Description

Using Pre-Trained Language Models

In this lab, you will explore what pre-trained models are, how they are built, and how to apply them to classification and clustering tasks. You’ll compare representational and generative workflows and evaluate their strengths, limitations, and trade-offs.

Learning objectives

Upon completion of this lab, you will be able to:

  • Define what a pre-trained model is, and differentiate between general pre-trained models and pre-trained language models.
  • Explain the distinction between representational and generative pre-trained language models, with examples of each.
  • Describe the key stages in building a pre-trained language model.
  • Identify important factors to consider when selecting or using a pre-trained language model for a given task.
  • Apply pre-trained language models to classification tasks using both representation-based and generation-based workflows.
  • Apply pre-trained language models to clustering tasks such as topic modelling, comparing representation- and generation-based approaches.

Intended audience

This course is designed for:

  • Data Scientists
  • Software Developers
  • Machine Learning Engineers
  • AI Engineers
  • DevOps Engineers

Prerequisites

Completion of previous modules is highly recommended before attempting this lab.

Lab structure

Demo: Representational vs Generative Classification
In this demo, you will compare two different ways of classifying text:
- A representational approach, using OpenAI embeddings (text-embedding-3-small) with a simple nearest-centroid classifier.
- A generative approach, using gpt-4o-mini with a strict label list and zero temperature.

You’ll run both methods on a slightly noisy dataset, evaluate them with accuracy and confusion matrices, and discuss the trade-offs between scalability, speed, and flexibility.

Intended learning outcomes:
- Build and evaluate a classifier using OpenAI embeddings and centroids.
- Write constrained prompts for consistent generative classification.
- Compare performance between representational and generative methods.
- Explain trade-offs like accuracy vs cost, stability vs flexibility, and when to choose each approach.

Activity: Representational vs Generative Clustering
In this activity, you will explore two different approaches to clustering an unlabeled text corpus:
- A representational approach, using OpenAI embeddings + k-means (k=5).
- A generative approach, using gpt-4o-mini to group and name clusters directly.

You’ll then compare the results with metrics like silhouette, NMI, and ARI, and reflect on the strengths and weaknesses of each method.

Intended learning outcomes:
- Transform raw text into embeddings, run k-means, and interpret clustering quality with silhouette scores and PCA.
- Prompt an LLM to perform clustering and return consistent assignments with names.
- Quantitatively compare representational vs generative clustering, and explain when to prefer one approach over the other.

Hands-on Lab UUID

Lab steps

0 of 1 steps completed.Use arrow keys to navigate between steps. Press Enter to go to a step if available.
  1. Starting the Notebooks