hands-on lab

Text Analysis and LLMs with Python - Module 3

Difficulty: Intermediate
Duration: Up to 1 hour and 30 minutes
Students: 3
On average, students complete this lab in10m
Get guided in a real environmentPractice with a step-by-step scenario in a real, provisioned environment.
Learn and validateUse validations to check your solutions every step of the way.
See resultsTrack your knowledge and monitor your progress.

Description

Tokens and Embeddings

In this lab, you will learn how raw text is transformed into tokens and embeddings for large language models. You’ll explore different tokenization strategies, generate embeddings, and see how contextual embeddings capture meaning based on surrounding words.

Learning objectives

Upon completion of this lab, you will be able to:

  • Define tokens and explain the role of tokenization in preparing text for large language models.
  • Compare the four main types of tokenization — word, character, subword, and byte — including their pros and cons.
  • Identify scenarios where each tokenization strategy might be most appropriate.
  • Describe what embeddings are and how they represent the meaning of tokens in vector space.
  • Distinguish between static and contextual embeddings and explain their evolution.
  • Discuss how embeddings help models capture semantic and syntactic relationships between words and sentences.

Intended audience

This course is designed for:

  • Data Scientists
  • Software Developers
  • Machine Learning Engineers
  • AI Engineers
  • DevOps Engineers

Prerequisites

Completion of previous modules is highly recommended before attempting this lab.

Lab structure

Demo: From Text → Tokens → Embeddings
In this demo, you will:
- Compare tokenization across word, character, subword (WordPiece/SentencePiece), and byte-level BPE.
- Generate sentence embeddings using OpenAI models and explore semantic similarity.
- Inspect contextualized token embeddings with BERT to see how meaning depends on context.

Intended learning outcomes:
- Explain what tokens are and why tokenizers differ in practice.
- Compare word/character/subword/byte tokenization on tricky inputs.
- Compute sentence embeddings and use cosine similarity for retrieval.
- Show that a word has different vectors in different contexts.

Activity: Tokenize, Embed, and Decide
In this activity, you will:
- Compare tokenizers on diverse text samples.
- Build a tiny semantic search with OpenAI embeddings.
- Demonstrate contextualized embeddings with BERT.
- Choose a tokenizer for a use case and defend your choice.

Intended learning outcomes:
- Diagnose tokenizer behaviour on domain-specific text.
- Retrieve semantically similar texts using embeddings + cosine similarity.
- Show that context matters for token embeddings.
- Make and justify a tokenizer choice using evidence.

Hands-on Lab UUID

Lab steps

0 of 1 steps completed.Use arrow keys to navigate between steps. Press Enter to go to a step if available.
  1. Starting the Notebooks