In this lab, you will learn how raw text is transformed into tokens and embeddings for large language models. You’ll explore different tokenization strategies, generate embeddings, and see how contextual embeddings capture meaning based on surrounding words.
Upon completion of this lab, you will be able to:
This course is designed for:
Completion of previous modules is highly recommended before attempting this lab.
Demo: From Text → Tokens → Embeddings
In this demo, you will:
- Compare tokenization across word, character, subword (WordPiece/SentencePiece), and byte-level BPE.
- Generate sentence embeddings using OpenAI models and explore semantic similarity.
- Inspect contextualized token embeddings with BERT to see how meaning depends on context.
Intended learning outcomes:
- Explain what tokens are and why tokenizers differ in practice.
- Compare word/character/subword/byte tokenization on tricky inputs.
- Compute sentence embeddings and use cosine similarity for retrieval.
- Show that a word has different vectors in different contexts.
Activity: Tokenize, Embed, and Decide
In this activity, you will:
- Compare tokenizers on diverse text samples.
- Build a tiny semantic search with OpenAI embeddings.
- Demonstrate contextualized embeddings with BERT.
- Choose a tokenizer for a use case and defend your choice.
Intended learning outcomes:
- Diagnose tokenizer behaviour on domain-specific text.
- Retrieve semantically similar texts using embeddings + cosine similarity.
- Show that context matters for token embeddings.
- Make and justify a tokenizer choice using evidence.