In this lab, you will explore the transformer architecture, understand its core components, and investigate how it overcame the limitations of earlier sequence models. You’ll examine attention, self-attention, encoder/decoder structures, and positional encoding, as well as strengths and limitations of transformers in real-world use cases.
Upon completion of this lab, you will be able to:
This course is designed for:
Completion of previous modules is highly recommended before attempting this lab.
Demo: Attention Under the Hood — Encoder, Decoder & Encoder–Decoder
In this demo, you will see attention in action by visualising:
- Encoder self-attention with DistilBERT (heatmaps per layer/head and mean across heads)
- Decoder masked self-attention with GPT-2 (causal/triangular masking)
- Encoder→decoder cross-attention with T5-small (which source tokens the decoder looks at)
- KV-caching speedups during generation to link mechanics to performance
Intended learning outcomes:
- Explain queries, keys, values, and how attention weights are interpreted on heatmaps.
- Distinguish encoder, decoder (masked), and encoder–decoder (cross-attention) architectures.
- Describe causal masking and why it enforces left-to-right generation.
- Interpret how attention shifts across layers/heads (syntax vs. semantics).
- Explain what KV caching is and why it speeds up decoding.
Activity: Be an Attention Detective
In this activity, you will investigate real attention patterns to build intuition:
- Edit inputs with pronoun ambiguity and compare self-attention across layers/heads.
- Explore masked self-attention by changing the last tokens of a prefix.
- Inspect cross-attention over two decoding steps to see how focus shifts.
- Benchmark generation with/without cache for different lengths.
You will document your observations with notes and timing tables.
Intended learning outcomes:
- Analyse and articulate which tokens a model attends to and why (for encoder, decoder, and cross-attention).
- Identify and explain causal masking patterns.
- Describe how cross-attention moves across steps as tokens are generated.
- Quantify the impact of KV caching and reason about throughput/latency.
- Reflect on failure modes (ambiguity, long-range dependencies, head noise) and tie them to transformer limitations.