Module 2: The Transformer Architecture
Master the Transformer self-attention, multi-head projections, RoPE positional encodings, Pre/Post layer normalization, and models like BERT, GPT, and T5.
2.1 Attention Mechanism
In 2017, researchers at Google published the paper "Attention Is All You Need", introducing the Transformer model. The core breakthrough of the Transformer is the Self-Attention Mechanism, which allows the model to process sequences in parallel and evaluate relationships between any two tokens in a sentence, regardless of distance.
The Problem with RNNs
Prior sequence models (like LSTMs and RNNs) had to process text sequentially (word-by-word). This meant context was easily lost over long distances, and training could not be parallelized. Transformers solve this by looking at all tokens simultaneously and calculating attention weights between them.
A. Query, Key, and Value Vectors (Q, K, V)
To understand attention, think of a database lookup. When you type a search query ($Q$) on a database, it checks your query against database keys ($K$) and retrieves the most similar content value ($V$).
In self-attention, each input token's embedding vector is projected into three distinct vectors using trained weight matrices:
- Query Vector ($Q$): Represents the current word looking for context.
- Key Vector ($K$): Represents all other words in the sentence. We check similarity against this vector.
- Value Vector ($V$): Represents the actual semantic content of the word.
Mathematically, for an input embedding matrix $X$, we multiply by weights $W_Q, W_K, W_V$:
B. Scaled Dot-Product Attention
Once we have the $Q, K, V$ matrices, we calculate the attention score matrix using the Scaled Dot-Product Formula:
Let's break this equation down step-by-step:
- Calculate Dot-Product ($Q K^T$): Computes similarity scores between every token query and all keys. For a sequence of length $L$, this outputs an $L \times L$ attention grid.
- Scale by Key Dimension ($\sqrt{d_k}$): If the vector dimension ($d_k$) is large, the dot-product outputs can grow extremely large. This causes the softmax function to output extremely small gradients. We divide by $\sqrt{d_k}$ to stabilize scores.
- Apply Softmax: Converts the raw similarity scores into probability values between 0.0 and 1.0, ensuring the sum of attention weights for any given token equals 1.0.
- Multiply by Values ($V$): Scales the value vectors by the computed softmax probabilities, focusing the model's representation on the most relevant contextual tokens.
import torch
import torch.nn.functional as F
def scaled_dot_product_attention(Q, K, V, mask=None):
d_k = Q.size(-1)
scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
attention_weights = F.softmax(scores, dim=-1)
output = torch.matmul(attention_weights, V)
return output, attention_weights
# Simulate inputs: batch_size=1, seq_len=4, dim=64
Q = torch.randn(1, 4, 64)
K = torch.randn(1, 4, 64)
V = torch.randn(1, 4, 64)
out, weights = scaled_dot_product_attention(Q, K, V)
print("Attention weights shape:", weights.shape)
print("Contextual output shape:", out.shape)
C. Self-Attention vs. Cross-Attention
There are two primary configurations of attention layers in deep architectures:
- Self-Attention: Queries, Keys, and Values all stem from the same input sequence. It maps contextual links inside a single sentence.
- Cross-Attention: Queries come from the target sequence (decoder), while Keys and Values come from the source sequence (encoder). This connects the inputs and outputs, which is crucial in machine translation or image caption generation.
Complexity Constraint
Because self-attention compares every token with every other token, calculating $Q K^T$ requires $O(N^2)$ computation complexity (where $N$ is the sequence context length). This is why training LLMs on massive context lengths (like 100k+ tokens) is computationally demanding, leading to the development of FlashAttention.
2.2 Multi-Head Attention
Rather than calculating attention once with a single Query, Key, and Value vector set, we project the Query, Key, and Value matrices into multiple lower-dimensional heads in parallel.
Why multiple heads? This allows the model to look at different parts of the sentence at the same time. For example, one head might track grammatical relationships (like finding verbs for a noun), while another head focuses on nouns referring to entities mentioned earlier.
Mathematically, we project the $d_{model}$ queries, keys, and values $h$ times with different projection weights:
We then concatenate all heads and multiply by an output projection matrix $W^O$:
2.3 Transformer Block Deep Dive
A complete Transformer block contains several components beyond attention:
A. Positional Encodings
Since self-attention processes all tokens simultaneously, it has no native concept of word order. Without positional encodings, the sentence "Dog bites man" and "Man bites dog" would result in identical self-attention representations.
- Sinusoidal Encodings: Add absolute sine and cosine waves of different frequencies to input embeddings.
- RoPE (Rotary Position Embedding): Rotates Query and Key vectors in the 2D complex plane. It captures relative distance naturally and is used in modern LLMs (LLaMA, Gemma).
- ALiBi (Attention with Linear Biases): Subtracts a linear bias proportional to token distance directly from the attention scores matrix.
B. Layer Normalization: Pre-LN vs. Post-LN
Post-LN: Normalization occurs after the residual additions:
Used in the original 2017 paper, but gradients near the output layer grow extremely large, requiring warmup.
Pre-LN: Normalization occurs before the sub-layers:
This allows gradients to flow directly through the residual shortcuts, stabilizing training in deep models. Modern models (like GPT-3, LLaMA) use Pre-LN.
C. Feed-Forward Networks & Residual Shortcuts
Each block contains a Position-wise Feed-Forward Network (FFN), consisting of two linear layers with an activation in between. Residual shortcuts are added around both the attention and FFN layers, preventing vanishing gradients.
2.4 Transformer Variants
The three classic configurations are:
- Encoder-Only (BERT): Uses bidirectional attention (can look both left and right). Ideal for classification and search.
- Decoder-Only (GPT series): Uses causal masking (can only look left to previous tokens). Ideal for text generation.
- Encoder-Decoder (T5, BART): Employs both. The encoder handles inputs, and cross-attention coordinates the decoder's outputs. Ideal for summarization and translation.
A. Modern Optimizations: FlashAttention
Standard attention requires reading and writing the $N \times N$ attention matrix to slow High-Bandwidth Memory (HBM). **FlashAttention** reorganizes the softmax calculation using tiling, keeping calculations in fast GPU SRAM and accelerating training speeds up to 4x.
Next Steps
Proceed to Module 3: Large Language Models to see how the Transformer block is scaled to billions of parameters.
