Module 3: Large Language Models (LLMs)
Master LLM pretraining, Chinchilla scaling laws, tokenization algorithms (BPE), decoding strategies (top-k, top-p), and KV Cache optimization.
3.1 Pretraining
Pretraining is the first and most expensive stage of training an LLM. The goal is to build general language syntax, reasoning, and world knowledge by feeding the model massive text corpora (crawled web pages, books, code repositories).
- Next-Token Prediction (Causal Language Modeling): Given a sequence of tokens $(t_1, \dots, t_{k-1})$, the model must predict the probability distribution for the next token $P(t_k | t_1, \dots, t_{k-1})$. This is the training paradigm for GPT models.
- Masked Language Modeling: Randomly hides 15% of tokens in a sentence and trains the model to fill in the blanks using bidirectional context. This is the training paradigm for BERT models.
A. Scaling Laws
How does LLM performance scale as we increase parameters ($N$), training tokens ($D$), and training compute ($C$)?
Kaplan et al. (OpenAI): Stated that model capacity scales power-law style, and suggested prioritizing model parameter count ($N$) over training dataset volume ($D$).
Chinchilla Scaling Laws (Hoffmann et al., DeepMind): Proved that Kaplan's study under-trained models. They showed that for optimal compute allocation, **parameters ($N$) and dataset size ($D$) should scale equally (1:1 ratio)**. A model with 70 billion parameters should be trained on at least 1.4 trillion tokens to be compute-optimal. Modern open models (like LLaMA-3-8B) train on much more data (e.g. 15 trillion tokens) to squeeze out inference-efficiency.
3.2 Notable LLM Architectures
The landscape of open-weights models has converged around specific decoder-only designs:
- GPT Series (OpenAI): Popularized decoder-only architectures, scaling from GPT-1 (117M parameters) to GPT-4 (an estimated multi-expert model of 1.7T parameters).
- LLaMA Series (Meta): Set the standard for open research. Features RMSNorm (Root Mean Square Normalization) for stable training and SwiGLU activation functions instead of GELU.
- Mistral & Mixtral (Mistral AI): Mixtral introduced **Sparse Mixture of Experts (SMoE)**, which routing inputs to a subset of feed-forward networks (experts) at each layer, enabling fast inference speeds.
- Gemma (Google): High-performance lightweight models utilizing Multi-Query Attention and RoPE embeddings.
3.3 Tokenization Deep Dive
LLMs cannot process text directly; characters must be parsed into integer IDs representing tokens (sub-word fragments).
Byte-Pair Encoding (BPE) Algorithm:
- Start with a vocabulary containing all individual characters.
- Represent all words as sequences of characters.
- Iteratively count the most frequent adjacent character pairs in the corpus.
- Merge the most frequent pair and add it to the vocabulary.
- Repeat until the desired vocabulary size (e.g., 32,000 for LLaMA-2 or 128,000 for LLaMA-3) is reached.
import collections
# Sample corpus split into characters with end-of-word marker
corpus = {
'l o w ': 5,
'l o w e r ': 2,
'n e w e s t ': 6,
'w i d e s t ': 3
}
def get_stats(corpus):
pairs = collections.defaultdict(int)
for word, freq in corpus.items():
symbols = word.split()
for i in range(len(symbols)-1):
pairs[symbols[i], symbols[i+1]] += freq
return pairs
def merge_vocab(pair, v_in):
v_out = {}
bigram = ' '.join(pair)
replacement = ''.join(pair)
for word in v_in:
w_out = word.replace(bigram, replacement)
v_out[w_out] = v_in[word]
return v_out
# Run 10 merges
for i in range(10):
pairs = get_stats(corpus)
if not pairs:
break
best = max(pairs, key=pairs.get)
corpus = merge_vocab(best, corpus)
print(f"Merge {i+1}: {best} -> {''.join(best)}")
3.4 Inference & Decoding Strategies
At inference, the model outputs log-odds (logits) for the next token. We apply decoding strategies to generate text:
A. Temperature Scaling
Modulates logits before passing them to the softmax layer:
Where $z_i$ is the logit, and $T$ is the temperature.
- Low Temperature (e.g. 0.2): Amplifies the highest logit, leading to deterministic, focused outputs.
- High Temperature (e.g. 0.9): Flattens the distribution, producing creative and diverse outputs.
B. Top-K and Top-P Sampling
- Top-K Sampling: Restricts the vocabulary to only the $K$ tokens with the highest probabilities, then redistributes the probabilities among them.
- Top-P (Nucleus) Sampling: Accumulates probabilities sorting from highest to lowest. Once the cumulative sum exceeds threshold $P$ (e.g., 0.90), all remaining tokens are discarded. This avoids sampling highly improbable words while maintaining diversity.
3.5 Context Window & Long-Context Handling
Expanding the context window is critical for processing documents, books, and full code repositories.
A. KV Cache
During generation, tokens are predicted one-by-one. In a standard Transformer block, calculating attention for token $n$ requires multiplying it with the Key and Value vectors of all previous tokens $1, \dots, n-1$.
Since previous tokens do not change, recalculating their Keys and Values at each step is highly redundant. **KV Caching** stores previously computed Key and Value vectors in memory, reducing processing time from $O(N^2)$ to $O(1)$ at the cost of GPU RAM storage.
B. RoPE Scaling (YaRN, LongRoPE)
If a model is trained on a 4k context window, its Rotary Positional Encodings cannot represent distances beyond 4k. **RoPE scaling** stretches or interpolates the frequencies of the RoPE rotation matrix to accommodate longer sequences (e.g. extending context to 128k) with minimal training.
Next Steps
Proceed to Module 4: Training & Fine-Tuning LLMs to learn how the Transformer block is scaled to billions of parameters.
