Module 4: Training & Fine-Tuning LLMs
Master Supervised Fine-Tuning, PEFT/LoRA math, QLoRA, DPO vs PPO alignment, model quantization (GGUF, GPTQ), and distributed DeepSpeed training.
4.1 Supervised Fine-Tuning (SFT)
Pretrained models behave as next-token text autocomplete engines. To transform them into useful interactive AI assistants, we perform **Supervised Fine-Tuning (SFT)** (also known as Instruction Tuning).
SFT trains the model on curated datasets containing instructions and correct responses:
{
"instruction": "Explain gravity in one sentence.",
"response": "Gravity is an attractive force that pulls objects with mass toward each other."
}
Chat Templating: Chat systems rely on structured formats (e.g. ChatML or LLaMA-3 formats) to help the model distinguish between instructions from the system, user questions, and its own past answers:
<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>
Hi!<|eot_id|><|start_header_id|>assistant<|end_header_id|>
Catastrophic Forgetting
If you fine-tune an LLM too heavily on a very narrow task (like translating medical documents), it may lose its broader reasoning and language generation capabilities. Mitigate this by mixing general-purpose instruction datasets into your training data.
4.2 Parameter-Efficient Fine-Tuning (PEFT) & LoRA
Fine-tuning pre-trained LLMs on domain-specific datasets is essential. However, updating all parameters in a 7-billion parameter model requires massive computing resources.
This led to the creation of **Parameter-Efficient Fine-Tuning (PEFT)** techniques, which freeze the base model parameters and only train a small subset of additional parameters, reducing memory costs. The most widely used PEFT method is **LoRA (Low-Rank Adaptation)**.
A. The Math Behind LoRA
During full model training, we adjust a model's weight matrix $W_0$ (of dimension $d \times k$) by a parameter update matrix $\Delta W$. The final trained weight matrix is:
LoRA factorizes $\Delta W$ into the product of two much smaller, low-rank matrices $B$ and $A$:
Where:
- $W_0$ has dimensions $d \times k$ (frozen base weight).
- $B$ has dimensions $d \times r$ (trainable adapter parameter).
- $A$ has dimensions $r \times k$ (trainable adapter parameter).
- $r$ is the **Rank** hyperparameter (typically $r \ll \min(d, k)$, such as 8 or 16).
If $d = 4096$ and $k = 4096$, then $W_0$ contains 16,777,216 parameters. If we use LoRA with a rank $r = 8$, matrices $A$ and $B$ together contain only 65,536 parameters (only 0.39%!).
B. What is QLoRA?
QLoRA (Quantized Low-Rank Adaptation) takes efficiency a step further. It quantizes the frozen base model weights ($W_0$) to a high-density **4-bit NormalFloat (NF4)** representation.
During training, the 4-bit weights are dequantized back to 16-bit brain-floats (BF16) on the fly to calculate gradients for the trainable 16-bit LoRA adapter weights ($A$ and $B$). This allows developers to fine-tune a 7B parameter LLM on a single consumer GPU (like an Nvidia RTX 3090 or RTX 4090) with virtually no performance loss.
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType
# 1. Load the tokenizer and base model
model_id = "meta-llama/Meta-Llama-3-8B"
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)
# 2. Define the LoRA Config
lora_config = LoraConfig(
r=16, # Rank value
lora_alpha=32, # Scaling factor
target_modules=["q_proj", "v_proj"], # Target projection weights
lora_dropout=0.05,
bias="none",
task_type=TaskType.CAUSAL_LM
)
# 3. Wrap model with PEFT adapter layers
peft_model = get_peft_model(model, lora_config)
# 4. Print parameter statistics
peft_model.print_trainable_parameters()
4.3 Alignment & RLHF
After SFT, models can still produce toxic, biased, or unhelpful answers. We apply **Alignment** techniques to make them Helpful, Harmless, and Honest (HHH).
-
Reinforcement Learning from Human Feedback (RLHF):
- Generate multiple responses for a prompt and have humans rank them.
- Train a secondary **Reward Model** to output a score matching human preferences.
- Fine-tune the base LLM policy using **Proximal Policy Optimization (PPO)**, maximizing the reward model's score while applying a KL divergence penalty to prevent the model from drifting too far from the original SFT weights.
-
Direct Preference Optimization (DPO):
RLHF with PPO is notoriously unstable and requires running three separate models (LLM, Reward model, Reference model) simultaneously. **DPO** bypasses the reward model entirely. It mathematically formulates human preferences directly into the loss function of the policy model, allowing you to align models using simple cross-entropy loss over pairwise preference datasets (e.g. chosen vs. rejected completions).
4.4 Quantization
Pretrained weights are typically saved as 16-bit floating-point numbers (FP16 or BF16), requiring 2 bytes of memory per parameter. Quantization compresses these weights into smaller representations (like 8-bit or 4-bit integers), dramatically reducing GPU RAM requirements.
- Post-Training Quantization (PTQ): Compresses weights after training is complete. Standard methods include **GPTQ** and **AWQ**, which analyze activation distributions on a small calibration dataset to minimize quantization errors.
- GGUF (GPT-Generated Unified Format): A binary file format designed for running LLMs locally on CPU/GPU hardware using frameworks like llama.cpp. Supports 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit quantized weights.
4.5 Distributed Training
When training models with dozens of billions of parameters, the model weights, gradients, and optimizer states cannot fit on a single GPU's memory. We must shard the workload across multiple GPUs:
- Data Parallelism: Copies the model to all GPUs. Each GPU processes a different slice of the data batch and synchronizes gradients before updating.
- Tensor Parallelism: Shards individual weight matrices (e.g. splitting the Self-Attention projection weights) across multiple GPUs, performing parallel matrix multiplications.
- Pipeline Parallelism: Shards different layers of the model sequentially across a chain of GPUs (e.g., layers 1-8 on GPU 0, layers 9-16 on GPU 1).
- ZeRO (Zero Redundancy Optimizer): Developed by Microsoft (DeepSpeed). It removes parameter redundancies in data parallelism by sharding optimizer states, gradients, and model parameters across GPUs instead of replicating them.
- FSDP (Fully Sharded Data Parallel): PyTorch's native implementation of ZeRO, allowing models to scale to hundreds of billions of parameters across clusters of GPUs.
Next Steps
Proceed to Module 5: Prompting & Prompt Engineering to learn how to interact with aligned models.
