Module 4 · Training & Fine-Tuning

Module 4: Training & Fine-Tuning LLMs

Master Supervised Fine-Tuning, PEFT/LoRA math, QLoRA, DPO vs PPO alignment, model quantization (GGUF, GPTQ), and distributed DeepSpeed training.

⏱ 24 Min Read • Author: GenAIWallah Team • Updated: May 2026

4.1 Supervised Fine-Tuning (SFT)

Pretrained models behave as next-token text autocomplete engines. To transform them into useful interactive AI assistants, we perform **Supervised Fine-Tuning (SFT)** (also known as Instruction Tuning).

SFT trains the model on curated datasets containing instructions and correct responses:

Instruction Dataset Sample (JSON)

{
  "instruction": "Explain gravity in one sentence.",
  "response": "Gravity is an attractive force that pulls objects with mass toward each other."
}

Chat Templating: Chat systems rely on structured formats (e.g. ChatML or LLaMA-3 formats) to help the model distinguish between instructions from the system, user questions, and its own past answers:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>
Hi!<|eot_id|><|start_header_id|>assistant<|end_header_id|>

⚠️

Catastrophic Forgetting

If you fine-tune an LLM too heavily on a very narrow task (like translating medical documents), it may lose its broader reasoning and language generation capabilities. Mitigate this by mixing general-purpose instruction datasets into your training data.

4.2 Parameter-Efficient Fine-Tuning (PEFT) & LoRA

Fine-tuning pre-trained LLMs on domain-specific datasets is essential. However, updating all parameters in a 7-billion parameter model requires massive computing resources.

This led to the creation of **Parameter-Efficient Fine-Tuning (PEFT)** techniques, which freeze the base model parameters and only train a small subset of additional parameters, reducing memory costs. The most widely used PEFT method is **LoRA (Low-Rank Adaptation)**.

A. The Math Behind LoRA

During full model training, we adjust a model's weight matrix $$W_0$$ (of dimension $d \times k$ ) by a parameter update matrix $\Delta W$ . The final trained weight matrix is:

W = W_0 + \Delta W

LoRA factorizes $\Delta W$ into the product of two much smaller, low-rank matrices $$B$$ and $$A$$ :

\Delta W = B \cdot A

Where:

$$W_0$$ has dimensions $d \times k$ (frozen base weight).
$$B$$ has dimensions $d \times r$ (trainable adapter parameter).
$$A$$ has dimensions $r \times k$ (trainable adapter parameter).
$$r$$ is the **Rank** hyperparameter (typically $r \ll \min(d, k)$ , such as 8 or 16).

If $$d = 4096$$ and $$k = 4096$$ , then $$W_0$$ contains 16,777,216 parameters. If we use LoRA with a rank $$r = 8$$ , matrices $$A$$ and $$B$$ together contain only 65,536 parameters (only 0.39%!).

B. What is QLoRA?

QLoRA (Quantized Low-Rank Adaptation) takes efficiency a step further. It quantizes the frozen base model weights ( $$W_0$$ ) to a high-density **4-bit NormalFloat (NF4)** representation.

During training, the 4-bit weights are dequantized back to 16-bit brain-floats (BF16) on the fly to calculate gradients for the trainable 16-bit LoRA adapter weights ( $$A$$ and $$B$$ ). This allows developers to fine-tune a 7B parameter LLM on a single consumer GPU (like an Nvidia RTX 3090 or RTX 4090) with virtually no performance loss.

Python (Hugging Face PEFT / LoRA)

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType

# 1. Load the tokenizer and base model
model_id = "meta-llama/Meta-Llama-3-8B"
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)

# 2. Define the LoRA Config
lora_config = LoraConfig(
    r=16,                       # Rank value
    lora_alpha=32,              # Scaling factor
    target_modules=["q_proj", "v_proj"], # Target projection weights
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM
)

# 3. Wrap model with PEFT adapter layers
peft_model = get_peft_model(model, lora_config)

# 4. Print parameter statistics
peft_model.print_trainable_parameters()

4.3 Alignment & RLHF

After SFT, models can still produce toxic, biased, or unhelpful answers. We apply **Alignment** techniques to make them Helpful, Harmless, and Honest (HHH).

Reinforcement Learning from Human Feedback (RLHF):
1. Generate multiple responses for a prompt and have humans rank them.
2. Train a secondary **Reward Model** to output a score matching human preferences.
3. Fine-tune the base LLM policy using **Proximal Policy Optimization (PPO)**, maximizing the reward model's score while applying a KL divergence penalty to prevent the model from drifting too far from the original SFT weights.
Direct Preference Optimization (DPO):
RLHF with PPO is notoriously unstable and requires running three separate models (LLM, Reward model, Reference model) simultaneously. **DPO** bypasses the reward model entirely. It mathematically formulates human preferences directly into the loss function of the policy model, allowing you to align models using simple cross-entropy loss over pairwise preference datasets (e.g. chosen vs. rejected completions).

4.4 Quantization

Pretrained weights are typically saved as 16-bit floating-point numbers (FP16 or BF16), requiring 2 bytes of memory per parameter. Quantization compresses these weights into smaller representations (like 8-bit or 4-bit integers), dramatically reducing GPU RAM requirements.

Post-Training Quantization (PTQ): Compresses weights after training is complete. Standard methods include **GPTQ** and **AWQ**, which analyze activation distributions on a small calibration dataset to minimize quantization errors.
GGUF (GPT-Generated Unified Format): A binary file format designed for running LLMs locally on CPU/GPU hardware using frameworks like llama.cpp. Supports 2-bit, 3-bit, 4-bit, 5-bit, 6-bit, and 8-bit quantized weights.

4.5 Distributed Training

When training models with dozens of billions of parameters, the model weights, gradients, and optimizer states cannot fit on a single GPU's memory. We must shard the workload across multiple GPUs:

Data Parallelism: Copies the model to all GPUs. Each GPU processes a different slice of the data batch and synchronizes gradients before updating.
Tensor Parallelism: Shards individual weight matrices (e.g. splitting the Self-Attention projection weights) across multiple GPUs, performing parallel matrix multiplications.
Pipeline Parallelism: Shards different layers of the model sequentially across a chain of GPUs (e.g., layers 1-8 on GPU 0, layers 9-16 on GPU 1).
ZeRO (Zero Redundancy Optimizer): Developed by Microsoft (DeepSpeed). It removes parameter redundancies in data parallelism by sharding optimizer states, gradients, and model parameters across GPUs instead of replicating them.
FSDP (Fully Sharded Data Parallel): PyTorch's native implementation of ZeRO, allowing models to scale to hundreds of billions of parameters across clusters of GPUs.

Parameter-Efficient Fine-Tuning: LoRA Weight Decomposition

💡

Next Steps

Proceed to Module 5: Prompting & Prompt Engineering to learn how to interact with aligned models.