Module 8 · Diffusion & Image Gen

Module 8: Diffusion Models & Image Generation

Master generative vision, diffusion theory, DDPM equations, Latent Diffusion, Stable Diffusion, CLIP text encoders, ControlNet, and DreamBooth.

⏱ 25 Min Read • Author: GenAIWallah Team • Updated: May 2026

8.1 Generative Model Taxonomy

Generative computer vision aims to model the data distribution of realistic images to generate novel samples.

Generative Adversarial Networks (GANs): Train two models simultaneously: a Generator creating fake images and a Discriminator trying to classify real vs. fake. Highly sharp outputs but unstable training (mode collapse).
Variational Autoencoders (VAEs): Compress images into a continuous latent space bottleneck vector and reconstruct them. Stable training, but outputs tend to be blurry.
Normalizing Flows: Apply invertible mappings to transform simple probability distributions into complex ones. High quality, but mathematically constrained and computationally heavy.
Diffusion Models: Formulate image generation as a sequential denoising process. Offers stable training, diverse samples, and state-of-the-art image quality.

8.2 Diffusion Model Theory

Diffusion models are inspired by non-equilibrium thermodynamics. They consist of two processes:

Forward Diffusion Process (Noising): A Markov chain that gradually adds Gaussian noise to an image $$x_0$$ over $$T$$ time steps, transforming the image into pure isotropic noise $$x_T$$ according to a variance schedule $\beta_t$ .
Reverse Diffusion Process (Denoising): The model (typically a U-Net or Vision Transformer) learns to predict and subtract the added noise at each step, mapping the pure noise back to a clean image.

DDPM (Denoising Diffusion Probabilistic Models): The classic formulation (Ho et al., 2020), which requires generating images sequentially over hundreds of steps (e.g. $$T=1000$$ ), making inference slow.

DDIM (Denoising Diffusion Implicit Models): A deterministic sampler that allows skipping generation steps, accelerating inference times by 10x-50x (e.g., generating high-quality samples in 20-50 steps).

Python (Gaussian Noise Addition Simulation)

import numpy as np

def add_noise(x_0, t, beta_schedule):
    # Calculate noise scaling terms
    beta = beta_schedule[t]
    alpha = 1.0 - beta
    alpha_bar = np.prod(1.0 - beta_schedule[:t+1])
    
    # Generate Gaussian noise
    noise = np.random.randn(*x_0.shape)
    
    # Formula: x_t = sqrt(alpha_bar) * x_0 + sqrt(1 - alpha_bar) * noise
    x_t = np.sqrt(alpha_bar) * x_0 + np.sqrt(1 - alpha_bar) * noise
    return x_t, noise

# Simulate 64x64 pixel image
img = np.ones((64, 64)) * 0.5
schedule = np.linspace(0.0001, 0.02, 1000)

x_t, noise = add_noise(img, t=500, beta_schedule=schedule)
print("Noised Image Mean:", np.mean(x_t))

8.3 Latent Diffusion Models (LDM)

Running diffusion directly on high-resolution pixel spaces (e.g., 1024x1024 images) is computationally prohibitive because U-Net calculations scale quadratically with image dimensions.

**Latent Diffusion** (popularized by **Stable Diffusion**) resolves this:

An encoder (**VAE**) compresses a high-resolution pixel image $$x$$ into a lower-dimensional latent space representation $$z$$ (e.g. compressing a 512x512x3 image to a 64x64x4 latent grid).
The forward and reverse diffusion processes are run entirely inside this **compressed latent space**, saving 64x compute.
A decoder (**VAE**) takes the final generated latent and projects it back to high-resolution pixels.

8.4 Text-to-Image Systems

To generate images from text instructions (prompts), we inject text conditioning vectors into the U-Net using cross-attention mechanisms.

CLIP Text Encoder: Encodes prompts into text embeddings. CLIP is trained on pairs of images and text captions, aligning text representations with visual semantics. Modern systems also combine CLIP with heavy language models (like Google's T5-XXL) to parse complex spatial descriptions.
ControlNet: Allows conditioning image generations with spatial structural guides (like Canny edges, human pose skeletons, or depth maps), giving precise control over layout shapes.

8.5 Fine-Tuning Image Models

To personalize diffusion models (e.g. generating images of a specific person or custom art style), we use lightweight adaptation techniques:

Textual Inversion: Finds a custom pseudo-word embedding token vector (e.g., ``) that maps to a user's concept, leaving the model's weight layers untouched.
DreamBooth: Fine-tunes the entire U-Net weights on a few images (3-5) of a subject, using a class-specific prior preservation loss to avoid forgetting parent category patterns (like forgetting what other dogs look like).
LoRA for Diffusion: Inserts low-rank updates into the cross-attention and projection layers of the U-Net, generating small sharing files (e.g. 100MB adapters).

8.6 Image Editing

Diffusion models allow editing existing graphics:

Inpainting: Masking a specific region of an image and running reverse diffusion only on that masked region, blending new generated elements (like adding a hat onto a person) into the background.
Outpainting: Expanding image boundaries beyond their original borders by predicting plausible surrounding landscapes.
InstructPix2Pix: Modifying images via natural language instructions (e.g., "Change the background to winter") using a model trained on paired text-image editing datasets.

Latent Diffusion: Noise Insertion and Image Generation

💡

Next Steps

Proceed to Module 9: Multimodal Models to study visual language models and video architectures.