Module 9 · Multimodal Models

Module 9: Multimodal Models

Master Multimodal AI, Vision Transformers, LLaVA projection, Whisper ASR, Sora video diffusion, and unified any-to-any systems.

⏱ 22 Min Read • Author: GenAIWallah Team • Updated: May 2026

9.1 Vision-Language Models (VLMs)

Humans perceive the world using multiple senses simultaneously. To build models that can analyze images and diagrams alongside text, we construct **Vision-Language Models (VLMs)**.

A. Patch Embeddings & Vision Transformers (ViT)

Traditional Transformers process discrete sequence tokens. To feed an image into a Transformer:

The image is divided into a grid of non-overlapping square patches (e.g. $16 \times 16$ pixels per patch).
Each patch is flattened into a single linear vector.
We pass these vectors through a linear projection matrix to output a sequence of 1D visual embeddings, treating image patches exactly like words.
This sequence is passed through a standard Transformer encoder, outputting visual representation tokens.

B. Projection Mappings (LLaVA)

How do we align the output tokens of a visual encoder (like CLIP ViT) with a language model's embedding space (like LLaMA)?

In **LLaVA (Large Language and Vision Assistant)**, we pass visual token embeddings through a trainable **projection connector** (a simple linear layer or a small MLP). The projection matrix translates visual features into vectors of the same dimension as word embeddings. The LLM can then process the combined sequence of visual tokens and text tokens seamlessly.

PyTorch (Visual-Text Embedding Projection Simulation)

import torch
import torch.nn as nn

# Simulate visual encoder outputs: 196 patches, 1024-dim features
visual_features = torch.randn(1, 196, 1024)

# Linear projection layer aligning to LLM embedding dimension (4096)
projection_connector = nn.Linear(1024, 4096)

# Projected visual tokens
visual_tokens = projection_connector(visual_features)
print("Aligned Visual Token Shape:", visual_tokens.shape) # Output: [1, 196, 4096]

# Now, we can concatenate visual_tokens with text_embeddings [1, seq_len, 4096]...

9.2 Frontier Multimodal Models

Early multimodal systems were stitched together using separate, pre-trained modules (e.g. connecting a standalone Whisper ASR model to LLaMA, and then to a TTS voice synthesizer). This introduced translation bottlenecks and latency.

Native Multimodality: Modern models (like GPT-4o or Gemini 1.5 Pro) are trained end-to-end on a single network across text, vision, and audio data simultaneously.

GPT-4o: Uses a single neural network to process inputs across text, vision, and audio, generating speech responses in under 230ms to support natural real-time conversations.
Gemini 1.5 Pro: Features a native **million-token context window**, allowing the model to analyze hours of video, massive audio files, or hundreds of thousands of lines of code in a single query.

9.3 Audio & Speech

Audio processing consists of converting continuous sound waves into discrete representations.

Whisper ASR: An encoder-decoder architecture developed by OpenAI. It converts raw audio into log-mel spectrogram features, encodes them, and decodes them into text transcripts, supporting multi-language speech recognition.
Speech-to-Text Tokenizers: Modern native audio models tokenize continuous audio signals into discrete audio tokens (like SoundStream or EnCodec), representing sound waves as sequences that are processed exactly like text.

9.4 Video Generation

Generating video requires modeling temporal continuity across frames.

Diffusion Transformers (DiT): Replaces the traditional spatial U-Net architecture with a Vision Transformer block, allowing the model to scale efficiently when processing 3D visual space-time patches (latent grids spanning multiple sequential frames).
OpenAI Sora: A text-to-video simulator. Sora is trained on space-time patches, acting as a physics engine simulator that generates up to 60-second video clips featuring complex camera motions and consistent character details.

9.5 Any-to-Any Models

The future of artificial intelligence lies in unified **Any-to-Any Models**. Rather than writing specialized visual interfaces, speech modules, or robotics commands, a single model architecture processes arbitrary inputs (text, coordinates, sound, images) and outputs arbitrary formats (robotics motor torques, visual generations, or voice replies) using a unified sequence-to-sequence token representation.

Multimodal Alignment: CLIP Contrastive Training Space

💡

Next Steps

Proceed to Module 10: Evaluation & Benchmarking to learn how to measure AI capabilities systematically.