Module 6 · Retrieval-Augmented Generation

Module 6: Retrieval-Augmented Generation (RAG)

Master RAG pipelines, text chunking, embeddings, vector databases, hybrid search, re-ranking, and Graph RAG.

⏱ 23 Min Read • Author: GenAIWallah Team • Updated: May 2026

6.1 Why RAG?

LLMs have two core limitations:

Knowledge Cutoffs: A model only knows facts up to its pretraining date.
Hallucinations: When asked about niche, obscure, or internal enterprise data, LLMs tend to generate false facts confidently.

**Retrieval-Augmented Generation (RAG)** resolves this by fetching relevant documents from an external dataset in real-time, inserting them directly into the LLM's prompt window, and instructing the model to answer the query *only* based on the retrieved context.

Parametric vs. Non-Parametric Memory: Parametric memory is the knowledge stored in the frozen weights of the LLM. Non-parametric memory is the external database. RAG separates these two, utilizing the database for fact storage and the LLM strictly as an extraction and reasoning engine.

6.2 RAG Pipeline Components

A production-grade RAG pipeline consists of the following modules:

A. Document Loading & Chunking

Before text can be stored, long documents must be split into smaller, coherent fragments (chunks).

Recursive Character Chunking: Splits text by a list of characters (paragraphs ` `, lines ` `, spaces ` `) recursively until chunks are small enough, preserving semantic boundaries.
Chunk Overlap (e.g. 10-20%): Duplicates text between adjacent chunks, preventing context boundaries from splitting vital sentences.

B. Embeddings & Vector Databases

Chunks are converted into high-dimensional vector representations using an **Embedding Model** (e.g. OpenAI `text-embedding-3-small`, Nomic Embed).

These vectors are stored in a **Vector Database** (like Chroma, Pinecone, or pgvector in PostgreSQL). When a user asks a query, the query is also embedded, and we query the database for the nearest vector neighbors using similarity metrics:

Cosine Similarity: Measures the angle direction between two vectors: $\text{Cosine Similarity}(u, v) = \frac{u \cdot v}{\|u\| \|v\|}$

Python (Similarity Retrieval Simulation)

import numpy as np

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

# Simulating vector store embeddings
chunks = [
    "Vector databases store high-dimensional embeddings.",
    "Calculus gradients are computed using the backpropagation chain rule.",
    "Prompt engineering includes few-shot prompting and CoT reasoning."
]

embeddings = [
    np.array([0.9, 0.1, 0.0]),
    np.array([0.1, 0.8, 0.1]),
    np.array([0.0, 0.2, 0.9])
]

# Query: "Explain gradients and derivatives"
query_emb = np.array([0.15, 0.75, 0.1])

similarities = [cosine_similarity(query_emb, emb) for emb in embeddings]
best_idx = np.argmax(similarities)

print("Best Match Chunk:", chunks[best_idx])
print("Similarity Score:", similarities[best_idx])

6.3 Advanced RAG

Basic vector searches often fail to retrieve the correct context. Advanced pipelines optimize retrieval:

Hybrid Search: Combines keyword-based matches (**BM25/TF-IDF**) with semantic vector searches, leveraging both literal word matches and conceptual meaning.
Cross-Encoder Re-ranking: Initial vector searches are fast but loose. We retrieve a larger candidate pool (e.g. top 50), then pass them through a heavy **Cross-Encoder Model** to evaluate precise query-document relationships, re-ranking them to feed only the top 5 most relevant documents to the LLM.
HyDE (Hypothetical Document Embeddings): Given a query, we first use the LLM to generate a hypothetical answer. We embed this hypothetical answer and use it to search the vector database. Since answers match answers better than questions match answers, this improves retrieval accuracy.

6.4 Evaluation of RAG

To evaluate RAG setups, we utilize the **RAGAS** framework, which measures four dimensions:

Context Precision: Are the retrieved documents relevant to the query?
Context Recall: Did the retriever fetch all necessary information to answer the query?
Faithfulness (Groundedness): Is the LLM's final response supported *only* by the retrieved context, or did it hallucinate?
Answer Relevance: Does the model's generated answer directly address the user's initial question?

6.5 Graph RAG

Traditional vector RAG is excellent at finding specific localized facts but struggles with global queries (e.g. "What are the main themes across all company documents?").

**Graph RAG** extracts entities (people, products, dates) and their relationships from documents using an LLM, structuring them into a **Knowledge Graph** (using databases like Neo4j). It then builds hierarchical summaries of graph clusters. When queried, it traverses relationship nodes to compile answers that span multiple disparate files.

RAG Integration Flow: Retrieval and Prompts Merging

💡

Next Steps

Proceed to Module 7: Agents & Agentic Systems to see how RAG is automated into autonomous workflows.