LangChain & GenAI · Module 5

Module 5: Document Loaders & Text Splitting

Learn Document Loaders and Text Splitting in LangChain — TextLoader, PyPDFLoader, CSVLoader, WebBaseLoader, DirectoryLoader, RecursiveCharacterTextSplitter, SemanticChunker and more.

⏱ 40 Min Read • Module 5 of 8 • Updated: May 2026

Before an LLM can answer questions about your documents, those documents must be loaded, parsed, and split into manageable chunks. This module covers every major Document Loader in LangChain and all the Text Splitter strategies — from simple character splitting to embedding-based semantic chunking. This is the first step of building any RAG pipeline.

Day 9

Document Loaders (PDF, Web, CSV)

Why this matters

Document loaders ingest PDFs, web pages, and CSVs into LangChain Document objects for RAG.

Loaders produce Document(page_content=..., metadata=...) objects from files and URLs.

from langchain_community.document_loaders import PyPDFLoader, WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter

docs = PyPDFLoader("report.pdf").load()
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
chunks = splitter.split_documents(docs)

PyPDFLoader, CSVLoader, WebBaseLoader cover common sources.
Use lazy_load() for large corpora to save memory.

Common mistakes

Hard-coding API keys in source instead of environment variables.
Passing raw strings where ChatPromptTemplate expects message tuples.
Skipping text splitting before embedding large PDFs (context overflow).

Interview checkpoints

Q: Explain document loaders in LangChain. A: One-sentence definition + one API name.
Q: Common bug? A: Keys, message format, or missing split/embed step.

Practice

Basic: Sketch a minimal document loaders snippet.
Intermediate: Run a notebook cell demonstrating Document Loaders.
Advanced: Break Document Loaders intentionally and interpret the error.

Recap

You can explain document loaders clearly.
You know one mistake to avoid.
You see how this connects to the next lesson.

Next: Text Splitters

Day 10

Text Splitting & Chunking Strategies

Why this matters

Splitters chunk long documents so embeddings and retrieval stay within context limits.

Splitters break documents into chunks sized for embedding models and context windows.

RecursiveCharacterTextSplitter — splits on paragraphs, sentences, then chars.
Tune chunk_size and chunk_overlap (often 10–20% overlap).
Preserve metadata (source, page) for citation in RAG answers.

Common mistakes

Hard-coding API keys in source instead of environment variables.
Passing raw strings where ChatPromptTemplate expects message tuples.
Skipping text splitting before embedding large PDFs (context overflow).

Interview checkpoints

Q: Explain text splitters in LangChain. A: One-sentence definition + one API name.
Q: Common bug? A: Keys, message format, or missing split/embed step.

Practice

Basic: Sketch a minimal text splitters snippet.
Intermediate: Run a notebook cell demonstrating Text Splitters.
Advanced: Break Text Splitters intentionally and interpret the error.

Recap

You can explain text splitters clearly.
You know one mistake to avoid.
You see how this connects to the next lesson.

Next: Embeddings

← Chains & LCEL Embeddings & Vectors →