Module 1
Days 1–12
Introduction to NLP & Ambiguity
Understand what NLP is, the core challenges of natural language, and why ambiguity is the central problem.
- NLP definition and real-world applications
- Lexical ambiguity — same word, multiple meanings
- Syntactic ambiguity — multiple parse trees
- Semantic and pragmatic resolution strategies
- Overview of NLP tasks: POS, NER, parsing, MT
Start Module 1 →
Module 2
Days 13–25
End-to-End NLP Pipeline
Trace the full lifecycle of an NLP system from raw data acquisition to model deployment.
- Data scraping and acquisition strategies
- Text cleaning and noise removal
- Embedding computation and representation
- Model training and evaluation workflow
- Deployment and serving NLP APIs
Start Module 2 →
Module 3
Days 26–38
Text Preprocessing Techniques
Master all standard text normalization steps before feeding text into any model.
- Tokenization — word, subword, sentence-level
- Lowercasing and punctuation handling
- Stopword removal — when to and when not to
- Stemming vs. Lemmatization — tradeoffs
- Regex-based cleaning and custom pipelines
Start Module 3 →
Module 4
Days 39–53
Text Vectorization & TF-IDF
Convert raw text into numerical representations that machine learning models can consume.
- One-Hot Encoding for vocabulary
- Bag-of-Words (BoW) representation
- N-grams and co-occurrence matrices
- TF-IDF — term and document frequency scaling
- Sparse vs. dense representation tradeoffs
Start Module 4 →
Module 5
Days 54–65
Word Embeddings (Word2Vec)
Learn dense vector representations that capture semantic meaning and word relationships.
- Limitations of sparse representations
- CBOW — predicting target from context
- Skip-gram — predicting context from target
- Negative sampling for efficient training
- GloVe and FastText comparisons
Start Module 5 →
Module 6
Days 66–78
Text Classification Models
Build classifiers that label text — from spam detection to sentiment analysis and topic categorization.
- Naive Bayes — conditional probabilities & Laplace smoothing
- Logistic Regression for multi-class text
- Support Vector Machines with text kernels
- Evaluation: accuracy, precision, recall, F1
- Handling class imbalance in NLP
Start Module 6 →
Module 7
Days 79–90
POS Tagging & Hidden Markov Models
Model sequential linguistic structure using probabilistic graphical models and dynamic programming.
- Part-of-Speech tag sets (Penn Treebank)
- HMM — states, transitions, emissions
- Forward-Backward algorithm
- Viterbi decoding for optimal tag sequence
- Named Entity Recognition (NER) with HMMs
Start Module 7 →
Module 8
Days 91–100
Duplicate Question Detection
End-to-end NLP case study using the Quora Question Pairs dataset — a real-world similarity problem.
- Problem framing: semantic similarity as binary classification
- Cosine similarity and Jaccard intersection
- Fuzzy matching with edit distance
- Feature engineering from text pairs
- XGBoost on engineered similarity vectors
Start Module 8 →