Module 7: Recurrent Neural Networks, LSTMs & GRUs
Process temporal text steps: design recurrences in RNN hidden states, evaluate vanishing gradients in BPTT, and trace Gating logic in LSTMs, GRUs, and Bidirectional paths.
Sequential Data
Contents
53.3.8 Key Implementation Details . . . . . . . . . . . . . . . . . . . 605
53.3.9 Fine-Tuning Strategy . . . . . . . . . . . . . . . . . . . . . . . 605
53.3.10Expected Performance . . . . . . . . . . . . . . . . . . . . . . 606
XI Advanced Keras 607
54 Keras Functional Model 608
54.1 Functional API in Keras - Detailed Notes . . . . . . . . . . . . . . . . 608
54.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 608
54.1.2 Why Functional API? . . . . . . . . . . . . . . . . . . . . . . 608
54.1.3 Basic Functional API Syntax . . . . . . . . . . . . . . . . . . 608
54.1.4 Code Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 609
54.1.5 Key Advantages of Functional API . . . . . . . . . . . . . . . 611
54.1.6 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . 611
54.1.7 Best Practices . . . . . . . . . . . . . . . . . . . . . . . . . . . 612
54.1.8 Common Architectures with Functional API . . . . . . . . . . 612
54.1.9 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 612
XII Recurrent Neural Networks 613
55 Why RNNs are needed RNNs Vs ANNs RNN Part 1 614
55.1 Why RNNs are needed | RNNs Vs ANNs | RNN Part 1 . . . . . . . . 614
55.1.1 Neural Network Types Covered So Far . . . . . . . . . . . . . 615
55.1.2 What are Recurrent Neural Networks? . . . . . . . . . . . . . 615
55.1.3 Understanding Sequential Data . . . . . . . . . . . . . . . . . 615
55.1.4 Why RNNs are Essential . . . . . . . . . . . . . . . . . . . . . 616
55.1.5 Applications of RNNs . . . . . . . . . . . . . . . . . . . . . . 617
55.2 RNN Fundamentals - Why Use RNNs? . . . . . . . . . . . . . . . . . 617
55.2.1 Core Question . . . . . . . . . . . . . . . . . . . . . . . . . . . 617
55.2.2 The Sequential Data Challenge . . . . . . . . . . . . . . . . . 617
55.2.3 Problem 1: Text Representation . . . . . . . . . . . . . . . . . 618
55.2.4 Problem 2: Variable Input Sizes . . . . . . . . . . . . . . . . . 619
55.2.5 Solution: Zero Padding . . . . . . . . . . . . . . . . . . . . . . 619
55.2.6 Problems with Zero Padding . . . . . . . . . . . . . . . . . . . 620
55.2.7 Why Traditional Neural Networks Fail . . . . . . . . . . . . . 621
55.3 RNN Applications & Learning Roadmap . . . . . . . . . . . . . . . . 623
55.3.1 Core Problems Summary . . . . . . . . . . . . . . . . . . . . . 623
55.3.2 Real-World RNN Applications . . . . . . . . . . . . . . . . . . 624
55.3.3 Additional RNN Applications . . . . . . . . . . . . . . . . . . 627
55.3.4 RNN Learning Roadmap . . . . . . . . . . . . . . . . . . . . . 628
56 Recurrent Neural Network Forward Propagation Architecture 630
56.1 Recurrent Neural Network | Forward Propagation | Architecture . . . 630
56.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 630
56.1.2 Why RNNs? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 630
56.1.3 Data Format for RNNs . . . . . . . . . . . . . . . . . . . . . . 631
56.1.4 RNN Architecture . . . . . . . . . . . . . . . . . . . . . . . . 632
xxiiiWhy this matters
Sequential data needs models that respect order — time, text, audio.
56.2.4 Data Flow Visualization
Input Processing 1Input Shape: (batch_size, 4, 5) 2? 34 timesteps * 5 features each 4? 5Sequential processing through RNN 637
Chapter 56. Recurrent Neural Network Forward Propagation Architecture RNN Processing Steps Figure 56.4: Mermaid diagram
56.3 RNNForwardPropagation: Complete
Technical Guide
56.2.4 Data Flow Visualization
Input Processing 1Input Shape: (batch_size, 4, 5) 2? 34 timesteps * 5 features each 4? 5Sequential processing through RNN 637
Chapter 56. Recurrent Neural Network Forward Propagation Architecture RNN Processing Steps Figure 56.4: Mermaid diagram
56.3 RNNForwardPropagation: Complete
Technical Guide
Unlike feedforward networks, Recurrent Neural Networks process sequential inputs by maintaining a hidden state vector $h_t$ that carries historical information across time loops: $$h_t = \tanh(W_{hh} h_{t-1} + W_{xh} x_t + b_h)$$ While RNNs are theoretically capable of sequence tracking, backpropagating through time (BPTT) leads to vanishing gradients, preventing the capture of long-term dependencies.
Common mistakes
- Not padding/masking variable-length sequence batches.
- Vanishing gradients in long order sequences.
- Teacher forcing only at train without plan for inference.
Interview checkpoints
- Q: RNN vs CNN? A: RNN for sequences; CNN for spatial grids.
- Q: LSTM vs vanilla? A: Gated memory reduces vanishing.
Practice
- Basic: Sketch unrolled RNN for 3 timesteps.
- Intermediate: LSTM layer on padded sequences.
- Advanced: Forecast univariate series with windowed LSTM.
Recap
- Sequential Data for sequence modeling.
- Watch shape (batch, time, features).
- Transformers often replace RNNs now.
Next: Day 67 — Vanilla RNN
Vanilla RNN
3.2. Neural Network Architectures: A Visual Guide Component Purpose Description Encoder Compression Reduces input to latent space Latent Space Representation Compact encoding of data Decoder Reconstruction Rebuilds input from latent space Variants Vanilla Autoencoder •Basic: Simple encoder-decoder structure •Undercomplete: Latent dimension smaller than input •Purpose: Dimensionality reduction, feature learning Variational Autoencoder (VAE) •Probabilistic: Encodes to distribution, not point •Generative: Can sample from latent space •Structure: Adds KL divergence to loss function •Formula: Loss = Reconstruction Error + KL Divergence Denoising Autoencoder •Corruption: Input deliberately noised •Cleaning: Must reconstruct clean output •Robust: Learns noise-invariant features Sparse Autoencoder •Regularization: Penalizes active neurons •Sparse: Only small subset of neurons active •Goal: Learn more efficient representations Applications •Dimensionality reduction •Feature learning •Anomaly detection •Image denoising •Data compression Key Properties •Unsupervised: No labels needed •Self-supervised: Creates own supervision signal •Data-specific: Works best on similar data distribution 39
Why this matters
Vanilla RNN maintains hidden state across timesteps.
33.3.10 Visualization Tools
Interactive Demos 1.3D Loss Surface Visualization –Shows ball rolling on loss landscape –Compares vanilla GD vs momentum 2.Contour Plot Animation –2D view of optimization path –Clear view of oscillation damping 3.Parameter Space Navigation –Click anywhere to start optimization –Compare different algorithms side-by-side Key Observations from Visualizations – Blue path: Vanilla gradient descent (slow, direct) – Purple path: Momentum (fast, may overshoot) – Local minima: Momentum escapes, vanilla GD gets stuck – Oscillations: Gradually dampen with momentum
33.3.10 Visualization Tools
Interactive Demos 1.3D Loss Surface Visualization –Shows ball rolling on loss landscape –Compares vanilla GD vs momentum 2.Contour Plot Animation –2D view of optimization path –Clear view of oscillation damping 3.Parameter Space Navigation –Click anywhere to start optimization –Compare different algorithms side-by-side Key Observations from Visualizations – Blue path: Vanilla gradient descent (slow, direct) – Purple path: Momentum (fast, may overshoot) – Local minima: Momentum escapes, vanilla GD gets stuck – Oscillations: Gradually dampen with momentum
To solve gradient vanishing, Hochreiter & Schmidhuber proposed LSTMs in 1997. LSTMs control information flow using a **cell state** $C_t$ and three gating layers:
- Forget Gate: $f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)$ controls what is deleted from the history.
- Input Gate: $i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)$ and $\tilde{C}_t = \tanh(W_c \cdot [h_{t-1}, x_t] + b_c)$ control what gets written.
- Output Gate: $o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)$ and $h_t = o_t * \tanh(C_t)$ yield the output state.
Common mistakes
- Not padding/masking variable-length rnn batches.
- Vanishing gradients in long hidden sequences.
- Teacher forcing only at train without plan for inference.
Interview checkpoints
- Q: RNN vs CNN? A: RNN for sequences; CNN for spatial grids.
- Q: LSTM vs vanilla? A: Gated memory reduces vanishing.
Practice
- Basic: Sketch unrolled RNN for 3 timesteps.
- Intermediate: LSTM layer on padded sequences.
- Advanced: Forecast univariate series with windowed LSTM.
Recap
- Vanilla RNN for sequence modeling.
- Watch shape (batch, time, features).
- Transformers often replace RNNs now.
Next: Day 68 — Hidden States
Hidden States
3.2. Neural Network Architectures: A Visual Guide Component Purpose Description Input Layer Receives sequence data One element at a time Hidden State Maintains memory Updated with each input Output Layer Generates predictions Can output at each step Recurrent Connection Enables memory Connects hidden state to itself Core Mechanism The fundamental RNN computation follows this mathematical formula: ht = tanh(Wx·xt +Wh·ht−1+b) Where: -x t: Input at time stept-h t: Hidden state at time stept-h t−1: Previous hidden state -Wx: Input weight matrix -Wh: Hidden state weight matrix -b: Bias vector -tanh: Hyperbolic tangent activation function This equation shows how RNNs combine current input with previous memory to produce new hidden states, enabling temporal pattern recognition. Where: -x t: Input at time t -ht: Hidden state at time t -Wx,W h: Weight matrices -b: Bias vector RNN Variants LSTM (Long Short-Term Memory) •Gates: Input, forget, output •Cell state: Long-term memory storage •Protection: Guards against vanishing/exploding gradients •Performance: Better at capturing long-range dependencies GRU (Gated Recurrent Unit) •Gates: Reset, update •Simplified: Fewer parameters than LSTM •Efficiency: Faster training, similar performance Bidirectional RNNs↔ •Two directions: Forward and backward •Context: Captures both past and future information •Enhanced: Better performance for many applications Applications •Natural language processing •Speech recognition •Time series prediction 37
Why this matters
Hidden state summarizes past — bottleneck for long sequences.
65.1.12 Key Takeaways
Remember: GRU is a simplified, more efficient alternative to LSTM that often performs comparably well while being faster to train and requiring fewer parameters. Core Benefits of GRU Benefit Impact SimplicityEasier to understand and implement EfficiencyFaster training and inference EffectivenessGood performance on many tasks FlexibilityGood starting point for sequence modeling 749
Chapter 65. Gated Recurrent Unit Deep Learning GRU CampusX 750
Chapter 66 BidirectionalRNNBiLSTMBidi- rectionalLSTMBidirectionalGRU
66.1 Bidirectional RNN | BiLSTM | Bidi-
rectional LSTM | Bidirectional GRU
66.2 BidirectionalRNN-ComprehensiveNotes
66.2.1 Overview
BidirectionalRecurrentNeuralNetworks(BiRNNs)areanadvancedarchi- tecture that processes sequences in both forward and backward directions, capturing context from both past and future inputs. Learning Path Progress Figure 66.1: Mermaid diagram
66.2.2 Why Bidirectional RNNs?
The Limitation of Unidirectional RNNs In traditional RNNs, information flows in one direction (left to right): 1x_1 -> [RNN] -> x_2 -> [RNN] -> x_3 -> [RNN] -> Output Problem: Output at time t only depends on past inputs (x1, x2, ..., xp) The Need for Future Context Some scenarios require future inputs to affect past outputs: Example: Named Entity Recognition (NER)Consider these sen- tences: 1.“I love Amazon, it’s a great website”- Amazon→Orga- nization (ORG) 751
Chapter 66. Bidirectional RNN BiLSTM Bidirectional LSTM Bidirectional GRU 2.“I love Amazon, it’s a beautiful river” ∗Amazon→Location (LOC) Key Insight: We can’t determine if “Amazon” is ORG or LOC until we read the future context!
66.2.3 Bidirectional RNN Architecture
Core Concept BiRNN uses two separate RNNs: -Forward RNN→: Processes se- quence left to right -Backward RNN←: Processes sequence right to left Visual Architecture 1Forward: x_1 -> [RNN_1] -> x_2 -> [RNN_2] -> x_3 -> [RNN_3] -> x? 2? ? ? ? 3h_1? h_2? h_3? h?? 4 5Backward: x? <- [RNN?] <- x_3 <- [RNN_3] <- x_2 <- [RNN_2] <- x_1 6? ? ? ? 7h?? h_3? h_2? h_1? 8 9Output: y_1 = sigma(V[h_1?;h_1?] + b) Mathematical Formulation Component Equation Forward Hidden Stateh → t = tanh(Whfh→ t−1+Wxfxt +bf) Backward Hidden Stateh ← t = tanh(Whbh← t+1 +Wxbxt +bb) Outputy t =σ(V[h→ t ;h← t ] +b) Where: -[h → t ;h← t ]represents concatenation -σis the sigmoid activation function
66.2.4 Implementation in Keras
Basic BiRNN Implementation
1fromtensorflow.keras.layersimportBidirectional, SimpleRNN, LSTM,
GRU
2
3# Simple BiRNN
4model.add(Bidirectional(SimpleRNN(5)))
5
6# BiLSTM (Most Common)
7model.add(Bidirectional(LSTM(5)))
75266.2. Bidirectional RNN - Comprehensive Notes 8 9# BiGRU
10model.add(Bidirectional(GRU(5)))
Parameter Comparison
Architecture Parameters Multiplier
SimpleRNN 190 1x
Bidirectional(SimpleRNN) 380 2x
LSTM Higher 1x
Bidirectional(LSTM) 2x Higher 2x
Note: Bidirectional wrapper doubles the parameters as it uses
two RNNs
66.2.5 Applications
Primary Use Cases
Application Description Why BiRNN?
Named Entity
Recognition (NER)
Identify entities in text Future context helps
disambiguate
Part-of-Speech TaggingAssign grammatical tags Context from both
directions
Machine TranslationTranslate between languages Better context
understanding
Sentiment AnalysisDetermine text sentiment Captures full sentence
context
Time Series ForecastingPredict future values Patterns from both
directions
753Chapter 66. Bidirectional RNN BiLSTM Bidirectional LSTM Bidirectional GRU Success Areas Figure 66.2: Mermaid diagram
66.2.6 Advantages & Drawbacks
Advantages ∗Complete Context: Access to both past and future information ∗Better Performance: Often outperforms unidirectional RNNs ∗Improved Accuracy: Especially for sequence labeling tasks Drawbacks Issue Description Impact Computational Complexity 2x parameters and computation Higher training time Overfitting RiskMore parameters = more complexity Need more regularization Latency IssuesNeed complete sequence before processing Not suitable for real-time Memory RequirementsStores both forward and backward states Higher memory usage 754
66.2. Bidirectional RNN - Comprehensive Notes Real-time Limitations Figure 66.3: Mermaid diagram
66.2.7 Best Practices
When to Use BiRNN Use when:- Complete sequence is available - Context from both direc- tions is valuable - Accuracy is more important than speed - Working with NLP tasks like NER, POS tagging 755
Chapter 66. Bidirectional RNN BiLSTM Bidirectional LSTM Bidirectional GRU Avoid when:- Real-time processing is required - Working with stream- ing data - Memory/computational resources are limited - Simple patterns suffice Implementation Tips 1.Start Simple: Try unidirectional first, then compare with bidirec- tional 2.Regularization: Use dropout to combat overfitting 3.Architecture Choice: BiLSTM is most commonly used 4.Batch Processing: Process multiple sequences together for effi- ciency
66.2.8 Summary
Bidirectional RNNs are powerful architectures that leverage both past and future context to make better predictions. While they come with increased computational costs and aren’t suitable for real-time applications, they excel in many NLP tasks where complete context improves performance significantly. Key Takeaways ∗Dual Processing: Forward + Backward RNNs ∗Better Context: Captures information from entire sequence ∗Easy Implementation: Simple wrapper in modern frameworks ∗Trade-offs: Better accuracy vs. higher complexity ∗Best for: NLP tasks with complete sequences available 756
66.2. Bidirectional RNN - Comprehensive Notes 757
Part XIII History of Large Language Models 758
Chapter 67 The Epic History of Large Lan- guageModels(LLMs)FromLSTMs to ChatGPT CampusX
67.1 The Epic History of Large Language
Models (LLMs) | From LSTMs to ChatGPT | CampusX Figure 67.1: image
67.2 Sequence Tasks and Types: Compre-
hensive Guide
67.2.1 Sequence Processing Architecture
Figure 67.2: image 759
Chapter 67. The Epic History of Large Language Models (LLMs) From LSTMs to ChatGPT CampusX
67.2.2 RNN Input-Output Patterns
Pattern Type Input Output Examples Architecture Many-to-OneSequence Scalar (1,0) Sentiment analysis, Classification One-to-ManyScalar/Image Sequence Image captioning, Description Many-to- Many (Async) Sequence Sequence Translation, Summarization Many-to- Many (Sync) Sequence Sequence POS Tagging, NER
67.2.3 Key Applications of Sequence Models
∗Text Processing: ·Sentiment analysis (positive/negative) ·Text generation & summarization ·Machine translation (Google Translate) ∗Vision & Language: ·Image captioning (image→description) ·Visual question answering ∗Time Series: ·Financial forecasting ·Weather prediction ·Anomaly detection ∗Bioinformatics: ·Protein sequence analysis ·DNA sequence classification 760
The GRU is a simplified variant of LSTM that merges the cell state and hidden state, and uses only two gates: a **Reset Gate** (controls how to combine new input with past memory) and an **Update Gate** (acts as both forget and input gate).
Common mistakes
- Not padding/masking variable-length hidden batches.
- Vanishing gradients in long bottleneck sequences.
- Teacher forcing only at train without plan for inference.
Interview checkpoints
- Q: RNN vs CNN? A: RNN for sequences; CNN for spatial grids.
- Q: LSTM vs vanilla? A: Gated memory reduces vanishing.
Practice
- Basic: Sketch unrolled RNN for 3 timesteps.
- Intermediate: LSTM layer on padded sequences.
- Advanced: Forecast univariate series with windowed LSTM.
Recap
- Hidden States for sequence modeling.
- Watch shape (batch, time, features).
- Transformers often replace RNNs now.
Next: Day 69 — BPTT
BPTT
Contents
59.1.2 Introduction to RNN Backpropagation . . . . . . . . . . . . . 670
59.1.3 RNN Architecture Review . . . . . . . . . . . . . . . . . . . . 670
59.1.4 Forward Propagation . . . . . . . . . . . . . . . . . . . . . . . 671
59.1.5 Backpropagation Through Time (BPTT) . . . . . . . . . . . . 672
59.1.6 Gradient Calculations . . . . . . . . . . . . . . . . . . . . . . 672
59.1.7 Implementation Details . . . . . . . . . . . . . . . . . . . . . . 673
59.1.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674
60 Problems with RNN 100 Days of Deep Learning 676
60.1 Problems with RNN | 100 Days of Deep Learning . . . . . . . . . . . 676
60.1.1 RNN Fundamentals Recap . . . . . . . . . . . . . . . . . . . . 676
60.1.2 Major Problems with RNNs . . . . . . . . . . . . . . . . . . . 676
60.1.3 Why this happens: . . . . . . . . . . . . . . . . . . . . . . . . 677
60.1.4 Real-world Example: . . . . . . . . . . . . . . . . . . . . . . . 677
60.1.5 Technical Deep Dive . . . . . . . . . . . . . . . . . . . . . . . 678
60.2 RNNMathematicalAnalysis: Long-TermDependency&GradientProb-
lems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 678
60.2.1 Course Context & Prerequisites . . . . . . . . . . . . . . . . . 678
60.2.2 Problem #1: Long-Term Dependency - Mathematical Analysis 679
60.2.3 Gradient Calculation & Chain Rule Application . . . . . . . . 680
60.2.4 Complete Mathematical Derivation . . . . . . . . . . . . . . . 681
60.2.5 Vanishing Gradient Problem - Mathematical Proof . . . . . . 681
60.2.6 Solutions to Vanishing Gradients . . . . . . . . . . . . . . . . 682
60.2.7 Problem #2: Exploding Gradients . . . . . . . . . . . . . . . 683
60.2.8 Solutions to Exploding Gradients . . . . . . . . . . . . . . . . 683
60.2.9 Summary & Mathematical Insights . . . . . . . . . . . . . . . 684
61 LSTM Long Short Term Memory Part 1 The What CampusX 686
61.1 LSTM | Long Short Term Memory | Part 1 | The What? | CampusX 686
61.1.1 Recap: From ANN to RNN . . . . . . . . . . . . . . . . . . . 686
61.1.2 The Critical Problem: Long Sequences . . . . . . . . . . . . . 687
61.1.3 LSTM: The Solution . . . . . . . . . . . . . . . . . . . . . . . 688
61.2 LSTM Core Concepts & Architecture - Deep Dive Notes . . . . . . . 689
61.2.1 Learning Objective . . . . . . . . . . . . . . . . . . . . . . . . 689
61.2.2 The Story-Based Learning Approach . . . . . . . . . . . . . . 690
61.2.3 Human Brain Memory Processing Model . . . . . . . . . . . . 691
61.2.4 The RNN Memory Problem . . . . . . . . . . . . . . . . . . . 692
61.2.5 LSTM Solution: Dual Memory Architecture . . . . . . . . . . 693
61.2.6 The Three Gates: LSTM’s Control System . . . . . . . . . . . 694
61.2.7 Pronoun Resolution Example . . . . . . . . . . . . . . . . . . 695
61.2.8 LSTM as a Computer System . . . . . . . . . . . . . . . . . . 698
Why this matters
BPTT backpropagates through time — expensive and unstable.
57.0.11 Summary
This implementation demonstrates the complete pipeline for text-based
sentiment analysis using RNNs in Keras, covering:
1.Text Preprocessing: Tokenization and sequence conversion
2.Data Preparation: Padding and normalization
3.Model Building: Both simple and embedding-based approaches
4.Training: Compilation and execution strategies
The embedding approach consistently outperforms simple integer encod-
ing due to its ability to capture semantic relationships and provide dense
660representations of textual data. 661
Chapter 58 Types of RNN Many to Many OnetoManyManytoOneRNNs
58.1 Types of RNN | Many to Many | One
to Many | Many to One RNNs
58.1.1 Overview
This comprehensive guide covers the four main types of Recurrent Neu- ral Network (RNN) architectures, their applications, and implementation patterns based on input-output sequence relationships.
58.1.2 Video Content Summary
58.1.3 Four Main RNN Architecture Types
Architecture Classification Matrix Input Type Output Type Architecture Applications Sequence SingleMany to OneSentiment Analysis, Rating Prediction Single SequenceOne to ManyImage Captioning, Music Generation Sequence SequenceMany to ManyTranslation, NER, POS Tagging Single SingleOne to OneImage Classification
58.1.4 1. Many to One Architecture
Core Concept Input: Sequential data (sentences, time series) Output: Single value/classification 662
58.1. Types of RNN | Many to Many | One to Many | Many to One RNNs Architecture Flow Figure 58.1: image Key Applications 1. Sentiment Analysis ∗Input: “This movie is amazing!” (sequence of words) ∗Output: 1 (Positive) or 0 (Negative) ∗Process: Analyze entire sentence context→Single sentiment score 2. Rating Prediction ∗Input: Product review text ∗Output: Star rating (1-5) ∗Use Case: Movie reviews→Predicted user rating Architecture Details ∗Hidden States: Each time step maintains hidden state ∗Final Output: Only from last time step ∗Information Flow: Sequential processing with memory
58.1.5 2. One to Many Architecture
Core Concept Input: Single non-sequential data (image, number) Output: Sequential data (text, music notes) 663
Chapter 58. Types of RNN Many to Many One to Many Many to One RNNs Architecture Flow Figure 58.2: image Key Applications 1. Image Captioning ∗Input: Image of person playing cricket ∗Output: “A man is playing cricket” ∗Process: CNN extracts features→RNN generates sequential text 2. Music Generation ∗Input: Musical seed/style parameter ∗Output: Sequence of musical notes ∗Process: Generate continuous musical composition Technical Implementation ∗Initial Input: Provided once at start ∗Subsequent Steps: Previous output becomes next input ∗Generation: Continues until stop condition
58.1.6 3. Many to Many Architecture
Core Concept Input: Sequential data Output: Sequential data Also Known As: Sequence-to-Sequence (Seq2Seq) models Two Subtypes 3A. Same Length Many-to-Many Characteristic: Input sequence length = Output sequence length 664
58.1. Types of RNN | Many to Many | One to Many | Many to One RNNs ApplicationsPart-of-Speech (POS) Tagging Input Word Output Tag “The” Article “quick” Adjective “brown” Adjective “fox” Noun Named Entity Recognition (NER) Input Output “Let’s meet at 7:00 PM at the airport” [O, O, O, TIME, TIME, O, O, LOCATION] Figure 58.3: image Architecture Flow 3B. Variable Length Many-to-Many Characteristic: Input length ̸=Output length Primary Application: Machine TranslationExample Translation Language Sentence Word Count English “My name is Nitish” 4 words Hindi “maeraaa naaama naiitaiisha haai” 4 words 665
Chapter 58. Types of RNN Many to Many One to Many Many to One RNNs Note: Differentlanguagesmayusedifferentwordcountsforsamemeaning Why Encoder-Decoder? TranslationLogic: Completesentenceunderstandingrequired before translation - Word-by-word translation loses context - Full sentence comprehension preserves meaning, grammar, and context Figure 58.4: image
58.1.7 4. One to One Architecture
Core Concept Input: Non-sequential data Output: Non-sequential data Note: Technically not RNN - regular neural network 666
58.1. Types of RNN | Many to Many | One to Many | Many to One RNNs Architecture Flow Figure 58.5: image Key Applications Image Classification ∗Input: Image data ∗Output: Class label (Cat/Dog, 0/1) ∗Networks: CNN, ANN (not RNN) Technical Note ∗No Recurrence: No feedback loops or time steps ∗No Memory: No hidden state preservation ∗Standard Networks: ANN, CNN architectures 667
Chapter 58. Types of RNN Many to Many One to Many Many to One RNNs
58.1.8 Summary Table
Architecture Input Output Memory Applications Many to OneSequence Single Yes Sentiment Analysis, Classification One to ManySingle Sequence Yes Image Captioning, Generation Many to Many (Same) Sequence Sequence Yes POS Tagging, NER Many to Many (Variable) Sequence Sequence Yes Machine Translation One to OneSingle Single No Image Classification Key Takeaways 1.Architecture Choice: Depends on input-output relationship 2.Sequential Processing: Core strength of RNNs 3.Memory Mechanism: Hidden states preserve temporal informa- tion 4.Application Diversity: Wide range of NLP and sequence model- ing tasks 668
58.1. Types of RNN | Many to Many | One to Many | Many to One RNNs 669
Chapter 59 How Backpropagation works in RNNBackpropagationThrough Time
59.1 How Backpropagation works in RNN
| Backpropagation Through Time
59.1.1 Overview
This comprehensive guide covers the fundamental concepts ofBackprop- agationThroughTime(BPTT)inRecurrentNeuralNetworks, includ- ing detailed mathematical derivations and practical examples.
59.1.2 Introduction to RNN Backpropagation
Key Concepts Concept Description Importance BPTTBackpropagation Through Time Core learning algorithm for RNNs Temporal DependenciesLearning from sequential data Essential for time-series analysis Gradient FlowHow gradients propagate through time Critical for understanding vanishing gradients Why BPTT? Key Insight: RNNs process sequential data where the output at each time step depends on both the current input and the previous hidden state. This creates a computational graph that unfolds through time.
59.1.3 RNN Architecture Review
Mathematical Representation The RNN operates with the following parameters: 670
59.1. How Backpropagation works in RNN | Backpropagation Through Time Parameter Dimension Description W_i3×3 Input weight matrix W_h3×3 Hidden weight matrix W_o1×3 Output weight matrix Example Setup: Sentiment Analysis Consider a toy dataset with three reviews: 1.Review 1: “cat mat cat”→Label: 1 (Positive) 2.Review 2: “rat rat mat”→Label: 0 (Negative) 3.Review 3: “mat cat mat”→Label: 1 (Positive) Vocabulary Encoding 1Vocabulary = { 2"cat": [1, 0, 0], 3"mat": [0, 1, 0], 4"rat": [0, 0, 1] 5}
59.1.4 Forward Propagation
Content sourced from CampusX Deep Learning notes (PDF). Run merge script for full body.
Common mistakes
- Not padding/masking variable-length bptt batches.
- Vanishing gradients in long truncated sequences.
- Teacher forcing only at train without plan for inference.
Interview checkpoints
- Q: RNN vs CNN? A: RNN for sequences; CNN for spatial grids.
- Q: LSTM vs vanilla? A: Gated memory reduces vanishing.
Practice
- Basic: Sketch unrolled RNN for 3 timesteps.
- Intermediate: LSTM layer on padded sequences.
- Advanced: Forecast univariate series with windowed LSTM.
Recap
- BPTT for sequence modeling.
- Watch shape (batch, time, features).
- Transformers often replace RNNs now.
Next: Day 70 — LSTM Gates
LSTM Gates
Contents
59.1.2 Introduction to RNN Backpropagation . . . . . . . . . . . . . 670
59.1.3 RNN Architecture Review . . . . . . . . . . . . . . . . . . . . 670
59.1.4 Forward Propagation . . . . . . . . . . . . . . . . . . . . . . . 671
59.1.5 Backpropagation Through Time (BPTT) . . . . . . . . . . . . 672
59.1.6 Gradient Calculations . . . . . . . . . . . . . . . . . . . . . . 672
59.1.7 Implementation Details . . . . . . . . . . . . . . . . . . . . . . 673
59.1.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674
60 Problems with RNN 100 Days of Deep Learning 676
60.1 Problems with RNN | 100 Days of Deep Learning . . . . . . . . . . . 676
60.1.1 RNN Fundamentals Recap . . . . . . . . . . . . . . . . . . . . 676
60.1.2 Major Problems with RNNs . . . . . . . . . . . . . . . . . . . 676
60.1.3 Why this happens: . . . . . . . . . . . . . . . . . . . . . . . . 677
60.1.4 Real-world Example: . . . . . . . . . . . . . . . . . . . . . . . 677
60.1.5 Technical Deep Dive . . . . . . . . . . . . . . . . . . . . . . . 678
60.2 RNNMathematicalAnalysis: Long-TermDependency&GradientProb-
lems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 678
60.2.1 Course Context & Prerequisites . . . . . . . . . . . . . . . . . 678
60.2.2 Problem #1: Long-Term Dependency - Mathematical Analysis 679
60.2.3 Gradient Calculation & Chain Rule Application . . . . . . . . 680
60.2.4 Complete Mathematical Derivation . . . . . . . . . . . . . . . 681
60.2.5 Vanishing Gradient Problem - Mathematical Proof . . . . . . 681
60.2.6 Solutions to Vanishing Gradients . . . . . . . . . . . . . . . . 682
60.2.7 Problem #2: Exploding Gradients . . . . . . . . . . . . . . . 683
60.2.8 Solutions to Exploding Gradients . . . . . . . . . . . . . . . . 683
60.2.9 Summary & Mathematical Insights . . . . . . . . . . . . . . . 684
61 LSTM Long Short Term Memory Part 1 The What CampusX 686
61.1 LSTM | Long Short Term Memory | Part 1 | The What? | CampusX 686
61.1.1 Recap: From ANN to RNN . . . . . . . . . . . . . . . . . . . 686
61.1.2 The Critical Problem: Long Sequences . . . . . . . . . . . . . 687
61.1.3 LSTM: The Solution . . . . . . . . . . . . . . . . . . . . . . . 688
61.2 LSTM Core Concepts & Architecture - Deep Dive Notes . . . . . . . 689
61.2.1 Learning Objective . . . . . . . . . . . . . . . . . . . . . . . . 689
61.2.2 The Story-Based Learning Approach . . . . . . . . . . . . . . 690
61.2.3 Human Brain Memory Processing Model . . . . . . . . . . . . 691
61.2.4 The RNN Memory Problem . . . . . . . . . . . . . . . . . . . 692
61.2.5 LSTM Solution: Dual Memory Architecture . . . . . . . . . . 693
61.2.6 The Three Gates: LSTM’s Control System . . . . . . . . . . . 694
61.2.7 Pronoun Resolution Example . . . . . . . . . . . . . . . . . . 695
61.2.8 LSTM as a Computer System . . . . . . . . . . . . . . . . . . 698
model = keras.Sequential([
keras.layers.Embedding(vocab_size, 64, mask_zero=True),
keras.layers.LSTM(64, return_sequences=False),
keras.layers.Dense(num_classes, activation='softmax'),
])
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])Why this matters
LSTM gates control forget/input/output — long-range memory.
65.1.11 LSTM vs GRU Comparison
Feature Comparison Feature LSTM GRU Gates3 (Input, Forget, Output) 2 (Reset, Update) Memory UnitsCell State + Hidden State Hidden State only Parameters4[(dÖh) +hš] + 4h3[(dÖh) +hš] + 3h ComplexityHigher Lower SpeedSlower Faster 747
Chapter 65. Gated Recurrent Unit Deep Learning GRU CampusX Performance Characteristics Figure 65.2: image When to Use Each Choose LSTM When: ∗Complex, long sequences ∗Large datasets available ∗Computational resources abundant ∗Maximum performance needed Choose GRU When: ∗Simpler tasks ∗Limited computational resources ∗Faster training required ∗Smaller datasets ∗Starting point for experimentation Parameter Count Formula LSTM Parameters: PLSTM = 4[dÖh+hš +h] GRU Parameters: PGRU = 3[dÖh+hš +h] 748
65.1. Gated Recurrent Unit | Deep Learning | GRU | CampusX Where: -d= input dimension -h= hidden dimension
65.1.11 LSTM vs GRU Comparison
Feature Comparison Feature LSTM GRU Gates3 (Input, Forget, Output) 2 (Reset, Update) Memory UnitsCell State + Hidden State Hidden State only Parameters4[(dÖh) +hš] + 4h3[(dÖh) +hš] + 3h ComplexityHigher Lower SpeedSlower Faster 747
Chapter 65. Gated Recurrent Unit Deep Learning GRU CampusX Performance Characteristics Figure 65.2: image When to Use Each Choose LSTM When: ∗Complex, long sequences ∗Large datasets available ∗Computational resources abundant ∗Maximum performance needed Choose GRU When: ∗Simpler tasks ∗Limited computational resources ∗Faster training required ∗Smaller datasets ∗Starting point for experimentation Parameter Count Formula LSTM Parameters: PLSTM = 4[dÖh+hš +h] GRU Parameters: PGRU = 3[dÖh+hš +h] 748
65.1. Gated Recurrent Unit | Deep Learning | GRU | CampusX Where: -d= input dimension -h= hidden dimension
Content sourced from CampusX Deep Learning notes (PDF). Run merge script for full body.
Common mistakes
- Not padding/masking variable-length lstm batches.
- Vanishing gradients in long gates sequences.
- Teacher forcing only at train without plan for inference.
Interview checkpoints
- Q: RNN vs CNN? A: RNN for sequences; CNN for spatial grids.
- Q: LSTM vs vanilla? A: Gated memory reduces vanishing.
Practice
- Basic: Sketch unrolled RNN for 3 timesteps.
- Intermediate: LSTM layer on padded sequences.
- Advanced: Forecast univariate series with windowed LSTM.
Recap
- LSTM Gates for sequence modeling.
- Watch shape (batch, time, features).
- Transformers often replace RNNs now.
Next: Day 71 — LSTM Cell State
LSTM Cell State
3.2. Neural Network Architectures: A Visual Guide Component Purpose Description Input Layer Receives sequence data One element at a time Hidden State Maintains memory Updated with each input Output Layer Generates predictions Can output at each step Recurrent Connection Enables memory Connects hidden state to itself Core Mechanism The fundamental RNN computation follows this mathematical formula: ht = tanh(Wx·xt +Wh·ht−1+b) Where: -x t: Input at time stept-h t: Hidden state at time stept-h t−1: Previous hidden state -Wx: Input weight matrix -Wh: Hidden state weight matrix -b: Bias vector -tanh: Hyperbolic tangent activation function This equation shows how RNNs combine current input with previous memory to produce new hidden states, enabling temporal pattern recognition. Where: -x t: Input at time t -ht: Hidden state at time t -Wx,W h: Weight matrices -b: Bias vector RNN Variants LSTM (Long Short-Term Memory) •Gates: Input, forget, output •Cell state: Long-term memory storage •Protection: Guards against vanishing/exploding gradients •Performance: Better at capturing long-range dependencies GRU (Gated Recurrent Unit) •Gates: Reset, update •Simplified: Fewer parameters than LSTM •Efficiency: Faster training, similar performance Bidirectional RNNs↔ •Two directions: Forward and backward •Context: Captures both past and future information •Enhanced: Better performance for many applications Applications •Natural language processing •Speech recognition •Time series prediction 37
Why this matters
Cell state is LSTM memory highway.
62.1.12 4. Complete LSTM Cell Animation
Step-by-Step Workflow Figure 62.6: image 711
Chapter 62. LSTM Architecture Part 2 The How CampusX Complete Mathematical Flow All LSTM Equations: 11. ft = sigma(Wf.[ht-1,Xt] + bf) # Forget gate 22. it = sigma(Wi.[ht-1,Xt] + bi) # Input gate 33. C?t = tanh(WC.[ht-1,Xt] + bC) # Candidate values 44. Ct = ft?Ct-1 + it?C?t # Cell state update 55. ot = sigma(Wo.[ht-1,Xt] + bo) # Output gate 66. ht = ot?tanh(Ct) # Hidden state Key Takeaways Feature Purpose Benefit Three GatesControl information flow Selective memory Cell State HighwayDirect gradient path Solves vanishing gradients Pointwise OperationsElement-wise control Fine-grained memory management Dual MemoryLong & short term Comprehensive context Animation Summary 1.Forget Phase: Remove irrelevant past info 2.Input Phase: Add new relevant info 3.Output Phase: Select what to output now Complexity Analysis Operation Time Space Purpose Forget Gate O(n2) O(n) Memory filtering Input Gate O(n2) O(n) Information addition Output Gate O(n2) O(n) Output generation Total O(n2) O(n)Per timestep 712
62.1. LSTM Architecture | Part 2 | The How? | CampusX 713
Chapter 63 LSTM Part 3 Next Word Pre- dictor Using CampusX
63.1 LSTM | Part 3 | Next Word Predictor
Using | CampusX
63.1.1 1. Introduction
What is a Next Word Predictor? ANext Word Predictoris an AI system that suggests the most likely word to follow a given sequence of words. It’s essentially a text generation model that predicts one word at a time. Figure 63.1: image Key Characteristics Feature Description Impact Sequential ProcessingAnalyzes word order and context High accuracy Pattern RecognitionLearns from large text corpora Better predictions Context AwarenessUses previous words for prediction Natural flow Real-time PredictionInstant suggestions User-friendly 714
63.1. LSTM | Part 3 | Next Word Predictor Using | CampusX
63.1.2 2. Real-World Applications
Industry Impact Application User Base Time Saved Adoption Rate Mobile Keyboards 3B+ users 30% typing time 85% Email Composers1.5B users 20% email time 65% Code Completion50M developers 40% coding time 75% Chat Applications2B+ users 25% messaging time 70%
63.1.3 3. Implementation Strategy
Converting Text Generation to Supervised Learning Data Transformation Process Original Sentence Input Sequence Target Word “Hi my name is Nitish” “Hi” “my” “Hi my” “name” “Hi my name” “is” “Hi my name is” “Nitish” Step 1: Sentence to Sequences Figure 63.2: image Step 2: Word to Number Mapping 715
Chapter 63. LSTM Part 3 Next Word Predictor Using CampusX Input Sequence Target [1] 2 [1, 2] 3 [1, 2, 3] 4 [1, 2, 3, 4] 5 Step 3: Numerical Dataset
63.1.4 4. Data Preprocessing
Tokenization Pipeline
63.2 Key Steps in Preprocessing
Figure 63.3: image Preprocessing Components Component Purpose Output TokenizerConvert text to tokens Word indices VocabularyStore unique words Word-to-ID mapping Sequence GeneratorCreate input sequences Training pairs PaddingUniform sequence length Fixed-size inputs Example Code Structure 1# Import necessary libraries 2importtensorflowastf
3fromtensorflow.keras.preprocessing.textimportTokenizer
4fromtensorflow.keras.preprocessing.sequenceimportpad_sequences
5
6# Initialize tokenizer
7tokenizer = Tokenizer()
8tokenizer.fit_on_texts([text])
9
10# Convert text to sequences
11sequences = tokenizer.texts_to_sequences(sentences)
71663.2. Key Steps in Preprocessing
63.2.1 5. Model Architecture
LSTM Network Design Figure 63.4: image 717
Chapter 63. LSTM Part 3 Next Word Predictor Using CampusX Layer Configuration Layer Type Parameters Purpose Embeddingvocab_size×100 Convert tokens to vectors LSTM 1150 units, return_sequences=True Capture sequence patterns LSTM 2100 units Extract high-level features Densevocab_size units Output probabilities Softmax- Normalize probabilities
63.2.2 6. Code Implementation
Complete Implementation Workflow Step 1: Data Preparation ∗Load text data ∗Split into sentences ∗Create token mappings Step 2: Sequence Creation ∗Convert text to numbers ∗Create input-output pairs ∗Pad sequences to uniform length Step 3: Model Construction ∗Build LSTM architecture ∗Configure hyperparameters ∗Compile with optimizer Step 4: Training Process ∗Train on prepared data ∗Monitor loss metrics ∗Save best model 718
63.2. Key Steps in Preprocessing
63.2.3 7. Training & Evaluation
Training Configuration Hyperparameter Value Purpose Batch Size64 Training efficiency Epochs100 Model convergence Learning Rate0.001 Optimization speed Dropout0.2 Prevent overfitting OptimizerAdam Adaptive learning
63.2.4 1. Dataset Overview
Dataset Statistics Metric Value Description Total Words~283 unique Vocabulary size Document TypeFAQ Text Q&A format LanguageEnglish Technical content SizeSmall Demo purposes Implementation Steps Step 1: Tokenization
1fromtensorflow.keras.preprocessing.textimportTokenizer
2
3tokenizer = Tokenizer()
4tokenizer.fit_on_texts([faqs])
5# Creates word-to-index mapping
Step 2: Sequence Generation
1input_sequences = []
2forsentenceinfaqs.split(’\n’):
3tokenized_sentence = tokenizer.texts_to_sequences([sentence])[0]
4foriin range(1,len(tokenized_sentence)):
5input_sequences.append(tokenized_sentence[:i+1])
719Chapter 63. LSTM Part 3 Next Word Predictor Using CampusX Sequence Creation Example Original Sentence Input Sequences Target “What is the fee” [What] is [What, is] the [What, is, the] fee Padding Configuration 1max_len =max([len(x)forxininput_sequences])# 56 2padded_sequences = pad_sequences(input_sequences, 3maxlen=max_len, 4padding=’pre’) 720
63.2. Key Steps in Preprocessing
63.2.5 3. Model Architecture Deep Dive
Complete Architecture Visualization Figure 63.5: image 721
Chapter 63. LSTM Part 3 Next Word Predictor Using CampusX Layer-by-Layer Breakdown Parameter Value Purpose Input Dim283 Vocabulary size Output Dim100 Dense vector size Input Length56 Max sequence length Parameters28,300 283×100 1 Embedding Layer Feature Configuration Calculation Units150 Hidden state dimension Time Steps56 Sequential processing Input per Step100 From embedding Output150 Final hidden state 2 LSTM Layer Component Value Function Units283 One per word ActivationSoftmax Probability distribution Parameters42,633 150×283 + 283 3 Dense Output Layer
63.2.6 4. Implementation Code
Complete Model Building
1fromtensorflow.keras.modelsimportSequential
2fromtensorflow.keras.layersimportEmbedding, LSTM, Dense
3
4# Build model
5model = Sequential()
6model.add(Embedding(283, 100, input_length=56))
7model.add(LSTM(150))
8model.add(Dense(283, activation=’softmax’))
9
10# Compile
11model.compile(loss=’categorical_crossentropy’,
72262.1.12 4. Complete LSTM Cell Animation
Step-by-Step Workflow Figure 62.6: image 711
Chapter 62. LSTM Architecture Part 2 The How CampusX Complete Mathematical Flow All LSTM Equations: 11. ft = sigma(Wf.[ht-1,Xt] + bf) # Forget gate 22. it = sigma(Wi.[ht-1,Xt] + bi) # Input gate 33. C?t = tanh(WC.[ht-1,Xt] + bC) # Candidate values 44. Ct = ft?Ct-1 + it?C?t # Cell state update 55. ot = sigma(Wo.[ht-1,Xt] + bo) # Output gate 66. ht = ot?tanh(Ct) # Hidden state Key Takeaways Feature Purpose Benefit Three GatesControl information flow Selective memory Cell State HighwayDirect gradient path Solves vanishing gradients Pointwise OperationsElement-wise control Fine-grained memory management Dual MemoryLong & short term Comprehensive context Animation Summary 1.Forget Phase: Remove irrelevant past info 2.Input Phase: Add new relevant info 3.Output Phase: Select what to output now Complexity Analysis Operation Time Space Purpose Forget Gate O(n2) O(n) Memory filtering Input Gate O(n2) O(n) Information addition Output Gate O(n2) O(n) Output generation Total O(n2) O(n)Per timestep 712
62.1. LSTM Architecture | Part 2 | The How? | CampusX 713
Chapter 63 LSTM Part 3 Next Word Pre- dictor Using CampusX
63.1 LSTM | Part 3 | Next Word Predictor
Using | CampusX
63.1.1 1. Introduction
What is a Next Word Predictor? ANext Word Predictoris an AI system that suggests the most likely word to follow a given sequence of words. It’s essentially a text generation model that predicts one word at a time. Figure 63.1: image Key Characteristics Feature Description Impact Sequential ProcessingAnalyzes word order and context High accuracy Pattern RecognitionLearns from large text corpora Better predictions Context AwarenessUses previous words for prediction Natural flow Real-time PredictionInstant suggestions User-friendly 714
63.1. LSTM | Part 3 | Next Word Predictor Using | CampusX
63.1.2 2. Real-World Applications
Industry Impact Application User Base Time Saved Adoption Rate Mobile Keyboards 3B+ users 30% typing time 85% Email Composers1.5B users 20% email time 65% Code Completion50M developers 40% coding time 75% Chat Applications2B+ users 25% messaging time 70%
63.1.3 3. Implementation Strategy
Converting Text Generation to Supervised Learning Data Transformation Process Original Sentence Input Sequence Target Word “Hi my name is Nitish” “Hi” “my” “Hi my” “name” “Hi my name” “is” “Hi my name is” “Nitish” Step 1: Sentence to Sequences Figure 63.2: image Step 2: Word to Number Mapping 715
Chapter 63. LSTM Part 3 Next Word Predictor Using CampusX Input Sequence Target [1] 2 [1, 2] 3 [1, 2, 3] 4 [1, 2, 3, 4] 5 Step 3: Numerical Dataset
63.1.4 4. Data Preprocessing
Tokenization Pipeline
63.2 Key Steps in Preprocessing
Figure 63.3: image Preprocessing Components Component Purpose Output TokenizerConvert text to tokens Word indices VocabularyStore unique words Word-to-ID mapping Sequence GeneratorCreate input sequences Training pairs PaddingUniform sequence length Fixed-size inputs Example Code Structure 1# Import necessary libraries 2importtensorflowastf
3fromtensorflow.keras.preprocessing.textimportTokenizer
4fromtensorflow.keras.preprocessing.sequenceimportpad_sequences
5
6# Initialize tokenizer
7tokenizer = Tokenizer()
8tokenizer.fit_on_texts([text])
9
10# Convert text to sequences
11sequences = tokenizer.texts_to_sequences(sentences)
71663.2. Key Steps in Preprocessing
63.2.1 5. Model Architecture
LSTM Network Design Figure 63.4: image 717
Chapter 63. LSTM Part 3 Next Word Predictor Using CampusX Layer Configuration Layer Type Parameters Purpose Embeddingvocab_size×100 Convert tokens to vectors LSTM 1150 units, return_sequences=True Capture sequence patterns LSTM 2100 units Extract high-level features Densevocab_size units Output probabilities Softmax- Normalize probabilities
63.2.2 6. Code Implementation
Complete Implementation Workflow Step 1: Data Preparation ∗Load text data ∗Split into sentences ∗Create token mappings Step 2: Sequence Creation ∗Convert text to numbers ∗Create input-output pairs ∗Pad sequences to uniform length Step 3: Model Construction ∗Build LSTM architecture ∗Configure hyperparameters ∗Compile with optimizer Step 4: Training Process ∗Train on prepared data ∗Monitor loss metrics ∗Save best model 718
63.2. Key Steps in Preprocessing
63.2.3 7. Training & Evaluation
Training Configuration Hyperparameter Value Purpose Batch Size64 Training efficiency Epochs100 Model convergence Learning Rate0.001 Optimization speed Dropout0.2 Prevent overfitting OptimizerAdam Adaptive learning
63.2.4 1. Dataset Overview
Dataset Statistics Metric Value Description Total Words~283 unique Vocabulary size Document TypeFAQ Text Q&A format LanguageEnglish Technical content SizeSmall Demo purposes Implementation Steps Step 1: Tokenization
1fromtensorflow.keras.preprocessing.textimportTokenizer
2
3tokenizer = Tokenizer()
4tokenizer.fit_on_texts([faqs])
5# Creates word-to-index mapping
Step 2: Sequence Generation
1input_sequences = []
2forsentenceinfaqs.split(’\n’):
3tokenized_sentence = tokenizer.texts_to_sequences([sentence])[0]
4foriin range(1,len(tokenized_sentence)):
5input_sequences.append(tokenized_sentence[:i+1])
719Chapter 63. LSTM Part 3 Next Word Predictor Using CampusX Sequence Creation Example Original Sentence Input Sequences Target “What is the fee” [What] is [What, is] the [What, is, the] fee Padding Configuration 1max_len =max([len(x)forxininput_sequences])# 56 2padded_sequences = pad_sequences(input_sequences, 3maxlen=max_len, 4padding=’pre’) 720
63.2. Key Steps in Preprocessing
63.2.5 3. Model Architecture Deep Dive
Complete Architecture Visualization Figure 63.5: image 721
Chapter 63. LSTM Part 3 Next Word Predictor Using CampusX Layer-by-Layer Breakdown Parameter Value Purpose Input Dim283 Vocabulary size Output Dim100 Dense vector size Input Length56 Max sequence length Parameters28,300 283×100 1 Embedding Layer Feature Configuration Calculation Units150 Hidden state dimension Time Steps56 Sequential processing Input per Step100 From embedding Output150 Final hidden state 2 LSTM Layer Component Value Function Units283 One per word ActivationSoftmax Probability distribution Parameters42,633 150×283 + 283 3 Dense Output Layer
63.2.6 4. Implementation Code
Complete Model Building
1fromtensorflow.keras.modelsimportSequential
2fromtensorflow.keras.layersimportEmbedding, LSTM, Dense
3
4# Build model
5model = Sequential()
6model.add(Embedding(283, 100, input_length=56))
7model.add(LSTM(150))
8model.add(Dense(283, activation=’softmax’))
9
10# Compile
11model.compile(loss=’categorical_crossentropy’,
722Content sourced from CampusX Deep Learning notes (PDF). Run merge script for full body.
Common mistakes
- Not padding/masking variable-length cell batches.
- Vanishing gradients in long state sequences.
- Teacher forcing only at train without plan for inference.
Interview checkpoints
- Q: RNN vs CNN? A: RNN for sequences; CNN for spatial grids.
- Q: LSTM vs vanilla? A: Gated memory reduces vanishing.
Practice
- Basic: Sketch unrolled RNN for 3 timesteps.
- Intermediate: LSTM layer on padded sequences.
- Advanced: Forecast univariate series with windowed LSTM.
Recap
- LSTM Cell State for sequence modeling.
- Watch shape (batch, time, features).
- Transformers often replace RNNs now.
Next: Day 72 — GRU
GRU
Contents I Introduction to Deep Learning 1 1 Course Announcement 2
1.1 100 Days of Deep Learning Course Announcement . . . . . . . . . . . 2
1.2 Deep Learning Course Content . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 1. Curriculum . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.2 Deep Learning Curriculum Structure . . . . . . . . . . . . . . 2
1.3 Artificial Neural Networks (ANN) . . . . . . . . . . . . . . . . . . . . 2
1.3.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3.2 Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3.3 MLP [Multi-layer perceptron] . . . . . . . . . . . . . . . . . . 3
1.3.4 Training an MLP [Most used Algorithm] . . . . . . . . . . . . 3
1.3.5 Practical with Keras . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.6 How to improve an ANN . . . . . . . . . . . . . . . . . . . . . 3
1.3.7 Advanced Topics . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.8 Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.9 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.10 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.11 Extra Content . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 What is Deep Learning Deep Learning Vs Machine Learning 8
2.1 What is Deep Learning? Deep Learning Vs Machine Learning . . . . 8
2.2 Deep Learning: Comprehensive Notes . . . . . . . . . . . . . . . . . . 8
2.2.1 Definition & Relationship to AI . . . . . . . . . . . . . . . . . 8
2.2.2 Biological Inspiration . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.3 Neural Network Structure . . . . . . . . . . . . . . . . . . . . 9
2.3 Machine Learning vs Deep Learning: A Comprehensive Comparison . 10
2.3.1 1. Machine Learning (ML) . . . . . . . . . . . . . . . . . . . . 10
2.3.2 2. Deep Learning (DL) . . . . . . . . . . . . . . . . . . . . . . 10
2.3.3 3. Detailed Comparison . . . . . . . . . . . . . . . . . . . . . 11
2.3.4 4. When to Use Each Approach . . . . . . . . . . . . . . . . . 11
2.3.5 5. Real-World Applications . . . . . . . . . . . . . . . . . . . 12
2.3.6 6. The ML-DL Relationship . . . . . . . . . . . . . . . . . . . 12
2.4 Neural Network Architectures Explained . . . . . . . . . . . . . . . . 12
2.4.1 1. Artificial Neural Networks (ANN) . . . . . . . . . . . . . . 12
2.4.2 2. Convolutional Neural Networks (CNN) . . . . . . . . . . . 13
2.4.3 3. Recurrent Neural Networks (RNN) . . . . . . . . . . . . . . 13
2.4.4 4. Generative Adversarial Networks (GAN) . . . . . . . . . . 14
2.4.5 Comparative Overview . . . . . . . . . . . . . . . . . . . . . . 15
2.5 The Rise of Deep Learning: Applications & Performance . . . . . . . 15
2.5.1 Introduction: Why Deep Learning Has Transformed AI . . . . 15
2.5.2 1. Applications: Transforming Industries . . . . . . . . . . . . 16
iiiWhy this matters
GRU simplifies LSTM with fewer gates.
20.1.1 Introduction
Topic: Vanishing Gradient Problem - a very special and important topic in deep learning where many interview questions are asked. In deep learning, you will encounter many variants of vanishing gradient problems, and if this problem occurs, then your neural network will not be able to train properly. What will be covered: What is vanishing gradient problem, why does it happen, and how to solve it in 5 different ways.
20.1.1 Introduction
Topic: Vanishing Gradient Problem - a very special and important topic in deep learning where many interview questions are asked. In deep learning, you will encounter many variants of vanishing gradient problems, and if this problem occurs, then your neural network will not be able to train properly. What will be covered: What is vanishing gradient problem, why does it happen, and how to solve it in 5 different ways.
Content sourced from CampusX Deep Learning notes (PDF). Run merge script for full body.
Common mistakes
- Not padding/masking variable-length gru batches.
- Vanishing gradients in long efficiency sequences.
- Teacher forcing only at train without plan for inference.
Interview checkpoints
- Q: RNN vs CNN? A: RNN for sequences; CNN for spatial grids.
- Q: LSTM vs vanilla? A: Gated memory reduces vanishing.
Practice
- Basic: Sketch unrolled RNN for 3 timesteps.
- Intermediate: LSTM layer on padded sequences.
- Advanced: Forecast univariate series with windowed LSTM.
Recap
- GRU for sequence modeling.
- Watch shape (batch, time, features).
- Transformers often replace RNNs now.
Bidirectional RNN
Contents
65.1.5 Input Processing . . . . . . . . . . . . . . . . . . . . . . . . . 742
65.1.6 GRU Architecture . . . . . . . . . . . . . . . . . . . . . . . . 743
65.1.7 Hidden State Fundamentals . . . . . . . . . . . . . . . . . . . 744
65.1.8 GRU Architecture Overview . . . . . . . . . . . . . . . . . . . 745
65.1.9 Mathematical Formulations . . . . . . . . . . . . . . . . . . . 746
65.1.10Step-by-Step Process . . . . . . . . . . . . . . . . . . . . . . . 746 65.1.11LSTM vs GRU Comparison . . . . . . . . . . . . . . . . . . . 747 65.1.12Key Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . 749 66 BidirectionalRNNBiLSTMBidirectionalLSTMBidirectionalGRU751
66.1 Bidirectional RNN | BiLSTM | Bidirectional LSTM | Bidirectional GRU751
66.2 Bidirectional RNN - Comprehensive Notes . . . . . . . . . . . . . . . 751
66.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 751
66.2.2 Why Bidirectional RNNs? . . . . . . . . . . . . . . . . . . . . 751
66.2.3 Bidirectional RNN Architecture . . . . . . . . . . . . . . . . . 752
66.2.4 Implementation in Keras . . . . . . . . . . . . . . . . . . . . . 752
66.2.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 753
66.2.6 Advantages & Drawbacks . . . . . . . . . . . . . . . . . . . . 754
66.2.7 Best Practices . . . . . . . . . . . . . . . . . . . . . . . . . . . 755
66.2.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 756
XIII History of Large Language Models 758
67 The Epic History of Large Language Models (LLMs) From LSTMs
to ChatGPT CampusX 759
67.1 The Epic History of Large Language Models (LLMs) | From LSTMs to
ChatGPT | CampusX . . . . . . . . . . . . . . . . . . . . . . . . . . 759
67.2 Sequence Tasks and Types: Comprehensive Guide . . . . . . . . . . . 759
67.2.1 Sequence Processing Architecture . . . . . . . . . . . . . . . . 759
67.2.2 RNN Input-Output Patterns . . . . . . . . . . . . . . . . . . . 760
67.2.3 Key Applications of Sequence Models . . . . . . . . . . . . . . 760
67.2.4 Translation Example . . . . . . . . . . . . . . . . . . . . . . . 761
67.3 Seq2Seq Tasks in NLP . . . . . . . . . . . . . . . . . . . . . . . . . . 761
67.3.1 Architecture Overview . . . . . . . . . . . . . . . . . . . . . . 761
67.3.2 Key Seq2Seq NLP Tasks . . . . . . . . . . . . . . . . . . . . . 761
67.3.3 Seq2Seq Task Flow Visualization . . . . . . . . . . . . . . . . 762
67.3.4 Key Insights . . . . . . . . . . . . . . . . . . . . . . . . . . . . 762
67.3.5 Timeline: From Simple to Sophisticated . . . . . . . . . . . . 763
67.3.6 The Five Evolutionary Stages . . . . . . . . . . . . . . . . . . 763
67.3.7 Key Developments in Each Stage . . . . . . . . . . . . . . . . 763
67.3.8 The Seq2Seq Revolution . . . . . . . . . . . . . . . . . . . . . 764
67.4 Stage 1 -Encoder Decoder Architecture . . . . . . . . . . . . . . . . . 764
67.4.1 Historical Context . . . . . . . . . . . . . . . . . . . . . . . . 764
67.4.2 Encoder-Decoder Architecture Overview . . . . . . . . . . . . 765
67.4.3 Research Paper Reference . . . . . . . . . . . . . . . . . . . . 765
67.4.4 Working Mechanism Explained . . . . . . . . . . . . . . . . . 766
67.4.5 Implementation Details . . . . . . . . . . . . . . . . . . . . . . 766
67.4.6 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 766
xxviiWhy this matters
Bidirectional RNN sees past and future — NLP tagging.
62.1.12 4. Complete LSTM Cell Animation
Step-by-Step Workflow Figure 62.6: image 711
Chapter 62. LSTM Architecture Part 2 The How CampusX Complete Mathematical Flow All LSTM Equations: 11. ft = sigma(Wf.[ht-1,Xt] + bf) # Forget gate 22. it = sigma(Wi.[ht-1,Xt] + bi) # Input gate 33. C?t = tanh(WC.[ht-1,Xt] + bC) # Candidate values 44. Ct = ft?Ct-1 + it?C?t # Cell state update 55. ot = sigma(Wo.[ht-1,Xt] + bo) # Output gate 66. ht = ot?tanh(Ct) # Hidden state Key Takeaways Feature Purpose Benefit Three GatesControl information flow Selective memory Cell State HighwayDirect gradient path Solves vanishing gradients Pointwise OperationsElement-wise control Fine-grained memory management Dual MemoryLong & short term Comprehensive context Animation Summary 1.Forget Phase: Remove irrelevant past info 2.Input Phase: Add new relevant info 3.Output Phase: Select what to output now Complexity Analysis Operation Time Space Purpose Forget Gate O(n2) O(n) Memory filtering Input Gate O(n2) O(n) Information addition Output Gate O(n2) O(n) Output generation Total O(n2) O(n)Per timestep 712
62.1. LSTM Architecture | Part 2 | The How? | CampusX 713
Chapter 63 LSTM Part 3 Next Word Pre- dictor Using CampusX
63.1 LSTM | Part 3 | Next Word Predictor
Using | CampusX
63.1.1 1. Introduction
What is a Next Word Predictor? ANext Word Predictoris an AI system that suggests the most likely word to follow a given sequence of words. It’s essentially a text generation model that predicts one word at a time. Figure 63.1: image Key Characteristics Feature Description Impact Sequential ProcessingAnalyzes word order and context High accuracy Pattern RecognitionLearns from large text corpora Better predictions Context AwarenessUses previous words for prediction Natural flow Real-time PredictionInstant suggestions User-friendly 714
63.1. LSTM | Part 3 | Next Word Predictor Using | CampusX
63.1.2 2. Real-World Applications
Industry Impact Application User Base Time Saved Adoption Rate Mobile Keyboards 3B+ users 30% typing time 85% Email Composers1.5B users 20% email time 65% Code Completion50M developers 40% coding time 75% Chat Applications2B+ users 25% messaging time 70%
63.1.3 3. Implementation Strategy
Converting Text Generation to Supervised Learning Data Transformation Process Original Sentence Input Sequence Target Word “Hi my name is Nitish” “Hi” “my” “Hi my” “name” “Hi my name” “is” “Hi my name is” “Nitish” Step 1: Sentence to Sequences Figure 63.2: image Step 2: Word to Number Mapping 715
Chapter 63. LSTM Part 3 Next Word Predictor Using CampusX Input Sequence Target [1] 2 [1, 2] 3 [1, 2, 3] 4 [1, 2, 3, 4] 5 Step 3: Numerical Dataset
63.1.4 4. Data Preprocessing
Tokenization Pipeline
63.2 Key Steps in Preprocessing
Figure 63.3: image Preprocessing Components Component Purpose Output TokenizerConvert text to tokens Word indices VocabularyStore unique words Word-to-ID mapping Sequence GeneratorCreate input sequences Training pairs PaddingUniform sequence length Fixed-size inputs Example Code Structure 1# Import necessary libraries 2importtensorflowastf
3fromtensorflow.keras.preprocessing.textimportTokenizer
4fromtensorflow.keras.preprocessing.sequenceimportpad_sequences
5
6# Initialize tokenizer
7tokenizer = Tokenizer()
8tokenizer.fit_on_texts([text])
9
10# Convert text to sequences
11sequences = tokenizer.texts_to_sequences(sentences)
71663.2. Key Steps in Preprocessing
63.2.1 5. Model Architecture
LSTM Network Design Figure 63.4: image 717
Chapter 63. LSTM Part 3 Next Word Predictor Using CampusX Layer Configuration Layer Type Parameters Purpose Embeddingvocab_size×100 Convert tokens to vectors LSTM 1150 units, return_sequences=True Capture sequence patterns LSTM 2100 units Extract high-level features Densevocab_size units Output probabilities Softmax- Normalize probabilities
63.2.2 6. Code Implementation
Complete Implementation Workflow Step 1: Data Preparation ∗Load text data ∗Split into sentences ∗Create token mappings Step 2: Sequence Creation ∗Convert text to numbers ∗Create input-output pairs ∗Pad sequences to uniform length Step 3: Model Construction ∗Build LSTM architecture ∗Configure hyperparameters ∗Compile with optimizer Step 4: Training Process ∗Train on prepared data ∗Monitor loss metrics ∗Save best model 718
63.2. Key Steps in Preprocessing
63.2.3 7. Training & Evaluation
Training Configuration Hyperparameter Value Purpose Batch Size64 Training efficiency Epochs100 Model convergence Learning Rate0.001 Optimization speed Dropout0.2 Prevent overfitting OptimizerAdam Adaptive learning
63.2.4 1. Dataset Overview
Dataset Statistics Metric Value Description Total Words~283 unique Vocabulary size Document TypeFAQ Text Q&A format LanguageEnglish Technical content SizeSmall Demo purposes Implementation Steps Step 1: Tokenization
1fromtensorflow.keras.preprocessing.textimportTokenizer
2
3tokenizer = Tokenizer()
4tokenizer.fit_on_texts([faqs])
5# Creates word-to-index mapping
Step 2: Sequence Generation
1input_sequences = []
2forsentenceinfaqs.split(’\n’):
3tokenized_sentence = tokenizer.texts_to_sequences([sentence])[0]
4foriin range(1,len(tokenized_sentence)):
5input_sequences.append(tokenized_sentence[:i+1])
719Chapter 63. LSTM Part 3 Next Word Predictor Using CampusX Sequence Creation Example Original Sentence Input Sequences Target “What is the fee” [What] is [What, is] the [What, is, the] fee Padding Configuration 1max_len =max([len(x)forxininput_sequences])# 56 2padded_sequences = pad_sequences(input_sequences, 3maxlen=max_len, 4padding=’pre’) 720
63.2. Key Steps in Preprocessing
63.2.5 3. Model Architecture Deep Dive
Complete Architecture Visualization Figure 63.5: image 721
Chapter 63. LSTM Part 3 Next Word Predictor Using CampusX Layer-by-Layer Breakdown Parameter Value Purpose Input Dim283 Vocabulary size Output Dim100 Dense vector size Input Length56 Max sequence length Parameters28,300 283×100 1 Embedding Layer Feature Configuration Calculation Units150 Hidden state dimension Time Steps56 Sequential processing Input per Step100 From embedding Output150 Final hidden state 2 LSTM Layer Component Value Function Units283 One per word ActivationSoftmax Probability distribution Parameters42,633 150×283 + 283 3 Dense Output Layer
63.2.6 4. Implementation Code
Complete Model Building
1fromtensorflow.keras.modelsimportSequential
2fromtensorflow.keras.layersimportEmbedding, LSTM, Dense
3
4# Build model
5model = Sequential()
6model.add(Embedding(283, 100, input_length=56))
7model.add(LSTM(150))
8model.add(Dense(283, activation=’softmax’))
9
10# Compile
11model.compile(loss=’categorical_crossentropy’,
722Content sourced from CampusX Deep Learning notes (PDF). Run merge script for full body.
Common mistakes
- Not padding/masking variable-length bidirectional batches.
- Vanishing gradients in long nlp sequences.
- Teacher forcing only at train without plan for inference.
Interview checkpoints
- Q: RNN vs CNN? A: RNN for sequences; CNN for spatial grids.
- Q: LSTM vs vanilla? A: Gated memory reduces vanishing.
Practice
- Basic: Sketch unrolled RNN for 3 timesteps.
- Intermediate: LSTM layer on padded sequences.
- Advanced: Forecast univariate series with windowed LSTM.
Recap
- Bidirectional RNN for sequence modeling.
- Watch shape (batch, time, features).
- Transformers often replace RNNs now.
Next: Day 74 — Stacked LSTMs
Stacked LSTMs
Contents
62.1.5 5. Mathematical Representations . . . . . . . . . . . . . . . . 705
62.1.6 6. Pointwise Operations {#pointwise-operations}⊙. . . . . . 705
62.1.7 7. Neural Network Layers . . . . . . . . . . . . . . . . . . . . 706
62.1.8 8. Complete LSTM Workflow . . . . . . . . . . . . . . . . . . 706
62.1.9 1. The Forget Gate . . . . . . . . . . . . . . . . . . . . . . . . 707
62.1.102. The Input Gate . . . . . . . . . . . . . . . . . . . . . . . . 708 62.1.113. The Output Gate . . . . . . . . . . . . . . . . . . . . . . . 709 62.1.124. Complete LSTM Cell Animation . . . . . . . . . . . . . . . 711 63 LSTM Part 3 Next Word Predictor Using CampusX 714
63.1 LSTM | Part 3 | Next Word Predictor Using | CampusX . . . . . . . 714
63.1.1 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 714
63.1.2 2. Real-World Applications . . . . . . . . . . . . . . . . . . . 715
63.1.3 3. Implementation Strategy . . . . . . . . . . . . . . . . . . . 715
63.1.4 4. Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . 716
63.2 Key Steps in Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . 716
63.2.1 5. Model Architecture . . . . . . . . . . . . . . . . . . . . . . 717
63.2.2 6. Code Implementation . . . . . . . . . . . . . . . . . . . . . 718
63.2.3 7. Training & Evaluation . . . . . . . . . . . . . . . . . . . . . 719
63.2.4 1. Dataset Overview . . . . . . . . . . . . . . . . . . . . . . . 719
63.2.5 3. Model Architecture Deep Dive . . . . . . . . . . . . . . . . 721
63.2.6 4. Implementation Code . . . . . . . . . . . . . . . . . . . . . 722
63.2.7 5. Training Process . . . . . . . . . . . . . . . . . . . . . . . . 723
63.2.8 6. Prediction Mechanism . . . . . . . . . . . . . . . . . . . . . 723
63.2.9 7. Performance Optimization . . . . . . . . . . . . . . . . . . 724
63.2.108. Results & Examples . . . . . . . . . . . . . . . . . . . . . . 724 64 Deep RNNs Stacked RNNs Stacked LSTMs Stacked GRUs Cam- pusX 726
64.1 DeepRNNs|StackedRNNs|StackedLSTMs|StackedGRUs|CampusX726
64.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 726
64.1.2 Fundamental Concepts . . . . . . . . . . . . . . . . . . . . . . 726
64.1.3 Architecture Deep Dive . . . . . . . . . . . . . . . . . . . . . . 727
64.1.4 Information Flow . . . . . . . . . . . . . . . . . . . . . . . . . 728
64.1.5 Implementation Details . . . . . . . . . . . . . . . . . . . . . . 729
64.1.6Key Mathematical Concepts Covered:. . . . . . . . . . 730
64.2 Deep RNN Complete Guide - Part 2: Advanced Concepts & Imple-
mentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 730
64.2.1 Mathematical Notation System . . . . . . . . . . . . . . . . . 730
64.2.2 Why & When to Use Deep RNNs . . . . . . . . . . . . . . . . 733
64.2.3 Variants & Extensions . . . . . . . . . . . . . . . . . . . . . . 735
64.2.4 Key Takeaways & Next Steps . . . . . . . . . . . . . . . . . . 739
65 Gated Recurrent Unit Deep Learning GRU CampusX 740
65.1 Gated Recurrent Unit | Deep Learning | GRU | CampusX . . . . . . 740
65.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 740
65.1.2 Why GRU Exists . . . . . . . . . . . . . . . . . . . . . . . . . 740
Why this matters
Stacked LSTMs add depth across time.
63.2.10 8. Results & Examples
Prediction Examples Input Predictions Quality “mail”“mail us at nitish.campusx@gmail.com” Perfect “what is the fee”“what is the fee of the course for data science” Excellent “total duration”“total duration of the course is 7 months so the total course fee becomes 799” Very Good “both are”“both are not a part of this program’s curriculum” Contextual 724
63.2. Key Steps in Preprocessing Key Insights Aspect Observation Recommendation Strengths Good pattern recognition on training data Build on this foundation Weaknesses Limited vocabulary, potential overfitting Add validation split Next Steps Scale to larger datasets Use transfer learning 725
Chapter 64 DeepRNNsStackedRNNsStacked LSTMsStackedGRUsCampusX
64.1 Deep RNNs | Stacked RNNs | Stacked
LSTMs | Stacked GRUs | CampusX
64.1.1 Introduction
Deep RNNs(also calledStacked RNNs) are an extension of traditional RNNs where multiple RNN layers are stacked vertically to increase the model’s representational power and ability to capture complex patterns in sequential data. Key Motivation Problem Solution Benefit Limited representational power Add more hidden layers Increased model complexity Poor performance on complex tasks Stack multiple RNN cells Better pattern recognition Insufficient feature extraction Vertical layer composition Hierarchical feature learning
64.1.2 Fundamental Concepts
Evolution from Simple to Deep Neural Network Complexity Progression Problem Setup: Sentiment Analysis Task: Classify movie reviews as positive (1) or negative (0) 726
64.1. Deep RNNs | Stacked RNNs | Stacked LSTMs | Stacked GRUs | CampusX Review Label Length “cat mat rat” 1 (positive) 3 words “good bad ugly” 0 (negative) 3 words “love hate fear” 1 (positive) 3 words Example Dataset Word Encoding cat [1, 0, 0] mat [0, 1, 0] rat [0, 0, 1] Word Encoding (One-Hot)
64.1.3 Architecture Deep Dive
Standard RNN Architecture Single RNN Cell Structure 1Input Layer (3D) -> RNN Cell (3 units) -> Output Layer (1D) 2? Feedback Loop Mathematical RepresentationFor a single RNN cell at time stept: ht = tanh(Whh·ht−1+Wxh·xt +bh) yt =σ(Why·ht +by) Where: -ht: Hidden state at timet-xt: Input at timet-Whh: Hidden-to- hidden weight matrix (3×3) -W xh: Input-to-hidden weight matrix (3×3) -W hy: Hidden-to-output weight matrix (1×3) -σ: Sigmoid activation function Deep RNN Architecture Two-Layer Deep RNN Structure 1Input Layer (3D) -> RNN Layer 1 (3 units) -> RNN Layer 2 (2 units) -> Output (1D) 2? Feedback Loop ? Feedback Loop 727
Chapter 64. Deep RNNs Stacked RNNs Stacked LSTMs Stacked GRUs CampusX Mathematical Formulation Layer 1: h(1) t = tanh(W (1) hh·h(1) t−1+W (1) xh·xt +b (1) h ) Layer 2: h(2) t = tanh(W (2) hh·h(2) t−1+Wh(1)h(2)·h(1) t +b (2) h ) Output: yt =σ(Why·h(2) t +by) Connection Matrix Dimensions Input→Layer 1W (1) xh 3×3 Layer 1→Layer 1W (1) hh 3×3 Layer 1→Layer 2W h(1)h(2) 2×3 Layer 2→Layer 2W (2) hh 2×2 Layer 2→OutputW hy 1×2 Weight Matrix Dimensions
64.1.4 Information Flow
Time Step Analysis Time Step 1 (t= 1) Input: “cat”→[1, 0, 0] Layer 1 Computation: h(1) 1 = tanh(W (1) hh·[0,0,0] +W(1) xh·[1,0,0] +b(1) h ) Layer 2 Computation: h(2) 1 = tanh(W (2) hh·[0,0] +Wh(1)h(2)·h(1) 1 +b (2) h ) Time Step 2 (t= 2) Input: “mat”→[0, 1, 0] Layer 1 Computation: h(1) 2 = tanh(W (1) hh·h(1) 1 +W (1) xh·[0,1,0] +b(1) h ) Layer 2 Computation: h(2) 2 = tanh(W (2) hh·h(2) 1 +Wh(1)h(2)·h(1) 2 +b (2) h ) 728
64.1. Deep RNNs | Stacked RNNs | Stacked LSTMs | Stacked GRUs | CampusX Time Step 3 (t= 3) Input: “rat”→[0, 0, 1] Final Output Computation: y3 =σ(Why·h(2) 3 +by) Unfolded Architecture Visualization Figure 64.1: image
64.1.5 Implementation Details
Memory Requirements Component Memory Usage Layer 1 weights(3Ö3) + (3Ö3) = 18parameters Layer 2 weights(2Ö3) + (2Ö2) = 10parameters Output weights1Ö2 = 2parameters Total 30 parameters Computational Complexity ∗Forward Pass:O(T× ∑L l=1n2 l ) ∗Backward Pass:O(T×∑L l=1n2 l ) Where: -T: Sequence length -L: Number of layers -nl: Number of units in layerl 729
Chapter 64. Deep RNNs Stacked RNNs Stacked LSTMs Stacked GRUs CampusX Advantages of Deep RNNs Advantage Description Impact Hierarchical Representation Each layer learns different levels of abstraction High Better Feature Extraction Multiple layers capture complex patterns High Improved PerformanceBetter accuracy on complex tasks Medium Flexible ArchitectureCan vary units per layer Medium Challenges Challenge Description Mitigation Vanishing GradientsGradients diminish through layers LSTM/GRU cells Computational CostMore parameters and operations Efficient implementations OverfittingComplex model may overfit Regularization techniques
63.2.10 8. Results & Examples
Prediction Examples Input Predictions Quality “mail”“mail us at nitish.campusx@gmail.com” Perfect “what is the fee”“what is the fee of the course for data science” Excellent “total duration”“total duration of the course is 7 months so the total course fee becomes 799” Very Good “both are”“both are not a part of this program’s curriculum” Contextual 724
63.2. Key Steps in Preprocessing Key Insights Aspect Observation Recommendation Strengths Good pattern recognition on training data Build on this foundation Weaknesses Limited vocabulary, potential overfitting Add validation split Next Steps Scale to larger datasets Use transfer learning 725
Chapter 64 DeepRNNsStackedRNNsStacked LSTMsStackedGRUsCampusX
64.1 Deep RNNs | Stacked RNNs | Stacked
LSTMs | Stacked GRUs | CampusX
64.1.1 Introduction
Deep RNNs(also calledStacked RNNs) are an extension of traditional RNNs where multiple RNN layers are stacked vertically to increase the model’s representational power and ability to capture complex patterns in sequential data. Key Motivation Problem Solution Benefit Limited representational power Add more hidden layers Increased model complexity Poor performance on complex tasks Stack multiple RNN cells Better pattern recognition Insufficient feature extraction Vertical layer composition Hierarchical feature learning
64.1.2 Fundamental Concepts
Evolution from Simple to Deep Neural Network Complexity Progression Problem Setup: Sentiment Analysis Task: Classify movie reviews as positive (1) or negative (0) 726
64.1. Deep RNNs | Stacked RNNs | Stacked LSTMs | Stacked GRUs | CampusX Review Label Length “cat mat rat” 1 (positive) 3 words “good bad ugly” 0 (negative) 3 words “love hate fear” 1 (positive) 3 words Example Dataset Word Encoding cat [1, 0, 0] mat [0, 1, 0] rat [0, 0, 1] Word Encoding (One-Hot)
64.1.3 Architecture Deep Dive
Standard RNN Architecture Single RNN Cell Structure 1Input Layer (3D) -> RNN Cell (3 units) -> Output Layer (1D) 2? Feedback Loop Mathematical RepresentationFor a single RNN cell at time stept: ht = tanh(Whh·ht−1+Wxh·xt +bh) yt =σ(Why·ht +by) Where: -ht: Hidden state at timet-xt: Input at timet-Whh: Hidden-to- hidden weight matrix (3×3) -W xh: Input-to-hidden weight matrix (3×3) -W hy: Hidden-to-output weight matrix (1×3) -σ: Sigmoid activation function Deep RNN Architecture Two-Layer Deep RNN Structure 1Input Layer (3D) -> RNN Layer 1 (3 units) -> RNN Layer 2 (2 units) -> Output (1D) 2? Feedback Loop ? Feedback Loop 727
Chapter 64. Deep RNNs Stacked RNNs Stacked LSTMs Stacked GRUs CampusX Mathematical Formulation Layer 1: h(1) t = tanh(W (1) hh·h(1) t−1+W (1) xh·xt +b (1) h ) Layer 2: h(2) t = tanh(W (2) hh·h(2) t−1+Wh(1)h(2)·h(1) t +b (2) h ) Output: yt =σ(Why·h(2) t +by) Connection Matrix Dimensions Input→Layer 1W (1) xh 3×3 Layer 1→Layer 1W (1) hh 3×3 Layer 1→Layer 2W h(1)h(2) 2×3 Layer 2→Layer 2W (2) hh 2×2 Layer 2→OutputW hy 1×2 Weight Matrix Dimensions
64.1.4 Information Flow
Time Step Analysis Time Step 1 (t= 1) Input: “cat”→[1, 0, 0] Layer 1 Computation: h(1) 1 = tanh(W (1) hh·[0,0,0] +W(1) xh·[1,0,0] +b(1) h ) Layer 2 Computation: h(2) 1 = tanh(W (2) hh·[0,0] +Wh(1)h(2)·h(1) 1 +b (2) h ) Time Step 2 (t= 2) Input: “mat”→[0, 1, 0] Layer 1 Computation: h(1) 2 = tanh(W (1) hh·h(1) 1 +W (1) xh·[0,1,0] +b(1) h ) Layer 2 Computation: h(2) 2 = tanh(W (2) hh·h(2) 1 +Wh(1)h(2)·h(1) 2 +b (2) h ) 728
64.1. Deep RNNs | Stacked RNNs | Stacked LSTMs | Stacked GRUs | CampusX Time Step 3 (t= 3) Input: “rat”→[0, 0, 1] Final Output Computation: y3 =σ(Why·h(2) 3 +by) Unfolded Architecture Visualization Figure 64.1: image
64.1.5 Implementation Details
Memory Requirements Component Memory Usage Layer 1 weights(3Ö3) + (3Ö3) = 18parameters Layer 2 weights(2Ö3) + (2Ö2) = 10parameters Output weights1Ö2 = 2parameters Total 30 parameters Computational Complexity ∗Forward Pass:O(T× ∑L l=1n2 l ) ∗Backward Pass:O(T×∑L l=1n2 l ) Where: -T: Sequence length -L: Number of layers -nl: Number of units in layerl 729
Chapter 64. Deep RNNs Stacked RNNs Stacked LSTMs Stacked GRUs CampusX Advantages of Deep RNNs Advantage Description Impact Hierarchical Representation Each layer learns different levels of abstraction High Better Feature Extraction Multiple layers capture complex patterns High Improved PerformanceBetter accuracy on complex tasks Medium Flexible ArchitectureCan vary units per layer Medium Challenges Challenge Description Mitigation Vanishing GradientsGradients diminish through layers LSTM/GRU cells Computational CostMore parameters and operations Efficient implementations OverfittingComplex model may overfit Regularization techniques
Content sourced from CampusX Deep Learning notes (PDF). Run merge script for full body.
Common mistakes
- Not padding/masking variable-length stacked batches.
- Vanishing gradients in long depth sequences.
- Teacher forcing only at train without plan for inference.
Interview checkpoints
- Q: RNN vs CNN? A: RNN for sequences; CNN for spatial grids.
- Q: LSTM vs vanilla? A: Gated memory reduces vanishing.
Practice
- Basic: Sketch unrolled RNN for 3 timesteps.
- Intermediate: LSTM layer on padded sequences.
- Advanced: Forecast univariate series with windowed LSTM.
Recap
- Stacked LSTMs for sequence modeling.
- Watch shape (batch, time, features).
- Transformers often replace RNNs now.
Time Series Forecasting
55.2. RNN Fundamentals - Why Use RNNs? RNN Advantages ∗Memory Capability:Remembers previous inputs ∗Sequential Processing:Processes one element at a time ∗Variable Length:Handles sequences of different lengths ∗Context Awareness:Maintains context throughout sequence
55.1.5 Applications of RNNs
Natural Language Processing (NLP) ∗Text Classification ∗Language Translation ∗Sentiment Analysis ∗Text Generation Time Series Analysis ∗Stock Price Prediction ∗Weather Forecasting ∗Sales Forecasting Speech & Audio ∗Speech Recognition ∗Music Generation ∗Audio Classification
55.2 RNNFundamentals-WhyUseRNNs?
55.2.1 Core Question
Why do we need RNNs (Recurrent Neural Networks)?What specific problems exist that prevent us from using regular neural networks on sequential data?
55.2.2 The Sequential Data Challenge
Text Classification Example Consider sentiment analysis: -Input: Text sentences -Output: Posi- tive/Negative sentiment 617
Why this matters
Time series forecasting needs careful scaling and windows.
53.3.10 Expected Performance
∗Accuracy: ~90-95% (typical for this approach) ∗Training Time: Faster than training from scratch ∗Data Efficiency: Works well with limited data ∗ 606
Part XI
Advanced Keras
607Chapter 54
Keras Functional Model
54.1 FunctionalAPIinKeras-DetailedNotes
54.1.1 Introduction
This tutorial covers theFunctional APIin Keras, which allows building
non-linear neural network topologiesunlike the Sequential API that
only supports linear layer stacking.
54.1.2 Why Functional API?
Limitations of Sequential Model
∗Sequential models follow alinear topology- one layer after another
∗Input→Layer 1→Layer 2→...→Output
∗Cannot handle:
·Multiple inputs
·Multiple outputs
·Branching architectures
·Shared layers
When to Use Functional API
Example 1: Multi-Output Model-Input: Human face images -
Outputs: - Age prediction (regression) - Emotion classification (happy,
sad, angry) - Requires branching architecture with shared CNN base
Example 2: Multi-Input Model-E-commerce pricing prediction
-Inputs: - Tabular metadata (color, size) - Text description - Product im-
age -Output: Price prediction - Different inputs need different processing
(Dense, RNN, CNN)
54.1.3 Basic Functional API Syntax
Key Components
1fromkeras.modelsimportModel
2fromkeras.layersimportInput, Dense
3
4# Define input layer
5input_layer = Input(shape=(input_shape,))
6
60854.1. Functional API in Keras - Detailed Notes
7# Build network by connecting layers
8hidden = Dense(64, activation=’relu’)(input_layer)
9output = Dense(1)(hidden)
10
11# Create model
12model = Model(inputs=input_layer, outputs=output)
Important Differences from Sequential:
1. Each layer must be given a name or variable
2. Layers are connected by calling them on previous layers
3. Model is created by specifying inputs and outputs
54.1.4 Code Examples
1. Simple Multi-Output Model
1fromkeras.layersimportInput, Dense
2fromkeras.modelsimportModel
3
4# Input layer
5x = Input(shape=(3,))
6
7# Shared layers
8hidden1 = Dense(128, activation=’relu’)(x)
9hidden2 = Dense(64, activation=’relu’)(hidden1)
10
11# Two output branches
12output1 = Dense(1, activation=’linear’, name=’age’)(hidden2)
13output2 = Dense(1, activation=’sigmoid’, name=’place’)(hidden2)
14
15# Create model with multiple outputs
16model = Model(inputs=x, outputs=[output1, output2])
17
18# Compile with multiple losses
19model.compile(
20optimizer=’adam’,
21loss={
22’age’: ’mse’,
23’place’: ’binary_crossentropy’
24}
25)
2. Multi-Input Model with Concatenation
1# Define two inputs
2inputA = Input(shape=(32,))
3inputB = Input(shape=(128,))
4
5# Branch 1
6x = Dense(8, activation="relu")(inputA)
7x1 = Dense(4, activation="relu")(x)
8
609Chapter 54. Keras Functional Model
9# Branch 2
10y = Dense(64, activation="relu")(inputB)
11y1 = Dense(32, activation="relu")(y)
12y2 = Dense(4, activation="relu")(y1)
13
14# Concatenate branches
15combined = concatenate([x1, y2])
16
17# Final layers
18z = Dense(2, activation="relu")(combined)
19z1 = Dense(1, activation="linear")(z)
20
21# Model with multiple inputs
22model = Model(inputs=[inputA, inputB], outputs=z1)
3. Practical Example: UTKFace Dataset
Dataset: Face images with age and gender labelsTask: Predict both age
and gender from face images
Data Preparation
1# Extract age and gender from filename
2for file inos.listdir(folder_path):
3age.append(int(file.split(’_’)[0]))
4gender.append(int(file.split(’_’)[1]))
5img_path.append(file)
6
7# Create DataFrame
8df = pd.DataFrame({’age’:age, ’gender’:gender, ’img’:img_path})
9
10# Split data
11train_df = df.sample(frac=1, random_state=0).iloc[:20000]
12test_df = df.sample(frac=1, random_state=0).iloc[20000:]
Data Augmentation
1train_datagen = ImageDataGenerator(
2rescale=1./255,
3rotation_range=30,
4width_shift_range=0.2,
5height_shift_range=0.2,
6shear_range=0.2,
7zoom_range=0.2,
8horizontal_flip=True
9)
10
11train_generator = train_datagen.flow_from_dataframe(
12train_df,
13directory=folder_path,
14x_col=’img’,
15y_col=[’age’,’gender’],# Multiple outputs
16target_size=(200,200),
17class_mode=’multi_output’
18)
61054.1. Functional API in Keras - Detailed Notes
Model Architecture with Transfer Learning
1fromkeras.applications.resnet50importResNet50
2
3# Load pre-trained ResNet50
4resnet = ResNet50(include_top=False, input_shape=(200,200,3))
5resnet.trainable = False
6
7# Get output from last layer
8output = resnet.layers[-1].output
9flatten = Flatten()(output)
10
11# Create branches for age and gender
12# Age branch
13dense1 = Dense(512, activation=’relu’)(flatten)
14dense3 = Dense(512, activation=’relu’)(dense1)
15output1 = Dense(1, activation=’linear’, name=’age’)(dense3)
16
17# Gender branch
18dense2 = Dense(512, activation=’relu’)(flatten)
19dense4 = Dense(512, activation=’relu’)(dense2)
20output2 = Dense(1, activation=’sigmoid’, name=’gender’)(dense4)
21
22# Create model
23model = Model(inputs=resnet.input, outputs=[output1, output2])
Compilation with Multiple Losses
1model.compile(
2optimizer=’adam’,
3loss={
4’age’: ’mae’,# Mean Absolute Error for regression
5’gender’: ’binary_crossentropy’# For binary classification
6},
7metrics={
8’age’: ’mae’,
9’gender’: ’accuracy’
10},
11loss_weights={
12’age’: 1,
13’gender’: 99# Higher weight for gender loss
14}
15)
54.1.5 Key Advantages of Functional API
1.Flexibility: Create any network topology
2.Multiple inputs/outputs: Handle complex data flows
3.Shared layers: Reuse layers in different branches
4.Model visualization: Easy to visualize withplot_model()
54.1.6 Visualization
1fromkeras.utilsimportplot_model
2plot_model(model, show_shapes=True)
611Chapter 54. Keras Functional Model
54.1.7 Best Practices
1.Naming layers: Give meaningful names to important layers
2.Variable naming: Use descriptive variable names for layer outputs
3.Loss weights: Adjust loss weights for multi-output models based
on task importance
4.Transfer learning: Combine pre-trained models with custom ar-
chitectures
54.1.8 Common Architectures with Functional API
1.Siamese Networks: Shared weights between branches
2.Multi-modal Networks: Different input types (text, image, tab-
ular)
3.Residual Networks: Skip connections
4.Attention Mechanisms: Complex routing between layers
54.1.9 Resources
∗Keras Functional API Documentation
∗Machine Learning Mastery Blog Post
This comprehensive guide shows how the Functional API enables building
sophisticated neural network architectures that go beyond simple sequen-
tial models, making it essential for complex deep learning applications.
612Part XII Recurrent Neural Networks 613
Chapter 55 Why RNNs are needed RNNs Vs ANNs RNN Part 1
55.1 WhyRNNsareneeded|RNNsVsANNs
| RNN Part 1 Figure 55.1: image 614
55.1. Why RNNs are needed | RNNs Vs ANNs | RNN Part 1
55.1.1 Neural Network Types Covered So Far
Neural Network Type Primary Use Case Data Type Artificial Neural Networks (ANN) General purpose Tabular Data Convolutional Neural Networks (CNN) Image processing Grid-like Data (Images, Videos) Recurrent Neural Networks (RNN) Sequential processing Sequential Data
55.1.2 What are Recurrent Neural Networks?
Definition RNN= A special type of sequential model specifically de- signed to work on sequential data Key Characteristics ∗Purpose:Process sequential information ∗Memory:Maintains context from previous inputs ∗Applications:NLP, time series, speech recognition
55.1.3 Understanding Sequential Data
Non-Sequential vs Sequential Data Non-Sequential Data Example Student Placement Prediction: 1Input Features -> Neural Network -> Prediction 2? Age: 22 3? Marks: 85% -> ANN -> Placement: Yes/No 4? Gender: Male Note:Order doesn’t matter - can rearrange features without affecting outcome 615
Chapter 55. Why RNNs are needed RNNs Vs ANNs RNN Part 1 Data Type Example Why Sequence Matters Text“Hey my name is Nitish” Word order determines meaning Time SeriesStock prices over years Past values influence future trends AudioSpeech waveforms Temporal patterns create meaning BiologicalDNA sequences Gene order affects function Sequential Data Examples Text Processing Example 1"Hey my name is Nitish" 2? ? ? ? ? 3Word Word Word Word Word 41 2 3 4 5 Sequential Processing:- Read word by word - Retain context from pre- vious words - Build understanding progressively - Combine all information for final meaning Time Series Example 1Stock Price Progression: 22001 -> 2002 -> 2003 -> 2004 -> ... 3$50 $55 $48 $62 Sequential Dependency:- Current price influenced by historical trends - Past performance affects future predictions - Temporal relationships are crucial
55.1.4 Why RNNs are Essential
The Sequential Data Challenge Traditional neural networks (ANN, CNN)cannot handle sequential dependenciesbecause: ∗Fixed Input Size:Cannot process variable-length sequences ∗No Memory:Cannot retain information from previous inputs ∗Order Ignorance:Treat all inputs as independent 616
55.2. RNN Fundamentals - Why Use RNNs? RNN Advantages ∗Memory Capability:Remembers previous inputs ∗Sequential Processing:Processes one element at a time ∗Variable Length:Handles sequences of different lengths ∗Context Awareness:Maintains context throughout sequence
55.1.5 Applications of RNNs
Natural Language Processing (NLP) ∗Text Classification ∗Language Translation ∗Sentiment Analysis ∗Text Generation Time Series Analysis ∗Stock Price Prediction ∗Weather Forecasting ∗Sales Forecasting Speech & Audio ∗Speech Recognition ∗Music Generation ∗Audio Classification
55.2 RNNFundamentals-WhyUseRNNs?
55.2.1 Core Question
Why do we need RNNs (Recurrent Neural Networks)?What specific problems exist that prevent us from using regular neural networks on sequential data?
55.2.2 The Sequential Data Challenge
Text Classification Example Consider sentiment analysis: -Input: Text sentences -Output: Posi- tive/Negative sentiment 617
Chapter 55. Why RNNs are needed RNNs Vs ANNs RNN Part 1 Example Sentences Expected Output “Hi my name is Nitish” Positive/Negative “My name” Positive/Negative “Name is” Positive/Negative
55.2.3 Problem 1: Text Representation
Challenge Neural networks cannot understand text directly - we need numerical rep- resentation. Solution: One-Hot Encoding Vocabulary Creation Process 1.Find unique wordsin entire vocabulary 2.Create vector representationfor each word Example Implementation Sample Text: “Hi my name is Nitish” - Unique words: 12 total words in vocabulary -Vector size: 12 dimen- sions per word Word One-Hot Vector “Hi” [1,0,0,0,0,0,0,0,0,0,0,0] “my” [0,1,0,0,0,0,0,0,0,0,0,0] “name” [0,0,1,0,0,0,0,0,0,0,0,0] Vector Stacking 1Input Matrix = [Hi_vector, my_vector, name_vector, is_vector, Nitish_vector] 2Result: Vertically stacked vectors 618
55.2. RNN Fundamentals - Why Use RNNs?
55.2.4 Problem 2: Variable Input Sizes
The Core Issue Sentence Word Count Input Size “Hi my name is Nitish” 5 words 5×12 = 60 “My name is” 3 words 3×12 = 36 “Name is” 2 words 2×12 = 24 Problem: Neural networks requirefixed input size Why This Breaks Neural Networks Figure 55.2: image
55.2.5 Solution: Zero Padding
Implementation Strategy 1.Find maximum sentence lengthin dataset 2.Pad shorter sentenceswith zero vectors Example Implementation Step 1: Identify Maximum Length ∗Longest sentence: “Hi my name is Nitish” (5 words) ∗Padding target: 5 words for all sentences Step 2: Apply Padding 1Original: "My name is" (3 words) 2Padded: "My name is [0] [0]" (5 words) 3 4Where [0] = [0,0,0,0,0,0,0,0,0,0,0,0] 619
53.3.10 Expected Performance
∗Accuracy: ~90-95% (typical for this approach) ∗Training Time: Faster than training from scratch ∗Data Efficiency: Works well with limited data ∗ 606
Part XI
Advanced Keras
607Chapter 54
Keras Functional Model
54.1 FunctionalAPIinKeras-DetailedNotes
54.1.1 Introduction
This tutorial covers theFunctional APIin Keras, which allows building
non-linear neural network topologiesunlike the Sequential API that
only supports linear layer stacking.
54.1.2 Why Functional API?
Limitations of Sequential Model
∗Sequential models follow alinear topology- one layer after another
∗Input→Layer 1→Layer 2→...→Output
∗Cannot handle:
·Multiple inputs
·Multiple outputs
·Branching architectures
·Shared layers
When to Use Functional API
Example 1: Multi-Output Model-Input: Human face images -
Outputs: - Age prediction (regression) - Emotion classification (happy,
sad, angry) - Requires branching architecture with shared CNN base
Example 2: Multi-Input Model-E-commerce pricing prediction
-Inputs: - Tabular metadata (color, size) - Text description - Product im-
age -Output: Price prediction - Different inputs need different processing
(Dense, RNN, CNN)
54.1.3 Basic Functional API Syntax
Key Components
1fromkeras.modelsimportModel
2fromkeras.layersimportInput, Dense
3
4# Define input layer
5input_layer = Input(shape=(input_shape,))
6
60854.1. Functional API in Keras - Detailed Notes
7# Build network by connecting layers
8hidden = Dense(64, activation=’relu’)(input_layer)
9output = Dense(1)(hidden)
10
11# Create model
12model = Model(inputs=input_layer, outputs=output)
Important Differences from Sequential:
1. Each layer must be given a name or variable
2. Layers are connected by calling them on previous layers
3. Model is created by specifying inputs and outputs
54.1.4 Code Examples
1. Simple Multi-Output Model
1fromkeras.layersimportInput, Dense
2fromkeras.modelsimportModel
3
4# Input layer
5x = Input(shape=(3,))
6
7# Shared layers
8hidden1 = Dense(128, activation=’relu’)(x)
9hidden2 = Dense(64, activation=’relu’)(hidden1)
10
11# Two output branches
12output1 = Dense(1, activation=’linear’, name=’age’)(hidden2)
13output2 = Dense(1, activation=’sigmoid’, name=’place’)(hidden2)
14
15# Create model with multiple outputs
16model = Model(inputs=x, outputs=[output1, output2])
17
18# Compile with multiple losses
19model.compile(
20optimizer=’adam’,
21loss={
22’age’: ’mse’,
23’place’: ’binary_crossentropy’
24}
25)
2. Multi-Input Model with Concatenation
1# Define two inputs
2inputA = Input(shape=(32,))
3inputB = Input(shape=(128,))
4
5# Branch 1
6x = Dense(8, activation="relu")(inputA)
7x1 = Dense(4, activation="relu")(x)
8
609Chapter 54. Keras Functional Model
9# Branch 2
10y = Dense(64, activation="relu")(inputB)
11y1 = Dense(32, activation="relu")(y)
12y2 = Dense(4, activation="relu")(y1)
13
14# Concatenate branches
15combined = concatenate([x1, y2])
16
17# Final layers
18z = Dense(2, activation="relu")(combined)
19z1 = Dense(1, activation="linear")(z)
20
21# Model with multiple inputs
22model = Model(inputs=[inputA, inputB], outputs=z1)
3. Practical Example: UTKFace Dataset
Dataset: Face images with age and gender labelsTask: Predict both age
and gender from face images
Data Preparation
1# Extract age and gender from filename
2for file inos.listdir(folder_path):
3age.append(int(file.split(’_’)[0]))
4gender.append(int(file.split(’_’)[1]))
5img_path.append(file)
6
7# Create DataFrame
8df = pd.DataFrame({’age’:age, ’gender’:gender, ’img’:img_path})
9
10# Split data
11train_df = df.sample(frac=1, random_state=0).iloc[:20000]
12test_df = df.sample(frac=1, random_state=0).iloc[20000:]
Data Augmentation
1train_datagen = ImageDataGenerator(
2rescale=1./255,
3rotation_range=30,
4width_shift_range=0.2,
5height_shift_range=0.2,
6shear_range=0.2,
7zoom_range=0.2,
8horizontal_flip=True
9)
10
11train_generator = train_datagen.flow_from_dataframe(
12train_df,
13directory=folder_path,
14x_col=’img’,
15y_col=[’age’,’gender’],# Multiple outputs
16target_size=(200,200),
17class_mode=’multi_output’
18)
61054.1. Functional API in Keras - Detailed Notes
Model Architecture with Transfer Learning
1fromkeras.applications.resnet50importResNet50
2
3# Load pre-trained ResNet50
4resnet = ResNet50(include_top=False, input_shape=(200,200,3))
5resnet.trainable = False
6
7# Get output from last layer
8output = resnet.layers[-1].output
9flatten = Flatten()(output)
10
11# Create branches for age and gender
12# Age branch
13dense1 = Dense(512, activation=’relu’)(flatten)
14dense3 = Dense(512, activation=’relu’)(dense1)
15output1 = Dense(1, activation=’linear’, name=’age’)(dense3)
16
17# Gender branch
18dense2 = Dense(512, activation=’relu’)(flatten)
19dense4 = Dense(512, activation=’relu’)(dense2)
20output2 = Dense(1, activation=’sigmoid’, name=’gender’)(dense4)
21
22# Create model
23model = Model(inputs=resnet.input, outputs=[output1, output2])
Compilation with Multiple Losses
1model.compile(
2optimizer=’adam’,
3loss={
4’age’: ’mae’,# Mean Absolute Error for regression
5’gender’: ’binary_crossentropy’# For binary classification
6},
7metrics={
8’age’: ’mae’,
9’gender’: ’accuracy’
10},
11loss_weights={
12’age’: 1,
13’gender’: 99# Higher weight for gender loss
14}
15)
54.1.5 Key Advantages of Functional API
1.Flexibility: Create any network topology
2.Multiple inputs/outputs: Handle complex data flows
3.Shared layers: Reuse layers in different branches
4.Model visualization: Easy to visualize withplot_model()
54.1.6 Visualization
1fromkeras.utilsimportplot_model
2plot_model(model, show_shapes=True)
611Chapter 54. Keras Functional Model
54.1.7 Best Practices
1.Naming layers: Give meaningful names to important layers
2.Variable naming: Use descriptive variable names for layer outputs
3.Loss weights: Adjust loss weights for multi-output models based
on task importance
4.Transfer learning: Combine pre-trained models with custom ar-
chitectures
54.1.8 Common Architectures with Functional API
1.Siamese Networks: Shared weights between branches
2.Multi-modal Networks: Different input types (text, image, tab-
ular)
3.Residual Networks: Skip connections
4.Attention Mechanisms: Complex routing between layers
54.1.9 Resources
∗Keras Functional API Documentation
∗Machine Learning Mastery Blog Post
This comprehensive guide shows how the Functional API enables building
sophisticated neural network architectures that go beyond simple sequen-
tial models, making it essential for complex deep learning applications.
612Part XII Recurrent Neural Networks 613
Chapter 55 Why RNNs are needed RNNs Vs ANNs RNN Part 1
55.1 WhyRNNsareneeded|RNNsVsANNs
| RNN Part 1 Figure 55.1: image 614
55.1. Why RNNs are needed | RNNs Vs ANNs | RNN Part 1
55.1.1 Neural Network Types Covered So Far
Neural Network Type Primary Use Case Data Type Artificial Neural Networks (ANN) General purpose Tabular Data Convolutional Neural Networks (CNN) Image processing Grid-like Data (Images, Videos) Recurrent Neural Networks (RNN) Sequential processing Sequential Data
55.1.2 What are Recurrent Neural Networks?
Definition RNN= A special type of sequential model specifically de- signed to work on sequential data Key Characteristics ∗Purpose:Process sequential information ∗Memory:Maintains context from previous inputs ∗Applications:NLP, time series, speech recognition
55.1.3 Understanding Sequential Data
Non-Sequential vs Sequential Data Non-Sequential Data Example Student Placement Prediction: 1Input Features -> Neural Network -> Prediction 2? Age: 22 3? Marks: 85% -> ANN -> Placement: Yes/No 4? Gender: Male Note:Order doesn’t matter - can rearrange features without affecting outcome 615
Chapter 55. Why RNNs are needed RNNs Vs ANNs RNN Part 1 Data Type Example Why Sequence Matters Text“Hey my name is Nitish” Word order determines meaning Time SeriesStock prices over years Past values influence future trends AudioSpeech waveforms Temporal patterns create meaning BiologicalDNA sequences Gene order affects function Sequential Data Examples Text Processing Example 1"Hey my name is Nitish" 2? ? ? ? ? 3Word Word Word Word Word 41 2 3 4 5 Sequential Processing:- Read word by word - Retain context from pre- vious words - Build understanding progressively - Combine all information for final meaning Time Series Example 1Stock Price Progression: 22001 -> 2002 -> 2003 -> 2004 -> ... 3$50 $55 $48 $62 Sequential Dependency:- Current price influenced by historical trends - Past performance affects future predictions - Temporal relationships are crucial
55.1.4 Why RNNs are Essential
The Sequential Data Challenge Traditional neural networks (ANN, CNN)cannot handle sequential dependenciesbecause: ∗Fixed Input Size:Cannot process variable-length sequences ∗No Memory:Cannot retain information from previous inputs ∗Order Ignorance:Treat all inputs as independent 616
55.2. RNN Fundamentals - Why Use RNNs? RNN Advantages ∗Memory Capability:Remembers previous inputs ∗Sequential Processing:Processes one element at a time ∗Variable Length:Handles sequences of different lengths ∗Context Awareness:Maintains context throughout sequence
55.1.5 Applications of RNNs
Natural Language Processing (NLP) ∗Text Classification ∗Language Translation ∗Sentiment Analysis ∗Text Generation Time Series Analysis ∗Stock Price Prediction ∗Weather Forecasting ∗Sales Forecasting Speech & Audio ∗Speech Recognition ∗Music Generation ∗Audio Classification
55.2 RNNFundamentals-WhyUseRNNs?
55.2.1 Core Question
Why do we need RNNs (Recurrent Neural Networks)?What specific problems exist that prevent us from using regular neural networks on sequential data?
55.2.2 The Sequential Data Challenge
Text Classification Example Consider sentiment analysis: -Input: Text sentences -Output: Posi- tive/Negative sentiment 617
Chapter 55. Why RNNs are needed RNNs Vs ANNs RNN Part 1 Example Sentences Expected Output “Hi my name is Nitish” Positive/Negative “My name” Positive/Negative “Name is” Positive/Negative
55.2.3 Problem 1: Text Representation
Challenge Neural networks cannot understand text directly - we need numerical rep- resentation. Solution: One-Hot Encoding Vocabulary Creation Process 1.Find unique wordsin entire vocabulary 2.Create vector representationfor each word Example Implementation Sample Text: “Hi my name is Nitish” - Unique words: 12 total words in vocabulary -Vector size: 12 dimen- sions per word Word One-Hot Vector “Hi” [1,0,0,0,0,0,0,0,0,0,0,0] “my” [0,1,0,0,0,0,0,0,0,0,0,0] “name” [0,0,1,0,0,0,0,0,0,0,0,0] Vector Stacking 1Input Matrix = [Hi_vector, my_vector, name_vector, is_vector, Nitish_vector] 2Result: Vertically stacked vectors 618
55.2. RNN Fundamentals - Why Use RNNs?
55.2.4 Problem 2: Variable Input Sizes
The Core Issue Sentence Word Count Input Size “Hi my name is Nitish” 5 words 5×12 = 60 “My name is” 3 words 3×12 = 36 “Name is” 2 words 2×12 = 24 Problem: Neural networks requirefixed input size Why This Breaks Neural Networks Figure 55.2: image
55.2.5 Solution: Zero Padding
Implementation Strategy 1.Find maximum sentence lengthin dataset 2.Pad shorter sentenceswith zero vectors Example Implementation Step 1: Identify Maximum Length ∗Longest sentence: “Hi my name is Nitish” (5 words) ∗Padding target: 5 words for all sentences Step 2: Apply Padding 1Original: "My name is" (3 words) 2Padded: "My name is [0] [0]" (5 words) 3 4Where [0] = [0,0,0,0,0,0,0,0,0,0,0,0] 619
Content sourced from CampusX Deep Learning notes (PDF). Run merge script for full body.
Common mistakes
- Not padding/masking variable-length forecast batches.
- Vanishing gradients in long window sequences.
- Teacher forcing only at train without plan for inference.
Interview checkpoints
- Q: RNN vs CNN? A: RNN for sequences; CNN for spatial grids.
- Q: LSTM vs vanilla? A: Gated memory reduces vanishing.
Practice
- Basic: Sketch unrolled RNN for 3 timesteps.
- Intermediate: LSTM layer on padded sequences.
- Advanced: Forecast univariate series with windowed LSTM.
Recap
- Time Series Forecasting for sequence modeling.
- Watch shape (batch, time, features).
- Transformers often replace RNNs now.
Text Generation RNN
Contents
3.4.6 5. Adding Audio to Mute Videos . . . . . . . . . . . . . . . . 52
3.4.7 6. Image Caption Generation . . . . . . . . . . . . . . . . . . 52
3.4.8 7. Text Translation . . . . . . . . . . . . . . . . . . . . . . . . 53
3.4.9 8. Pixel Restoration . . . . . . . . . . . . . . . . . . . . . . . 54
3.4.10 9. Object Detection/Identification (Google Photos) . . . . . . 54
3.4.11 10. GANs (Generative Adversarial Networks) . . . . . . . . . 55
3.4.12 11. Deep Dreams . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.4.13 The Technical Foundation . . . . . . . . . . . . . . . . . . . . 56
3.4.14 The Future of Deep Learning Applications . . . . . . . . . . . 56
3.5 Artificial Intelligence & Deep Learning Resources . . . . . . . . . . . 57
3.5.1 Neural Network Architectures . . . . . . . . . . . . . . . . . . 57
3.5.2 Key Researchers & Databases . . . . . . . . . . . . . . . . . . 57
3.5.3 Generative AI Models & Applications . . . . . . . . . . . . . . 57
3.5.4 Advanced Techniques & Demonstrations . . . . . . . . . . . . 57
3.5.5 AI Development Timeline . . . . . . . . . . . . . . . . . . . . 58
3.5.6 Key AI Capabilities Showcase . . . . . . . . . . . . . . . . . . 58
II Perceptrons 60 4 What is perceptron Perceptron vs Neuron Perceptron Geometric Intuition 61
4.1 Perceptron: The Building Block of Neural Networks . . . . . . . . . . 61
4.1.1 Introduction to Perceptrons - . . . . . . . . . . . . . . . . . . 61
4.1.2 Training and Prediction Process . . . . . . . . . . . . . . . . . 63
4.1.3 Example Application . . . . . . . . . . . . . . . . . . . . . . . 63
4.1.4 Neuron vs. Perceptron . . . . . . . . . . . . . . . . . . . . . . 64
4.1.5 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.1.6 Geometric Intuition . . . . . . . . . . . . . . . . . . . . . . . . 65
4.1.7 Code Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.1.8 Understanding Weights . . . . . . . . . . . . . . . . . . . . . . 68
4.1.9 Key Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.1.10 Next Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5 Perceptron Trick How to train a perceptron Part 2 70
5.1 The Perceptron Trick: Training Linear Classifiers Through Geometric
Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.1.1 The Perceptron’s Learning Challenge . . . . . . . . . . . . . . 70
5.1.2 The Geometric Intuition and Transformations . . . . . . . . . 73
5.1.3 Mathematical Foundation . . . . . . . . . . . . . . . . . . . . 73
5.1.4 Positive & Negative Regions . . . . . . . . . . . . . . . . . . . 75
5.1.5 The Transformation Magic . . . . . . . . . . . . . . . . . . . . 75
5.1.6 Simplified Learning Algorithm . . . . . . . . . . . . . . . . . . 75
5.1.7 Learning in Action: The Convergence Process . . . . . . . . . 76
5.1.8 Why This Matters . . . . . . . . . . . . . . . . . . . . . . . . 77
Why this matters
Text generation samples next character/token autoregressively.
62.1.12 4. Complete LSTM Cell Animation
Step-by-Step Workflow Figure 62.6: image 711
Chapter 62. LSTM Architecture Part 2 The How CampusX Complete Mathematical Flow All LSTM Equations: 11. ft = sigma(Wf.[ht-1,Xt] + bf) # Forget gate 22. it = sigma(Wi.[ht-1,Xt] + bi) # Input gate 33. C?t = tanh(WC.[ht-1,Xt] + bC) # Candidate values 44. Ct = ft?Ct-1 + it?C?t # Cell state update 55. ot = sigma(Wo.[ht-1,Xt] + bo) # Output gate 66. ht = ot?tanh(Ct) # Hidden state Key Takeaways Feature Purpose Benefit Three GatesControl information flow Selective memory Cell State HighwayDirect gradient path Solves vanishing gradients Pointwise OperationsElement-wise control Fine-grained memory management Dual MemoryLong & short term Comprehensive context Animation Summary 1.Forget Phase: Remove irrelevant past info 2.Input Phase: Add new relevant info 3.Output Phase: Select what to output now Complexity Analysis Operation Time Space Purpose Forget Gate O(n2) O(n) Memory filtering Input Gate O(n2) O(n) Information addition Output Gate O(n2) O(n) Output generation Total O(n2) O(n)Per timestep 712
62.1. LSTM Architecture | Part 2 | The How? | CampusX 713
Chapter 63 LSTM Part 3 Next Word Pre- dictor Using CampusX
63.1 LSTM | Part 3 | Next Word Predictor
Using | CampusX
63.1.1 1. Introduction
What is a Next Word Predictor? ANext Word Predictoris an AI system that suggests the most likely word to follow a given sequence of words. It’s essentially a text generation model that predicts one word at a time. Figure 63.1: image Key Characteristics Feature Description Impact Sequential ProcessingAnalyzes word order and context High accuracy Pattern RecognitionLearns from large text corpora Better predictions Context AwarenessUses previous words for prediction Natural flow Real-time PredictionInstant suggestions User-friendly 714
63.1. LSTM | Part 3 | Next Word Predictor Using | CampusX
63.1.2 2. Real-World Applications
Industry Impact Application User Base Time Saved Adoption Rate Mobile Keyboards 3B+ users 30% typing time 85% Email Composers1.5B users 20% email time 65% Code Completion50M developers 40% coding time 75% Chat Applications2B+ users 25% messaging time 70%
63.1.3 3. Implementation Strategy
Converting Text Generation to Supervised Learning Data Transformation Process Original Sentence Input Sequence Target Word “Hi my name is Nitish” “Hi” “my” “Hi my” “name” “Hi my name” “is” “Hi my name is” “Nitish” Step 1: Sentence to Sequences Figure 63.2: image Step 2: Word to Number Mapping 715
Chapter 63. LSTM Part 3 Next Word Predictor Using CampusX Input Sequence Target [1] 2 [1, 2] 3 [1, 2, 3] 4 [1, 2, 3, 4] 5 Step 3: Numerical Dataset
63.1.4 4. Data Preprocessing
Tokenization Pipeline
63.2 Key Steps in Preprocessing
Figure 63.3: image Preprocessing Components Component Purpose Output TokenizerConvert text to tokens Word indices VocabularyStore unique words Word-to-ID mapping Sequence GeneratorCreate input sequences Training pairs PaddingUniform sequence length Fixed-size inputs Example Code Structure 1# Import necessary libraries 2importtensorflowastf
3fromtensorflow.keras.preprocessing.textimportTokenizer
4fromtensorflow.keras.preprocessing.sequenceimportpad_sequences
5
6# Initialize tokenizer
7tokenizer = Tokenizer()
8tokenizer.fit_on_texts([text])
9
10# Convert text to sequences
11sequences = tokenizer.texts_to_sequences(sentences)
71663.2. Key Steps in Preprocessing
63.2.1 5. Model Architecture
LSTM Network Design Figure 63.4: image 717
Chapter 63. LSTM Part 3 Next Word Predictor Using CampusX Layer Configuration Layer Type Parameters Purpose Embeddingvocab_size×100 Convert tokens to vectors LSTM 1150 units, return_sequences=True Capture sequence patterns LSTM 2100 units Extract high-level features Densevocab_size units Output probabilities Softmax- Normalize probabilities
63.2.2 6. Code Implementation
Complete Implementation Workflow Step 1: Data Preparation ∗Load text data ∗Split into sentences ∗Create token mappings Step 2: Sequence Creation ∗Convert text to numbers ∗Create input-output pairs ∗Pad sequences to uniform length Step 3: Model Construction ∗Build LSTM architecture ∗Configure hyperparameters ∗Compile with optimizer Step 4: Training Process ∗Train on prepared data ∗Monitor loss metrics ∗Save best model 718
63.2. Key Steps in Preprocessing
63.2.3 7. Training & Evaluation
Training Configuration Hyperparameter Value Purpose Batch Size64 Training efficiency Epochs100 Model convergence Learning Rate0.001 Optimization speed Dropout0.2 Prevent overfitting OptimizerAdam Adaptive learning
63.2.4 1. Dataset Overview
Dataset Statistics Metric Value Description Total Words~283 unique Vocabulary size Document TypeFAQ Text Q&A format LanguageEnglish Technical content SizeSmall Demo purposes Implementation Steps Step 1: Tokenization
1fromtensorflow.keras.preprocessing.textimportTokenizer
2
3tokenizer = Tokenizer()
4tokenizer.fit_on_texts([faqs])
5# Creates word-to-index mapping
Step 2: Sequence Generation
1input_sequences = []
2forsentenceinfaqs.split(’\n’):
3tokenized_sentence = tokenizer.texts_to_sequences([sentence])[0]
4foriin range(1,len(tokenized_sentence)):
5input_sequences.append(tokenized_sentence[:i+1])
719Chapter 63. LSTM Part 3 Next Word Predictor Using CampusX Sequence Creation Example Original Sentence Input Sequences Target “What is the fee” [What] is [What, is] the [What, is, the] fee Padding Configuration 1max_len =max([len(x)forxininput_sequences])# 56 2padded_sequences = pad_sequences(input_sequences, 3maxlen=max_len, 4padding=’pre’) 720
63.2. Key Steps in Preprocessing
63.2.5 3. Model Architecture Deep Dive
Complete Architecture Visualization Figure 63.5: image 721
Chapter 63. LSTM Part 3 Next Word Predictor Using CampusX Layer-by-Layer Breakdown Parameter Value Purpose Input Dim283 Vocabulary size Output Dim100 Dense vector size Input Length56 Max sequence length Parameters28,300 283×100 1 Embedding Layer Feature Configuration Calculation Units150 Hidden state dimension Time Steps56 Sequential processing Input per Step100 From embedding Output150 Final hidden state 2 LSTM Layer Component Value Function Units283 One per word ActivationSoftmax Probability distribution Parameters42,633 150×283 + 283 3 Dense Output Layer
63.2.6 4. Implementation Code
Complete Model Building
1fromtensorflow.keras.modelsimportSequential
2fromtensorflow.keras.layersimportEmbedding, LSTM, Dense
3
4# Build model
5model = Sequential()
6model.add(Embedding(283, 100, input_length=56))
7model.add(LSTM(150))
8model.add(Dense(283, activation=’softmax’))
9
10# Compile
11model.compile(loss=’categorical_crossentropy’,
72262.1.12 4. Complete LSTM Cell Animation
Step-by-Step Workflow Figure 62.6: image 711
Chapter 62. LSTM Architecture Part 2 The How CampusX Complete Mathematical Flow All LSTM Equations: 11. ft = sigma(Wf.[ht-1,Xt] + bf) # Forget gate 22. it = sigma(Wi.[ht-1,Xt] + bi) # Input gate 33. C?t = tanh(WC.[ht-1,Xt] + bC) # Candidate values 44. Ct = ft?Ct-1 + it?C?t # Cell state update 55. ot = sigma(Wo.[ht-1,Xt] + bo) # Output gate 66. ht = ot?tanh(Ct) # Hidden state Key Takeaways Feature Purpose Benefit Three GatesControl information flow Selective memory Cell State HighwayDirect gradient path Solves vanishing gradients Pointwise OperationsElement-wise control Fine-grained memory management Dual MemoryLong & short term Comprehensive context Animation Summary 1.Forget Phase: Remove irrelevant past info 2.Input Phase: Add new relevant info 3.Output Phase: Select what to output now Complexity Analysis Operation Time Space Purpose Forget Gate O(n2) O(n) Memory filtering Input Gate O(n2) O(n) Information addition Output Gate O(n2) O(n) Output generation Total O(n2) O(n)Per timestep 712
62.1. LSTM Architecture | Part 2 | The How? | CampusX 713
Chapter 63 LSTM Part 3 Next Word Pre- dictor Using CampusX
63.1 LSTM | Part 3 | Next Word Predictor
Using | CampusX
63.1.1 1. Introduction
What is a Next Word Predictor? ANext Word Predictoris an AI system that suggests the most likely word to follow a given sequence of words. It’s essentially a text generation model that predicts one word at a time. Figure 63.1: image Key Characteristics Feature Description Impact Sequential ProcessingAnalyzes word order and context High accuracy Pattern RecognitionLearns from large text corpora Better predictions Context AwarenessUses previous words for prediction Natural flow Real-time PredictionInstant suggestions User-friendly 714
63.1. LSTM | Part 3 | Next Word Predictor Using | CampusX
63.1.2 2. Real-World Applications
Industry Impact Application User Base Time Saved Adoption Rate Mobile Keyboards 3B+ users 30% typing time 85% Email Composers1.5B users 20% email time 65% Code Completion50M developers 40% coding time 75% Chat Applications2B+ users 25% messaging time 70%
63.1.3 3. Implementation Strategy
Converting Text Generation to Supervised Learning Data Transformation Process Original Sentence Input Sequence Target Word “Hi my name is Nitish” “Hi” “my” “Hi my” “name” “Hi my name” “is” “Hi my name is” “Nitish” Step 1: Sentence to Sequences Figure 63.2: image Step 2: Word to Number Mapping 715
Chapter 63. LSTM Part 3 Next Word Predictor Using CampusX Input Sequence Target [1] 2 [1, 2] 3 [1, 2, 3] 4 [1, 2, 3, 4] 5 Step 3: Numerical Dataset
63.1.4 4. Data Preprocessing
Tokenization Pipeline
63.2 Key Steps in Preprocessing
Figure 63.3: image Preprocessing Components Component Purpose Output TokenizerConvert text to tokens Word indices VocabularyStore unique words Word-to-ID mapping Sequence GeneratorCreate input sequences Training pairs PaddingUniform sequence length Fixed-size inputs Example Code Structure 1# Import necessary libraries 2importtensorflowastf
3fromtensorflow.keras.preprocessing.textimportTokenizer
4fromtensorflow.keras.preprocessing.sequenceimportpad_sequences
5
6# Initialize tokenizer
7tokenizer = Tokenizer()
8tokenizer.fit_on_texts([text])
9
10# Convert text to sequences
11sequences = tokenizer.texts_to_sequences(sentences)
71663.2. Key Steps in Preprocessing
63.2.1 5. Model Architecture
LSTM Network Design Figure 63.4: image 717
Chapter 63. LSTM Part 3 Next Word Predictor Using CampusX Layer Configuration Layer Type Parameters Purpose Embeddingvocab_size×100 Convert tokens to vectors LSTM 1150 units, return_sequences=True Capture sequence patterns LSTM 2100 units Extract high-level features Densevocab_size units Output probabilities Softmax- Normalize probabilities
63.2.2 6. Code Implementation
Complete Implementation Workflow Step 1: Data Preparation ∗Load text data ∗Split into sentences ∗Create token mappings Step 2: Sequence Creation ∗Convert text to numbers ∗Create input-output pairs ∗Pad sequences to uniform length Step 3: Model Construction ∗Build LSTM architecture ∗Configure hyperparameters ∗Compile with optimizer Step 4: Training Process ∗Train on prepared data ∗Monitor loss metrics ∗Save best model 718
63.2. Key Steps in Preprocessing
63.2.3 7. Training & Evaluation
Training Configuration Hyperparameter Value Purpose Batch Size64 Training efficiency Epochs100 Model convergence Learning Rate0.001 Optimization speed Dropout0.2 Prevent overfitting OptimizerAdam Adaptive learning
63.2.4 1. Dataset Overview
Dataset Statistics Metric Value Description Total Words~283 unique Vocabulary size Document TypeFAQ Text Q&A format LanguageEnglish Technical content SizeSmall Demo purposes Implementation Steps Step 1: Tokenization
1fromtensorflow.keras.preprocessing.textimportTokenizer
2
3tokenizer = Tokenizer()
4tokenizer.fit_on_texts([faqs])
5# Creates word-to-index mapping
Step 2: Sequence Generation
1input_sequences = []
2forsentenceinfaqs.split(’\n’):
3tokenized_sentence = tokenizer.texts_to_sequences([sentence])[0]
4foriin range(1,len(tokenized_sentence)):
5input_sequences.append(tokenized_sentence[:i+1])
719Chapter 63. LSTM Part 3 Next Word Predictor Using CampusX Sequence Creation Example Original Sentence Input Sequences Target “What is the fee” [What] is [What, is] the [What, is, the] fee Padding Configuration 1max_len =max([len(x)forxininput_sequences])# 56 2padded_sequences = pad_sequences(input_sequences, 3maxlen=max_len, 4padding=’pre’) 720
63.2. Key Steps in Preprocessing
63.2.5 3. Model Architecture Deep Dive
Complete Architecture Visualization Figure 63.5: image 721
Chapter 63. LSTM Part 3 Next Word Predictor Using CampusX Layer-by-Layer Breakdown Parameter Value Purpose Input Dim283 Vocabulary size Output Dim100 Dense vector size Input Length56 Max sequence length Parameters28,300 283×100 1 Embedding Layer Feature Configuration Calculation Units150 Hidden state dimension Time Steps56 Sequential processing Input per Step100 From embedding Output150 Final hidden state 2 LSTM Layer Component Value Function Units283 One per word ActivationSoftmax Probability distribution Parameters42,633 150×283 + 283 3 Dense Output Layer
63.2.6 4. Implementation Code
Complete Model Building
1fromtensorflow.keras.modelsimportSequential
2fromtensorflow.keras.layersimportEmbedding, LSTM, Dense
3
4# Build model
5model = Sequential()
6model.add(Embedding(283, 100, input_length=56))
7model.add(LSTM(150))
8model.add(Dense(283, activation=’softmax’))
9
10# Compile
11model.compile(loss=’categorical_crossentropy’,
722Content sourced from CampusX Deep Learning notes (PDF). Run merge script for full body.
Common mistakes
- Not padding/masking variable-length generation batches.
- Vanishing gradients in long temperature sequences.
- Teacher forcing only at train without plan for inference.
Interview checkpoints
- Q: RNN vs CNN? A: RNN for sequences; CNN for spatial grids.
- Q: LSTM vs vanilla? A: Gated memory reduces vanishing.
Practice
- Basic: Sketch unrolled RNN for 3 timesteps.
- Intermediate: LSTM layer on padded sequences.
- Advanced: Forecast univariate series with windowed LSTM.
Recap
- Text Generation RNN for sequence modeling.
- Watch shape (batch, time, features).
- Transformers often replace RNNs now.
