Search topics…
Tutorials
Explore
June 6 Offline Event →
Module 7 · 100 Days of DL

Module 7: Recurrent Neural Networks, LSTMs & GRUs

Process temporal text steps: design recurrences in RNN hidden states, evaluate vanishing gradients in BPTT, and trace Gating logic in LSTMs, GRUs, and Bidirectional paths.

⏱ 35 Min Read Author: GenAIWallah Team Updated: May 2026
Day 66

Sequential Data

Contents

53.3.8 Key Implementation Details . . . . . . . . . . . . . . . . . . . 605

53.3.9 Fine-Tuning Strategy . . . . . . . . . . . . . . . . . . . . . . . 605

53.3.10Expected Performance . . . . . . . . . . . . . . . . . . . . . . 606

Python
XI Advanced Keras 607
54 Keras Functional Model 608
54.1 Functional API in Keras - Detailed Notes . . . . . . . . . . . . . . . . 608
54.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 608
54.1.2 Why Functional API? . . . . . . . . . . . . . . . . . . . . . . 608
54.1.3 Basic Functional API Syntax . . . . . . . . . . . . . . . . . . 608
54.1.4 Code Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 609
54.1.5 Key Advantages of Functional API . . . . . . . . . . . . . . . 611
54.1.6 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . 611
54.1.7 Best Practices . . . . . . . . . . . . . . . . . . . . . . . . . . . 612
54.1.8 Common Architectures with Functional API . . . . . . . . . . 612
54.1.9 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 612
XII Recurrent Neural Networks 613
55 Why RNNs are needed RNNs Vs ANNs RNN Part 1 614
55.1 Why RNNs are needed | RNNs Vs ANNs | RNN Part 1 . . . . . . . . 614
55.1.1 Neural Network Types Covered So Far . . . . . . . . . . . . . 615
55.1.2 What are Recurrent Neural Networks? . . . . . . . . . . . . . 615
55.1.3 Understanding Sequential Data . . . . . . . . . . . . . . . . . 615
55.1.4 Why RNNs are Essential . . . . . . . . . . . . . . . . . . . . . 616
55.1.5 Applications of RNNs . . . . . . . . . . . . . . . . . . . . . . 617
55.2 RNN Fundamentals - Why Use RNNs? . . . . . . . . . . . . . . . . . 617
55.2.1 Core Question . . . . . . . . . . . . . . . . . . . . . . . . . . . 617
55.2.2 The Sequential Data Challenge . . . . . . . . . . . . . . . . . 617
55.2.3 Problem 1: Text Representation . . . . . . . . . . . . . . . . . 618
55.2.4 Problem 2: Variable Input Sizes . . . . . . . . . . . . . . . . . 619
55.2.5 Solution: Zero Padding . . . . . . . . . . . . . . . . . . . . . . 619
55.2.6 Problems with Zero Padding . . . . . . . . . . . . . . . . . . . 620
55.2.7 Why Traditional Neural Networks Fail . . . . . . . . . . . . . 621
55.3 RNN Applications & Learning Roadmap . . . . . . . . . . . . . . . . 623
55.3.1 Core Problems Summary . . . . . . . . . . . . . . . . . . . . . 623
55.3.2 Real-World RNN Applications . . . . . . . . . . . . . . . . . . 624
55.3.3 Additional RNN Applications . . . . . . . . . . . . . . . . . . 627
55.3.4 RNN Learning Roadmap . . . . . . . . . . . . . . . . . . . . . 628
56 Recurrent Neural Network Forward Propagation Architecture 630
56.1 Recurrent Neural Network | Forward Propagation | Architecture . . . 630
56.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 630
56.1.2 Why RNNs? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 630
56.1.3 Data Format for RNNs . . . . . . . . . . . . . . . . . . . . . . 631
56.1.4 RNN Architecture . . . . . . . . . . . . . . . . . . . . . . . . 632
xxiii

Why this matters

Sequential data needs models that respect order — time, text, audio.

56.2.4 Data Flow Visualization

Input Processing 1Input Shape: (batch_size, 4, 5) 2? 34 timesteps * 5 features each 4? 5Sequential processing through RNN 637

Chapter 56. Recurrent Neural Network Forward Propagation Architecture RNN Processing Steps Figure 56.4: Mermaid diagram

56.3 RNNForwardPropagation: Complete

Technical Guide

56.2.4 Data Flow Visualization

Input Processing 1Input Shape: (batch_size, 4, 5) 2? 34 timesteps * 5 features each 4? 5Sequential processing through RNN 637

Chapter 56. Recurrent Neural Network Forward Propagation Architecture RNN Processing Steps Figure 56.4: Mermaid diagram

56.3 RNNForwardPropagation: Complete

Technical Guide

Unlike feedforward networks, Recurrent Neural Networks process sequential inputs by maintaining a hidden state vector $h_t$ that carries historical information across time loops: $$h_t = \tanh(W_{hh} h_{t-1} + W_{xh} x_t + b_h)$$ While RNNs are theoretically capable of sequence tracking, backpropagating through time (BPTT) leads to vanishing gradients, preventing the capture of long-term dependencies.

Common mistakes

  • Not padding/masking variable-length sequence batches.
  • Vanishing gradients in long order sequences.
  • Teacher forcing only at train without plan for inference.

Interview checkpoints

  • Q: RNN vs CNN? A: RNN for sequences; CNN for spatial grids.
  • Q: LSTM vs vanilla? A: Gated memory reduces vanishing.

Practice

  1. Basic: Sketch unrolled RNN for 3 timesteps.
  2. Intermediate: LSTM layer on padded sequences.
  3. Advanced: Forecast univariate series with windowed LSTM.

Recap

  • Sequential Data for sequence modeling.
  • Watch shape (batch, time, features).
  • Transformers often replace RNNs now.

Next: Day 67 — Vanilla RNN

Day 67

Vanilla RNN

3.2. Neural Network Architectures: A Visual Guide Component Purpose Description Encoder Compression Reduces input to latent space Latent Space Representation Compact encoding of data Decoder Reconstruction Rebuilds input from latent space Variants Vanilla Autoencoder •Basic: Simple encoder-decoder structure •Undercomplete: Latent dimension smaller than input •Purpose: Dimensionality reduction, feature learning Variational Autoencoder (VAE) •Probabilistic: Encodes to distribution, not point •Generative: Can sample from latent space •Structure: Adds KL divergence to loss function •Formula: Loss = Reconstruction Error + KL Divergence Denoising Autoencoder •Corruption: Input deliberately noised •Cleaning: Must reconstruct clean output •Robust: Learns noise-invariant features Sparse Autoencoder •Regularization: Penalizes active neurons •Sparse: Only small subset of neurons active •Goal: Learn more efficient representations Applications •Dimensionality reduction •Feature learning •Anomaly detection •Image denoising •Data compression Key Properties •Unsupervised: No labels needed •Self-supervised: Creates own supervision signal •Data-specific: Works best on similar data distribution 39

Why this matters

Vanilla RNN maintains hidden state across timesteps.

33.3.10 Visualization Tools

Interactive Demos 1.3D Loss Surface Visualization –Shows ball rolling on loss landscape –Compares vanilla GD vs momentum 2.Contour Plot Animation –2D view of optimization path –Clear view of oscillation damping 3.Parameter Space Navigation –Click anywhere to start optimization –Compare different algorithms side-by-side Key Observations from Visualizations – Blue path: Vanilla gradient descent (slow, direct) – Purple path: Momentum (fast, may overshoot) – Local minima: Momentum escapes, vanilla GD gets stuck – Oscillations: Gradually dampen with momentum

33.3.10 Visualization Tools

Interactive Demos 1.3D Loss Surface Visualization –Shows ball rolling on loss landscape –Compares vanilla GD vs momentum 2.Contour Plot Animation –2D view of optimization path –Clear view of oscillation damping 3.Parameter Space Navigation –Click anywhere to start optimization –Compare different algorithms side-by-side Key Observations from Visualizations – Blue path: Vanilla gradient descent (slow, direct) – Purple path: Momentum (fast, may overshoot) – Local minima: Momentum escapes, vanilla GD gets stuck – Oscillations: Gradually dampen with momentum

To solve gradient vanishing, Hochreiter & Schmidhuber proposed LSTMs in 1997. LSTMs control information flow using a **cell state** $C_t$ and three gating layers:

  • Forget Gate: $f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)$ controls what is deleted from the history.
  • Input Gate: $i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)$ and $\tilde{C}_t = \tanh(W_c \cdot [h_{t-1}, x_t] + b_c)$ control what gets written.
  • Output Gate: $o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)$ and $h_t = o_t * \tanh(C_t)$ yield the output state.
LSTM Cell Architecture & Gating Mechanics
Cell state C_{t-1} Hidden state h_{t-1} Forget f_t Input i_t Output o_t × + Cell state C_t Hidden state h_t

Common mistakes

  • Not padding/masking variable-length rnn batches.
  • Vanishing gradients in long hidden sequences.
  • Teacher forcing only at train without plan for inference.

Interview checkpoints

  • Q: RNN vs CNN? A: RNN for sequences; CNN for spatial grids.
  • Q: LSTM vs vanilla? A: Gated memory reduces vanishing.

Practice

  1. Basic: Sketch unrolled RNN for 3 timesteps.
  2. Intermediate: LSTM layer on padded sequences.
  3. Advanced: Forecast univariate series with windowed LSTM.

Recap

  • Vanilla RNN for sequence modeling.
  • Watch shape (batch, time, features).
  • Transformers often replace RNNs now.

Next: Day 68 — Hidden States

Day 68

Hidden States

3.2. Neural Network Architectures: A Visual Guide Component Purpose Description Input Layer Receives sequence data One element at a time Hidden State Maintains memory Updated with each input Output Layer Generates predictions Can output at each step Recurrent Connection Enables memory Connects hidden state to itself Core Mechanism The fundamental RNN computation follows this mathematical formula: ht = tanh(Wx·xt +Wh·ht−1+b) Where: -x t: Input at time stept-h t: Hidden state at time stept-h t−1: Previous hidden state -Wx: Input weight matrix -Wh: Hidden state weight matrix -b: Bias vector -tanh: Hyperbolic tangent activation function This equation shows how RNNs combine current input with previous memory to produce new hidden states, enabling temporal pattern recognition. Where: -x t: Input at time t -ht: Hidden state at time t -Wx,W h: Weight matrices -b: Bias vector RNN Variants LSTM (Long Short-Term Memory) •Gates: Input, forget, output •Cell state: Long-term memory storage •Protection: Guards against vanishing/exploding gradients •Performance: Better at capturing long-range dependencies GRU (Gated Recurrent Unit) •Gates: Reset, update •Simplified: Fewer parameters than LSTM •Efficiency: Faster training, similar performance Bidirectional RNNs↔ •Two directions: Forward and backward •Context: Captures both past and future information •Enhanced: Better performance for many applications Applications •Natural language processing •Speech recognition •Time series prediction 37

Why this matters

Hidden state summarizes past — bottleneck for long sequences.

65.1.12 Key Takeaways

Remember: GRU is a simplified, more efficient alternative to LSTM that often performs comparably well while being faster to train and requiring fewer parameters. Core Benefits of GRU Benefit Impact SimplicityEasier to understand and implement EfficiencyFaster training and inference EffectivenessGood performance on many tasks FlexibilityGood starting point for sequence modeling 749

Chapter 65. Gated Recurrent Unit Deep Learning GRU CampusX 750

Chapter 66 BidirectionalRNNBiLSTMBidi- rectionalLSTMBidirectionalGRU

66.1 Bidirectional RNN | BiLSTM | Bidi-

rectional LSTM | Bidirectional GRU

66.2 BidirectionalRNN-ComprehensiveNotes

66.2.1 Overview

BidirectionalRecurrentNeuralNetworks(BiRNNs)areanadvancedarchi- tecture that processes sequences in both forward and backward directions, capturing context from both past and future inputs. Learning Path Progress Figure 66.1: Mermaid diagram

66.2.2 Why Bidirectional RNNs?

The Limitation of Unidirectional RNNs In traditional RNNs, information flows in one direction (left to right): 1x_1 -> [RNN] -> x_2 -> [RNN] -> x_3 -> [RNN] -> Output Problem: Output at time t only depends on past inputs (x1, x2, ..., xp) The Need for Future Context Some scenarios require future inputs to affect past outputs: Example: Named Entity Recognition (NER)Consider these sen- tences: 1.“I love Amazon, it’s a great website”- Amazon→Orga- nization (ORG) 751

Chapter 66. Bidirectional RNN BiLSTM Bidirectional LSTM Bidirectional GRU 2.“I love Amazon, it’s a beautiful river” ∗Amazon→Location (LOC) Key Insight: We can’t determine if “Amazon” is ORG or LOC until we read the future context!

66.2.3 Bidirectional RNN Architecture

Core Concept BiRNN uses two separate RNNs: -Forward RNN→: Processes se- quence left to right -Backward RNN←: Processes sequence right to left Visual Architecture 1Forward: x_1 -> [RNN_1] -> x_2 -> [RNN_2] -> x_3 -> [RNN_3] -> x? 2? ? ? ? 3h_1? h_2? h_3? h?? 4 5Backward: x? <- [RNN?] <- x_3 <- [RNN_3] <- x_2 <- [RNN_2] <- x_1 6? ? ? ? 7h?? h_3? h_2? h_1? 8 9Output: y_1 = sigma(V[h_1?;h_1?] + b) Mathematical Formulation Component Equation Forward Hidden Stateh → t = tanh(Whfh→ t−1+Wxfxt +bf) Backward Hidden Stateh ← t = tanh(Whbh← t+1 +Wxbxt +bb) Outputy t =σ(V[h→ t ;h← t ] +b) Where: -[h → t ;h← t ]represents concatenation -σis the sigmoid activation function

Python
66.2.4 Implementation in Keras
Basic BiRNN Implementation
1fromtensorflow.keras.layersimportBidirectional, SimpleRNN, LSTM,
GRU
2
3# Simple BiRNN
4model.add(Bidirectional(SimpleRNN(5)))
5
6# BiLSTM (Most Common)
7model.add(Bidirectional(LSTM(5)))
752

66.2. Bidirectional RNN - Comprehensive Notes 8 9# BiGRU

Python
10model.add(Bidirectional(GRU(5)))
Parameter Comparison
Architecture Parameters Multiplier
SimpleRNN 190 1x
Bidirectional(SimpleRNN) 380 2x
LSTM Higher 1x
Bidirectional(LSTM) 2x Higher 2x
Note: Bidirectional wrapper doubles the parameters as it uses
two RNNs
66.2.5 Applications
Primary Use Cases
Application Description Why BiRNN?
Named Entity
Recognition (NER)
Identify entities in text Future context helps
disambiguate
Part-of-Speech TaggingAssign grammatical tags Context from both
directions
Machine TranslationTranslate between languages Better context
understanding
Sentiment AnalysisDetermine text sentiment Captures full sentence
context
Time Series ForecastingPredict future values Patterns from both
directions
753

Chapter 66. Bidirectional RNN BiLSTM Bidirectional LSTM Bidirectional GRU Success Areas Figure 66.2: Mermaid diagram

66.2.6 Advantages & Drawbacks

Advantages ∗Complete Context: Access to both past and future information ∗Better Performance: Often outperforms unidirectional RNNs ∗Improved Accuracy: Especially for sequence labeling tasks Drawbacks Issue Description Impact Computational Complexity 2x parameters and computation Higher training time Overfitting RiskMore parameters = more complexity Need more regularization Latency IssuesNeed complete sequence before processing Not suitable for real-time Memory RequirementsStores both forward and backward states Higher memory usage 754

66.2. Bidirectional RNN - Comprehensive Notes Real-time Limitations Figure 66.3: Mermaid diagram

66.2.7 Best Practices

When to Use BiRNN Use when:- Complete sequence is available - Context from both direc- tions is valuable - Accuracy is more important than speed - Working with NLP tasks like NER, POS tagging 755

Chapter 66. Bidirectional RNN BiLSTM Bidirectional LSTM Bidirectional GRU Avoid when:- Real-time processing is required - Working with stream- ing data - Memory/computational resources are limited - Simple patterns suffice Implementation Tips 1.Start Simple: Try unidirectional first, then compare with bidirec- tional 2.Regularization: Use dropout to combat overfitting 3.Architecture Choice: BiLSTM is most commonly used 4.Batch Processing: Process multiple sequences together for effi- ciency

66.2.8 Summary

Bidirectional RNNs are powerful architectures that leverage both past and future context to make better predictions. While they come with increased computational costs and aren’t suitable for real-time applications, they excel in many NLP tasks where complete context improves performance significantly. Key Takeaways ∗Dual Processing: Forward + Backward RNNs ∗Better Context: Captures information from entire sequence ∗Easy Implementation: Simple wrapper in modern frameworks ∗Trade-offs: Better accuracy vs. higher complexity ∗Best for: NLP tasks with complete sequences available 756

66.2. Bidirectional RNN - Comprehensive Notes 757

Part XIII History of Large Language Models 758

Chapter 67 The Epic History of Large Lan- guageModels(LLMs)FromLSTMs to ChatGPT CampusX

67.1 The Epic History of Large Language

Models (LLMs) | From LSTMs to ChatGPT | CampusX Figure 67.1: image

67.2 Sequence Tasks and Types: Compre-

hensive Guide

67.2.1 Sequence Processing Architecture

Figure 67.2: image 759

Chapter 67. The Epic History of Large Language Models (LLMs) From LSTMs to ChatGPT CampusX

67.2.2 RNN Input-Output Patterns

Pattern Type Input Output Examples Architecture Many-to-OneSequence Scalar (1,0) Sentiment analysis, Classification One-to-ManyScalar/Image Sequence Image captioning, Description Many-to- Many (Async) Sequence Sequence Translation, Summarization Many-to- Many (Sync) Sequence Sequence POS Tagging, NER

67.2.3 Key Applications of Sequence Models

∗Text Processing: ·Sentiment analysis (positive/negative) ·Text generation & summarization ·Machine translation (Google Translate) ∗Vision & Language: ·Image captioning (image→description) ·Visual question answering ∗Time Series: ·Financial forecasting ·Weather prediction ·Anomaly detection ∗Bioinformatics: ·Protein sequence analysis ·DNA sequence classification 760

The GRU is a simplified variant of LSTM that merges the cell state and hidden state, and uses only two gates: a **Reset Gate** (controls how to combine new input with past memory) and an **Update Gate** (acts as both forget and input gate).

Common mistakes

  • Not padding/masking variable-length hidden batches.
  • Vanishing gradients in long bottleneck sequences.
  • Teacher forcing only at train without plan for inference.

Interview checkpoints

  • Q: RNN vs CNN? A: RNN for sequences; CNN for spatial grids.
  • Q: LSTM vs vanilla? A: Gated memory reduces vanishing.

Practice

  1. Basic: Sketch unrolled RNN for 3 timesteps.
  2. Intermediate: LSTM layer on padded sequences.
  3. Advanced: Forecast univariate series with windowed LSTM.

Recap

  • Hidden States for sequence modeling.
  • Watch shape (batch, time, features).
  • Transformers often replace RNNs now.

Next: Day 69 — BPTT

Day 69

BPTT

Contents

59.1.2 Introduction to RNN Backpropagation . . . . . . . . . . . . . 670

59.1.3 RNN Architecture Review . . . . . . . . . . . . . . . . . . . . 670

59.1.4 Forward Propagation . . . . . . . . . . . . . . . . . . . . . . . 671

59.1.5 Backpropagation Through Time (BPTT) . . . . . . . . . . . . 672

59.1.6 Gradient Calculations . . . . . . . . . . . . . . . . . . . . . . 672

59.1.7 Implementation Details . . . . . . . . . . . . . . . . . . . . . . 673

59.1.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674

60 Problems with RNN 100 Days of Deep Learning 676

60.1 Problems with RNN | 100 Days of Deep Learning . . . . . . . . . . . 676

60.1.1 RNN Fundamentals Recap . . . . . . . . . . . . . . . . . . . . 676

60.1.2 Major Problems with RNNs . . . . . . . . . . . . . . . . . . . 676

60.1.3 Why this happens: . . . . . . . . . . . . . . . . . . . . . . . . 677

60.1.4 Real-world Example: . . . . . . . . . . . . . . . . . . . . . . . 677

60.1.5 Technical Deep Dive . . . . . . . . . . . . . . . . . . . . . . . 678

60.2 RNNMathematicalAnalysis: Long-TermDependency&GradientProb-

lems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 678

60.2.1 Course Context & Prerequisites . . . . . . . . . . . . . . . . . 678

60.2.2 Problem #1: Long-Term Dependency - Mathematical Analysis 679

60.2.3 Gradient Calculation & Chain Rule Application . . . . . . . . 680

60.2.4 Complete Mathematical Derivation . . . . . . . . . . . . . . . 681

60.2.5 Vanishing Gradient Problem - Mathematical Proof . . . . . . 681

60.2.6 Solutions to Vanishing Gradients . . . . . . . . . . . . . . . . 682

60.2.7 Problem #2: Exploding Gradients . . . . . . . . . . . . . . . 683

60.2.8 Solutions to Exploding Gradients . . . . . . . . . . . . . . . . 683

60.2.9 Summary & Mathematical Insights . . . . . . . . . . . . . . . 684

61 LSTM Long Short Term Memory Part 1 The What CampusX 686

61.1 LSTM | Long Short Term Memory | Part 1 | The What? | CampusX 686

61.1.1 Recap: From ANN to RNN . . . . . . . . . . . . . . . . . . . 686

61.1.2 The Critical Problem: Long Sequences . . . . . . . . . . . . . 687

61.1.3 LSTM: The Solution . . . . . . . . . . . . . . . . . . . . . . . 688

61.2 LSTM Core Concepts & Architecture - Deep Dive Notes . . . . . . . 689

61.2.1 Learning Objective . . . . . . . . . . . . . . . . . . . . . . . . 689

61.2.2 The Story-Based Learning Approach . . . . . . . . . . . . . . 690

61.2.3 Human Brain Memory Processing Model . . . . . . . . . . . . 691

61.2.4 The RNN Memory Problem . . . . . . . . . . . . . . . . . . . 692

61.2.5 LSTM Solution: Dual Memory Architecture . . . . . . . . . . 693

61.2.6 The Three Gates: LSTM’s Control System . . . . . . . . . . . 694

61.2.7 Pronoun Resolution Example . . . . . . . . . . . . . . . . . . 695

61.2.8 LSTM as a Computer System . . . . . . . . . . . . . . . . . . 698

Why this matters

BPTT backpropagates through time — expensive and unstable.

57.0.11 Summary

This implementation demonstrates the complete pipeline for text-based

Python
sentiment analysis using RNNs in Keras, covering:
1.Text Preprocessing: Tokenization and sequence conversion
2.Data Preparation: Padding and normalization
3.Model Building: Both simple and embedding-based approaches
4.Training: Compilation and execution strategies
The embedding approach consistently outperforms simple integer encod-
ing due to its ability to capture semantic relationships and provide dense
660

representations of textual data. 661

Chapter 58 Types of RNN Many to Many OnetoManyManytoOneRNNs

58.1 Types of RNN | Many to Many | One

to Many | Many to One RNNs

58.1.1 Overview

This comprehensive guide covers the four main types of Recurrent Neu- ral Network (RNN) architectures, their applications, and implementation patterns based on input-output sequence relationships.

58.1.2 Video Content Summary

58.1.3 Four Main RNN Architecture Types

Architecture Classification Matrix Input Type Output Type Architecture Applications Sequence SingleMany to OneSentiment Analysis, Rating Prediction Single SequenceOne to ManyImage Captioning, Music Generation Sequence SequenceMany to ManyTranslation, NER, POS Tagging Single SingleOne to OneImage Classification

58.1.4 1. Many to One Architecture

Core Concept Input: Sequential data (sentences, time series) Output: Single value/classification 662

58.1. Types of RNN | Many to Many | One to Many | Many to One RNNs Architecture Flow Figure 58.1: image Key Applications 1. Sentiment Analysis ∗Input: “This movie is amazing!” (sequence of words) ∗Output: 1 (Positive) or 0 (Negative) ∗Process: Analyze entire sentence context→Single sentiment score 2. Rating Prediction ∗Input: Product review text ∗Output: Star rating (1-5) ∗Use Case: Movie reviews→Predicted user rating Architecture Details ∗Hidden States: Each time step maintains hidden state ∗Final Output: Only from last time step ∗Information Flow: Sequential processing with memory

58.1.5 2. One to Many Architecture

Core Concept Input: Single non-sequential data (image, number) Output: Sequential data (text, music notes) 663

Chapter 58. Types of RNN Many to Many One to Many Many to One RNNs Architecture Flow Figure 58.2: image Key Applications 1. Image Captioning ∗Input: Image of person playing cricket ∗Output: “A man is playing cricket” ∗Process: CNN extracts features→RNN generates sequential text 2. Music Generation ∗Input: Musical seed/style parameter ∗Output: Sequence of musical notes ∗Process: Generate continuous musical composition Technical Implementation ∗Initial Input: Provided once at start ∗Subsequent Steps: Previous output becomes next input ∗Generation: Continues until stop condition

58.1.6 3. Many to Many Architecture

Core Concept Input: Sequential data Output: Sequential data Also Known As: Sequence-to-Sequence (Seq2Seq) models Two Subtypes 3A. Same Length Many-to-Many Characteristic: Input sequence length = Output sequence length 664

58.1. Types of RNN | Many to Many | One to Many | Many to One RNNs ApplicationsPart-of-Speech (POS) Tagging Input Word Output Tag “The” Article “quick” Adjective “brown” Adjective “fox” Noun Named Entity Recognition (NER) Input Output “Let’s meet at 7:00 PM at the airport” [O, O, O, TIME, TIME, O, O, LOCATION] Figure 58.3: image Architecture Flow 3B. Variable Length Many-to-Many Characteristic: Input length ̸=Output length Primary Application: Machine TranslationExample Translation Language Sentence Word Count English “My name is Nitish” 4 words Hindi “maeraaa naaama naiitaiisha haai” 4 words 665

Chapter 58. Types of RNN Many to Many One to Many Many to One RNNs Note: Differentlanguagesmayusedifferentwordcountsforsamemeaning Why Encoder-Decoder? TranslationLogic: Completesentenceunderstandingrequired before translation - Word-by-word translation loses context - Full sentence comprehension preserves meaning, grammar, and context Figure 58.4: image

58.1.7 4. One to One Architecture

Core Concept Input: Non-sequential data Output: Non-sequential data Note: Technically not RNN - regular neural network 666

58.1. Types of RNN | Many to Many | One to Many | Many to One RNNs Architecture Flow Figure 58.5: image Key Applications Image Classification ∗Input: Image data ∗Output: Class label (Cat/Dog, 0/1) ∗Networks: CNN, ANN (not RNN) Technical Note ∗No Recurrence: No feedback loops or time steps ∗No Memory: No hidden state preservation ∗Standard Networks: ANN, CNN architectures 667

Chapter 58. Types of RNN Many to Many One to Many Many to One RNNs

58.1.8 Summary Table

Architecture Input Output Memory Applications Many to OneSequence Single Yes Sentiment Analysis, Classification One to ManySingle Sequence Yes Image Captioning, Generation Many to Many (Same) Sequence Sequence Yes POS Tagging, NER Many to Many (Variable) Sequence Sequence Yes Machine Translation One to OneSingle Single No Image Classification Key Takeaways 1.Architecture Choice: Depends on input-output relationship 2.Sequential Processing: Core strength of RNNs 3.Memory Mechanism: Hidden states preserve temporal informa- tion 4.Application Diversity: Wide range of NLP and sequence model- ing tasks 668

58.1. Types of RNN | Many to Many | One to Many | Many to One RNNs 669

Chapter 59 How Backpropagation works in RNNBackpropagationThrough Time

59.1 How Backpropagation works in RNN

| Backpropagation Through Time

59.1.1 Overview

This comprehensive guide covers the fundamental concepts ofBackprop- agationThroughTime(BPTT)inRecurrentNeuralNetworks, includ- ing detailed mathematical derivations and practical examples.

59.1.2 Introduction to RNN Backpropagation

Key Concepts Concept Description Importance BPTTBackpropagation Through Time Core learning algorithm for RNNs Temporal DependenciesLearning from sequential data Essential for time-series analysis Gradient FlowHow gradients propagate through time Critical for understanding vanishing gradients Why BPTT? Key Insight: RNNs process sequential data where the output at each time step depends on both the current input and the previous hidden state. This creates a computational graph that unfolds through time.

59.1.3 RNN Architecture Review

Mathematical Representation The RNN operates with the following parameters: 670

59.1. How Backpropagation works in RNN | Backpropagation Through Time Parameter Dimension Description W_i3×3 Input weight matrix W_h3×3 Hidden weight matrix W_o1×3 Output weight matrix Example Setup: Sentiment Analysis Consider a toy dataset with three reviews: 1.Review 1: “cat mat cat”→Label: 1 (Positive) 2.Review 2: “rat rat mat”→Label: 0 (Negative) 3.Review 3: “mat cat mat”→Label: 1 (Positive) Vocabulary Encoding 1Vocabulary = { 2"cat": [1, 0, 0], 3"mat": [0, 1, 0], 4"rat": [0, 0, 1] 5}

59.1.4 Forward Propagation

Content sourced from CampusX Deep Learning notes (PDF). Run merge script for full body.

Common mistakes

  • Not padding/masking variable-length bptt batches.
  • Vanishing gradients in long truncated sequences.
  • Teacher forcing only at train without plan for inference.

Interview checkpoints

  • Q: RNN vs CNN? A: RNN for sequences; CNN for spatial grids.
  • Q: LSTM vs vanilla? A: Gated memory reduces vanishing.

Practice

  1. Basic: Sketch unrolled RNN for 3 timesteps.
  2. Intermediate: LSTM layer on padded sequences.
  3. Advanced: Forecast univariate series with windowed LSTM.

Recap

  • BPTT for sequence modeling.
  • Watch shape (batch, time, features).
  • Transformers often replace RNNs now.

Next: Day 70 — LSTM Gates

Day 70

LSTM Gates

Contents

59.1.2 Introduction to RNN Backpropagation . . . . . . . . . . . . . 670

59.1.3 RNN Architecture Review . . . . . . . . . . . . . . . . . . . . 670

59.1.4 Forward Propagation . . . . . . . . . . . . . . . . . . . . . . . 671

59.1.5 Backpropagation Through Time (BPTT) . . . . . . . . . . . . 672

59.1.6 Gradient Calculations . . . . . . . . . . . . . . . . . . . . . . 672

59.1.7 Implementation Details . . . . . . . . . . . . . . . . . . . . . . 673

59.1.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674

60 Problems with RNN 100 Days of Deep Learning 676

60.1 Problems with RNN | 100 Days of Deep Learning . . . . . . . . . . . 676

60.1.1 RNN Fundamentals Recap . . . . . . . . . . . . . . . . . . . . 676

60.1.2 Major Problems with RNNs . . . . . . . . . . . . . . . . . . . 676

60.1.3 Why this happens: . . . . . . . . . . . . . . . . . . . . . . . . 677

60.1.4 Real-world Example: . . . . . . . . . . . . . . . . . . . . . . . 677

60.1.5 Technical Deep Dive . . . . . . . . . . . . . . . . . . . . . . . 678

60.2 RNNMathematicalAnalysis: Long-TermDependency&GradientProb-

lems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 678

60.2.1 Course Context & Prerequisites . . . . . . . . . . . . . . . . . 678

60.2.2 Problem #1: Long-Term Dependency - Mathematical Analysis 679

60.2.3 Gradient Calculation & Chain Rule Application . . . . . . . . 680

60.2.4 Complete Mathematical Derivation . . . . . . . . . . . . . . . 681

60.2.5 Vanishing Gradient Problem - Mathematical Proof . . . . . . 681

60.2.6 Solutions to Vanishing Gradients . . . . . . . . . . . . . . . . 682

60.2.7 Problem #2: Exploding Gradients . . . . . . . . . . . . . . . 683

60.2.8 Solutions to Exploding Gradients . . . . . . . . . . . . . . . . 683

60.2.9 Summary & Mathematical Insights . . . . . . . . . . . . . . . 684

61 LSTM Long Short Term Memory Part 1 The What CampusX 686

61.1 LSTM | Long Short Term Memory | Part 1 | The What? | CampusX 686

61.1.1 Recap: From ANN to RNN . . . . . . . . . . . . . . . . . . . 686

61.1.2 The Critical Problem: Long Sequences . . . . . . . . . . . . . 687

61.1.3 LSTM: The Solution . . . . . . . . . . . . . . . . . . . . . . . 688

61.2 LSTM Core Concepts & Architecture - Deep Dive Notes . . . . . . . 689

61.2.1 Learning Objective . . . . . . . . . . . . . . . . . . . . . . . . 689

61.2.2 The Story-Based Learning Approach . . . . . . . . . . . . . . 690

61.2.3 Human Brain Memory Processing Model . . . . . . . . . . . . 691

61.2.4 The RNN Memory Problem . . . . . . . . . . . . . . . . . . . 692

61.2.5 LSTM Solution: Dual Memory Architecture . . . . . . . . . . 693

61.2.6 The Three Gates: LSTM’s Control System . . . . . . . . . . . 694

61.2.7 Pronoun Resolution Example . . . . . . . . . . . . . . . . . . 695

61.2.8 LSTM as a Computer System . . . . . . . . . . . . . . . . . . 698

Keras LSTM
model = keras.Sequential([
    keras.layers.Embedding(vocab_size, 64, mask_zero=True),
    keras.layers.LSTM(64, return_sequences=False),
    keras.layers.Dense(num_classes, activation='softmax'),
])
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

Why this matters

LSTM gates control forget/input/output — long-range memory.

65.1.11 LSTM vs GRU Comparison

Feature Comparison Feature LSTM GRU Gates3 (Input, Forget, Output) 2 (Reset, Update) Memory UnitsCell State + Hidden State Hidden State only Parameters4[(dÖh) +hš] + 4h3[(dÖh) +hš] + 3h ComplexityHigher Lower SpeedSlower Faster 747

Chapter 65. Gated Recurrent Unit Deep Learning GRU CampusX Performance Characteristics Figure 65.2: image When to Use Each Choose LSTM When: ∗Complex, long sequences ∗Large datasets available ∗Computational resources abundant ∗Maximum performance needed Choose GRU When: ∗Simpler tasks ∗Limited computational resources ∗Faster training required ∗Smaller datasets ∗Starting point for experimentation Parameter Count Formula LSTM Parameters: PLSTM = 4[dÖh+hš +h] GRU Parameters: PGRU = 3[dÖh+hš +h] 748

65.1. Gated Recurrent Unit | Deep Learning | GRU | CampusX Where: -d= input dimension -h= hidden dimension

65.1.11 LSTM vs GRU Comparison

Feature Comparison Feature LSTM GRU Gates3 (Input, Forget, Output) 2 (Reset, Update) Memory UnitsCell State + Hidden State Hidden State only Parameters4[(dÖh) +hš] + 4h3[(dÖh) +hš] + 3h ComplexityHigher Lower SpeedSlower Faster 747

Chapter 65. Gated Recurrent Unit Deep Learning GRU CampusX Performance Characteristics Figure 65.2: image When to Use Each Choose LSTM When: ∗Complex, long sequences ∗Large datasets available ∗Computational resources abundant ∗Maximum performance needed Choose GRU When: ∗Simpler tasks ∗Limited computational resources ∗Faster training required ∗Smaller datasets ∗Starting point for experimentation Parameter Count Formula LSTM Parameters: PLSTM = 4[dÖh+hš +h] GRU Parameters: PGRU = 3[dÖh+hš +h] 748

65.1. Gated Recurrent Unit | Deep Learning | GRU | CampusX Where: -d= input dimension -h= hidden dimension

Content sourced from CampusX Deep Learning notes (PDF). Run merge script for full body.

Common mistakes

  • Not padding/masking variable-length lstm batches.
  • Vanishing gradients in long gates sequences.
  • Teacher forcing only at train without plan for inference.

Interview checkpoints

  • Q: RNN vs CNN? A: RNN for sequences; CNN for spatial grids.
  • Q: LSTM vs vanilla? A: Gated memory reduces vanishing.

Practice

  1. Basic: Sketch unrolled RNN for 3 timesteps.
  2. Intermediate: LSTM layer on padded sequences.
  3. Advanced: Forecast univariate series with windowed LSTM.

Recap

  • LSTM Gates for sequence modeling.
  • Watch shape (batch, time, features).
  • Transformers often replace RNNs now.

Next: Day 71 — LSTM Cell State

Day 71

LSTM Cell State

3.2. Neural Network Architectures: A Visual Guide Component Purpose Description Input Layer Receives sequence data One element at a time Hidden State Maintains memory Updated with each input Output Layer Generates predictions Can output at each step Recurrent Connection Enables memory Connects hidden state to itself Core Mechanism The fundamental RNN computation follows this mathematical formula: ht = tanh(Wx·xt +Wh·ht−1+b) Where: -x t: Input at time stept-h t: Hidden state at time stept-h t−1: Previous hidden state -Wx: Input weight matrix -Wh: Hidden state weight matrix -b: Bias vector -tanh: Hyperbolic tangent activation function This equation shows how RNNs combine current input with previous memory to produce new hidden states, enabling temporal pattern recognition. Where: -x t: Input at time t -ht: Hidden state at time t -Wx,W h: Weight matrices -b: Bias vector RNN Variants LSTM (Long Short-Term Memory) •Gates: Input, forget, output •Cell state: Long-term memory storage •Protection: Guards against vanishing/exploding gradients •Performance: Better at capturing long-range dependencies GRU (Gated Recurrent Unit) •Gates: Reset, update •Simplified: Fewer parameters than LSTM •Efficiency: Faster training, similar performance Bidirectional RNNs↔ •Two directions: Forward and backward •Context: Captures both past and future information •Enhanced: Better performance for many applications Applications •Natural language processing •Speech recognition •Time series prediction 37

Why this matters

Cell state is LSTM memory highway.

62.1.12 4. Complete LSTM Cell Animation

Step-by-Step Workflow Figure 62.6: image 711

Chapter 62. LSTM Architecture Part 2 The How CampusX Complete Mathematical Flow All LSTM Equations: 11. ft = sigma(Wf.[ht-1,Xt] + bf) # Forget gate 22. it = sigma(Wi.[ht-1,Xt] + bi) # Input gate 33. C?t = tanh(WC.[ht-1,Xt] + bC) # Candidate values 44. Ct = ft?Ct-1 + it?C?t # Cell state update 55. ot = sigma(Wo.[ht-1,Xt] + bo) # Output gate 66. ht = ot?tanh(Ct) # Hidden state Key Takeaways Feature Purpose Benefit Three GatesControl information flow Selective memory Cell State HighwayDirect gradient path Solves vanishing gradients Pointwise OperationsElement-wise control Fine-grained memory management Dual MemoryLong & short term Comprehensive context Animation Summary 1.Forget Phase: Remove irrelevant past info 2.Input Phase: Add new relevant info 3.Output Phase: Select what to output now Complexity Analysis Operation Time Space Purpose Forget Gate O(n2) O(n) Memory filtering Input Gate O(n2) O(n) Information addition Output Gate O(n2) O(n) Output generation Total O(n2) O(n)Per timestep 712

62.1. LSTM Architecture | Part 2 | The How? | CampusX 713

Chapter 63 LSTM Part 3 Next Word Pre- dictor Using CampusX

63.1 LSTM | Part 3 | Next Word Predictor

Using | CampusX

63.1.1 1. Introduction

What is a Next Word Predictor? ANext Word Predictoris an AI system that suggests the most likely word to follow a given sequence of words. It’s essentially a text generation model that predicts one word at a time. Figure 63.1: image Key Characteristics Feature Description Impact Sequential ProcessingAnalyzes word order and context High accuracy Pattern RecognitionLearns from large text corpora Better predictions Context AwarenessUses previous words for prediction Natural flow Real-time PredictionInstant suggestions User-friendly 714

63.1. LSTM | Part 3 | Next Word Predictor Using | CampusX

63.1.2 2. Real-World Applications

Industry Impact Application User Base Time Saved Adoption Rate Mobile Keyboards 3B+ users 30% typing time 85% Email Composers1.5B users 20% email time 65% Code Completion50M developers 40% coding time 75% Chat Applications2B+ users 25% messaging time 70%

63.1.3 3. Implementation Strategy

Converting Text Generation to Supervised Learning Data Transformation Process Original Sentence Input Sequence Target Word “Hi my name is Nitish” “Hi” “my” “Hi my” “name” “Hi my name” “is” “Hi my name is” “Nitish” Step 1: Sentence to Sequences Figure 63.2: image Step 2: Word to Number Mapping 715

Chapter 63. LSTM Part 3 Next Word Predictor Using CampusX Input Sequence Target [1] 2 [1, 2] 3 [1, 2, 3] 4 [1, 2, 3, 4] 5 Step 3: Numerical Dataset

63.1.4 4. Data Preprocessing

Tokenization Pipeline

63.2 Key Steps in Preprocessing

Figure 63.3: image Preprocessing Components Component Purpose Output TokenizerConvert text to tokens Word indices VocabularyStore unique words Word-to-ID mapping Sequence GeneratorCreate input sequences Training pairs PaddingUniform sequence length Fixed-size inputs Example Code Structure 1# Import necessary libraries 2importtensorflowastf

Python
3fromtensorflow.keras.preprocessing.textimportTokenizer
4fromtensorflow.keras.preprocessing.sequenceimportpad_sequences
5
6# Initialize tokenizer
7tokenizer = Tokenizer()
8tokenizer.fit_on_texts([text])
9
10# Convert text to sequences
11sequences = tokenizer.texts_to_sequences(sentences)
716

63.2. Key Steps in Preprocessing

63.2.1 5. Model Architecture

LSTM Network Design Figure 63.4: image 717

Chapter 63. LSTM Part 3 Next Word Predictor Using CampusX Layer Configuration Layer Type Parameters Purpose Embeddingvocab_size×100 Convert tokens to vectors LSTM 1150 units, return_sequences=True Capture sequence patterns LSTM 2100 units Extract high-level features Densevocab_size units Output probabilities Softmax- Normalize probabilities

63.2.2 6. Code Implementation

Complete Implementation Workflow Step 1: Data Preparation ∗Load text data ∗Split into sentences ∗Create token mappings Step 2: Sequence Creation ∗Convert text to numbers ∗Create input-output pairs ∗Pad sequences to uniform length Step 3: Model Construction ∗Build LSTM architecture ∗Configure hyperparameters ∗Compile with optimizer Step 4: Training Process ∗Train on prepared data ∗Monitor loss metrics ∗Save best model 718

63.2. Key Steps in Preprocessing

63.2.3 7. Training & Evaluation

Training Configuration Hyperparameter Value Purpose Batch Size64 Training efficiency Epochs100 Model convergence Learning Rate0.001 Optimization speed Dropout0.2 Prevent overfitting OptimizerAdam Adaptive learning

63.2.4 1. Dataset Overview

Dataset Statistics Metric Value Description Total Words~283 unique Vocabulary size Document TypeFAQ Text Q&A format LanguageEnglish Technical content SizeSmall Demo purposes Implementation Steps Step 1: Tokenization

Python
1fromtensorflow.keras.preprocessing.textimportTokenizer
2
3tokenizer = Tokenizer()
4tokenizer.fit_on_texts([faqs])
5# Creates word-to-index mapping
Step 2: Sequence Generation
1input_sequences = []
2forsentenceinfaqs.split(’\n’):
3tokenized_sentence = tokenizer.texts_to_sequences([sentence])[0]
4foriin range(1,len(tokenized_sentence)):
5input_sequences.append(tokenized_sentence[:i+1])
719

Chapter 63. LSTM Part 3 Next Word Predictor Using CampusX Sequence Creation Example Original Sentence Input Sequences Target “What is the fee” [What] is [What, is] the [What, is, the] fee Padding Configuration 1max_len =max([len(x)forxininput_sequences])# 56 2padded_sequences = pad_sequences(input_sequences, 3maxlen=max_len, 4padding=’pre’) 720

63.2. Key Steps in Preprocessing

63.2.5 3. Model Architecture Deep Dive

Complete Architecture Visualization Figure 63.5: image 721

Chapter 63. LSTM Part 3 Next Word Predictor Using CampusX Layer-by-Layer Breakdown Parameter Value Purpose Input Dim283 Vocabulary size Output Dim100 Dense vector size Input Length56 Max sequence length Parameters28,300 283×100 1 Embedding Layer Feature Configuration Calculation Units150 Hidden state dimension Time Steps56 Sequential processing Input per Step100 From embedding Output150 Final hidden state 2 LSTM Layer Component Value Function Units283 One per word ActivationSoftmax Probability distribution Parameters42,633 150×283 + 283 3 Dense Output Layer

63.2.6 4. Implementation Code

Complete Model Building

Python
1fromtensorflow.keras.modelsimportSequential
2fromtensorflow.keras.layersimportEmbedding, LSTM, Dense
3
4# Build model
5model = Sequential()
6model.add(Embedding(283, 100, input_length=56))
7model.add(LSTM(150))
8model.add(Dense(283, activation=’softmax’))
9
10# Compile
11model.compile(loss=’categorical_crossentropy’,
722

62.1.12 4. Complete LSTM Cell Animation

Step-by-Step Workflow Figure 62.6: image 711

Chapter 62. LSTM Architecture Part 2 The How CampusX Complete Mathematical Flow All LSTM Equations: 11. ft = sigma(Wf.[ht-1,Xt] + bf) # Forget gate 22. it = sigma(Wi.[ht-1,Xt] + bi) # Input gate 33. C?t = tanh(WC.[ht-1,Xt] + bC) # Candidate values 44. Ct = ft?Ct-1 + it?C?t # Cell state update 55. ot = sigma(Wo.[ht-1,Xt] + bo) # Output gate 66. ht = ot?tanh(Ct) # Hidden state Key Takeaways Feature Purpose Benefit Three GatesControl information flow Selective memory Cell State HighwayDirect gradient path Solves vanishing gradients Pointwise OperationsElement-wise control Fine-grained memory management Dual MemoryLong & short term Comprehensive context Animation Summary 1.Forget Phase: Remove irrelevant past info 2.Input Phase: Add new relevant info 3.Output Phase: Select what to output now Complexity Analysis Operation Time Space Purpose Forget Gate O(n2) O(n) Memory filtering Input Gate O(n2) O(n) Information addition Output Gate O(n2) O(n) Output generation Total O(n2) O(n)Per timestep 712

62.1. LSTM Architecture | Part 2 | The How? | CampusX 713

Chapter 63 LSTM Part 3 Next Word Pre- dictor Using CampusX

63.1 LSTM | Part 3 | Next Word Predictor

Using | CampusX

63.1.1 1. Introduction

What is a Next Word Predictor? ANext Word Predictoris an AI system that suggests the most likely word to follow a given sequence of words. It’s essentially a text generation model that predicts one word at a time. Figure 63.1: image Key Characteristics Feature Description Impact Sequential ProcessingAnalyzes word order and context High accuracy Pattern RecognitionLearns from large text corpora Better predictions Context AwarenessUses previous words for prediction Natural flow Real-time PredictionInstant suggestions User-friendly 714

63.1. LSTM | Part 3 | Next Word Predictor Using | CampusX

63.1.2 2. Real-World Applications

Industry Impact Application User Base Time Saved Adoption Rate Mobile Keyboards 3B+ users 30% typing time 85% Email Composers1.5B users 20% email time 65% Code Completion50M developers 40% coding time 75% Chat Applications2B+ users 25% messaging time 70%

63.1.3 3. Implementation Strategy

Converting Text Generation to Supervised Learning Data Transformation Process Original Sentence Input Sequence Target Word “Hi my name is Nitish” “Hi” “my” “Hi my” “name” “Hi my name” “is” “Hi my name is” “Nitish” Step 1: Sentence to Sequences Figure 63.2: image Step 2: Word to Number Mapping 715

Chapter 63. LSTM Part 3 Next Word Predictor Using CampusX Input Sequence Target [1] 2 [1, 2] 3 [1, 2, 3] 4 [1, 2, 3, 4] 5 Step 3: Numerical Dataset

63.1.4 4. Data Preprocessing

Tokenization Pipeline

63.2 Key Steps in Preprocessing

Figure 63.3: image Preprocessing Components Component Purpose Output TokenizerConvert text to tokens Word indices VocabularyStore unique words Word-to-ID mapping Sequence GeneratorCreate input sequences Training pairs PaddingUniform sequence length Fixed-size inputs Example Code Structure 1# Import necessary libraries 2importtensorflowastf

Python
3fromtensorflow.keras.preprocessing.textimportTokenizer
4fromtensorflow.keras.preprocessing.sequenceimportpad_sequences
5
6# Initialize tokenizer
7tokenizer = Tokenizer()
8tokenizer.fit_on_texts([text])
9
10# Convert text to sequences
11sequences = tokenizer.texts_to_sequences(sentences)
716

63.2. Key Steps in Preprocessing

63.2.1 5. Model Architecture

LSTM Network Design Figure 63.4: image 717

Chapter 63. LSTM Part 3 Next Word Predictor Using CampusX Layer Configuration Layer Type Parameters Purpose Embeddingvocab_size×100 Convert tokens to vectors LSTM 1150 units, return_sequences=True Capture sequence patterns LSTM 2100 units Extract high-level features Densevocab_size units Output probabilities Softmax- Normalize probabilities

63.2.2 6. Code Implementation

Complete Implementation Workflow Step 1: Data Preparation ∗Load text data ∗Split into sentences ∗Create token mappings Step 2: Sequence Creation ∗Convert text to numbers ∗Create input-output pairs ∗Pad sequences to uniform length Step 3: Model Construction ∗Build LSTM architecture ∗Configure hyperparameters ∗Compile with optimizer Step 4: Training Process ∗Train on prepared data ∗Monitor loss metrics ∗Save best model 718

63.2. Key Steps in Preprocessing

63.2.3 7. Training & Evaluation

Training Configuration Hyperparameter Value Purpose Batch Size64 Training efficiency Epochs100 Model convergence Learning Rate0.001 Optimization speed Dropout0.2 Prevent overfitting OptimizerAdam Adaptive learning

63.2.4 1. Dataset Overview

Dataset Statistics Metric Value Description Total Words~283 unique Vocabulary size Document TypeFAQ Text Q&A format LanguageEnglish Technical content SizeSmall Demo purposes Implementation Steps Step 1: Tokenization

Python
1fromtensorflow.keras.preprocessing.textimportTokenizer
2
3tokenizer = Tokenizer()
4tokenizer.fit_on_texts([faqs])
5# Creates word-to-index mapping
Step 2: Sequence Generation
1input_sequences = []
2forsentenceinfaqs.split(’\n’):
3tokenized_sentence = tokenizer.texts_to_sequences([sentence])[0]
4foriin range(1,len(tokenized_sentence)):
5input_sequences.append(tokenized_sentence[:i+1])
719

Chapter 63. LSTM Part 3 Next Word Predictor Using CampusX Sequence Creation Example Original Sentence Input Sequences Target “What is the fee” [What] is [What, is] the [What, is, the] fee Padding Configuration 1max_len =max([len(x)forxininput_sequences])# 56 2padded_sequences = pad_sequences(input_sequences, 3maxlen=max_len, 4padding=’pre’) 720

63.2. Key Steps in Preprocessing

63.2.5 3. Model Architecture Deep Dive

Complete Architecture Visualization Figure 63.5: image 721

Chapter 63. LSTM Part 3 Next Word Predictor Using CampusX Layer-by-Layer Breakdown Parameter Value Purpose Input Dim283 Vocabulary size Output Dim100 Dense vector size Input Length56 Max sequence length Parameters28,300 283×100 1 Embedding Layer Feature Configuration Calculation Units150 Hidden state dimension Time Steps56 Sequential processing Input per Step100 From embedding Output150 Final hidden state 2 LSTM Layer Component Value Function Units283 One per word ActivationSoftmax Probability distribution Parameters42,633 150×283 + 283 3 Dense Output Layer

63.2.6 4. Implementation Code

Complete Model Building

Python
1fromtensorflow.keras.modelsimportSequential
2fromtensorflow.keras.layersimportEmbedding, LSTM, Dense
3
4# Build model
5model = Sequential()
6model.add(Embedding(283, 100, input_length=56))
7model.add(LSTM(150))
8model.add(Dense(283, activation=’softmax’))
9
10# Compile
11model.compile(loss=’categorical_crossentropy’,
722

Content sourced from CampusX Deep Learning notes (PDF). Run merge script for full body.

Common mistakes

  • Not padding/masking variable-length cell batches.
  • Vanishing gradients in long state sequences.
  • Teacher forcing only at train without plan for inference.

Interview checkpoints

  • Q: RNN vs CNN? A: RNN for sequences; CNN for spatial grids.
  • Q: LSTM vs vanilla? A: Gated memory reduces vanishing.

Practice

  1. Basic: Sketch unrolled RNN for 3 timesteps.
  2. Intermediate: LSTM layer on padded sequences.
  3. Advanced: Forecast univariate series with windowed LSTM.

Recap

  • LSTM Cell State for sequence modeling.
  • Watch shape (batch, time, features).
  • Transformers often replace RNNs now.

Next: Day 72 — GRU

Day 72

GRU

Contents I Introduction to Deep Learning 1 1 Course Announcement 2

1.1 100 Days of Deep Learning Course Announcement . . . . . . . . . . . 2

1.2 Deep Learning Course Content . . . . . . . . . . . . . . . . . . . . . 2

1.2.1 1. Curriculum . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.2 Deep Learning Curriculum Structure . . . . . . . . . . . . . . 2

1.3 Artificial Neural Networks (ANN) . . . . . . . . . . . . . . . . . . . . 2

1.3.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3.2 Perceptron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3.3 MLP [Multi-layer perceptron] . . . . . . . . . . . . . . . . . . 3

1.3.4 Training an MLP [Most used Algorithm] . . . . . . . . . . . . 3

Python
1.3.5 Practical with Keras . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.6 How to improve an ANN . . . . . . . . . . . . . . . . . . . . . 3
1.3.7 Advanced Topics . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.8 Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.9 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.10 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.11 Extra Content . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 What is Deep Learning Deep Learning Vs Machine Learning 8
2.1 What is Deep Learning? Deep Learning Vs Machine Learning . . . . 8
2.2 Deep Learning: Comprehensive Notes . . . . . . . . . . . . . . . . . . 8
2.2.1 Definition & Relationship to AI . . . . . . . . . . . . . . . . . 8
2.2.2 Biological Inspiration . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.3 Neural Network Structure . . . . . . . . . . . . . . . . . . . . 9
2.3 Machine Learning vs Deep Learning: A Comprehensive Comparison . 10
2.3.1 1. Machine Learning (ML) . . . . . . . . . . . . . . . . . . . . 10
2.3.2 2. Deep Learning (DL) . . . . . . . . . . . . . . . . . . . . . . 10
2.3.3 3. Detailed Comparison . . . . . . . . . . . . . . . . . . . . . 11
2.3.4 4. When to Use Each Approach . . . . . . . . . . . . . . . . . 11
2.3.5 5. Real-World Applications . . . . . . . . . . . . . . . . . . . 12
2.3.6 6. The ML-DL Relationship . . . . . . . . . . . . . . . . . . . 12
2.4 Neural Network Architectures Explained . . . . . . . . . . . . . . . . 12
2.4.1 1. Artificial Neural Networks (ANN) . . . . . . . . . . . . . . 12
2.4.2 2. Convolutional Neural Networks (CNN) . . . . . . . . . . . 13
2.4.3 3. Recurrent Neural Networks (RNN) . . . . . . . . . . . . . . 13
2.4.4 4. Generative Adversarial Networks (GAN) . . . . . . . . . . 14
2.4.5 Comparative Overview . . . . . . . . . . . . . . . . . . . . . . 15
2.5 The Rise of Deep Learning: Applications & Performance . . . . . . . 15
2.5.1 Introduction: Why Deep Learning Has Transformed AI . . . . 15
2.5.2 1. Applications: Transforming Industries . . . . . . . . . . . . 16
iii

Why this matters

GRU simplifies LSTM with fewer gates.

20.1.1 Introduction

Topic: Vanishing Gradient Problem - a very special and important topic in deep learning where many interview questions are asked. In deep learning, you will encounter many variants of vanishing gradient problems, and if this problem occurs, then your neural network will not be able to train properly. What will be covered: What is vanishing gradient problem, why does it happen, and how to solve it in 5 different ways.

20.1.1 Introduction

Topic: Vanishing Gradient Problem - a very special and important topic in deep learning where many interview questions are asked. In deep learning, you will encounter many variants of vanishing gradient problems, and if this problem occurs, then your neural network will not be able to train properly. What will be covered: What is vanishing gradient problem, why does it happen, and how to solve it in 5 different ways.

Content sourced from CampusX Deep Learning notes (PDF). Run merge script for full body.

Common mistakes

  • Not padding/masking variable-length gru batches.
  • Vanishing gradients in long efficiency sequences.
  • Teacher forcing only at train without plan for inference.

Interview checkpoints

  • Q: RNN vs CNN? A: RNN for sequences; CNN for spatial grids.
  • Q: LSTM vs vanilla? A: Gated memory reduces vanishing.

Practice

  1. Basic: Sketch unrolled RNN for 3 timesteps.
  2. Intermediate: LSTM layer on padded sequences.
  3. Advanced: Forecast univariate series with windowed LSTM.

Recap

  • GRU for sequence modeling.
  • Watch shape (batch, time, features).
  • Transformers often replace RNNs now.

Next: Day 73 — Bidirectional RNN

Day 73

Bidirectional RNN

Contents

65.1.5 Input Processing . . . . . . . . . . . . . . . . . . . . . . . . . 742

65.1.6 GRU Architecture . . . . . . . . . . . . . . . . . . . . . . . . 743

65.1.7 Hidden State Fundamentals . . . . . . . . . . . . . . . . . . . 744

65.1.8 GRU Architecture Overview . . . . . . . . . . . . . . . . . . . 745

65.1.9 Mathematical Formulations . . . . . . . . . . . . . . . . . . . 746

65.1.10Step-by-Step Process . . . . . . . . . . . . . . . . . . . . . . . 746 65.1.11LSTM vs GRU Comparison . . . . . . . . . . . . . . . . . . . 747 65.1.12Key Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . 749 66 BidirectionalRNNBiLSTMBidirectionalLSTMBidirectionalGRU751

66.1 Bidirectional RNN | BiLSTM | Bidirectional LSTM | Bidirectional GRU751

66.2 Bidirectional RNN - Comprehensive Notes . . . . . . . . . . . . . . . 751

66.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 751

66.2.2 Why Bidirectional RNNs? . . . . . . . . . . . . . . . . . . . . 751

66.2.3 Bidirectional RNN Architecture . . . . . . . . . . . . . . . . . 752

Python
66.2.4 Implementation in Keras . . . . . . . . . . . . . . . . . . . . . 752
66.2.5 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 753
66.2.6 Advantages & Drawbacks . . . . . . . . . . . . . . . . . . . . 754
66.2.7 Best Practices . . . . . . . . . . . . . . . . . . . . . . . . . . . 755
66.2.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 756
XIII History of Large Language Models 758
67 The Epic History of Large Language Models (LLMs) From LSTMs
to ChatGPT CampusX 759
67.1 The Epic History of Large Language Models (LLMs) | From LSTMs to
ChatGPT | CampusX . . . . . . . . . . . . . . . . . . . . . . . . . . 759
67.2 Sequence Tasks and Types: Comprehensive Guide . . . . . . . . . . . 759
67.2.1 Sequence Processing Architecture . . . . . . . . . . . . . . . . 759
67.2.2 RNN Input-Output Patterns . . . . . . . . . . . . . . . . . . . 760
67.2.3 Key Applications of Sequence Models . . . . . . . . . . . . . . 760
67.2.4 Translation Example . . . . . . . . . . . . . . . . . . . . . . . 761
67.3 Seq2Seq Tasks in NLP . . . . . . . . . . . . . . . . . . . . . . . . . . 761
67.3.1 Architecture Overview . . . . . . . . . . . . . . . . . . . . . . 761
67.3.2 Key Seq2Seq NLP Tasks . . . . . . . . . . . . . . . . . . . . . 761
67.3.3 Seq2Seq Task Flow Visualization . . . . . . . . . . . . . . . . 762
67.3.4 Key Insights . . . . . . . . . . . . . . . . . . . . . . . . . . . . 762
67.3.5 Timeline: From Simple to Sophisticated . . . . . . . . . . . . 763
67.3.6 The Five Evolutionary Stages . . . . . . . . . . . . . . . . . . 763
67.3.7 Key Developments in Each Stage . . . . . . . . . . . . . . . . 763
67.3.8 The Seq2Seq Revolution . . . . . . . . . . . . . . . . . . . . . 764
67.4 Stage 1 -Encoder Decoder Architecture . . . . . . . . . . . . . . . . . 764
67.4.1 Historical Context . . . . . . . . . . . . . . . . . . . . . . . . 764
67.4.2 Encoder-Decoder Architecture Overview . . . . . . . . . . . . 765
67.4.3 Research Paper Reference . . . . . . . . . . . . . . . . . . . . 765
67.4.4 Working Mechanism Explained . . . . . . . . . . . . . . . . . 766
67.4.5 Implementation Details . . . . . . . . . . . . . . . . . . . . . . 766
67.4.6 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 766
xxvii

Why this matters

Bidirectional RNN sees past and future — NLP tagging.

62.1.12 4. Complete LSTM Cell Animation

Step-by-Step Workflow Figure 62.6: image 711

Chapter 62. LSTM Architecture Part 2 The How CampusX Complete Mathematical Flow All LSTM Equations: 11. ft = sigma(Wf.[ht-1,Xt] + bf) # Forget gate 22. it = sigma(Wi.[ht-1,Xt] + bi) # Input gate 33. C?t = tanh(WC.[ht-1,Xt] + bC) # Candidate values 44. Ct = ft?Ct-1 + it?C?t # Cell state update 55. ot = sigma(Wo.[ht-1,Xt] + bo) # Output gate 66. ht = ot?tanh(Ct) # Hidden state Key Takeaways Feature Purpose Benefit Three GatesControl information flow Selective memory Cell State HighwayDirect gradient path Solves vanishing gradients Pointwise OperationsElement-wise control Fine-grained memory management Dual MemoryLong & short term Comprehensive context Animation Summary 1.Forget Phase: Remove irrelevant past info 2.Input Phase: Add new relevant info 3.Output Phase: Select what to output now Complexity Analysis Operation Time Space Purpose Forget Gate O(n2) O(n) Memory filtering Input Gate O(n2) O(n) Information addition Output Gate O(n2) O(n) Output generation Total O(n2) O(n)Per timestep 712

62.1. LSTM Architecture | Part 2 | The How? | CampusX 713

Chapter 63 LSTM Part 3 Next Word Pre- dictor Using CampusX

63.1 LSTM | Part 3 | Next Word Predictor

Using | CampusX

63.1.1 1. Introduction

What is a Next Word Predictor? ANext Word Predictoris an AI system that suggests the most likely word to follow a given sequence of words. It’s essentially a text generation model that predicts one word at a time. Figure 63.1: image Key Characteristics Feature Description Impact Sequential ProcessingAnalyzes word order and context High accuracy Pattern RecognitionLearns from large text corpora Better predictions Context AwarenessUses previous words for prediction Natural flow Real-time PredictionInstant suggestions User-friendly 714

63.1. LSTM | Part 3 | Next Word Predictor Using | CampusX

63.1.2 2. Real-World Applications

Industry Impact Application User Base Time Saved Adoption Rate Mobile Keyboards 3B+ users 30% typing time 85% Email Composers1.5B users 20% email time 65% Code Completion50M developers 40% coding time 75% Chat Applications2B+ users 25% messaging time 70%

63.1.3 3. Implementation Strategy

Converting Text Generation to Supervised Learning Data Transformation Process Original Sentence Input Sequence Target Word “Hi my name is Nitish” “Hi” “my” “Hi my” “name” “Hi my name” “is” “Hi my name is” “Nitish” Step 1: Sentence to Sequences Figure 63.2: image Step 2: Word to Number Mapping 715

Chapter 63. LSTM Part 3 Next Word Predictor Using CampusX Input Sequence Target [1] 2 [1, 2] 3 [1, 2, 3] 4 [1, 2, 3, 4] 5 Step 3: Numerical Dataset

63.1.4 4. Data Preprocessing

Tokenization Pipeline

63.2 Key Steps in Preprocessing

Figure 63.3: image Preprocessing Components Component Purpose Output TokenizerConvert text to tokens Word indices VocabularyStore unique words Word-to-ID mapping Sequence GeneratorCreate input sequences Training pairs PaddingUniform sequence length Fixed-size inputs Example Code Structure 1# Import necessary libraries 2importtensorflowastf

Python
3fromtensorflow.keras.preprocessing.textimportTokenizer
4fromtensorflow.keras.preprocessing.sequenceimportpad_sequences
5
6# Initialize tokenizer
7tokenizer = Tokenizer()
8tokenizer.fit_on_texts([text])
9
10# Convert text to sequences
11sequences = tokenizer.texts_to_sequences(sentences)
716

63.2. Key Steps in Preprocessing

63.2.1 5. Model Architecture

LSTM Network Design Figure 63.4: image 717

Chapter 63. LSTM Part 3 Next Word Predictor Using CampusX Layer Configuration Layer Type Parameters Purpose Embeddingvocab_size×100 Convert tokens to vectors LSTM 1150 units, return_sequences=True Capture sequence patterns LSTM 2100 units Extract high-level features Densevocab_size units Output probabilities Softmax- Normalize probabilities

63.2.2 6. Code Implementation

Complete Implementation Workflow Step 1: Data Preparation ∗Load text data ∗Split into sentences ∗Create token mappings Step 2: Sequence Creation ∗Convert text to numbers ∗Create input-output pairs ∗Pad sequences to uniform length Step 3: Model Construction ∗Build LSTM architecture ∗Configure hyperparameters ∗Compile with optimizer Step 4: Training Process ∗Train on prepared data ∗Monitor loss metrics ∗Save best model 718

63.2. Key Steps in Preprocessing

63.2.3 7. Training & Evaluation

Training Configuration Hyperparameter Value Purpose Batch Size64 Training efficiency Epochs100 Model convergence Learning Rate0.001 Optimization speed Dropout0.2 Prevent overfitting OptimizerAdam Adaptive learning

63.2.4 1. Dataset Overview

Dataset Statistics Metric Value Description Total Words~283 unique Vocabulary size Document TypeFAQ Text Q&A format LanguageEnglish Technical content SizeSmall Demo purposes Implementation Steps Step 1: Tokenization

Python
1fromtensorflow.keras.preprocessing.textimportTokenizer
2
3tokenizer = Tokenizer()
4tokenizer.fit_on_texts([faqs])
5# Creates word-to-index mapping
Step 2: Sequence Generation
1input_sequences = []
2forsentenceinfaqs.split(’\n’):
3tokenized_sentence = tokenizer.texts_to_sequences([sentence])[0]
4foriin range(1,len(tokenized_sentence)):
5input_sequences.append(tokenized_sentence[:i+1])
719

Chapter 63. LSTM Part 3 Next Word Predictor Using CampusX Sequence Creation Example Original Sentence Input Sequences Target “What is the fee” [What] is [What, is] the [What, is, the] fee Padding Configuration 1max_len =max([len(x)forxininput_sequences])# 56 2padded_sequences = pad_sequences(input_sequences, 3maxlen=max_len, 4padding=’pre’) 720

63.2. Key Steps in Preprocessing

63.2.5 3. Model Architecture Deep Dive

Complete Architecture Visualization Figure 63.5: image 721

Chapter 63. LSTM Part 3 Next Word Predictor Using CampusX Layer-by-Layer Breakdown Parameter Value Purpose Input Dim283 Vocabulary size Output Dim100 Dense vector size Input Length56 Max sequence length Parameters28,300 283×100 1 Embedding Layer Feature Configuration Calculation Units150 Hidden state dimension Time Steps56 Sequential processing Input per Step100 From embedding Output150 Final hidden state 2 LSTM Layer Component Value Function Units283 One per word ActivationSoftmax Probability distribution Parameters42,633 150×283 + 283 3 Dense Output Layer

63.2.6 4. Implementation Code

Complete Model Building

Python
1fromtensorflow.keras.modelsimportSequential
2fromtensorflow.keras.layersimportEmbedding, LSTM, Dense
3
4# Build model
5model = Sequential()
6model.add(Embedding(283, 100, input_length=56))
7model.add(LSTM(150))
8model.add(Dense(283, activation=’softmax’))
9
10# Compile
11model.compile(loss=’categorical_crossentropy’,
722

Content sourced from CampusX Deep Learning notes (PDF). Run merge script for full body.

Common mistakes

  • Not padding/masking variable-length bidirectional batches.
  • Vanishing gradients in long nlp sequences.
  • Teacher forcing only at train without plan for inference.

Interview checkpoints

  • Q: RNN vs CNN? A: RNN for sequences; CNN for spatial grids.
  • Q: LSTM vs vanilla? A: Gated memory reduces vanishing.

Practice

  1. Basic: Sketch unrolled RNN for 3 timesteps.
  2. Intermediate: LSTM layer on padded sequences.
  3. Advanced: Forecast univariate series with windowed LSTM.

Recap

  • Bidirectional RNN for sequence modeling.
  • Watch shape (batch, time, features).
  • Transformers often replace RNNs now.

Next: Day 74 — Stacked LSTMs

Day 74

Stacked LSTMs

Contents

62.1.5 5. Mathematical Representations . . . . . . . . . . . . . . . . 705

62.1.6 6. Pointwise Operations {#pointwise-operations}⊙. . . . . . 705

62.1.7 7. Neural Network Layers . . . . . . . . . . . . . . . . . . . . 706

62.1.8 8. Complete LSTM Workflow . . . . . . . . . . . . . . . . . . 706

62.1.9 1. The Forget Gate . . . . . . . . . . . . . . . . . . . . . . . . 707

62.1.102. The Input Gate . . . . . . . . . . . . . . . . . . . . . . . . 708 62.1.113. The Output Gate . . . . . . . . . . . . . . . . . . . . . . . 709 62.1.124. Complete LSTM Cell Animation . . . . . . . . . . . . . . . 711 63 LSTM Part 3 Next Word Predictor Using CampusX 714

63.1 LSTM | Part 3 | Next Word Predictor Using | CampusX . . . . . . . 714

63.1.1 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 714

63.1.2 2. Real-World Applications . . . . . . . . . . . . . . . . . . . 715

63.1.3 3. Implementation Strategy . . . . . . . . . . . . . . . . . . . 715

63.1.4 4. Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . 716

63.2 Key Steps in Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . 716

63.2.1 5. Model Architecture . . . . . . . . . . . . . . . . . . . . . . 717

63.2.2 6. Code Implementation . . . . . . . . . . . . . . . . . . . . . 718

63.2.3 7. Training & Evaluation . . . . . . . . . . . . . . . . . . . . . 719

63.2.4 1. Dataset Overview . . . . . . . . . . . . . . . . . . . . . . . 719

63.2.5 3. Model Architecture Deep Dive . . . . . . . . . . . . . . . . 721

63.2.6 4. Implementation Code . . . . . . . . . . . . . . . . . . . . . 722

63.2.7 5. Training Process . . . . . . . . . . . . . . . . . . . . . . . . 723

63.2.8 6. Prediction Mechanism . . . . . . . . . . . . . . . . . . . . . 723

63.2.9 7. Performance Optimization . . . . . . . . . . . . . . . . . . 724

63.2.108. Results & Examples . . . . . . . . . . . . . . . . . . . . . . 724 64 Deep RNNs Stacked RNNs Stacked LSTMs Stacked GRUs Cam- pusX 726

64.1 DeepRNNs|StackedRNNs|StackedLSTMs|StackedGRUs|CampusX726

64.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 726

64.1.2 Fundamental Concepts . . . . . . . . . . . . . . . . . . . . . . 726

64.1.3 Architecture Deep Dive . . . . . . . . . . . . . . . . . . . . . . 727

64.1.4 Information Flow . . . . . . . . . . . . . . . . . . . . . . . . . 728

64.1.5 Implementation Details . . . . . . . . . . . . . . . . . . . . . . 729

64.1.6Key Mathematical Concepts Covered:. . . . . . . . . . 730

64.2 Deep RNN Complete Guide - Part 2: Advanced Concepts & Imple-

mentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 730

64.2.1 Mathematical Notation System . . . . . . . . . . . . . . . . . 730

64.2.2 Why & When to Use Deep RNNs . . . . . . . . . . . . . . . . 733

64.2.3 Variants & Extensions . . . . . . . . . . . . . . . . . . . . . . 735

64.2.4 Key Takeaways & Next Steps . . . . . . . . . . . . . . . . . . 739

65 Gated Recurrent Unit Deep Learning GRU CampusX 740

65.1 Gated Recurrent Unit | Deep Learning | GRU | CampusX . . . . . . 740

65.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 740

65.1.2 Why GRU Exists . . . . . . . . . . . . . . . . . . . . . . . . . 740

Why this matters

Stacked LSTMs add depth across time.

63.2.10 8. Results & Examples

Prediction Examples Input Predictions Quality “mail”“mail us at nitish.campusx@gmail.com” Perfect “what is the fee”“what is the fee of the course for data science” Excellent “total duration”“total duration of the course is 7 months so the total course fee becomes 799” Very Good “both are”“both are not a part of this program’s curriculum” Contextual 724

63.2. Key Steps in Preprocessing Key Insights Aspect Observation Recommendation Strengths Good pattern recognition on training data Build on this foundation Weaknesses Limited vocabulary, potential overfitting Add validation split Next Steps Scale to larger datasets Use transfer learning 725

Chapter 64 DeepRNNsStackedRNNsStacked LSTMsStackedGRUsCampusX

64.1 Deep RNNs | Stacked RNNs | Stacked

LSTMs | Stacked GRUs | CampusX

64.1.1 Introduction

Deep RNNs(also calledStacked RNNs) are an extension of traditional RNNs where multiple RNN layers are stacked vertically to increase the model’s representational power and ability to capture complex patterns in sequential data. Key Motivation Problem Solution Benefit Limited representational power Add more hidden layers Increased model complexity Poor performance on complex tasks Stack multiple RNN cells Better pattern recognition Insufficient feature extraction Vertical layer composition Hierarchical feature learning

64.1.2 Fundamental Concepts

Evolution from Simple to Deep Neural Network Complexity Progression Problem Setup: Sentiment Analysis Task: Classify movie reviews as positive (1) or negative (0) 726

64.1. Deep RNNs | Stacked RNNs | Stacked LSTMs | Stacked GRUs | CampusX Review Label Length “cat mat rat” 1 (positive) 3 words “good bad ugly” 0 (negative) 3 words “love hate fear” 1 (positive) 3 words Example Dataset Word Encoding cat [1, 0, 0] mat [0, 1, 0] rat [0, 0, 1] Word Encoding (One-Hot)

64.1.3 Architecture Deep Dive

Standard RNN Architecture Single RNN Cell Structure 1Input Layer (3D) -> RNN Cell (3 units) -> Output Layer (1D) 2? Feedback Loop Mathematical RepresentationFor a single RNN cell at time stept: ht = tanh(Whh·ht−1+Wxh·xt +bh) yt =σ(Why·ht +by) Where: -ht: Hidden state at timet-xt: Input at timet-Whh: Hidden-to- hidden weight matrix (3×3) -W xh: Input-to-hidden weight matrix (3×3) -W hy: Hidden-to-output weight matrix (1×3) -σ: Sigmoid activation function Deep RNN Architecture Two-Layer Deep RNN Structure 1Input Layer (3D) -> RNN Layer 1 (3 units) -> RNN Layer 2 (2 units) -> Output (1D) 2? Feedback Loop ? Feedback Loop 727

Chapter 64. Deep RNNs Stacked RNNs Stacked LSTMs Stacked GRUs CampusX Mathematical Formulation Layer 1: h(1) t = tanh(W (1) hh·h(1) t−1+W (1) xh·xt +b (1) h ) Layer 2: h(2) t = tanh(W (2) hh·h(2) t−1+Wh(1)h(2)·h(1) t +b (2) h ) Output: yt =σ(Why·h(2) t +by) Connection Matrix Dimensions Input→Layer 1W (1) xh 3×3 Layer 1→Layer 1W (1) hh 3×3 Layer 1→Layer 2W h(1)h(2) 2×3 Layer 2→Layer 2W (2) hh 2×2 Layer 2→OutputW hy 1×2 Weight Matrix Dimensions

64.1.4 Information Flow

Time Step Analysis Time Step 1 (t= 1) Input: “cat”→[1, 0, 0] Layer 1 Computation: h(1) 1 = tanh(W (1) hh·[0,0,0] +W(1) xh·[1,0,0] +b(1) h ) Layer 2 Computation: h(2) 1 = tanh(W (2) hh·[0,0] +Wh(1)h(2)·h(1) 1 +b (2) h ) Time Step 2 (t= 2) Input: “mat”→[0, 1, 0] Layer 1 Computation: h(1) 2 = tanh(W (1) hh·h(1) 1 +W (1) xh·[0,1,0] +b(1) h ) Layer 2 Computation: h(2) 2 = tanh(W (2) hh·h(2) 1 +Wh(1)h(2)·h(1) 2 +b (2) h ) 728

64.1. Deep RNNs | Stacked RNNs | Stacked LSTMs | Stacked GRUs | CampusX Time Step 3 (t= 3) Input: “rat”→[0, 0, 1] Final Output Computation: y3 =σ(Why·h(2) 3 +by) Unfolded Architecture Visualization Figure 64.1: image

64.1.5 Implementation Details

Memory Requirements Component Memory Usage Layer 1 weights(3Ö3) + (3Ö3) = 18parameters Layer 2 weights(2Ö3) + (2Ö2) = 10parameters Output weights1Ö2 = 2parameters Total 30 parameters Computational Complexity ∗Forward Pass:O(T× ∑L l=1n2 l ) ∗Backward Pass:O(T×∑L l=1n2 l ) Where: -T: Sequence length -L: Number of layers -nl: Number of units in layerl 729

Chapter 64. Deep RNNs Stacked RNNs Stacked LSTMs Stacked GRUs CampusX Advantages of Deep RNNs Advantage Description Impact Hierarchical Representation Each layer learns different levels of abstraction High Better Feature Extraction Multiple layers capture complex patterns High Improved PerformanceBetter accuracy on complex tasks Medium Flexible ArchitectureCan vary units per layer Medium Challenges Challenge Description Mitigation Vanishing GradientsGradients diminish through layers LSTM/GRU cells Computational CostMore parameters and operations Efficient implementations OverfittingComplex model may overfit Regularization techniques

63.2.10 8. Results & Examples

Prediction Examples Input Predictions Quality “mail”“mail us at nitish.campusx@gmail.com” Perfect “what is the fee”“what is the fee of the course for data science” Excellent “total duration”“total duration of the course is 7 months so the total course fee becomes 799” Very Good “both are”“both are not a part of this program’s curriculum” Contextual 724

63.2. Key Steps in Preprocessing Key Insights Aspect Observation Recommendation Strengths Good pattern recognition on training data Build on this foundation Weaknesses Limited vocabulary, potential overfitting Add validation split Next Steps Scale to larger datasets Use transfer learning 725

Chapter 64 DeepRNNsStackedRNNsStacked LSTMsStackedGRUsCampusX

64.1 Deep RNNs | Stacked RNNs | Stacked

LSTMs | Stacked GRUs | CampusX

64.1.1 Introduction

Deep RNNs(also calledStacked RNNs) are an extension of traditional RNNs where multiple RNN layers are stacked vertically to increase the model’s representational power and ability to capture complex patterns in sequential data. Key Motivation Problem Solution Benefit Limited representational power Add more hidden layers Increased model complexity Poor performance on complex tasks Stack multiple RNN cells Better pattern recognition Insufficient feature extraction Vertical layer composition Hierarchical feature learning

64.1.2 Fundamental Concepts

Evolution from Simple to Deep Neural Network Complexity Progression Problem Setup: Sentiment Analysis Task: Classify movie reviews as positive (1) or negative (0) 726

64.1. Deep RNNs | Stacked RNNs | Stacked LSTMs | Stacked GRUs | CampusX Review Label Length “cat mat rat” 1 (positive) 3 words “good bad ugly” 0 (negative) 3 words “love hate fear” 1 (positive) 3 words Example Dataset Word Encoding cat [1, 0, 0] mat [0, 1, 0] rat [0, 0, 1] Word Encoding (One-Hot)

64.1.3 Architecture Deep Dive

Standard RNN Architecture Single RNN Cell Structure 1Input Layer (3D) -> RNN Cell (3 units) -> Output Layer (1D) 2? Feedback Loop Mathematical RepresentationFor a single RNN cell at time stept: ht = tanh(Whh·ht−1+Wxh·xt +bh) yt =σ(Why·ht +by) Where: -ht: Hidden state at timet-xt: Input at timet-Whh: Hidden-to- hidden weight matrix (3×3) -W xh: Input-to-hidden weight matrix (3×3) -W hy: Hidden-to-output weight matrix (1×3) -σ: Sigmoid activation function Deep RNN Architecture Two-Layer Deep RNN Structure 1Input Layer (3D) -> RNN Layer 1 (3 units) -> RNN Layer 2 (2 units) -> Output (1D) 2? Feedback Loop ? Feedback Loop 727

Chapter 64. Deep RNNs Stacked RNNs Stacked LSTMs Stacked GRUs CampusX Mathematical Formulation Layer 1: h(1) t = tanh(W (1) hh·h(1) t−1+W (1) xh·xt +b (1) h ) Layer 2: h(2) t = tanh(W (2) hh·h(2) t−1+Wh(1)h(2)·h(1) t +b (2) h ) Output: yt =σ(Why·h(2) t +by) Connection Matrix Dimensions Input→Layer 1W (1) xh 3×3 Layer 1→Layer 1W (1) hh 3×3 Layer 1→Layer 2W h(1)h(2) 2×3 Layer 2→Layer 2W (2) hh 2×2 Layer 2→OutputW hy 1×2 Weight Matrix Dimensions

64.1.4 Information Flow

Time Step Analysis Time Step 1 (t= 1) Input: “cat”→[1, 0, 0] Layer 1 Computation: h(1) 1 = tanh(W (1) hh·[0,0,0] +W(1) xh·[1,0,0] +b(1) h ) Layer 2 Computation: h(2) 1 = tanh(W (2) hh·[0,0] +Wh(1)h(2)·h(1) 1 +b (2) h ) Time Step 2 (t= 2) Input: “mat”→[0, 1, 0] Layer 1 Computation: h(1) 2 = tanh(W (1) hh·h(1) 1 +W (1) xh·[0,1,0] +b(1) h ) Layer 2 Computation: h(2) 2 = tanh(W (2) hh·h(2) 1 +Wh(1)h(2)·h(1) 2 +b (2) h ) 728

64.1. Deep RNNs | Stacked RNNs | Stacked LSTMs | Stacked GRUs | CampusX Time Step 3 (t= 3) Input: “rat”→[0, 0, 1] Final Output Computation: y3 =σ(Why·h(2) 3 +by) Unfolded Architecture Visualization Figure 64.1: image

64.1.5 Implementation Details

Memory Requirements Component Memory Usage Layer 1 weights(3Ö3) + (3Ö3) = 18parameters Layer 2 weights(2Ö3) + (2Ö2) = 10parameters Output weights1Ö2 = 2parameters Total 30 parameters Computational Complexity ∗Forward Pass:O(T× ∑L l=1n2 l ) ∗Backward Pass:O(T×∑L l=1n2 l ) Where: -T: Sequence length -L: Number of layers -nl: Number of units in layerl 729

Chapter 64. Deep RNNs Stacked RNNs Stacked LSTMs Stacked GRUs CampusX Advantages of Deep RNNs Advantage Description Impact Hierarchical Representation Each layer learns different levels of abstraction High Better Feature Extraction Multiple layers capture complex patterns High Improved PerformanceBetter accuracy on complex tasks Medium Flexible ArchitectureCan vary units per layer Medium Challenges Challenge Description Mitigation Vanishing GradientsGradients diminish through layers LSTM/GRU cells Computational CostMore parameters and operations Efficient implementations OverfittingComplex model may overfit Regularization techniques

Content sourced from CampusX Deep Learning notes (PDF). Run merge script for full body.

Common mistakes

  • Not padding/masking variable-length stacked batches.
  • Vanishing gradients in long depth sequences.
  • Teacher forcing only at train without plan for inference.

Interview checkpoints

  • Q: RNN vs CNN? A: RNN for sequences; CNN for spatial grids.
  • Q: LSTM vs vanilla? A: Gated memory reduces vanishing.

Practice

  1. Basic: Sketch unrolled RNN for 3 timesteps.
  2. Intermediate: LSTM layer on padded sequences.
  3. Advanced: Forecast univariate series with windowed LSTM.

Recap

  • Stacked LSTMs for sequence modeling.
  • Watch shape (batch, time, features).
  • Transformers often replace RNNs now.

Next: Day 75 — Time Series Forecasting

Day 75

Time Series Forecasting

55.2. RNN Fundamentals - Why Use RNNs? RNN Advantages ∗Memory Capability:Remembers previous inputs ∗Sequential Processing:Processes one element at a time ∗Variable Length:Handles sequences of different lengths ∗Context Awareness:Maintains context throughout sequence

55.1.5 Applications of RNNs

Natural Language Processing (NLP) ∗Text Classification ∗Language Translation ∗Sentiment Analysis ∗Text Generation Time Series Analysis ∗Stock Price Prediction ∗Weather Forecasting ∗Sales Forecasting Speech & Audio ∗Speech Recognition ∗Music Generation ∗Audio Classification

55.2 RNNFundamentals-WhyUseRNNs?

55.2.1 Core Question

Why do we need RNNs (Recurrent Neural Networks)?What specific problems exist that prevent us from using regular neural networks on sequential data?

55.2.2 The Sequential Data Challenge

Text Classification Example Consider sentiment analysis: -Input: Text sentences -Output: Posi- tive/Negative sentiment 617

Why this matters

Time series forecasting needs careful scaling and windows.

53.3.10 Expected Performance

∗Accuracy: ~90-95% (typical for this approach) ∗Training Time: Faster than training from scratch ∗Data Efficiency: Works well with limited data ∗ 606

Part XI

Python
Advanced Keras
607

Chapter 54

Python
Keras Functional Model
54.1 FunctionalAPIinKeras-DetailedNotes
54.1.1 Introduction
This tutorial covers theFunctional APIin Keras, which allows building
non-linear neural network topologiesunlike the Sequential API that
only supports linear layer stacking.
54.1.2 Why Functional API?
Limitations of Sequential Model
∗Sequential models follow alinear topology- one layer after another
∗Input→Layer 1→Layer 2→...→Output
∗Cannot handle:
·Multiple inputs
·Multiple outputs
·Branching architectures
·Shared layers
When to Use Functional API
Example 1: Multi-Output Model-Input: Human face images -
Outputs: - Age prediction (regression) - Emotion classification (happy,
sad, angry) - Requires branching architecture with shared CNN base
Example 2: Multi-Input Model-E-commerce pricing prediction
-Inputs: - Tabular metadata (color, size) - Text description - Product im-
age -Output: Price prediction - Different inputs need different processing
(Dense, RNN, CNN)
54.1.3 Basic Functional API Syntax
Key Components
1fromkeras.modelsimportModel
2fromkeras.layersimportInput, Dense
3
4# Define input layer
5input_layer = Input(shape=(input_shape,))
6
608
Python
54.1. Functional API in Keras - Detailed Notes
7# Build network by connecting layers
8hidden = Dense(64, activation=’relu’)(input_layer)
9output = Dense(1)(hidden)
10
11# Create model
12model = Model(inputs=input_layer, outputs=output)
Important Differences from Sequential:
1. Each layer must be given a name or variable
2. Layers are connected by calling them on previous layers
3. Model is created by specifying inputs and outputs
54.1.4 Code Examples
1. Simple Multi-Output Model
1fromkeras.layersimportInput, Dense
2fromkeras.modelsimportModel
3
4# Input layer
5x = Input(shape=(3,))
6
7# Shared layers
8hidden1 = Dense(128, activation=’relu’)(x)
9hidden2 = Dense(64, activation=’relu’)(hidden1)
10
11# Two output branches
12output1 = Dense(1, activation=’linear’, name=’age’)(hidden2)
13output2 = Dense(1, activation=’sigmoid’, name=’place’)(hidden2)
14
15# Create model with multiple outputs
16model = Model(inputs=x, outputs=[output1, output2])
17
18# Compile with multiple losses
19model.compile(
20optimizer=’adam’,
21loss={
22’age’: ’mse’,
23’place’: ’binary_crossentropy’
24}
25)
2. Multi-Input Model with Concatenation
1# Define two inputs
2inputA = Input(shape=(32,))
3inputB = Input(shape=(128,))
4
5# Branch 1
6x = Dense(8, activation="relu")(inputA)
7x1 = Dense(4, activation="relu")(x)
8
609
Python
Chapter 54. Keras Functional Model
9# Branch 2
10y = Dense(64, activation="relu")(inputB)
11y1 = Dense(32, activation="relu")(y)
12y2 = Dense(4, activation="relu")(y1)
13
14# Concatenate branches
15combined = concatenate([x1, y2])
16
17# Final layers
18z = Dense(2, activation="relu")(combined)
19z1 = Dense(1, activation="linear")(z)
20
21# Model with multiple inputs
22model = Model(inputs=[inputA, inputB], outputs=z1)
3. Practical Example: UTKFace Dataset
Dataset: Face images with age and gender labelsTask: Predict both age
and gender from face images
Data Preparation
1# Extract age and gender from filename
2for file inos.listdir(folder_path):
3age.append(int(file.split(’_’)[0]))
4gender.append(int(file.split(’_’)[1]))
5img_path.append(file)
6
7# Create DataFrame
8df = pd.DataFrame({’age’:age, ’gender’:gender, ’img’:img_path})
9
10# Split data
11train_df = df.sample(frac=1, random_state=0).iloc[:20000]
12test_df = df.sample(frac=1, random_state=0).iloc[20000:]
Data Augmentation
1train_datagen = ImageDataGenerator(
2rescale=1./255,
3rotation_range=30,
4width_shift_range=0.2,
5height_shift_range=0.2,
6shear_range=0.2,
7zoom_range=0.2,
8horizontal_flip=True
9)
10
11train_generator = train_datagen.flow_from_dataframe(
12train_df,
13directory=folder_path,
14x_col=’img’,
15y_col=[’age’,’gender’],# Multiple outputs
16target_size=(200,200),
17class_mode=’multi_output’
18)
610
Python
54.1. Functional API in Keras - Detailed Notes
Model Architecture with Transfer Learning
1fromkeras.applications.resnet50importResNet50
2
3# Load pre-trained ResNet50
4resnet = ResNet50(include_top=False, input_shape=(200,200,3))
5resnet.trainable = False
6
7# Get output from last layer
8output = resnet.layers[-1].output
9flatten = Flatten()(output)
10
11# Create branches for age and gender
12# Age branch
13dense1 = Dense(512, activation=’relu’)(flatten)
14dense3 = Dense(512, activation=’relu’)(dense1)
15output1 = Dense(1, activation=’linear’, name=’age’)(dense3)
16
17# Gender branch
18dense2 = Dense(512, activation=’relu’)(flatten)
19dense4 = Dense(512, activation=’relu’)(dense2)
20output2 = Dense(1, activation=’sigmoid’, name=’gender’)(dense4)
21
22# Create model
23model = Model(inputs=resnet.input, outputs=[output1, output2])
Compilation with Multiple Losses
1model.compile(
2optimizer=’adam’,
3loss={
4’age’: ’mae’,# Mean Absolute Error for regression
5’gender’: ’binary_crossentropy’# For binary classification
6},
7metrics={
8’age’: ’mae’,
9’gender’: ’accuracy’
10},
11loss_weights={
12’age’: 1,
13’gender’: 99# Higher weight for gender loss
14}
15)
54.1.5 Key Advantages of Functional API
1.Flexibility: Create any network topology
2.Multiple inputs/outputs: Handle complex data flows
3.Shared layers: Reuse layers in different branches
4.Model visualization: Easy to visualize withplot_model()
54.1.6 Visualization
1fromkeras.utilsimportplot_model
2plot_model(model, show_shapes=True)
611
Python
Chapter 54. Keras Functional Model
54.1.7 Best Practices
1.Naming layers: Give meaningful names to important layers
2.Variable naming: Use descriptive variable names for layer outputs
3.Loss weights: Adjust loss weights for multi-output models based
on task importance
4.Transfer learning: Combine pre-trained models with custom ar-
chitectures
54.1.8 Common Architectures with Functional API
1.Siamese Networks: Shared weights between branches
2.Multi-modal Networks: Different input types (text, image, tab-
ular)
3.Residual Networks: Skip connections
4.Attention Mechanisms: Complex routing between layers
54.1.9 Resources
∗Keras Functional API Documentation
∗Machine Learning Mastery Blog Post
This comprehensive guide shows how the Functional API enables building
sophisticated neural network architectures that go beyond simple sequen-
tial models, making it essential for complex deep learning applications.
612

Part XII Recurrent Neural Networks 613

Chapter 55 Why RNNs are needed RNNs Vs ANNs RNN Part 1

55.1 WhyRNNsareneeded|RNNsVsANNs

| RNN Part 1 Figure 55.1: image 614

55.1. Why RNNs are needed | RNNs Vs ANNs | RNN Part 1

55.1.1 Neural Network Types Covered So Far

Neural Network Type Primary Use Case Data Type Artificial Neural Networks (ANN) General purpose Tabular Data Convolutional Neural Networks (CNN) Image processing Grid-like Data (Images, Videos) Recurrent Neural Networks (RNN) Sequential processing Sequential Data

55.1.2 What are Recurrent Neural Networks?

Definition RNN= A special type of sequential model specifically de- signed to work on sequential data Key Characteristics ∗Purpose:Process sequential information ∗Memory:Maintains context from previous inputs ∗Applications:NLP, time series, speech recognition

55.1.3 Understanding Sequential Data

Non-Sequential vs Sequential Data Non-Sequential Data Example Student Placement Prediction: 1Input Features -> Neural Network -> Prediction 2? Age: 22 3? Marks: 85% -> ANN -> Placement: Yes/No 4? Gender: Male Note:Order doesn’t matter - can rearrange features without affecting outcome 615

Chapter 55. Why RNNs are needed RNNs Vs ANNs RNN Part 1 Data Type Example Why Sequence Matters Text“Hey my name is Nitish” Word order determines meaning Time SeriesStock prices over years Past values influence future trends AudioSpeech waveforms Temporal patterns create meaning BiologicalDNA sequences Gene order affects function Sequential Data Examples Text Processing Example 1"Hey my name is Nitish" 2? ? ? ? ? 3Word Word Word Word Word 41 2 3 4 5 Sequential Processing:- Read word by word - Retain context from pre- vious words - Build understanding progressively - Combine all information for final meaning Time Series Example 1Stock Price Progression: 22001 -> 2002 -> 2003 -> 2004 -> ... 3$50 $55 $48 $62 Sequential Dependency:- Current price influenced by historical trends - Past performance affects future predictions - Temporal relationships are crucial

55.1.4 Why RNNs are Essential

The Sequential Data Challenge Traditional neural networks (ANN, CNN)cannot handle sequential dependenciesbecause: ∗Fixed Input Size:Cannot process variable-length sequences ∗No Memory:Cannot retain information from previous inputs ∗Order Ignorance:Treat all inputs as independent 616

55.2. RNN Fundamentals - Why Use RNNs? RNN Advantages ∗Memory Capability:Remembers previous inputs ∗Sequential Processing:Processes one element at a time ∗Variable Length:Handles sequences of different lengths ∗Context Awareness:Maintains context throughout sequence

55.1.5 Applications of RNNs

Natural Language Processing (NLP) ∗Text Classification ∗Language Translation ∗Sentiment Analysis ∗Text Generation Time Series Analysis ∗Stock Price Prediction ∗Weather Forecasting ∗Sales Forecasting Speech & Audio ∗Speech Recognition ∗Music Generation ∗Audio Classification

55.2 RNNFundamentals-WhyUseRNNs?

55.2.1 Core Question

Why do we need RNNs (Recurrent Neural Networks)?What specific problems exist that prevent us from using regular neural networks on sequential data?

55.2.2 The Sequential Data Challenge

Text Classification Example Consider sentiment analysis: -Input: Text sentences -Output: Posi- tive/Negative sentiment 617

Chapter 55. Why RNNs are needed RNNs Vs ANNs RNN Part 1 Example Sentences Expected Output “Hi my name is Nitish” Positive/Negative “My name” Positive/Negative “Name is” Positive/Negative

55.2.3 Problem 1: Text Representation

Challenge Neural networks cannot understand text directly - we need numerical rep- resentation. Solution: One-Hot Encoding Vocabulary Creation Process 1.Find unique wordsin entire vocabulary 2.Create vector representationfor each word Example Implementation Sample Text: “Hi my name is Nitish” - Unique words: 12 total words in vocabulary -Vector size: 12 dimen- sions per word Word One-Hot Vector “Hi” [1,0,0,0,0,0,0,0,0,0,0,0] “my” [0,1,0,0,0,0,0,0,0,0,0,0] “name” [0,0,1,0,0,0,0,0,0,0,0,0] Vector Stacking 1Input Matrix = [Hi_vector, my_vector, name_vector, is_vector, Nitish_vector] 2Result: Vertically stacked vectors 618

55.2. RNN Fundamentals - Why Use RNNs?

55.2.4 Problem 2: Variable Input Sizes

The Core Issue Sentence Word Count Input Size “Hi my name is Nitish” 5 words 5×12 = 60 “My name is” 3 words 3×12 = 36 “Name is” 2 words 2×12 = 24 Problem: Neural networks requirefixed input size Why This Breaks Neural Networks Figure 55.2: image

55.2.5 Solution: Zero Padding

Implementation Strategy 1.Find maximum sentence lengthin dataset 2.Pad shorter sentenceswith zero vectors Example Implementation Step 1: Identify Maximum Length ∗Longest sentence: “Hi my name is Nitish” (5 words) ∗Padding target: 5 words for all sentences Step 2: Apply Padding 1Original: "My name is" (3 words) 2Padded: "My name is [0] [0]" (5 words) 3 4Where [0] = [0,0,0,0,0,0,0,0,0,0,0,0] 619

53.3.10 Expected Performance

∗Accuracy: ~90-95% (typical for this approach) ∗Training Time: Faster than training from scratch ∗Data Efficiency: Works well with limited data ∗ 606

Part XI

Python
Advanced Keras
607

Chapter 54

Python
Keras Functional Model
54.1 FunctionalAPIinKeras-DetailedNotes
54.1.1 Introduction
This tutorial covers theFunctional APIin Keras, which allows building
non-linear neural network topologiesunlike the Sequential API that
only supports linear layer stacking.
54.1.2 Why Functional API?
Limitations of Sequential Model
∗Sequential models follow alinear topology- one layer after another
∗Input→Layer 1→Layer 2→...→Output
∗Cannot handle:
·Multiple inputs
·Multiple outputs
·Branching architectures
·Shared layers
When to Use Functional API
Example 1: Multi-Output Model-Input: Human face images -
Outputs: - Age prediction (regression) - Emotion classification (happy,
sad, angry) - Requires branching architecture with shared CNN base
Example 2: Multi-Input Model-E-commerce pricing prediction
-Inputs: - Tabular metadata (color, size) - Text description - Product im-
age -Output: Price prediction - Different inputs need different processing
(Dense, RNN, CNN)
54.1.3 Basic Functional API Syntax
Key Components
1fromkeras.modelsimportModel
2fromkeras.layersimportInput, Dense
3
4# Define input layer
5input_layer = Input(shape=(input_shape,))
6
608
Python
54.1. Functional API in Keras - Detailed Notes
7# Build network by connecting layers
8hidden = Dense(64, activation=’relu’)(input_layer)
9output = Dense(1)(hidden)
10
11# Create model
12model = Model(inputs=input_layer, outputs=output)
Important Differences from Sequential:
1. Each layer must be given a name or variable
2. Layers are connected by calling them on previous layers
3. Model is created by specifying inputs and outputs
54.1.4 Code Examples
1. Simple Multi-Output Model
1fromkeras.layersimportInput, Dense
2fromkeras.modelsimportModel
3
4# Input layer
5x = Input(shape=(3,))
6
7# Shared layers
8hidden1 = Dense(128, activation=’relu’)(x)
9hidden2 = Dense(64, activation=’relu’)(hidden1)
10
11# Two output branches
12output1 = Dense(1, activation=’linear’, name=’age’)(hidden2)
13output2 = Dense(1, activation=’sigmoid’, name=’place’)(hidden2)
14
15# Create model with multiple outputs
16model = Model(inputs=x, outputs=[output1, output2])
17
18# Compile with multiple losses
19model.compile(
20optimizer=’adam’,
21loss={
22’age’: ’mse’,
23’place’: ’binary_crossentropy’
24}
25)
2. Multi-Input Model with Concatenation
1# Define two inputs
2inputA = Input(shape=(32,))
3inputB = Input(shape=(128,))
4
5# Branch 1
6x = Dense(8, activation="relu")(inputA)
7x1 = Dense(4, activation="relu")(x)
8
609
Python
Chapter 54. Keras Functional Model
9# Branch 2
10y = Dense(64, activation="relu")(inputB)
11y1 = Dense(32, activation="relu")(y)
12y2 = Dense(4, activation="relu")(y1)
13
14# Concatenate branches
15combined = concatenate([x1, y2])
16
17# Final layers
18z = Dense(2, activation="relu")(combined)
19z1 = Dense(1, activation="linear")(z)
20
21# Model with multiple inputs
22model = Model(inputs=[inputA, inputB], outputs=z1)
3. Practical Example: UTKFace Dataset
Dataset: Face images with age and gender labelsTask: Predict both age
and gender from face images
Data Preparation
1# Extract age and gender from filename
2for file inos.listdir(folder_path):
3age.append(int(file.split(’_’)[0]))
4gender.append(int(file.split(’_’)[1]))
5img_path.append(file)
6
7# Create DataFrame
8df = pd.DataFrame({’age’:age, ’gender’:gender, ’img’:img_path})
9
10# Split data
11train_df = df.sample(frac=1, random_state=0).iloc[:20000]
12test_df = df.sample(frac=1, random_state=0).iloc[20000:]
Data Augmentation
1train_datagen = ImageDataGenerator(
2rescale=1./255,
3rotation_range=30,
4width_shift_range=0.2,
5height_shift_range=0.2,
6shear_range=0.2,
7zoom_range=0.2,
8horizontal_flip=True
9)
10
11train_generator = train_datagen.flow_from_dataframe(
12train_df,
13directory=folder_path,
14x_col=’img’,
15y_col=[’age’,’gender’],# Multiple outputs
16target_size=(200,200),
17class_mode=’multi_output’
18)
610
Python
54.1. Functional API in Keras - Detailed Notes
Model Architecture with Transfer Learning
1fromkeras.applications.resnet50importResNet50
2
3# Load pre-trained ResNet50
4resnet = ResNet50(include_top=False, input_shape=(200,200,3))
5resnet.trainable = False
6
7# Get output from last layer
8output = resnet.layers[-1].output
9flatten = Flatten()(output)
10
11# Create branches for age and gender
12# Age branch
13dense1 = Dense(512, activation=’relu’)(flatten)
14dense3 = Dense(512, activation=’relu’)(dense1)
15output1 = Dense(1, activation=’linear’, name=’age’)(dense3)
16
17# Gender branch
18dense2 = Dense(512, activation=’relu’)(flatten)
19dense4 = Dense(512, activation=’relu’)(dense2)
20output2 = Dense(1, activation=’sigmoid’, name=’gender’)(dense4)
21
22# Create model
23model = Model(inputs=resnet.input, outputs=[output1, output2])
Compilation with Multiple Losses
1model.compile(
2optimizer=’adam’,
3loss={
4’age’: ’mae’,# Mean Absolute Error for regression
5’gender’: ’binary_crossentropy’# For binary classification
6},
7metrics={
8’age’: ’mae’,
9’gender’: ’accuracy’
10},
11loss_weights={
12’age’: 1,
13’gender’: 99# Higher weight for gender loss
14}
15)
54.1.5 Key Advantages of Functional API
1.Flexibility: Create any network topology
2.Multiple inputs/outputs: Handle complex data flows
3.Shared layers: Reuse layers in different branches
4.Model visualization: Easy to visualize withplot_model()
54.1.6 Visualization
1fromkeras.utilsimportplot_model
2plot_model(model, show_shapes=True)
611
Python
Chapter 54. Keras Functional Model
54.1.7 Best Practices
1.Naming layers: Give meaningful names to important layers
2.Variable naming: Use descriptive variable names for layer outputs
3.Loss weights: Adjust loss weights for multi-output models based
on task importance
4.Transfer learning: Combine pre-trained models with custom ar-
chitectures
54.1.8 Common Architectures with Functional API
1.Siamese Networks: Shared weights between branches
2.Multi-modal Networks: Different input types (text, image, tab-
ular)
3.Residual Networks: Skip connections
4.Attention Mechanisms: Complex routing between layers
54.1.9 Resources
∗Keras Functional API Documentation
∗Machine Learning Mastery Blog Post
This comprehensive guide shows how the Functional API enables building
sophisticated neural network architectures that go beyond simple sequen-
tial models, making it essential for complex deep learning applications.
612

Part XII Recurrent Neural Networks 613

Chapter 55 Why RNNs are needed RNNs Vs ANNs RNN Part 1

55.1 WhyRNNsareneeded|RNNsVsANNs

| RNN Part 1 Figure 55.1: image 614

55.1. Why RNNs are needed | RNNs Vs ANNs | RNN Part 1

55.1.1 Neural Network Types Covered So Far

Neural Network Type Primary Use Case Data Type Artificial Neural Networks (ANN) General purpose Tabular Data Convolutional Neural Networks (CNN) Image processing Grid-like Data (Images, Videos) Recurrent Neural Networks (RNN) Sequential processing Sequential Data

55.1.2 What are Recurrent Neural Networks?

Definition RNN= A special type of sequential model specifically de- signed to work on sequential data Key Characteristics ∗Purpose:Process sequential information ∗Memory:Maintains context from previous inputs ∗Applications:NLP, time series, speech recognition

55.1.3 Understanding Sequential Data

Non-Sequential vs Sequential Data Non-Sequential Data Example Student Placement Prediction: 1Input Features -> Neural Network -> Prediction 2? Age: 22 3? Marks: 85% -> ANN -> Placement: Yes/No 4? Gender: Male Note:Order doesn’t matter - can rearrange features without affecting outcome 615

Chapter 55. Why RNNs are needed RNNs Vs ANNs RNN Part 1 Data Type Example Why Sequence Matters Text“Hey my name is Nitish” Word order determines meaning Time SeriesStock prices over years Past values influence future trends AudioSpeech waveforms Temporal patterns create meaning BiologicalDNA sequences Gene order affects function Sequential Data Examples Text Processing Example 1"Hey my name is Nitish" 2? ? ? ? ? 3Word Word Word Word Word 41 2 3 4 5 Sequential Processing:- Read word by word - Retain context from pre- vious words - Build understanding progressively - Combine all information for final meaning Time Series Example 1Stock Price Progression: 22001 -> 2002 -> 2003 -> 2004 -> ... 3$50 $55 $48 $62 Sequential Dependency:- Current price influenced by historical trends - Past performance affects future predictions - Temporal relationships are crucial

55.1.4 Why RNNs are Essential

The Sequential Data Challenge Traditional neural networks (ANN, CNN)cannot handle sequential dependenciesbecause: ∗Fixed Input Size:Cannot process variable-length sequences ∗No Memory:Cannot retain information from previous inputs ∗Order Ignorance:Treat all inputs as independent 616

55.2. RNN Fundamentals - Why Use RNNs? RNN Advantages ∗Memory Capability:Remembers previous inputs ∗Sequential Processing:Processes one element at a time ∗Variable Length:Handles sequences of different lengths ∗Context Awareness:Maintains context throughout sequence

55.1.5 Applications of RNNs

Natural Language Processing (NLP) ∗Text Classification ∗Language Translation ∗Sentiment Analysis ∗Text Generation Time Series Analysis ∗Stock Price Prediction ∗Weather Forecasting ∗Sales Forecasting Speech & Audio ∗Speech Recognition ∗Music Generation ∗Audio Classification

55.2 RNNFundamentals-WhyUseRNNs?

55.2.1 Core Question

Why do we need RNNs (Recurrent Neural Networks)?What specific problems exist that prevent us from using regular neural networks on sequential data?

55.2.2 The Sequential Data Challenge

Text Classification Example Consider sentiment analysis: -Input: Text sentences -Output: Posi- tive/Negative sentiment 617

Chapter 55. Why RNNs are needed RNNs Vs ANNs RNN Part 1 Example Sentences Expected Output “Hi my name is Nitish” Positive/Negative “My name” Positive/Negative “Name is” Positive/Negative

55.2.3 Problem 1: Text Representation

Challenge Neural networks cannot understand text directly - we need numerical rep- resentation. Solution: One-Hot Encoding Vocabulary Creation Process 1.Find unique wordsin entire vocabulary 2.Create vector representationfor each word Example Implementation Sample Text: “Hi my name is Nitish” - Unique words: 12 total words in vocabulary -Vector size: 12 dimen- sions per word Word One-Hot Vector “Hi” [1,0,0,0,0,0,0,0,0,0,0,0] “my” [0,1,0,0,0,0,0,0,0,0,0,0] “name” [0,0,1,0,0,0,0,0,0,0,0,0] Vector Stacking 1Input Matrix = [Hi_vector, my_vector, name_vector, is_vector, Nitish_vector] 2Result: Vertically stacked vectors 618

55.2. RNN Fundamentals - Why Use RNNs?

55.2.4 Problem 2: Variable Input Sizes

The Core Issue Sentence Word Count Input Size “Hi my name is Nitish” 5 words 5×12 = 60 “My name is” 3 words 3×12 = 36 “Name is” 2 words 2×12 = 24 Problem: Neural networks requirefixed input size Why This Breaks Neural Networks Figure 55.2: image

55.2.5 Solution: Zero Padding

Implementation Strategy 1.Find maximum sentence lengthin dataset 2.Pad shorter sentenceswith zero vectors Example Implementation Step 1: Identify Maximum Length ∗Longest sentence: “Hi my name is Nitish” (5 words) ∗Padding target: 5 words for all sentences Step 2: Apply Padding 1Original: "My name is" (3 words) 2Padded: "My name is [0] [0]" (5 words) 3 4Where [0] = [0,0,0,0,0,0,0,0,0,0,0,0] 619

Content sourced from CampusX Deep Learning notes (PDF). Run merge script for full body.

Common mistakes

  • Not padding/masking variable-length forecast batches.
  • Vanishing gradients in long window sequences.
  • Teacher forcing only at train without plan for inference.

Interview checkpoints

  • Q: RNN vs CNN? A: RNN for sequences; CNN for spatial grids.
  • Q: LSTM vs vanilla? A: Gated memory reduces vanishing.

Practice

  1. Basic: Sketch unrolled RNN for 3 timesteps.
  2. Intermediate: LSTM layer on padded sequences.
  3. Advanced: Forecast univariate series with windowed LSTM.

Recap

  • Time Series Forecasting for sequence modeling.
  • Watch shape (batch, time, features).
  • Transformers often replace RNNs now.

Next: Day 76 — Text Generation RNN

Day 76

Text Generation RNN

Contents

3.4.6 5. Adding Audio to Mute Videos . . . . . . . . . . . . . . . . 52

3.4.7 6. Image Caption Generation . . . . . . . . . . . . . . . . . . 52

3.4.8 7. Text Translation . . . . . . . . . . . . . . . . . . . . . . . . 53

3.4.9 8. Pixel Restoration . . . . . . . . . . . . . . . . . . . . . . . 54

3.4.10 9. Object Detection/Identification (Google Photos) . . . . . . 54

3.4.11 10. GANs (Generative Adversarial Networks) . . . . . . . . . 55

3.4.12 11. Deep Dreams . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.4.13 The Technical Foundation . . . . . . . . . . . . . . . . . . . . 56

3.4.14 The Future of Deep Learning Applications . . . . . . . . . . . 56

3.5 Artificial Intelligence & Deep Learning Resources . . . . . . . . . . . 57

3.5.1 Neural Network Architectures . . . . . . . . . . . . . . . . . . 57

3.5.2 Key Researchers & Databases . . . . . . . . . . . . . . . . . . 57

3.5.3 Generative AI Models & Applications . . . . . . . . . . . . . . 57

3.5.4 Advanced Techniques & Demonstrations . . . . . . . . . . . . 57

3.5.5 AI Development Timeline . . . . . . . . . . . . . . . . . . . . 58

3.5.6 Key AI Capabilities Showcase . . . . . . . . . . . . . . . . . . 58

II Perceptrons 60 4 What is perceptron Perceptron vs Neuron Perceptron Geometric Intuition 61

4.1 Perceptron: The Building Block of Neural Networks . . . . . . . . . . 61

4.1.1 Introduction to Perceptrons - . . . . . . . . . . . . . . . . . . 61

4.1.2 Training and Prediction Process . . . . . . . . . . . . . . . . . 63

4.1.3 Example Application . . . . . . . . . . . . . . . . . . . . . . . 63

4.1.4 Neuron vs. Perceptron . . . . . . . . . . . . . . . . . . . . . . 64

4.1.5 Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.1.6 Geometric Intuition . . . . . . . . . . . . . . . . . . . . . . . . 65

4.1.7 Code Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

4.1.8 Understanding Weights . . . . . . . . . . . . . . . . . . . . . . 68

4.1.9 Key Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.1.10 Next Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5 Perceptron Trick How to train a perceptron Part 2 70

5.1 The Perceptron Trick: Training Linear Classifiers Through Geometric

Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.1.1 The Perceptron’s Learning Challenge . . . . . . . . . . . . . . 70

5.1.2 The Geometric Intuition and Transformations . . . . . . . . . 73

5.1.3 Mathematical Foundation . . . . . . . . . . . . . . . . . . . . 73

5.1.4 Positive & Negative Regions . . . . . . . . . . . . . . . . . . . 75

5.1.5 The Transformation Magic . . . . . . . . . . . . . . . . . . . . 75

5.1.6 Simplified Learning Algorithm . . . . . . . . . . . . . . . . . . 75

5.1.7 Learning in Action: The Convergence Process . . . . . . . . . 76

5.1.8 Why This Matters . . . . . . . . . . . . . . . . . . . . . . . . 77

Why this matters

Text generation samples next character/token autoregressively.

62.1.12 4. Complete LSTM Cell Animation

Step-by-Step Workflow Figure 62.6: image 711

Chapter 62. LSTM Architecture Part 2 The How CampusX Complete Mathematical Flow All LSTM Equations: 11. ft = sigma(Wf.[ht-1,Xt] + bf) # Forget gate 22. it = sigma(Wi.[ht-1,Xt] + bi) # Input gate 33. C?t = tanh(WC.[ht-1,Xt] + bC) # Candidate values 44. Ct = ft?Ct-1 + it?C?t # Cell state update 55. ot = sigma(Wo.[ht-1,Xt] + bo) # Output gate 66. ht = ot?tanh(Ct) # Hidden state Key Takeaways Feature Purpose Benefit Three GatesControl information flow Selective memory Cell State HighwayDirect gradient path Solves vanishing gradients Pointwise OperationsElement-wise control Fine-grained memory management Dual MemoryLong & short term Comprehensive context Animation Summary 1.Forget Phase: Remove irrelevant past info 2.Input Phase: Add new relevant info 3.Output Phase: Select what to output now Complexity Analysis Operation Time Space Purpose Forget Gate O(n2) O(n) Memory filtering Input Gate O(n2) O(n) Information addition Output Gate O(n2) O(n) Output generation Total O(n2) O(n)Per timestep 712

62.1. LSTM Architecture | Part 2 | The How? | CampusX 713

Chapter 63 LSTM Part 3 Next Word Pre- dictor Using CampusX

63.1 LSTM | Part 3 | Next Word Predictor

Using | CampusX

63.1.1 1. Introduction

What is a Next Word Predictor? ANext Word Predictoris an AI system that suggests the most likely word to follow a given sequence of words. It’s essentially a text generation model that predicts one word at a time. Figure 63.1: image Key Characteristics Feature Description Impact Sequential ProcessingAnalyzes word order and context High accuracy Pattern RecognitionLearns from large text corpora Better predictions Context AwarenessUses previous words for prediction Natural flow Real-time PredictionInstant suggestions User-friendly 714

63.1. LSTM | Part 3 | Next Word Predictor Using | CampusX

63.1.2 2. Real-World Applications

Industry Impact Application User Base Time Saved Adoption Rate Mobile Keyboards 3B+ users 30% typing time 85% Email Composers1.5B users 20% email time 65% Code Completion50M developers 40% coding time 75% Chat Applications2B+ users 25% messaging time 70%

63.1.3 3. Implementation Strategy

Converting Text Generation to Supervised Learning Data Transformation Process Original Sentence Input Sequence Target Word “Hi my name is Nitish” “Hi” “my” “Hi my” “name” “Hi my name” “is” “Hi my name is” “Nitish” Step 1: Sentence to Sequences Figure 63.2: image Step 2: Word to Number Mapping 715

Chapter 63. LSTM Part 3 Next Word Predictor Using CampusX Input Sequence Target [1] 2 [1, 2] 3 [1, 2, 3] 4 [1, 2, 3, 4] 5 Step 3: Numerical Dataset

63.1.4 4. Data Preprocessing

Tokenization Pipeline

63.2 Key Steps in Preprocessing

Figure 63.3: image Preprocessing Components Component Purpose Output TokenizerConvert text to tokens Word indices VocabularyStore unique words Word-to-ID mapping Sequence GeneratorCreate input sequences Training pairs PaddingUniform sequence length Fixed-size inputs Example Code Structure 1# Import necessary libraries 2importtensorflowastf

Python
3fromtensorflow.keras.preprocessing.textimportTokenizer
4fromtensorflow.keras.preprocessing.sequenceimportpad_sequences
5
6# Initialize tokenizer
7tokenizer = Tokenizer()
8tokenizer.fit_on_texts([text])
9
10# Convert text to sequences
11sequences = tokenizer.texts_to_sequences(sentences)
716

63.2. Key Steps in Preprocessing

63.2.1 5. Model Architecture

LSTM Network Design Figure 63.4: image 717

Chapter 63. LSTM Part 3 Next Word Predictor Using CampusX Layer Configuration Layer Type Parameters Purpose Embeddingvocab_size×100 Convert tokens to vectors LSTM 1150 units, return_sequences=True Capture sequence patterns LSTM 2100 units Extract high-level features Densevocab_size units Output probabilities Softmax- Normalize probabilities

63.2.2 6. Code Implementation

Complete Implementation Workflow Step 1: Data Preparation ∗Load text data ∗Split into sentences ∗Create token mappings Step 2: Sequence Creation ∗Convert text to numbers ∗Create input-output pairs ∗Pad sequences to uniform length Step 3: Model Construction ∗Build LSTM architecture ∗Configure hyperparameters ∗Compile with optimizer Step 4: Training Process ∗Train on prepared data ∗Monitor loss metrics ∗Save best model 718

63.2. Key Steps in Preprocessing

63.2.3 7. Training & Evaluation

Training Configuration Hyperparameter Value Purpose Batch Size64 Training efficiency Epochs100 Model convergence Learning Rate0.001 Optimization speed Dropout0.2 Prevent overfitting OptimizerAdam Adaptive learning

63.2.4 1. Dataset Overview

Dataset Statistics Metric Value Description Total Words~283 unique Vocabulary size Document TypeFAQ Text Q&A format LanguageEnglish Technical content SizeSmall Demo purposes Implementation Steps Step 1: Tokenization

Python
1fromtensorflow.keras.preprocessing.textimportTokenizer
2
3tokenizer = Tokenizer()
4tokenizer.fit_on_texts([faqs])
5# Creates word-to-index mapping
Step 2: Sequence Generation
1input_sequences = []
2forsentenceinfaqs.split(’\n’):
3tokenized_sentence = tokenizer.texts_to_sequences([sentence])[0]
4foriin range(1,len(tokenized_sentence)):
5input_sequences.append(tokenized_sentence[:i+1])
719

Chapter 63. LSTM Part 3 Next Word Predictor Using CampusX Sequence Creation Example Original Sentence Input Sequences Target “What is the fee” [What] is [What, is] the [What, is, the] fee Padding Configuration 1max_len =max([len(x)forxininput_sequences])# 56 2padded_sequences = pad_sequences(input_sequences, 3maxlen=max_len, 4padding=’pre’) 720

63.2. Key Steps in Preprocessing

63.2.5 3. Model Architecture Deep Dive

Complete Architecture Visualization Figure 63.5: image 721

Chapter 63. LSTM Part 3 Next Word Predictor Using CampusX Layer-by-Layer Breakdown Parameter Value Purpose Input Dim283 Vocabulary size Output Dim100 Dense vector size Input Length56 Max sequence length Parameters28,300 283×100 1 Embedding Layer Feature Configuration Calculation Units150 Hidden state dimension Time Steps56 Sequential processing Input per Step100 From embedding Output150 Final hidden state 2 LSTM Layer Component Value Function Units283 One per word ActivationSoftmax Probability distribution Parameters42,633 150×283 + 283 3 Dense Output Layer

63.2.6 4. Implementation Code

Complete Model Building

Python
1fromtensorflow.keras.modelsimportSequential
2fromtensorflow.keras.layersimportEmbedding, LSTM, Dense
3
4# Build model
5model = Sequential()
6model.add(Embedding(283, 100, input_length=56))
7model.add(LSTM(150))
8model.add(Dense(283, activation=’softmax’))
9
10# Compile
11model.compile(loss=’categorical_crossentropy’,
722

62.1.12 4. Complete LSTM Cell Animation

Step-by-Step Workflow Figure 62.6: image 711

Chapter 62. LSTM Architecture Part 2 The How CampusX Complete Mathematical Flow All LSTM Equations: 11. ft = sigma(Wf.[ht-1,Xt] + bf) # Forget gate 22. it = sigma(Wi.[ht-1,Xt] + bi) # Input gate 33. C?t = tanh(WC.[ht-1,Xt] + bC) # Candidate values 44. Ct = ft?Ct-1 + it?C?t # Cell state update 55. ot = sigma(Wo.[ht-1,Xt] + bo) # Output gate 66. ht = ot?tanh(Ct) # Hidden state Key Takeaways Feature Purpose Benefit Three GatesControl information flow Selective memory Cell State HighwayDirect gradient path Solves vanishing gradients Pointwise OperationsElement-wise control Fine-grained memory management Dual MemoryLong & short term Comprehensive context Animation Summary 1.Forget Phase: Remove irrelevant past info 2.Input Phase: Add new relevant info 3.Output Phase: Select what to output now Complexity Analysis Operation Time Space Purpose Forget Gate O(n2) O(n) Memory filtering Input Gate O(n2) O(n) Information addition Output Gate O(n2) O(n) Output generation Total O(n2) O(n)Per timestep 712

62.1. LSTM Architecture | Part 2 | The How? | CampusX 713

Chapter 63 LSTM Part 3 Next Word Pre- dictor Using CampusX

63.1 LSTM | Part 3 | Next Word Predictor

Using | CampusX

63.1.1 1. Introduction

What is a Next Word Predictor? ANext Word Predictoris an AI system that suggests the most likely word to follow a given sequence of words. It’s essentially a text generation model that predicts one word at a time. Figure 63.1: image Key Characteristics Feature Description Impact Sequential ProcessingAnalyzes word order and context High accuracy Pattern RecognitionLearns from large text corpora Better predictions Context AwarenessUses previous words for prediction Natural flow Real-time PredictionInstant suggestions User-friendly 714

63.1. LSTM | Part 3 | Next Word Predictor Using | CampusX

63.1.2 2. Real-World Applications

Industry Impact Application User Base Time Saved Adoption Rate Mobile Keyboards 3B+ users 30% typing time 85% Email Composers1.5B users 20% email time 65% Code Completion50M developers 40% coding time 75% Chat Applications2B+ users 25% messaging time 70%

63.1.3 3. Implementation Strategy

Converting Text Generation to Supervised Learning Data Transformation Process Original Sentence Input Sequence Target Word “Hi my name is Nitish” “Hi” “my” “Hi my” “name” “Hi my name” “is” “Hi my name is” “Nitish” Step 1: Sentence to Sequences Figure 63.2: image Step 2: Word to Number Mapping 715

Chapter 63. LSTM Part 3 Next Word Predictor Using CampusX Input Sequence Target [1] 2 [1, 2] 3 [1, 2, 3] 4 [1, 2, 3, 4] 5 Step 3: Numerical Dataset

63.1.4 4. Data Preprocessing

Tokenization Pipeline

63.2 Key Steps in Preprocessing

Figure 63.3: image Preprocessing Components Component Purpose Output TokenizerConvert text to tokens Word indices VocabularyStore unique words Word-to-ID mapping Sequence GeneratorCreate input sequences Training pairs PaddingUniform sequence length Fixed-size inputs Example Code Structure 1# Import necessary libraries 2importtensorflowastf

Python
3fromtensorflow.keras.preprocessing.textimportTokenizer
4fromtensorflow.keras.preprocessing.sequenceimportpad_sequences
5
6# Initialize tokenizer
7tokenizer = Tokenizer()
8tokenizer.fit_on_texts([text])
9
10# Convert text to sequences
11sequences = tokenizer.texts_to_sequences(sentences)
716

63.2. Key Steps in Preprocessing

63.2.1 5. Model Architecture

LSTM Network Design Figure 63.4: image 717

Chapter 63. LSTM Part 3 Next Word Predictor Using CampusX Layer Configuration Layer Type Parameters Purpose Embeddingvocab_size×100 Convert tokens to vectors LSTM 1150 units, return_sequences=True Capture sequence patterns LSTM 2100 units Extract high-level features Densevocab_size units Output probabilities Softmax- Normalize probabilities

63.2.2 6. Code Implementation

Complete Implementation Workflow Step 1: Data Preparation ∗Load text data ∗Split into sentences ∗Create token mappings Step 2: Sequence Creation ∗Convert text to numbers ∗Create input-output pairs ∗Pad sequences to uniform length Step 3: Model Construction ∗Build LSTM architecture ∗Configure hyperparameters ∗Compile with optimizer Step 4: Training Process ∗Train on prepared data ∗Monitor loss metrics ∗Save best model 718

63.2. Key Steps in Preprocessing

63.2.3 7. Training & Evaluation

Training Configuration Hyperparameter Value Purpose Batch Size64 Training efficiency Epochs100 Model convergence Learning Rate0.001 Optimization speed Dropout0.2 Prevent overfitting OptimizerAdam Adaptive learning

63.2.4 1. Dataset Overview

Dataset Statistics Metric Value Description Total Words~283 unique Vocabulary size Document TypeFAQ Text Q&A format LanguageEnglish Technical content SizeSmall Demo purposes Implementation Steps Step 1: Tokenization

Python
1fromtensorflow.keras.preprocessing.textimportTokenizer
2
3tokenizer = Tokenizer()
4tokenizer.fit_on_texts([faqs])
5# Creates word-to-index mapping
Step 2: Sequence Generation
1input_sequences = []
2forsentenceinfaqs.split(’\n’):
3tokenized_sentence = tokenizer.texts_to_sequences([sentence])[0]
4foriin range(1,len(tokenized_sentence)):
5input_sequences.append(tokenized_sentence[:i+1])
719

Chapter 63. LSTM Part 3 Next Word Predictor Using CampusX Sequence Creation Example Original Sentence Input Sequences Target “What is the fee” [What] is [What, is] the [What, is, the] fee Padding Configuration 1max_len =max([len(x)forxininput_sequences])# 56 2padded_sequences = pad_sequences(input_sequences, 3maxlen=max_len, 4padding=’pre’) 720

63.2. Key Steps in Preprocessing

63.2.5 3. Model Architecture Deep Dive

Complete Architecture Visualization Figure 63.5: image 721

Chapter 63. LSTM Part 3 Next Word Predictor Using CampusX Layer-by-Layer Breakdown Parameter Value Purpose Input Dim283 Vocabulary size Output Dim100 Dense vector size Input Length56 Max sequence length Parameters28,300 283×100 1 Embedding Layer Feature Configuration Calculation Units150 Hidden state dimension Time Steps56 Sequential processing Input per Step100 From embedding Output150 Final hidden state 2 LSTM Layer Component Value Function Units283 One per word ActivationSoftmax Probability distribution Parameters42,633 150×283 + 283 3 Dense Output Layer

63.2.6 4. Implementation Code

Complete Model Building

Python
1fromtensorflow.keras.modelsimportSequential
2fromtensorflow.keras.layersimportEmbedding, LSTM, Dense
3
4# Build model
5model = Sequential()
6model.add(Embedding(283, 100, input_length=56))
7model.add(LSTM(150))
8model.add(Dense(283, activation=’softmax’))
9
10# Compile
11model.compile(loss=’categorical_crossentropy’,
722

Content sourced from CampusX Deep Learning notes (PDF). Run merge script for full body.

Common mistakes

  • Not padding/masking variable-length generation batches.
  • Vanishing gradients in long temperature sequences.
  • Teacher forcing only at train without plan for inference.

Interview checkpoints

  • Q: RNN vs CNN? A: RNN for sequences; CNN for spatial grids.
  • Q: LSTM vs vanilla? A: Gated memory reduces vanishing.

Practice

  1. Basic: Sketch unrolled RNN for 3 timesteps.
  2. Intermediate: LSTM layer on padded sequences.
  3. Advanced: Forecast univariate series with windowed LSTM.

Recap

  • Text Generation RNN for sequence modeling.
  • Watch shape (batch, time, features).
  • Transformers often replace RNNs now.

Next: Day 77 — Seq2Seq Architecture

← Module 6: CNNs Module 8: Seq2Seq & Attention →