Module 3 · 100 Days of DL

Module 3: Gradient Descent Variations & Tuning

Analyze optimization trajectories in Batch, Stochastic, and Mini-batch Gradient Descent. Address gradient vanishing/exploding bounds, and deploy hyperparameter search via Keras Tuner.

⏱ 30 Min Read • Author: GenAIWallah Team • Updated: May 2026

Day 23

Batch vs SGD

Contents 17 Backpropagation Part 3 The Why Complete Deep Learning Playlist179

17.1 Backpropagation Intuition Notes - Part 3 . . . . . . . . . . . . . . . . 179

17.1.1 Algorithm Review . . . . . . . . . . . . . . . . . . . . . . . . . 179

17.1.2 Part 1: Intuition Behind the Algorithm . . . . . . . . . . . . . 179

17.1.3 Part 2: Concept of Gradient . . . . . . . . . . . . . . . . . . . 180

17.1.4 Part 3: Concept of Derivative (Intuitive Understanding) . . . 180

17.1.5 Part 4: Concept of Minima . . . . . . . . . . . . . . . . . . . 181

17.1.6 Part 5: Intuition of Backpropagation . . . . . . . . . . . . . . 182

17.1.7 Part 6: Learning Rate Effects . . . . . . . . . . . . . . . . . . 183

17.1.8 Part 7: Convergence . . . . . . . . . . . . . . . . . . . . . . . 184

17.1.9 Part 8: Interactive Visualization . . . . . . . . . . . . . . . . . 184

17.1.10Complete Algorithm Understanding . . . . . . . . . . . . . . . 185 17.1.11Key Conceptual Insights . . . . . . . . . . . . . . . . . . . . . 185 17.1.12Summary and Next Steps . . . . . . . . . . . . . . . . . . . . 186 17.1.13Practice Recommendations . . . . . . . . . . . . . . . . . . . . 186 18 MLP Memoization Complete Deep Learning Playlist 188

18.1 Memoization in Backpropagation . . . . . . . . . . . . . . . . . . . . 188

18.1.1 Part 1: What is Memoization? . . . . . . . . . . . . . . . . . . 188

18.1.2 Part 2: Fibonacci Sequence Example . . . . . . . . . . . . . . 188

18.1.3 Part 3: Multi-Layer Neural Networks . . . . . . . . . . . . . . 190

18.1.4 Part 4: Complex Derivative Calculations . . . . . . . . . . . . 192

18.1.5 Key Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . 192

18.1.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192

VI Gradient Problems in Neural Networks 194 19 Gradient Descent in Neural Network: Batch vs Stochastic vs Mini- Batch 195

19.1 Introduction to Gradient Descent . . . . . . . . . . . . . . . . . . . . 195

19.2 Neural Network Context: Back Propagation Algorithm . . . . . . . . 195

19.2.1 Back Propagation Process . . . . . . . . . . . . . . . . . . . . 195

19.3 Three Types of Gradient Descent . . . . . . . . . . . . . . . . . . . . 196

19.4 Performance Metrics Comparison . . . . . . . . . . . . . . . . . . . . 198

19.4.1 Example: 500 Rows, 10 Epochs . . . . . . . . . . . . . . . . . 198

19.5 Batch Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . 198

19.5.1 How It Works . . . . . . . . . . . . . . . . . . . . . . . . . . . 198

19.5.2 Pseudo Code Structure . . . . . . . . . . . . . . . . . . . . . . 198

19.5.3 Key Characteristics . . . . . . . . . . . . . . . . . . . . . . . . 199

19.6 Stochastic Gradient Descent (SGD) . . . . . . . . . . . . . . . . . . . 199

19.6.1 How It Works . . . . . . . . . . . . . . . . . . . . . . . . . . . 199

19.6.2 Pseudo Code Structure . . . . . . . . . . . . . . . . . . . . . . 199

19.6.3 Key Characteristics . . . . . . . . . . . . . . . . . . . . . . . . 199

19.7 Mini-Batch Gradient Descent . . . . . . . . . . . . . . . . . . . . . . 199

19.7.1 Best of Both Worlds . . . . . . . . . . . . . . . . . . . . . . . 199

19.7.2 Pseudo Code Structure . . . . . . . . . . . . . . . . . . . . . . 200

19.7.3 Characteristics Comparison . . . . . . . . . . . . . . . . . . . 200

19.8 Implementation Code Comparison . . . . . . . . . . . . . . . . . . . . 200

Why this matters

Batch GD uses the full dataset per step — stable but slow; SGD uses one sample — noisy but fast.

19.15.1 Batch Size Parameter

In deep learning frameworks, the gradient descent variant is controlled by the batch_sizeparameter: – Batch GD:batch_size = total_rowsorNone – Stochastic GD:batch_size = 1 – Mini-batch GD:batch_size = small_number(e.g. 32, 64, 128)

19.15.1 Batch Size Parameter

Weights are adjusted to minimize error. Depending on how much data is processed per update step, we choose from three variations:

Batch Gradient Descent: Computes the gradient over the entire dataset. Smooth trajectory, but extremely slow and memory intensive for big datasets.
Stochastic Gradient Descent (SGD): Computes gradient for a single random sample per step. Extremely fast but oscillates heavily.
Mini-batch Gradient Descent: Processes a batch of size $M$ (typically 32 to 512). Combines the best of both worlds.

Gradient Descent Optimization Convergence Paths

Common mistakes

Ignoring batch effects on convergence.
Not monitoring sgd during training.
Tuning on test data.

Interview checkpoints

Q: Batch vs SGD in one sentence? A: Core training stability topic.
Q: Debug step? A: Plot gradients per layer.

Practice

Basic: Explain Batch vs SGD plainly.
Intermediate: Experiment on MNIST with one change.
Advanced: Document before/after metrics.

Recap

Understand batch vs sgd.
Link to loss curves and init.
Prepare for regularization module.

Next: Day 24 — Mini-batch Training

Day 24

Mini-batch Training

Chapter 12. Handwritten Digit Classification using ANN MNIST Dataset 6batch_size=32# Smaller batch size 7) Different Optimizers 1# Try different optimizers 2optimizers_to_try = [ 3’adam’,# Default choice (usually best) 4’sgd’,# Simple gradient descent 5’rmsprop’,# Alternative optimizer 6’adagrad’# Another option 7] 8 9foroptinoptimizers_to_try:

Python

10model.compile(
11loss=’sparse_categorical_crossentropy’,
12optimizer=opt,
13metrics=[’accuracy’]
14)
Expected Improvements:
•More layers: Can capture more complex patterns
•Dropout: Reduces overfitting, improves generalization
•Larger networks: Higher capacity for learning
•Longer training: Better convergence (watch for overfitting)
Things to Watch:
•Overfitting: Training accuracy » Validation accuracy
•Training time: Larger models take longer
•Diminishing returns: More complex̸=always better
12.1.9 Advanced Concepts
Understanding Multi-Class Output
1# Softmax ensures probabilities sum to 1
2sample_output = model.predict(X_test[0].reshape(1, 28, 28))[0]
3print("Individual probabilities:")
4fori, probin enumerate(sample_output):
5print(f"Digit {i}: {prob:.4f}")
6print(f"Sum of probabilities: {sample_output.sum():.4f}")# Should be 1.0
Confusion Matrix Analysis
1fromsklearn.metricsimportconfusion_matrix, classification_report
2importseabornassns
3
4# Generate predictions
144

Why this matters

Mini-batches balance noise and throughput — default batch sizes 32–256 on GPUs.

19.7 Mini-Batch Gradient Descent . . . . . . . . . . . . . . . . . . . . . . 199

As error signals are backpropagated through many layers, gradients can shrink exponentially (vanishing gradients) or grow exponentially (exploding gradients) during matrix multiplications. Vanishing gradients are highly prevalent when using activation functions like **Sigmoid** or **Tanh**, whose derivatives saturate near 0.

Common mistakes

Ignoring mini-batch effects on convergence.
Not monitoring shuffle during training.
Tuning on test data.

Interview checkpoints

Q: Mini-batch Training in one sentence? A: Core training stability topic.
Q: Debug step? A: Plot gradients per layer.

Practice

Basic: Explain Mini-batch Training plainly.
Intermediate: Experiment on MNIST with one change.
Advanced: Document before/after metrics.

Recap

Understand mini-batch training.
Link to loss curves and init.
Prepare for regularization module.

Next: Day 25 — Vanishing Gradients

Day 25

Vanishing Gradients

Contents

59.1.2 Introduction to RNN Backpropagation . . . . . . . . . . . . . 670

59.1.3 RNN Architecture Review . . . . . . . . . . . . . . . . . . . . 670

59.1.4 Forward Propagation . . . . . . . . . . . . . . . . . . . . . . . 671

59.1.5 Backpropagation Through Time (BPTT) . . . . . . . . . . . . 672

59.1.6 Gradient Calculations . . . . . . . . . . . . . . . . . . . . . . 672

59.1.7 Implementation Details . . . . . . . . . . . . . . . . . . . . . . 673

59.1.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674

60 Problems with RNN 100 Days of Deep Learning 676

60.1 Problems with RNN | 100 Days of Deep Learning . . . . . . . . . . . 676

60.1.1 RNN Fundamentals Recap . . . . . . . . . . . . . . . . . . . . 676

60.1.2 Major Problems with RNNs . . . . . . . . . . . . . . . . . . . 676

60.1.3 Why this happens: . . . . . . . . . . . . . . . . . . . . . . . . 677

60.1.4 Real-world Example: . . . . . . . . . . . . . . . . . . . . . . . 677

60.1.5 Technical Deep Dive . . . . . . . . . . . . . . . . . . . . . . . 678

60.2 RNNMathematicalAnalysis: Long-TermDependency&GradientProb-

lems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 678

60.2.1 Course Context & Prerequisites . . . . . . . . . . . . . . . . . 678

60.2.2 Problem #1: Long-Term Dependency - Mathematical Analysis 679

60.2.3 Gradient Calculation & Chain Rule Application . . . . . . . . 680

60.2.4 Complete Mathematical Derivation . . . . . . . . . . . . . . . 681

60.2.5 Vanishing Gradient Problem - Mathematical Proof . . . . . . 681

60.2.6 Solutions to Vanishing Gradients . . . . . . . . . . . . . . . . 682

60.2.7 Problem #2: Exploding Gradients . . . . . . . . . . . . . . . 683

60.2.8 Solutions to Exploding Gradients . . . . . . . . . . . . . . . . 683

60.2.9 Summary & Mathematical Insights . . . . . . . . . . . . . . . 684

61 LSTM Long Short Term Memory Part 1 The What CampusX 686

61.1 LSTM | Long Short Term Memory | Part 1 | The What? | CampusX 686

61.1.1 Recap: From ANN to RNN . . . . . . . . . . . . . . . . . . . 686

61.1.2 The Critical Problem: Long Sequences . . . . . . . . . . . . . 687

61.1.3 LSTM: The Solution . . . . . . . . . . . . . . . . . . . . . . . 688

61.2 LSTM Core Concepts & Architecture - Deep Dive Notes . . . . . . . 689

61.2.1 Learning Objective . . . . . . . . . . . . . . . . . . . . . . . . 689

61.2.2 The Story-Based Learning Approach . . . . . . . . . . . . . . 690

61.2.3 Human Brain Memory Processing Model . . . . . . . . . . . . 691

61.2.4 The RNN Memory Problem . . . . . . . . . . . . . . . . . . . 692

61.2.5 LSTM Solution: Dual Memory Architecture . . . . . . . . . . 693

61.2.6 The Three Gates: LSTM’s Control System . . . . . . . . . . . 694

61.2.7 Pronoun Resolution Example . . . . . . . . . . . . . . . . . . 695

61.2.8 LSTM as a Computer System . . . . . . . . . . . . . . . . . . 698

Why this matters

Vanishing gradients stall deep sigmoid/tanh nets — activations and init matter.

68.3.10 Key Training Insights

Critical Success Factors Factor Description Impact Teacher ForcingUse correct tokens during training Faster convergence Proper Loss FunctionCategorical cross-entropy for multi-class Accurate gradients Learning RateBalance between speed and stability Training success Sufficient DataLarge parallel dataset Model generalization Common Training Challenges Challenge Symptom Solution Vanishing GradientsPoor long sequence learning Use LSTM/GRU Exploding GradientsTraining instability Gradient clipping OverfittingGood training, poor validation Regularization Slow ConvergenceHigh loss after many epochs Adjust learning rate Training Success Indicators Metric Good Performance Poor Performance Training LossSteadily decreasing Oscillating/increasing Validation LossFollowing training loss Much higher than training BLEU Score> 25 for translation < 10 Convergence TimeFew hundred epochs Never converges 811

Chapter 68. Encoder Decoder Sequence to Sequence Architecture Deep Learning CampusX

68.4 Encoder-Decoder: Prediction & Ad-

vanced Improvements Guide

68.4.1 FromBasicArchitecturetoProduction-Ready

Models

68.4.2 Prediction Process After Training

Training vs Prediction Mode Aspect Training Mode Prediction Mode Weights StatusContinuously updating Frozen (fixed values) Data AvailabilityHas target sequences No target sequences Teacher ForcingUses correct inputs Uses model predictions BackpropagationRequired for learning Not needed PurposeLearning patterns Making predictions 812

68.4. Encoder-Decoder: Prediction & Advanced Improvements Guide Prediction Workflow Figure 68.14: image 813

Chapter 68. Encoder Decoder Sequence to Sequence Architecture Deep Learning CampusX Step-by-Step Prediction Example Step Component Input Process Output Action 1Encoder “think” Process token h1, c1 Continue 2Encoder “about” Process token h2, c2 Continue 3Encoder “it” Process token Context Vector Transfer to decoder 4Decoder<START>+ Context Generate probabilities “saocha” (highest prob) Use as next input 5Decoder “saocha” + States Generate probabilities “jaaao” (highest prob) Use as next input 6Decoder “jaaao” + States Generate probabilities “lao” (highest prob) Use as next input 7Decoder “lao” + States Generate probabilities <END> (highest prob) Stop generation Input: “Think about it”→Expected: “saocha jaaao lao” Key Differences from Training Critical Change: During prediction, wecannot use teacher forcing because we don’t know the correct target sequence. The model must rely on its own predictions. Figure 68.15: image Autoregressive Generation Process 814

68.4. Encoder-Decoder: Prediction & Advanced Improvements Guide

68.4.3 Improvement 1: Embeddings Over One-Hot

Encoding Problem with One-Hot Encoding Issue Small Vocabulary Large Vocabulary Impact Dimension Size5-7 dimensions 100,000+ dimensions Memory explosion SparsityMostly zeros 99.999% zeros Computational waste Semantic Information None captured None captured No word relationships Storage Requirements Manageable Prohibitive Infrastructure strain Solution: Word Embeddings Embedding Architecture Aspect One-Hot Embeddings Improvement DimensionalityVocabulary size Fixed (e.g., 300) 99%+ reduction Semantic InfoNone Rich relationships Context capture Memory UsageO(V) per word O(d) per word Massive savings Training SpeedSlower Faster Computational efficiency Density99%+ zeros 100% non-zero Information dense Embedding Benefits Comparison Implementation Options Strategy Method Pros Cons Best For Pre-trainedWord2Vec, GloVe Ready to use General knowledge May not fit domain General applications Custom Training Train with network Domain-specific Task-optimized Requires more data Specialized domains Embedding Strategies 815

Chapter 68. Encoder Decoder Sequence to Sequence Architecture Deep Learning CampusX

68.4.4 Improvement 2: Deep LSTMs (Multi-Layer

Architecture) Single vs Multi-Layer Comparison Architecture Single Layer Multi-Layer LSTM Layers1 layer 3-4 layers Context Vectors1 vector Multiple vectors Parameter CountLower Higher Learning CapacityLimited Enhanced Long DependenciesModerate Excellent 816

68.4. Encoder-Decoder: Prediction & Advanced Improvements Guide Deep LSTM Architecture Figure 68.16: image Three Key Benefits of Deep LSTMs Sequence Length Single Layer Multi-Layer Performance Gap Short (< 20 words) Good Excellent +15% accuracy Medium (20-50 words) Moderate Good +25% accuracy Long (50+ words)Poor Good +40% accuracy 1 Enhanced Long-Term Dependencies 817

Chapter 68. Encoder Decoder Sequence to Sequence Architecture Deep Learning CampusX Why: Multiple context vectors provide more “memory slots” to store sequence information Figure 68.17: image 2 Hierarchical Representation Learning Metric Single Layer Multi-Layer Benefit Parameters~100K ~400K 4x model capacity Learning AbilityBasic patterns Complex patterns Advanced feature extraction GeneralizationLimited Strong Better unseen data performance Data VariationsStruggles Handles well Robust to input diversity 3 Increased Model Capacity Original Paper Results ResearchFinding: Sutskeveretal.used4-layerDeepLSTMs and achieved significant improvements over single-layer base- lines. Model Type BLEU Score Improvement Architecture Baseline25.2 - Single layer Deep LSTM 34.8 +38%4 layers, 1000 units each Performance Comparison 818

68.4. Encoder-Decoder: Prediction & Advanced Improvements Guide

68.4.5 Improvement 3: Input Sequence Reversal

Concept Overview Approach Input Order Distance to First Output Gradient Flow Normal “Think about it” 3 timesteps Longer path ←↩Reversed“it about Think” 1 timestep Shorter path Normal vs Reversed Input Processing 819

Chapter 68. Encoder Decoder Sequence to Sequence Architecture Deep Learning CampusX The Science Behind Reversal Figure 68.18: image 820

68.4. Encoder-Decoder: Prediction & Advanced Improvements Guide Benefits Analysis Benefit Explanation Impact Shorter Gradient PathFirst input word closer to first output Better learning Faster ConvergenceReduced vanishing gradient effect Training speed Better Context CaptureInitial words get more attention Translation quality Advantages Challenge Explanation Mitigation Later Words DistanceEnd words become farther

Python

from outputs
Language-specific testing
Language DependencyNot all language pairs
benefit equally
Empirical validation
Experimental NatureRequires case-by-case
evaluation
A/B testing
Trade-offs
Language-Specific Effectiveness
Language Type Reversal Benefit Reason Examples
Front-Heavy High Critical info at start English, French
Balanced Medium Even information
distribution
Spanish, Italian
End-Heavy Low/None Critical info at end Japanese, Korean
Effectiveness by Language Characteristics
HistoricalNote: OriginalSutskeverpaperusedEnglish→French
translation and saw significant improvements with input rever-
sal.
821

Chapter 68. Encoder Decoder Sequence to Sequence Architecture Deep Learning CampusX

68.4.6 Original Research Paper Summary

Sutskever et al. Architecture Specifications Specification Value Details Translation Task English→French Machine translation Dataset Size 12M sentences Massive parallel corpus English Words 304M words Source language French Words 348M words Target language Dataset Type Public corpus Reproducible research Task & Dataset Component Size Special Handling English Vocab 160K words Input vocabulary French Vocab 80K words Output vocabulary Special Tokens EOS (End of Sequence) Instead of START/END Unknown Words<UNK>token Out-of-vocabulary handling Vocabulary & Tokens 822

68.4. Encoder-Decoder: Prediction & Advanced Improvements Guide Figure 68.19: image Architecture Details 823

Chapter 68. Encoder Decoder Sequence to Sequence Architecture Deep Learning CampusX Component Configuration Purpose Embedding Dimension 1000D vectors Rich word representation LSTM Layers 4 layers each side Deep hierarchical learning LSTM Units 1000 units per layer High model capacity Output Function Softmax activation Probability distribution ←↩Input ProcessingReversed sequences Improved gradient flow Technical Specifications Performance Results Metric Baseline Sutskever Model Improvement BLEU Score ~25-3034.8+15-35% Translation Quality Standard State-of-art Revolutionary Research Impact - High citations Field-defining BLEU Score Achievement HistoricalSignificance: Thispaperestablishedencoder-decoder asthefoundationformodernneuralmachinetranslation, paving the way for attention mechanisms and transformer architec- tures. Key Takeaways for Implementation Feature Implementation Impact Embeddings 300-1000D vectors Memory efficiency Deep Architecture 3-4 LSTM layers Better performance Input ReversalReverse source sequences Language-dependent gains Special Tokens EOS, UNK handling Robust processing Must-Have Features 824

68.4. Encoder-Decoder: Prediction & Advanced Improvements Guide Figure 68.20: image 825

Chapter 68. Encoder Decoder Sequence to Sequence Architecture Deep Learning CampusX Figure 68.21: image Recommended Starting Points 826

Chapter 69 AttentionMechanismin1video Seq2SeqNetworksEncoderDe- coder Architecture

69.1 AttentionMechanismin1video|Seq2Seq

Networks | Encoder Decoder Architecture

69.1.1 Learning Objectives

Objective Description Understanding the NeedWhy attention mechanism is necessary Problem IdentificationIssues with traditional encoder-decoder Solution ExplorationHow attention mechanism works Implementation InsightsStep-by-step breakdown

69.1.2 The Problem with Encoder-Decoder Archi-

tecture Architecture Overview Figure 69.1: image 827

Chapter 69. Attention Mechanism in 1 video Seq2Seq Networks Encoder Decoder Architecture Core Issues Identified Challenge Description Impact Memory OverloadEncoder must compress entire sentence into single vector High Long SequencesPerformance degrades with sentences >25 words Critical Information LossImportant details get lost in compression High 1. Information Bottleneck Problem Human Analogy: Just like humans struggle to memorize and translate a 50-word sentence all at once! 828

68.3.10 Key Training Insights

Chapter 68. Encoder Decoder Sequence to Sequence Architecture Deep Learning CampusX

68.4 Encoder-Decoder: Prediction & Ad-

vanced Improvements Guide

68.4.1 FromBasicArchitecturetoProduction-Ready

Models

68.4.2 Prediction Process After Training

68.4. Encoder-Decoder: Prediction & Advanced Improvements Guide Prediction Workflow Figure 68.14: image 813

68.4. Encoder-Decoder: Prediction & Advanced Improvements Guide

68.4.3 Improvement 1: Embeddings Over One-Hot

Chapter 68. Encoder Decoder Sequence to Sequence Architecture Deep Learning CampusX

68.4.4 Improvement 2: Deep LSTMs (Multi-Layer

68.4. Encoder-Decoder: Prediction & Advanced Improvements Guide

68.4.5 Improvement 3: Input Sequence Reversal

Chapter 68. Encoder Decoder Sequence to Sequence Architecture Deep Learning CampusX The Science Behind Reversal Figure 68.18: image 820

Python

from outputs
Language-specific testing
Language DependencyNot all language pairs
benefit equally
Empirical validation
Experimental NatureRequires case-by-case
evaluation
A/B testing
Trade-offs
Language-Specific Effectiveness
Language Type Reversal Benefit Reason Examples
Front-Heavy High Critical info at start English, French
Balanced Medium Even information
distribution
Spanish, Italian
End-Heavy Low/None Critical info at end Japanese, Korean
Effectiveness by Language Characteristics
HistoricalNote: OriginalSutskeverpaperusedEnglish→French
translation and saw significant improvements with input rever-
sal.
821

Chapter 68. Encoder Decoder Sequence to Sequence Architecture Deep Learning CampusX

68.4.6 Original Research Paper Summary

68.4. Encoder-Decoder: Prediction & Advanced Improvements Guide Figure 68.19: image Architecture Details 823

68.4. Encoder-Decoder: Prediction & Advanced Improvements Guide Figure 68.20: image 825

Chapter 68. Encoder Decoder Sequence to Sequence Architecture Deep Learning CampusX Figure 68.21: image Recommended Starting Points 826

Chapter 69 AttentionMechanismin1video Seq2SeqNetworksEncoderDe- coder Architecture

69.1 AttentionMechanismin1video|Seq2Seq

Networks | Encoder Decoder Architecture

69.1.1 Learning Objectives

69.1.2 The Problem with Encoder-Decoder Archi-

tecture Architecture Overview Figure 69.1: image 827

To automate architecture decisions, we use **Keras Tuner** to search hyperparameter spaces (number of dense units, learning rate, dropout rates) using algorithms like Random Search, Hyperband, or Bayesian Optimization.

Common mistakes

Ignoring vanishing effects on convergence.
Not monitoring relu during training.
Tuning on test data.

Interview checkpoints

Q: Vanishing Gradients in one sentence? A: Core training stability topic.
Q: Debug step? A: Plot gradients per layer.

Practice

Basic: Explain Vanishing Gradients plainly.
Intermediate: Experiment on MNIST with one change.
Advanced: Document before/after metrics.

Recap

Understand vanishing gradients.
Link to loss curves and init.
Prepare for regularization module.

Next: Day 26 — Exploding Gradients

Day 26

Exploding Gradients

Contents

59.1.2 Introduction to RNN Backpropagation . . . . . . . . . . . . . 670

59.1.3 RNN Architecture Review . . . . . . . . . . . . . . . . . . . . 670

59.1.4 Forward Propagation . . . . . . . . . . . . . . . . . . . . . . . 671

59.1.5 Backpropagation Through Time (BPTT) . . . . . . . . . . . . 672

59.1.6 Gradient Calculations . . . . . . . . . . . . . . . . . . . . . . 672

59.1.7 Implementation Details . . . . . . . . . . . . . . . . . . . . . . 673

59.1.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674

60 Problems with RNN 100 Days of Deep Learning 676

60.1 Problems with RNN | 100 Days of Deep Learning . . . . . . . . . . . 676

60.1.1 RNN Fundamentals Recap . . . . . . . . . . . . . . . . . . . . 676

60.1.2 Major Problems with RNNs . . . . . . . . . . . . . . . . . . . 676

60.1.3 Why this happens: . . . . . . . . . . . . . . . . . . . . . . . . 677

60.1.4 Real-world Example: . . . . . . . . . . . . . . . . . . . . . . . 677

60.1.5 Technical Deep Dive . . . . . . . . . . . . . . . . . . . . . . . 678

60.2 RNNMathematicalAnalysis: Long-TermDependency&GradientProb-

lems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 678

60.2.1 Course Context & Prerequisites . . . . . . . . . . . . . . . . . 678

60.2.2 Problem #1: Long-Term Dependency - Mathematical Analysis 679

60.2.3 Gradient Calculation & Chain Rule Application . . . . . . . . 680

60.2.4 Complete Mathematical Derivation . . . . . . . . . . . . . . . 681

60.2.5 Vanishing Gradient Problem - Mathematical Proof . . . . . . 681

60.2.6 Solutions to Vanishing Gradients . . . . . . . . . . . . . . . . 682

60.2.7 Problem #2: Exploding Gradients . . . . . . . . . . . . . . . 683

60.2.8 Solutions to Exploding Gradients . . . . . . . . . . . . . . . . 683

60.2.9 Summary & Mathematical Insights . . . . . . . . . . . . . . . 684

61 LSTM Long Short Term Memory Part 1 The What CampusX 686

61.1 LSTM | Long Short Term Memory | Part 1 | The What? | CampusX 686

61.1.1 Recap: From ANN to RNN . . . . . . . . . . . . . . . . . . . 686

61.1.2 The Critical Problem: Long Sequences . . . . . . . . . . . . . 687

61.1.3 LSTM: The Solution . . . . . . . . . . . . . . . . . . . . . . . 688

61.2 LSTM Core Concepts & Architecture - Deep Dive Notes . . . . . . . 689

61.2.1 Learning Objective . . . . . . . . . . . . . . . . . . . . . . . . 689

61.2.2 The Story-Based Learning Approach . . . . . . . . . . . . . . 690

61.2.3 Human Brain Memory Processing Model . . . . . . . . . . . . 691

61.2.4 The RNN Memory Problem . . . . . . . . . . . . . . . . . . . 692

61.2.5 LSTM Solution: Dual Memory Architecture . . . . . . . . . . 693

61.2.6 The Three Gates: LSTM’s Control System . . . . . . . . . . . 694

61.2.7 Pronoun Resolution Example . . . . . . . . . . . . . . . . . . 695

61.2.8 LSTM as a Computer System . . . . . . . . . . . . . . . . . . 698

Why this matters

Exploding gradients blow up weights — clip or lower LR.

56.4.7 Key Insights & Best Practices

Critical Understanding Points 1 Two Inputs Per Timestep (Except First) 1# t=1: Only current input 2inputs_t1 = [x_1]# h_0 = zeros 3 4# t=2+: Current input + Previous memory 5inputs_t2_plus = [x_t, h_{t-1}] 2 Weight Sharing Across Time 1# SAME weights used at EVERY timestep 2assertW_input_t1isW_input_t2isW_input_t3# True! 3assertW_hidden_t1isW_hidden_t2isW_hidden_t3# True! 4 5# This enables parameter efficiency and generalization 3 Memory Accumulation 1# Each hidden state contains cumulative information 2h_1: Informationfromx_1 3h_2: Informationfromx_1 + x_2 4h_3: Informationfromx_1 + x_2 + x_3 4 Sequential Processing Limitation 1# Cannot parallelize across timesteps 2fortin range(sequence_length): 3h[t] = f(x[t], h[t-1])# h[t] depends on h[t-1] 4 5# Can parallelize across batch dimension 6batch_processing = True# Different sequences in parallel 650

56.4. RNN Forward Propagation: Complete Technical Deep Dive Activation Function Selection Task Type Hidden Activation Output Activation Justification Binary Classification tanh sigmoidBounded hidden states, probability output Multi-class Classification tanh softmaxProbability distribution over classes Regressiontanh linearContinuous output values Language Modeling tanh softmaxNext word probability Common Pitfalls & Solutions Pitfall 1: Gradient Issues 1# Problem: Vanishing/Exploding Gradients 2# Solutions: 3- Gradient clipping: np.clip(gradients, -1, 1) 4- Better architectures: LSTM, GRU 5- Proper weight initialization 6- Learning rate scheduling Pitfall 2: Memory Limitations 1# Problem: Limited long-term memory 2# Solutions: 3- Use LSTM/GRUforlonger sequences 4- Attention mechanisms 5- Truncated backpropagation 6- Hierarchical processing Pitfall 3: Training Instability 1# Problem: Training doesn’t converge 2# Solutions: 3- Proper weight initialization (Xavier/He) 4- Batch normalization 5- Dropoutforregularization 6- Learning rate tuning Performance Optimization Tips 1 Efficient Matrix Operations 1# Vectorized operations 2h_t = np.tanh(W_input @ x_t + W_hidden @ h_prev + bias) 3 4# Avoid loops for matrix multiplication 651

Chapter 56. Recurrent Neural Network Forward Propagation Architecture 2 Memory Management 1# Batch processing 2batch_size = 32 3sequences = shape(batch_size, sequence_length, vocab_size) 4 5# Gradient checkpointing for long sequences 3 Numerical Stability 1# Clip extreme values 2defstable_sigmoid(x): 3return1 / (1 + np.exp(-np.clip(x, -500, 500))) 4 5defstable_tanh(x): 6returnnp.tanh(np.clip(x, -500, 500)) 652

56.4. RNN Forward Propagation: Complete Technical Deep Dive 653

Chapter 57 RNN Sentiment Analysis RNN

Python

Code Example in Keras Cam-
pusX
57.0.1 Deep Learning with RNNs: Text Processing
& Sentiment Analysis Guide
57.0.2 Overview
This comprehensive guide covers implementing Recurrent Neural Net-
works (RNNs) using Keras for sentiment analysis, including text prepro-
cessing, tokenization, and embedding techniques.
ColabNotebook:-https://colab.research.google.com/drive/1uY7NEHi59w4FkB8TViwLjUDKxgCA8W5G?usp=sharing
ColabNotebook:-https://colab.research.google.com/drive/1FLJZ0LeMiW_6OkzFrC-
o035YZPBFEFR4?usp=sharing
654

57.0.3 Text Preprocessing Pipeline

Complete Workflow Figure 57.1: Mermaid diagram

57.0.4 Code Implementation Breakdown

1 Dataset Creation & Tokenization 655

Python

Chapter 57. RNN Sentiment Analysis RNN Code Example in Keras CampusX
1# Sample dataset
2docs = [’go india’,
3’india india’,
4’hip hip hurray’,
5’jeetega bhai jeetega india jeetega’,
6’bharat mata ki jai’,
7’kohli kohli’,
8’sachin sachin’,
9’dhoni dhoni’,
10’modi ji ki jai’,
11’inquilab zindabad’]
Tokenizer Configuration
1fromkeras.preprocessing.textimportTokenizer
2
3# Initialize tokenizer with OOV (Out of Vocabulary) handling
4tokenizer = Tokenizer(oov_token=’<nothing>’)
5tokenizer.fit_on_texts(docs)
Tokenizer Attributes Description Value
word_indexDictionary mapping words
to indices
{'india': 1,'jeetega':
2, ...}
word_countsFrequency count of each
word
{'india': 4,'jeetega':
3, ...}
document_countTotal number of documents10
2 Text to Sequence Conversion
1# Convert text to integer sequences
2sequences = tokenizer.texts_to_sequences(docs)
Original Text Integer Sequence
‘go india’ [10, 1]
‘india india’ [1, 1]
‘jeetega bhai jeetega india jeetega’ [2, 3, 2, 1, 2]
Transformation Example
656

3 Sequence Padding

Python

1fromkeras.utilsimportpad_sequences
2
3# Pad sequences to ensure uniform length
4sequences = pad_sequences(sequences, padding=’post’)
Padding Strategy
∗Purpose: Make all sequences the same length
∗Method:padding=’post’adds zeros at the end
∗Alternative:padding=’pre’adds zeros at the beginning
Before vs After Padding
1Before: [[10, 1], [1, 1], [7, 7, 8], [2, 3, 2, 1, 2], ...]
2After: [[10, 1, 0, 0, 0], [1, 1, 0, 0, 0], [7, 7, 8, 0, 0], [2, 3,
2, 1, 2], ...]
57.0.5 Model Architecture Options
Option 1: Simple RNN with Integer Encoding
1fromkerasimportSequential
2fromkeras.layersimportDense, SimpleRNN
3
4model = Sequential()
5model.add(SimpleRNN(32, input_shape=(50, 1), return_sequences=False)
)
6model.add(Dense(1, activation=’sigmoid’))
Layer Configuration Output Shape
SimpleRNN 32 neurons, no sequence
return
(None, 32)
Dense 1 neuron, sigmoid activation (None, 1)
Architecture Breakdown
Option 2: RNN with Embedding Layer
1fromkeras.layersimportEmbedding
2
3model = Sequential()
657

Python

Chapter 57. RNN Sentiment Analysis RNN Code Example in Keras CampusX
4model.add(Embedding(10000, 2, 50))# vocab_size, embedding_dim,
input_length
5model.add(SimpleRNN(32, return_sequences=False))
6model.add(Dense(1, activation=’sigmoid’))
Advantage Description Impact
Dense RepresentationNon-zero values, lower
dimensions
Efficiency
Semantic MeaningCaptures word relationships Accuracy
Learnable WeightsAdapts to specific dataset Performance
Embedding Layer Benefits
57.0.6 IMDB Dataset Implementation
Dataset Overview
1fromkeras.datasetsimportimdb
2
3# Load pre-processed IMDB dataset
4(X_train, y_train), (X_test, y_test) = imdb.load_data()
Metric Training Testing
Samples25,000 25,000
FeaturesVariable length sequences Variable length sequences
LabelsBinary (0/1) Binary (0/1)
Dataset Statistics
Data Preprocessing
1# Pad sequences to fixed length
2X_train = pad_sequences(X_train, padding=’post’, maxlen=50)
3X_test = pad_sequences(X_test, padding=’post’, maxlen=50)
658

57.0.7 Model Training & Compilation

Model Configuration

Python

1model.compile(
2optimizer=’adam’,
3loss=’binary_crossentropy’,
4metrics=[’accuracy’]
5)
Parameter Value Purpose
OptimizerAdam Adaptive learning rate
Loss FunctionBinary Crossentropy Binary classification
MetricsAccuracy Performance evaluation
Epochs5 Training iterations
Training Parameters
Training Execution
1history = model.fit(
2X_train, y_train,
3epochs=5,
4validation_data=(X_test, y_test)
5)
57.0.8 Performance Comparison
Results Analysis
Approach Training Accuracy Key Features
Integer Encoding~60-70% Direct sequence processing
Embedding Layer~80-98% Dense representation,
semantic meaning
57.0.9 Key Concepts Explained
Embedding Layer Mathematics
659

Python

Chapter 57. RNN Sentiment Analysis RNN Code Example in Keras CampusX
1Vocabulary Size: 17 unique words
2Embedding Dimension: 2
3Weight Matrix: 17 * 2 = 34 parameters
4
5For each word:
6Input: One-hot vector (17 dimensions)
7Output: Dense vector (2 dimensions)
RNN Parameter Calculation
1SimpleRNN(32 units):
2- Input weights: input_dim * 32
3- Recurrent weights: 32 * 32
4- Bias: 32
5Total: (input_dim + 32 + 1) * 32 parameters

Content sourced from CampusX Deep Learning notes (PDF). Run merge script for full body.

Common mistakes

Ignoring exploding effects on convergence.
Not monitoring clip during training.
Tuning on test data.

Interview checkpoints

Q: Exploding Gradients in one sentence? A: Core training stability topic.
Q: Debug step? A: Plot gradients per layer.

Practice

Basic: Explain Exploding Gradients plainly.
Intermediate: Experiment on MNIST with one change.
Advanced: Document before/after metrics.

Recap

Understand exploding gradients.
Link to loss curves and init.
Prepare for regularization module.

Next: Day 27 — Gradient Clipping

Day 27

Gradient Clipping

1.3. Artificial Neural Networks (ANN)

1.3.3 MLP [Multi-layer perceptron]

•Intuition of MLP •MLP Notation •Prediction in MLP

1.3.4 Training an MLP [Most used Algorithm]

•Gradient Descent •Backpropagation

Python

1.3.5 Practical with Keras
•CPU Vs GPU
•Installation
•Example 1 - Regression using Keras
•Example 2 - Classification using Keras
1.3.6 How to improve an ANN
•Vanishing Gradients
•Exploding Gradients
•Dropouts
•Regularization
•Weight Initialization
•Optimizers
•Gradient Checking and Clipping
•Batch Normalization
•Hyperparameter Tuning
1.3.7 Advanced Topics
•Callbacks
•Tensorboard
•Pretrained Models
•Keras Functional API
•Saving and Loading a Keras model
•Building a Streamlit Application
1.3.8 Project
•End-to-End Final Project
•AWS deployment
Convolutional Neural Networks (CNN)
•Convolution operations and filters
•Pooling layers and techniques
•Feature maps and visualization
3

Why this matters

Gradient clipping caps update magnitude — essential in RNNs and some transformers.

20.1.11 Brief Introduction: Exploding Gradient Prob-

lem Opposite Problem This isexactly oppositeto the vanishing gradient problem, but it’smore commonly seen in Recurrent Neural Networks (RNNs). Mathematical Principle Opposite logic: If you have numbersgreater than 1and you multiply them, you get a numberlarger than all of them. What Happens 1.When calculating derivatives: If all derivatives aregreater than 1 2.Result: You get avery large number 3.Weight updates: Becomeextremely large 4.Example: 1W_new = W_old - learning_rate * large_gradient 2W_new = 1 - 0.1 * 100 = 1 - 10 = -9 213

Chapter 20. Vanishing Gradient Problem in ANN Exploding Gradient Problem Code Example 5.Next iteration: Weight can become100 or even larger 6.Consequence: Weights become so large thatmodel starts behaving randomlyandloss doesn’t reduce Solution Preview Gradient Clippingtechnique will be covered when studying RNNs - detailed video will show what exploding gradient problem is and how to use gradient clipping to avoid this problem.

20.1.11 Brief Introduction: Exploding Gradient Prob-

Content sourced from CampusX Deep Learning notes (PDF). Run merge script for full body.

Common mistakes

Ignoring clip effects on convergence.
Not monitoring norm during training.
Tuning on test data.

Interview checkpoints

Q: Gradient Clipping in one sentence? A: Core training stability topic.
Q: Debug step? A: Plot gradients per layer.

Practice

Basic: Explain Gradient Clipping plainly.
Intermediate: Experiment on MNIST with one change.
Advanced: Document before/after metrics.

Recap

Understand gradient clipping.
Link to loss curves and init.
Prepare for regularization module.

Next: Day 28 — Weight Initialization

Day 28

Weight Initialization

Contents 25.0.8Key Takeaways. . . . . . . . . . . . . . . . . . . . . . . . . 252 25.0.9Additional Notes. . . . . . . . . . . . . . . . . . . . . . . . 253 26 Regularization in Deep Learning L2 Regularization in ANN L1 Reg- ularization Weight Decay in ANN 257 26.0.1Introduction to Regularization in Neural Networks. . 257 26.0.2Building Neural Networks: Basics. . . . . . . . . . . . . 257 26.0.3Understanding Overfitting. . . . . . . . . . . . . . . . . . 257 26.0.4Ways to Reduce Overfitting. . . . . . . . . . . . . . . . . 258

26.0.5 Complete Cost Function with Regularization . . . . . . . . . . 259

26.0.6 Regularization Types . . . . . . . . . . . . . . . . . . . . . . . 259

26.0.7 Parameter Definitions . . . . . . . . . . . . . . . . . . . . . . 259

26.0.8 Weight Structure . . . . . . . . . . . . . . . . . . . . . . . . . 259

26.0.9Regularization: How It Works. . . . . . . . . . . . . . . 260 26.0.10Intuition Behind Regularization. . . . . . . . . . . . . . 260 26.0.11Practical Implementation & Code Demo. . . . . . . . . 260 26.0.12Comparison Table: With vs Without Regularization. 261 26.0.13Visual Summary: Regularization Process. . . . . . . . 262 26.0.14Key Takeaways. . . . . . . . . . . . . . . . . . . . . . . . . 262 26.0.15Tips & Best Practices. . . . . . . . . . . . . . . . . . . . . 263 26.0.16Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 27 Activation Functions in Deep Learning Sigmoid, Tanh and Relu Ac- tivation Function 265

27.1 Activation Functions in Neural Networks . . . . . . . . . . . . . . . . 265

27.1.1 Introduction to Activation Functions . . . . . . . . . . . . . . 265

27.1.2 Why Activation Functions are Needed . . . . . . . . . . . . . 266

27.1.3 Ideal Activation Function Properties . . . . . . . . . . . . . . 268

27.1.4 Sigmoid Activation Function . . . . . . . . . . . . . . . . . . . 269

27.1.5 Tanh Activation Function . . . . . . . . . . . . . . . . . . . . 272

27.1.6 ReLU Activation Function . . . . . . . . . . . . . . . . . . . . 275

27.1.7 Summary and Comparison . . . . . . . . . . . . . . . . . . . . 279

27.1.8Final Takeaways. . . . . . . . . . . . . . . . . . . . . . . . 279 28 Relu Variants Explained Leaky Relu Parametric Relu Selu Activa- tion Functions Part 2 281

28.0.1 Introduction to Activation Functions . . . . . . . . . . . . . . 281

28.0.2 Why Activation Functions are Needed . . . . . . . . . . . . . 283

28.0.3 Ideal Activation Function Properties . . . . . . . . . . . . . . 285

28.0.4 Sigmoid Activation Function . . . . . . . . . . . . . . . . . . . 286

28.0.5 Tanh Activation Function . . . . . . . . . . . . . . . . . . . . 292

28.0.6 ReLU Activation Function . . . . . . . . . . . . . . . . . . . . 300

28.0.7 Summary and Comparison . . . . . . . . . . . . . . . . . . . . 306

28.0.8 Key Takeaways & Architecture Guide . . . . . . . . . . . . . . 312

29 Weight Initialization Techniques What not to do Deep Learning 316 30 Xavier Glorat And He Weight Initialization in Deep Learning 317

30.1 Neural Network Weight Initialization Techniques . . . . . . . . . . . . 317

xii

Why this matters

Weight initialization sets trainability — He/Xavier match activation.

30.1 Neural Network Weight Initialization Techniques . . . . . . . . . . . . 317

xii

Contents

30.1 Neural Network Weight Initialization Techniques . . . . . . . . . . . . 317

xii

Contents

Content sourced from CampusX Deep Learning notes (PDF). Run merge script for full body.

Common mistakes

Ignoring init effects on convergence.
Not monitoring he during training.
Tuning on test data.

Interview checkpoints

Q: Weight Initialization in one sentence? A: Core training stability topic.
Q: Debug step? A: Plot gradients per layer.

Practice

Basic: Explain Weight Initialization plainly.
Intermediate: Experiment on MNIST with one change.
Advanced: Document before/after metrics.

Recap

Understand weight initialization.
Link to loss curves and init.
Prepare for regularization module.

Next: Day 29 — Learning Rate Schedules

Day 29

Learning Rate Schedules

32.1. Optimizers in Deep Learning | Part 1 | Complete Deep Learning Course 3 Mini-batch Gradient Descent 1forepochin range(num_epochs): 2forbatchinmini_batches: 3gradients = compute_gradients(batch) 4weights = weights - learning_rate * gradients

32.1.4 Challenges with Traditional Optimizers

Learning Rate Selection Learning Rate Effect Visualization Too SmallSlow convergence Painfully slow Too LargeOvershooting/Divergence Unstable Just RightOptimal convergence Perfect The Goldilocks Problem 2 Learning Rate Scheduling Problem: Pre-defined schedules don’t adapt to data 1# Common scheduling strategies 2strategies = { 3"Step Decay": "lr = lr * 0.1 every 30 epochs", 4"Exponential": "lr = lr * exp(-decay * epoch)", 5"Cosine Annealing": "lr = lr_min + 0.5 * (lr_max - lr_min) * (1 + cos (? * epoch/total))" 6} 3 Same Learning Rate for All Parameters Issue: Different parameters may need different learning rates Figure 32.3: image 345

Why this matters

Learning rate schedules decay η over time — cosine, step, exponential.

49.1.11 Key Learning Points

Technical Concepts Applied 1.Convolutional Neural Networks: Feature extraction using filters 570

49.1. Cat Vs Dog Image Classification Project | Deep Learning Project | CNN Project 2.Data Generators: Efficient handling of large datasets 3.Batch Processing: Training with mini-batches 4.Regularization: Preventing overfitting 5.Transfer Learning Concepts: Building custom architecture Best Practices Demonstrated 1.GPU Utilization: Using Google Colab’s free GPU 2.Data Normalization: Essential preprocessing step 3.Model Monitoring: Plotting training curves 4.Overfitting Detection: Recognizing performance gaps 5.Iterative Improvement: Adding regularization techniques

49.1.11 Key Learning Points

Technical Concepts Applied 1.Convolutional Neural Networks: Feature extraction using filters 570

Content sourced from CampusX Deep Learning notes (PDF). Run merge script for full body.

Common mistakes

Ignoring lr effects on convergence.
Not monitoring schedule during training.
Tuning on test data.

Interview checkpoints

Q: Learning Rate Schedules in one sentence? A: Core training stability topic.
Q: Debug step? A: Plot gradients per layer.

Practice

Basic: Explain Learning Rate Schedules plainly.
Intermediate: Experiment on MNIST with one change.
Advanced: Document before/after metrics.

Recap

Understand learning rate schedules.
Link to loss curves and init.
Prepare for regularization module.

Next: Day 30 — Keras Tuner

Day 30

Keras Tuner

import keras_tuner as kt

def build_model(hp):
    model = keras.Sequential([
        keras.layers.Dense(hp.Int('units', 64, 256, step=64), activation='relu', input_shape=(20,)),
        keras.layers.Dense(1, activation='sigmoid'),
    ])
    model.compile(optimizer=keras.optimizers.Adam(hp.Choice('lr', [1e-3, 1e-4])),
                  loss='binary_crossentropy', metrics=['accuracy'])
    return model

tuner = kt.RandomSearch(build_model, objective='val_accuracy', max_trials=5, directory='tune_dir')
tuner.search(x_train, y_train, validation_split=0.2, epochs=10)

Why this matters

Keras Tuner automates hyperparameter search — use val loss, not train.

37.2.11 Key Insights Covered:

The Core Innovation 1Adagrad: v_t = ?(?w_i)^2 -> grows forever -> learning rate -> 0 2RMSProp: v_t = beta*v_{t-1} + (1-beta)*(?w_t)^2 -> controlled growth Performance Characteristics –Excellent for neural networks and non-convex problems –Handles sparse data efficiently –No major disadvantages (still competitive with ADAM) –Was the gold standard before ADAM arrived Modern Usage –Second choice after ADAM for most problems –First choice when ADAM doesn’t perform well –Particularly good for RNNs and memory-constrained environments 402

37.2. RMSProp Optimizer: Complete Deep Learning Notes 403

Chapter 38 AdamOptimizerExplainedinDe- tail with Animations Optimizers in Deep Learning Part 5

38.1 Adam Optimizer Explained in Detail with

Animations | Optimizers in Deep Learning Part 5

38.2 ADAMOptimizer: CompleteDeepLearn-

ing Notes

38.2.1 Introduction & Overview

What is ADAM? ADAM=Adaptive Moment Estimation Feature Description TypeGradient-based optimization algorithm PopularityMost widely used optimizer in deep learning ApplicationsANNs, CNNs, RNNs, and most neural architectures Key StrengthCombines momentum and adaptive learning rates Key Insight: ADAM is currently the most powerful optimization technique and is used in most deep learning implementations. 404

38.2. ADAM Optimizer: Complete Deep Learning Notes

38.2.2 Background: Evolution of Optimization

Optimization Techniques Timeline Figure 38.1: image 405

Chapter 38. Adam Optimizer Explained in Detail with Animations Optimizers in Deep Learning Part 5 Comparison of Optimization Methods Method Speed Oscillations Sparse Data Learning Rate Decay Convergence SGD/BGDSlow Minimal Poor Manual Good but slow MomentumFast High Poor Manual Fast but oscillates NAGFast Reduced Poor Manual Good AdagradFast Minimal Excellent Too aggressive Stops learning RMSpropFast Minimal Good Controlled Excellent ADAMFast Minimal Excellent AutomaticBest Overall Problem-Solution Evolution 1 Batch Gradient Descent Problem – Issue: Very slow convergence – Solution: Momentum→Uses past gradients for current update 2 Momentum Problem – Issue: High oscillations around minimum – Solution: NAG (Nesterov Accelerated Gradient)→Dampens oscillations 3 Sparse Data Problem – Issue: Poor performance on sparse features – Solution: Adagrad→Adaptive learning rates per parameter 4 Adagrad Problem – Issue: Learning rate becomes too small, stops learning – Solution: RMSprop→Controls learning rate decay 5 Integration Opportunity – Observation: Two successful concepts exist: –Momentum (velocity concept) –Adaptive learning rate decay – Solution: ADAM→Combines both concepts 406

38.2. ADAM Optimizer: Complete Deep Learning Notes

38.2.3 Mathematical Formulation

Core ADAM Equations The ADAM algorithm uses the following mathematical formulation: Weight Update Rule: wt+1 =w t− η√ˆvt +ϵ×ˆmt Momentum Estimation (1st Moment): mt =β1×mt−1+ (1−β1)×∇wt Velocity Estimation (2nd Moment): vt =β2×vt−1+ (1−β2)×(∇wt)2 Bias Correction: ˆmt = mt 1−βt 1 ˆvt = vt 1−βt 2 Default Hyperparameters Parameter Symbol Default Value Purpose Learning Rateη0.001 Step size control Momentum Decayβ1 0.9 Controls momentum Velocity Decayβ 2 0.999 Controls adaptive learning Epsilonε1e-8 Numerical stability

38.2.4 Algorithm Components

ADAM Algorithm Breakdown Step 1: Calculate First Moment (Momentum) 1# Exponentially weighted average of gradients 2m_t = beta1 * m_{t-1} + (1 - beta1) * gradient 407

Chapter 38. Adam Optimizer Explained in Detail with Animations Optimizers in Deep Learning Part 5 Step 2: Calculate Second Moment (Velocity) 1# Exponentially weighted average of squared gradients 2v_t = beta2 * v_{t-1} + (1 - beta2) * gradient^2 Step 3: Bias Correction 1# Correct for initialization bias 2m_hat = m_t / (1 - beta1^t) 3v_hat = v_t / (1 - beta2^t) Step 4: Parameter Update 1# Update weights 2w = w - learning_rate * m_hat / (sqrt(v_hat) + epsilon) Why Bias Correction? Problem: Initially, bothm = 0andv = 0 Effect: Creates bias towards zero in early iterations Solution: Bias correction factors(1-β)and(1-β)offset this bias

38.2.5 Visual Understanding

ADAM Behavior Animation Analysis Scenario ADAM Behavior Comparison Sparse DataDirect descent to center Better than Momentum’s zigzag Convergence SpeedFastest convergence Beats all previous methods Oscillation ControlMinimal oscillations Stable approach to minimum Non-convex Optimization Excellent performance Ideal for neural networks Performance Characteristics: 408

38.2. ADAM Optimizer: Complete Deep Learning Notes Convergence Comparison Chart Figure 38.2: image

38.2.6 Implementation Guidelines

Practical Usage Recommendations First Choice Strategy: 1# Start with ADAM - most cases 2optimizer = Adam(learning_rate=0.001) Alternative Options: 1# If ADAM doesn’t perform well 2optimizer_rmsprop = RMSprop(learning_rate=0.001) 3optimizer_momentum = SGD(learning_rate=0.01, momentum=0.9) Hyperparameter Tuning Guide Parameter Typical Range When to Adjust Learning Rate0.0001 - 0.01 Always tune first β1 (Momentum)0.8 - 0.95 For different momentum needs β2 (Velocity)0.99 - 0.999 For adaptive rate sensitivity Epsilon1e-8 - 1e-6 For numerical stability issues 409

Chapter 38. Adam Optimizer Explained in Detail with Animations Optimizers in Deep Learning Part 5 Decision Framework Figure 38.3: image

38.2.7 Performance Analysis

Why ADAM is Superior Automatic Learning Rate Management: –No manual learning rate scheduling needed 410

38.2. ADAM Optimizer: Complete Deep Learning Notes –Adaptive decay prevents overshooting –Balances exploration vs exploitation Robust to Hyperparameters: –Default values work well in most cases –Less sensitive to initial learning rate choice –Consistent performance across problems Memory Efficiency: –Only stores first and second moment estimates –O(p) memory complexity (p = parameters) –Computationally efficient Empirical Results Summary Research Findings: Over the past 3-4 years, ADAM has con- sistently delivered better results across different types of problems compared to other optimizers. Success Metrics: –Faster convergence (typically 2-5x speedup) –Better final performance –More stable training –Requires less hyperparameter tuning

38.2.8 Key Takeaways

Core Concepts to Remember 1. Combination: ADAM = Momentum + Adaptive Learning Rate 2. Mathematics: Uses both first and second moment estimates 3. Bias Correction: Essential for proper initialization 4. Default Choice: Start with ADAM for most deep learning problems 5. Flexibility: Can fall back to RMSprop or Momentum if needed Best Practices – Start with ADAMas your default optimizer – Monitor convergenceand compare with alternatives – Tune learning ratefirst, other parameters later – Use early stoppingto prevent overfitting – Experimentwith different optimizers for specific problems 411

Chapter 38. Adam Optimizer Explained in Detail with Animations Optimizers in Deep Learning Part 5 412

Part IX Hyperparameter Tuning 413

Chapter 39

Python

KerasTunerHyperparameterTun-
ing a Neural Network
39.1 Keras Tuner | Hyperparameter Tuning a
Neural Network
39.2 HyperparameterTuningwithKerasTuner
- Complete Guide
39.2.1 Introduction
Problem Statement
When building neural networks, we face multiple decisions: - How many hidden
layers? - How many neurons per layer? - Which activation function? - What
batch size? - Which optimizer?
Solution: Keras Tuner
Keras Tuneris one of the most famous hyperparameter tuning libraries that
helps automate the process of finding optimal hyperparameters.
39.2.2 Setup and Installation
Required Libraries
1# Core libraries
2importpandasaspd
3importnumpyasnp
4fromsklearn.preprocessingimportStandardScaler
5fromsklearn.model_selectionimporttrain_test_split
6
7# TensorFlow/Keras
8importtensorflowastf
9fromtensorflow.keras.modelsimportSequential
10fromtensorflow.keras.layersimportDense, Dropout
11
12# Keras Tuner
13importkeras_tuneraskt
414

Python

39.2. Hyperparameter Tuning with Keras Tuner - Complete Guide
Installation
1pip install keras-tuner
39.2.3 Dataset Preparation
Dataset: Pima Indians Diabetes
Feature Description Type
Pregnancies Number of pregnancies Numeric
Glucose Glucose concentration Numeric
BloodPressure Blood pressure Numeric
SkinThickness Skin thickness Numeric
Insulin Insulin level Numeric
BMI Body Mass Index Numeric
DiabetesPedigreeFunction Diabetes pedigree function Numeric
Age Age Numeric
Outcome Diabetes (0/1) Binary
Data Preprocessing Steps
1# Load dataset
2data = pd.read_csv(’diabetes.csv’)
3
4# Separate features and target
5X = data.iloc[:, :-1]# All columns except last
6y = data.iloc[:, -1]# Last column (Outcome)
7
8# Scale features
9scaler = StandardScaler()
10X_scaled = scaler.fit_transform(X)
11
12# Split data
13X_train, X_test, y_train, y_test = train_test_split(
14X_scaled, y, test_size=0.2, random_state=42
15)
39.2.4 Basic Model Building
Manual Approach (Before Tuning)
415

Python

Chapter 39. Keras Tuner Hyperparameter Tuning a Neural Network
1model = Sequential([
2Dense(32, activation=’relu’, input_dim=8),
3Dense(1, activation=’sigmoid’)
4])
5
6model.compile(
7optimizer=’rmsprop’,
8loss=’binary_crossentropy’,
9metrics=[’accuracy’]
10)
Results Analysis
Approach Accuracy Issue
Manual ~70% Trial and error
Intuition-based Variable Time-consuming
Automated Tuning Optimized Systematic
39.2.5 Optimizer Selection
Step 1: Define Build Function
1defbuild_model(hp):
2model = Sequential()
3
4# Fixed architecture for optimizer testing
5model.add(Dense(32, activation=’relu’, input_dim=8))
6model.add(Dense(1, activation=’sigmoid’))
7
8# Hyperparameter: Optimizer selection
9optimizer = hp.Choice(
10’optimizer’,
11values=[’adam’, ’rmsprop’, ’sgd’, ’adagrad’]
12)
13
14model.compile(
15optimizer=optimizer,
16loss=’binary_crossentropy’,
17metrics=[’accuracy’]
18)
19
20returnmodel
Step 2: Create Tuner Object
1tuner = kt.RandomSearch(
416

Python

39.2. Hyperparameter Tuning with Keras Tuner - Complete Guide
2build_model,
3objective=’val_accuracy’,
4max_trials=5,
5directory=’my_dir’,
6project_name=’optimizer_tuning’
7)
Step 3: Search for Best Optimizer
1tuner.search(
2X_train, y_train,
3epochs=10,
4validation_data=(X_test, y_test)
5)
6
7# Get best hyperparameters
8best_params = tuner.get_best_hyperparameters()[0]
9print(f"Best optimizer: {best_params.get(’optimizer’)}")
Optimizer Comparison Results
Optimizer Validation Accuracy Performance
RMSprop 0.538
Adam 0.650
SGD 0.570
Adagrad 0.650
39.2.6 Number of Neurons Optimization
Hyperparameter: Units Selection
1defbuild_model(hp):
2model = Sequential()
3
4# Variable number of units
5units = hp.Int(’units’, min_value=8, max_value=128, step=8)
6
7model.add(Dense(
8units=units,
9activation=’relu’,
10input_dim=8
11))
12model.add(Dense(1, activation=’sigmoid’))
13
14model.compile(
15optimizer=’rmsprop’,# Use best from previous step
417

Python

Chapter 39. Keras Tuner Hyperparameter Tuning a Neural Network
16loss=’binary_crossentropy’,
17metrics=[’accuracy’]
18)
19
20returnmodel
Units Testing Range
Figure 39.1: Mermaid diagram
Best Results
– Optimal Units: 120 neurons
– Validation Accuracy: Improved performance
– Pattern: More neurons generally better (up to a point)
39.2.7 Number of Layers Optimization
Dynamic Layer Creation
1defbuild_model(hp):
2model = Sequential()
3
4# Variable number of layers
5num_layers = hp.Int(’num_layers’, min_value=1, max_value=10)
6
7foriin range(num_layers):
8ifi == 0:
9# First layer with input dimension
10model.add(Dense(
11units=hp.Int(f’units_{i}’, 8, 128, step=8),
12activation=’relu’,
13input_dim=8
14))
15else:
16# Hidden layers
17model.add(Dense(
18units=hp.Int(f’units_{i}’, 8, 128, step=8),
19activation=’relu’
20))
21
22# Output layer
23model.add(Dense(1, activation=’sigmoid’))
24
25model.compile(
26optimizer=’rmsprop’,
27loss=’binary_crosse

Content sourced from CampusX Deep Learning notes (PDF). Run merge script for full body.

Common mistakes

Ignoring tuner effects on convergence.
Not monitoring keras during training.
Tuning on test data.

Interview checkpoints

Q: Keras Tuner in one sentence? A: Core training stability topic.
Q: Debug step? A: Plot gradients per layer.

Practice

Basic: Explain Keras Tuner plainly.
Intermediate: Experiment on MNIST with one change.
Advanced: Document before/after metrics.

Recap

Understand keras tuner.
Link to loss curves and init.
Prepare for regularization module.

Next: Day 31 — Training Curves

Day 31

Training Curves

Chapter 9. Multi Layer Perceptron MLP Intuition Platform Features Feature Description Benefit Visual InterfaceNo coding required Easy experimentation Real-time TrainingWatch learning process Immediate feedback Multiple DatasetsVarious complexity levels Progressive learning Architecture ControlModify layers/nodes Hands-on understanding Demo Results Summary Architecture Activation Result 2 input→2 hidden→1 output Sigmoid Failed 2 input→4 hidden→1 output Sigmoid Success 2 input→4 hidden→1 output ReLU Fast Success XOR Problem Solution Key Observations Important Findings:1. More hidden nodes = Better non- linear capability 2. ReLU activation = Faster convergence 3. Complex data needs deeper networks 4. Layer-by-layer visualization shows learning progression

9.1.7 Performance Comparison

Single Perceptron vs MLP Aspect Single Perceptron Multi-Layer Perceptron Decision BoundaryLinear only Non-linear curves XOR ProblemCannot solve Easily solved Complex DataFails Handles well Training TimeFast Longer ParametersFew Many 118

Why this matters

Training curves diagnose bias/variance — watch train-val gap.

49.1.11 Key Learning Points

Technical Concepts Applied 1.Convolutional Neural Networks: Feature extraction using filters 570

49.1.11 Key Learning Points

Technical Concepts Applied 1.Convolutional Neural Networks: Feature extraction using filters 570

Content sourced from CampusX Deep Learning notes (PDF). Run merge script for full body.

Common mistakes

Ignoring curves effects on convergence.
Not monitoring overfit during training.
Tuning on test data.

Interview checkpoints

Q: Training Curves in one sentence? A: Core training stability topic.
Q: Debug step? A: Plot gradients per layer.

Practice

Basic: Explain Training Curves plainly.
Intermediate: Experiment on MNIST with one change.
Advanced: Document before/after metrics.

Recap

Understand training curves.
Link to loss curves and init.
Prepare for regularization module.

Next: Day 32 — Hyperparameter Tuning

Day 32

Hyperparameter Tuning

Contents

21.1.3 Part 1: Hyperparameter Tuning . . . . . . . . . . . . . . . . . 218

21.1.4 Types of Gradient Descent - . . . . . . . . . . . . . . . . . . . 220

21.1.5 Part 2: Common Deep Learning Problems . . . . . . . . . . . 220

21.1.6 Solution Techniques Summary . . . . . . . . . . . . . . . . . . 221

21.1.7 Future Learning Roadmap . . . . . . . . . . . . . . . . . . . . 222

21.1.8 Key Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . 223

22 Early Stopping In Neural Networks End to End Deep Learning Course 225

22.1 Early Stopping in Neural Networks . . . . . . . . . . . . . . . . . . . 225

22.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225

22.1.2 Learning Objectives . . . . . . . . . . . . . . . . . . . . . . . . 225

22.1.3 The Problem: When to Stop Training? . . . . . . . . . . . . . 225

22.1.4 What is Early Stopping? . . . . . . . . . . . . . . . . . . . . . 226

22.1.5 Practical Implementation . . . . . . . . . . . . . . . . . . . . . 226

22.1.6 Early Stopping Parameters . . . . . . . . . . . . . . . . . . . . 227

22.1.7 Training Flow with Early Stopping . . . . . . . . . . . . . . . 229

22.1.8 Best Practices . . . . . . . . . . . . . . . . . . . . . . . . . . . 230

22.1.9 Advanced Configuration . . . . . . . . . . . . . . . . . . . . . 230

22.1.10Real-World Benefits . . . . . . . . . . . . . . . . . . . . . . . 231 22.1.11Quick Start Checklist . . . . . . . . . . . . . . . . . . . . . . . 232 22.1.12Key Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . 232 23 Data Scaling in Neural Network Feature Scaling in ANN End to End Deep Learning Course 233

23.1 Deep Learning: Feature Scaling and Normalization - Detailed Notes . 233

23.1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . 233

23.1.2 Technical Intuition . . . . . . . . . . . . . . . . . . . . . . . . 233

23.1.3 Solutions: Feature Scaling Techniques . . . . . . . . . . . . . . 234

23.1.4 When to Use Which Technique? . . . . . . . . . . . . . . . . . 235

23.1.5 Practical Implementation . . . . . . . . . . . . . . . . . . . . . 235

23.1.6 Results Comparison . . . . . . . . . . . . . . . . . . . . . . . . 235

23.1.7 Neural Network Architecture Used . . . . . . . . . . . . . . . 236

23.1.8 Key Insights and Best Practices . . . . . . . . . . . . . . . . . 236

23.1.9 Summary and Takeaways . . . . . . . . . . . . . . . . . . . . . 236

23.1.10Practical Checklist . . . . . . . . . . . . . . . . . . . . . . . . 237 24 Dropout Layer in Deep LearningDropoutsin ANN Endto End Deep Learning 239

24.0.1 Detailed Notes on Dropout in Neural Networks . . . . . . . . 239

25 Dropout Layers in ANN Code Example Regression Classification 246 25.0.1Data and Model Setup. . . . . . . . . . . . . . . . . . . . 246 25.0.2Overfitting Observation. . . . . . . . . . . . . . . . . . . . 246 25.0.3Dropout Implementation. . . . . . . . . . . . . . . . . . . 247 25.0.4Classification Example. . . . . . . . . . . . . . . . . . . . 247 25.0.5Practical Tips for Dropout. . . . . . . . . . . . . . . . . . 248 25.0.6Limitations and Challenges. . . . . . . . . . . . . . . . . 248 25.0.7Visual Summary. . . . . . . . . . . . . . . . . . . . . . . . 249 xi

Why this matters

Hyperparameter tuning is experimental science — one change at a time.

35.1.10 Hyperparameter Guidelines

Parameter Typical Range Recommendation Learning Rate (η)0.001 - 0.1 Start with 0.01 Momentum (β)0.8 - 0.99 0.9 for most cases Decay Factor- Adjust based on oscillations 384

35.1. Nesterov Accelerated Gradient (NAG) Explained in Detail | Animations | Optimizers in Deep Learning

35.1.10 Hyperparameter Guidelines

Parameter Typical Range Recommendation Learning Rate (η)0.001 - 0.1 Start with 0.01 Momentum (β)0.8 - 0.99 0.9 for most cases Decay Factor- Adjust based on oscillations 384

35.1. Nesterov Accelerated Gradient (NAG) Explained in Detail | Animations | Optimizers in Deep Learning

Content sourced from CampusX Deep Learning notes (PDF). Run merge script for full body.

Common mistakes

Ignoring hparam effects on convergence.
Not monitoring grid during training.
Tuning on test data.

Interview checkpoints

Q: Hyperparameter Tuning in one sentence? A: Core training stability topic.
Q: Debug step? A: Plot gradients per layer.

Practice

Basic: Explain Hyperparameter Tuning plainly.
Intermediate: Experiment on MNIST with one change.
Advanced: Document before/after metrics.

Recap

Understand hyperparameter tuning.
Link to loss curves and init.
Prepare for regularization module.

Next: Day 33 — Gradient Flow Project

Day 33

Gradient Flow Project

Contents Function 79

6.1 Detailed Notes: Loss Functions in Perceptron . . . . . . . . . . . . . 79

6.1.1 Recap of Perceptron . . . . . . . . . . . . . . . . . . . . . . . 79

6.1.2 Problems with Perceptron Trick . . . . . . . . . . . . . . . . . 80

6.1.3 Introduction to Loss Functions . . . . . . . . . . . . . . . . . 80

6.1.4 Perceptron Loss Function . . . . . . . . . . . . . . . . . . . . 80

6.1.5 Geometric Intuition of Loss Function . . . . . . . . . . . . . . 81

6.1.6 Sklearn -> Perceptron Loss Function . . . . . . . . . . . . . . 81

6.1.7 Sklearn Implementation of Perceptron Loss Function . . . . . 81

6.1.8 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . 82

6.1.9 Code Implementation . . . . . . . . . . . . . . . . . . . . . . . 84

6.1.10 Flexibility of Perceptron Model . . . . . . . . . . . . . . . . . 85

6.1.11 Key Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . 87

7 Problem with perceptron 89

7.0.1 Problem with Perceptron . . . . . . . . . . . . . . . . . . . . . 89

7.0.2 Code Implementation Details . . . . . . . . . . . . . . . . . . 89

7.0.3 Code Structure Breakdown . . . . . . . . . . . . . . . . . . . . 95

7.0.4 TensorFlow Playground Demonstration . . . . . . . . . . . . . 95

7.0.5 Key Observations . . . . . . . . . . . . . . . . . . . . . . . . . 96

7.0.6 Learning Outcomes . . . . . . . . . . . . . . . . . . . . . . . . 97

7.0.7 Educational Value . . . . . . . . . . . . . . . . . . . . . . . . 97

7.0.8 Code Access & Usage . . . . . . . . . . . . . . . . . . . . . . . 97

7.0.9 Video Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 98

III Multi-Layer Perceptrons 99 8 MLP Notation 100

8.1 Multi-Layer Perceptron (MLP) Notation . . . . . . . . . . . . . . . . 100

8.1.1 Learning Objectives . . . . . . . . . . . . . . . . . . . . . . . . 100

8.1.2 Neural Network Architecture Setup . . . . . . . . . . . . . . . 100

8.1.3 Trainable Parameters Calculation . . . . . . . . . . . . . . . . 101

8.1.4 Color Coding System for Weights . . . . . . . . . . . . . . . . 103

8.1.5 Weight Notation System . . . . . . . . . . . . . . . . . . . . . 104

8.1.6 Bias Notation System . . . . . . . . . . . . . . . . . . . . . . . 105

8.1.7 Output Notation System . . . . . . . . . . . . . . . . . . . . . 106

8.1.8 Practice Exercise . . . . . . . . . . . . . . . . . . . . . . . . . 106

8.1.9 Next Video Preview . . . . . . . . . . . . . . . . . . . . . . . 107

8.1.10 Key Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . 107

9 Multi Layer Perceptron MLP Intuition 109

9.1 Multi-Layer Perceptron: MLP Intuition . . . . . . . . . . . . . . . . . 109

9.1.1 The Core Problem . . . . . . . . . . . . . . . . . . . . . . . . 109

9.1.2 Perceptron with Sigmoid Activation . . . . . . . . . . . . . . . 109

9.1.3 Multi-Layer Perceptron Construction . . . . . . . . . . . . . . 111

Why this matters

Gradient flow project ties vanishing, init, and optimizers together.

17.1.13 Practice Recommendations

[Hands-OnExercises-https://developers-dot-devsite-v2-prod.appspot.com/machine- learning/crash-course/backprop-scroll] 1. Interactive Exploration –Use TensorFlow Playground extensively –Try different learning rates and observe effects 186

17.1. Backpropagation Intuition Notes - Part 3 –Experiment with various data patterns 2. Implementation Practice –Code backpropagation from scratch –Verify gradients numerically –Compare with framework implementations 3. Visualization Projects –Create animated gradient descent –Plot loss landscapes in 2D/3D –Show parameter evolution over time 187

Chapter 18 MLPMemoizationCompleteDeep Learning Playlist

18.1 Memoization in Backpropagation

Optimizing Neural Network Training with Computer Science Techniques

18.1.1 Part 1: What is Memoization?

Wikipedia Definition “In computing, memoization is an optimization technique used pri- marily to speed up computer programs by storing the results of ex- pensive function calls and returning the cached result when the same input occurs again.” Simple Explanation If you have written a program where the same output is being calculated re- peatedly, then in such situations you store the result when you calculate the output for the first time. When you need to calculate the output again for the same input, you don’t calculate it again but take the stored result and show it. Trade-off Analysis –Benefit: Your program becomes faster and takes less time –Cost: You have to spend a little space to store things –Result: This is a very famous technique in computer science Applications in Computer Science This technique is used in a branch of programming calledDynamic Program- ming.

18.1.2 Part 2: Fibonacci Sequence Example

Problem Statement Everyone knows the Fibonacci series where any term’s value is obtained by adding the previous two terms. The goal is to create a function called fibonacci that receives n as input and tells what the nth term of Fibonacci is. 188

18.1. Memoization in Backpropagation Naive Implementation (Inefficient) 1deffibonacci(n): 2ifn == 0orn == 1: 3return1 4else: 5returnfibonacci(n-1) + fibonacci(n-2) Performance Analysis Time Complexity IssuesDemonstrating the performance problems: - For input 36: Takes several seconds - For input 38: Takes even more time (around a minute) - For input 40: Would take 2-3 hours - For input 50: Could take even more time - This is a highly inefficient approach Redundant Calculations ProblemExplaining the exponential time com- plexity by showing how many redundant calculations occur. To calculate fi- bonacci(5): - fibonacci(3) is calculated 2 times - fibonacci(2) is calculated 3 times - fibonacci(1) is calculated multiple times - fibonacci(0) is calculated mul- tiple times Tcalculating just one value, many repeated calculations are required. Optimized Implementation (With Memoization) 1deffibonacci_memo(n, memo={}): 2ifninmemo: 3returnmemo[n] 4 5ifn == 0orn == 1: 6return1 7else: 8memo[n] = fibonacci_memo(n-1, memo) + fibonacci_memo(n-2, memo) 9returnmemo[n] Performance Improvement After implementing memoization: - For input 38: Takes very little time - For input 100: Still takes very little time (same time) Memoization Summary Summarizingthatmemoizationisacomputersciencetechniquewhereyouspend space to reduce time, basically making programs faster. 189

Chapter 18. MLP Memoization Complete Deep Learning Playlist

18.1.3 Part 3: Multi-Layer Neural Networks

Network Architecture Complexity They explains that until now They had only worked on networks with just one hidden layer. Now They will look at neural networks with multiple hidden layers, which increases complexity slightly. Example Network Structure They presents a network with four layers: - Input layer - Two hidden layers - Output layer Figure 18.1: image Here we have3×3 = 9 + 3 = 12,3×2 = 6 + 2 = 8,2×1 = 2 + 1 = 3which in turn gives us23trainable parameters. Derivative Calculation Challenge Target: Calculate ∂L ∂W1 11 Explaining that to update parameters, you need to calculate derivatives of all weights and biases. For the first layer weights, calculating derivatives becomes slightly complex and tricky. Chain Rule ApplicationShowing that lossLdepends on outputˆy, andˆy depends onO 21. The calculation requires: ∂L ∂W2 11 = ∂L ∂ˆy×∂ˆy ∂O21 ×∂O21 ∂W2 11 190

18.1. Memoization in Backpropagation Figure 18.2: image Multiple Path Problem Complex Routing IssueThey explain the main problem that occurs when a node’s output goes through two paths. When you changeW1 11, the output of that node changes, but that node’s output goes ahead through two routes. Mathematical Solution for Multiple PathsIt demonstrates that in mathematics, when you have such a situation where you need to differentiate, youtrackbothparts. Youhavetocalculate: ∂L ∂x= ∂L ∂f(x)×∂f(x) ∂x + ∂L ∂g(x)×∂g(x) ∂x Specific Network CalculationFor the specific network, they show: -W1 11 affectsO 11 -O 11 goes toO 21 and also affects the loss through another path via W 2 21 - The complete derivative requires tracking both paths 191

Chapter 18. MLP Memoization Complete Deep Learning Playlist

18.1.4 Part 4: Complex Derivative Calculations

Complete Mathematical Expression First Path CalculationThe first part of the calculation: ∂L ∂O11 ×∂O11 ∂W1 11 Second Path CalculationThe second part: ∂L ∂O21 ×∂O21 ∂W1 11 Complete ExpressionDemonstrating that the final answer becomes: ∂L ∂W1 11 = ∂L ∂ˆy×∂ˆy ∂O21 ×∂O21 ∂O11 ×∂O11 ∂W1 11 + ∂L ∂ˆy×∂ˆy ∂O22 ×∂O22 ∂O11 ×∂O11 ∂W1 11 ## Part 5: Memoization Application in Backpropagation Backpropagation=Chain Rule+Memoization

18.1.5 Key Takeaways

Essential Understanding 1. Mathematical Foundation –Chain rule enables gradient calculation in deep networks –Complexity grows with network depth –Multiple paths create redundant calculations 2. Computer Science Optimization –Memoization eliminates redundant calculations –Time-space trade-off: memory for speed –Critical for making deep learning practical 3. Hybrid Approach –Modern backpropagation combines mathematics with computer sci- ence –Libraries automatically implement these optimizations –Understanding both components is valuable

18.1.6 Conclusion

Two-Part Learning They summarizes that two things were learned: 1. As you go deeper in neural networks, calculating derivatives takes more time and the formula for calculating derivatives becomes more complex 2. Due to having many layers, you have to recalculate the same derivatives repeatedly, but memoization eliminates this redundancy 192

18.1. Memoization in Backpropagation Optimization Strategy They explains that to optimize the overall algorithm, the technique of memo- ization is used, which is a technique from the field of dynamic programming in computer science, and when this trick is used with chain rule, very intelligent results are obtained. 193

Part VI Gradient Problems in Neural Networks 194

Chapter 19 Gradient Descent in Neural Net- work: BatchvsStochasticvsMini- Batch

19.1 Introduction to Gradient Descent

Gradient Descentis the most popular algorithm for optimization and one of the most common ways to optimize neural networks. It is an optimization algorithm used to find the optimal solution. – Goal: Minimize the loss function (objective function) – Method: Update parameters in the opposite direction of the gradient of the objective function – Process: Move step by step like going downhill to get to the minimum point – Learning Rate: Controls the step size towards the minimum

19.2 Neural Network Context: Back Propaga-

tion Algorithm

19.2.1 Back Propagation Process

1.Decide number of epochs 2.For each epoch: –Take one data point at a time –Calculate prediction for that point –Calculate loss –Update weights using equations 3.Calculate average losswhen epoch completes 195

Chapter 19. Gradient Descent in Neural Network: Batch vs Stochastic vs Mini-Batch Figure 19.1: Back Propagation Process in Neural Network

19.3 Three Types of Gradient Descent

The three flavors differ inhow much datais used to compute the gradient of the objective function: – Batch Gradient Descent: Uses the entire dataset at once – Stochastic Gradient Descent (SGD): Uses a single random data point – Mini-Batch Gradient Descent: Uses a small batch of data points 196

19.3. Three Types of Gradient Descent Aspect Batch GD Mini-Batch GD Stochastic GD Data Points Used Entire dataset at once Small batches (e.g. 32, 64, 128) Single data point Batch Sizebatch_size = total_rowsorNone batch_size = small_number batch_size = 1 Updates per Epoch 1 updateper epoch Number of batches per epoch Number of rows per epoch Speed per Epoch Fastest Medium Slowest Convergence Speed Slowest Medium Fastest Memory UsageHighest (entire dataset) Medium (batch size) Lowest (single point) Loss BehaviorVery stable and smooth Moderately stable Very unstable and noisy VectorizationFully utilized Partially utilized Not utilized RandomizationNo shuffling needed Shuffle before each epoch Random point selection Solution Accuracy Exact solution Good approximation Approximate solution Local Minima Escape Poor (can get stuck) Moderate Excellent (random jumps) Real-world Usage Rare (small datasets only) Most common Less common ImplementationSimple (single loop) Moderate (batch handling) Simple (point-by-point) Table 19.1: Comparison of the Three Gradient Descent Methods 197

Chapter 19. Gradient Descent in Neural Network: Batch vs Stochastic vs Mini-Batch

19.4 Performance Metrics Comparison

Scenario Batch GD Mini-Batch GD Stochastic GD Time to complete 10 epochs ∼0.5 seconds∼5 seconds∼10 seconds Updates for 320 rows, 10 epochs 10 updates∼100 updates (batch_32) 3,200 updates Epochs needed for convergence 50–100 epochs 20–50 epochs 10–20 epochs Final validation accuracy

17.1.13 Practice Recommendations

Chapter 18 MLPMemoizationCompleteDeep Learning Playlist

18.1 Memoization in Backpropagation

Optimizing Neural Network Training with Computer Science Techniques

18.1.1 Part 1: What is Memoization?

18.1.2 Part 2: Fibonacci Sequence Example

Chapter 18. MLP Memoization Complete Deep Learning Playlist

18.1.3 Part 3: Multi-Layer Neural Networks

Chapter 18. MLP Memoization Complete Deep Learning Playlist

18.1.4 Part 4: Complex Derivative Calculations

18.1.5 Key Takeaways

18.1.6 Conclusion

Part VI Gradient Problems in Neural Networks 194

Chapter 19 Gradient Descent in Neural Net- work: BatchvsStochasticvsMini- Batch

19.1 Introduction to Gradient Descent

19.2 Neural Network Context: Back Propaga-

tion Algorithm

19.2.1 Back Propagation Process

Chapter 19. Gradient Descent in Neural Network: Batch vs Stochastic vs Mini-Batch Figure 19.1: Back Propagation Process in Neural Network

19.3 Three Types of Gradient Descent

Chapter 19. Gradient Descent in Neural Network: Batch vs Stochastic vs Mini-Batch

19.4 Performance Metrics Comparison

Content sourced from CampusX Deep Learning notes (PDF). Run merge script for full body.

Common mistakes

Ignoring project effects on convergence.
Not monitoring debug during training.
Tuning on test data.

Interview checkpoints

Q: Gradient Flow Project in one sentence? A: Core training stability topic.
Q: Debug step? A: Plot gradients per layer.

Practice

Basic: Explain Gradient Flow Project plainly.
Intermediate: Experiment on MNIST with one change.
Advanced: Document before/after metrics.

Recap

Understand gradient flow project.
Link to loss curves and init.
Prepare for regularization module.

Next: Day 34 — Overfitting in DL

← Module 2: MLPs Module 4: Performance & Regularization →