Module 5 · 100 Days of DL

Module 5: Batch Normalization & Optimizers

Stabilize intermediate training: implement Batch Normalization scaling. Map learning rate schedules in Momentum, Nesterov, AdaGrad, RMSProp, and Adam.

⏱ 32 Min Read • Author: GenAIWallah Team • Updated: May 2026

Day 45

Momentum SGD

Contents

30.1.1 Problems with Poor Initialization . . . . . . . . . . . . . . . . 317

30.1.2 Xavier/Glorot Initialization . . . . . . . . . . . . . . . . . . . 318

30.1.3 He Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . 319

30.1.4 Mathematical Formulations . . . . . . . . . . . . . . . . . . . 320

Python

30.1.5 Implementation in Keras . . . . . . . . . . . . . . . . . . . . . 320
30.1.6 Comparison Table . . . . . . . . . . . . . . . . . . . . . . . . . 321
30.1.7 Best Practices . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
30.1.8 Visual Summary . . . . . . . . . . . . . . . . . . . . . . . . . 322
30.1.9 Code Demonstration Results . . . . . . . . . . . . . . . . . . . 323
30.1.10Key Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . 324
VIII Optimizers in Deep Learning 325
31 Batch Normalization in Deep Learning Batch Learning in Keras 326
31.1 Batch Normalization: The Complete Deep Learning Guide . . . . . . 326
31.1.1 Introduction & Overview . . . . . . . . . . . . . . . . . . . . . 326
31.1.2 Theoretical Foundation . . . . . . . . . . . . . . . . . . . . . . 326
31.1.3 Why Batch Normalization? . . . . . . . . . . . . . . . . . . . 328
31.1.4 Mathematical Framework . . . . . . . . . . . . . . . . . . . . 329
31.1.5 Implementation Details . . . . . . . . . . . . . . . . . . . . . . 330
31.1.6 Advantages & Benefits . . . . . . . . . . . . . . . . . . . . . . 332
31.1.7 Complete Code Implementation . . . . . . . . . . . . . . . . . 332
31.1.8 Advanced Topics . . . . . . . . . . . . . . . . . . . . . . . . . 338
31.1.9 Best Practices & Tips . . . . . . . . . . . . . . . . . . . . . . 339
31.1.10Summary & Key Takeaways . . . . . . . . . . . . . . . . . . . 340
32 OptimizersinDeepLearningPart1CompleteDeepLearningCourse342
32.1 Optimizers in Deep Learning | Part 1 | Complete Deep Learning Course342
32.1.1 Introduction to Optimizers . . . . . . . . . . . . . . . . . . . . 342
32.1.2 Role of Optimizers . . . . . . . . . . . . . . . . . . . . . . . . 343
32.1.3 Types of Gradient Descent . . . . . . . . . . . . . . . . . . . . 344
32.1.4 Challenges with Traditional Optimizers . . . . . . . . . . . . . 345
32.1.5 Modern Optimization Algorithms . . . . . . . . . . . . . . . . 346
32.1.6 Practical Implementation . . . . . . . . . . . . . . . . . . . . . 347
32.1.7 Key Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . 348
33 Exponentially Weighted Moving Average or Exponential Weighted
Average Deep Learning 349
33.1 Exponentially Weighted Moving Average or Exponential Weighted Av-
erage | Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 349
33.2 SGD with Momentum Optimization . . . . . . . . . . . . . . . . . . . 349
33.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
33.2.2 Understanding Graph Representations . . . . . . . . . . . . . 349
33.2.3 Convex vs Non-Convex Optimization . . . . . . . . . . . . . . 350
33.2.4 Why Momentum? . . . . . . . . . . . . . . . . . . . . . . . . . 351
33.2.5 Mathematical Formulation . . . . . . . . . . . . . . . . . . . . 351
33.2.6 How Momentum Works . . . . . . . . . . . . . . . . . . . . . 352
33.2.7 Effect of Beta Parameter . . . . . . . . . . . . . . . . . . . . . 353
xiii

Why this matters

Momentum accumulates velocity — smooths noisy gradients.

34.2 Momentum Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 372

To accelerate convergence, **Batch Normalization** standardizes the activations of each layer across a mini-batch: $$\mu_B = \frac{1}{m} \sum x_i, \quad \sigma_B^2 = \frac{1}{m} \sum (x_i - \mu_B)^2$$ $$\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}, \quad y_i = \gamma \hat{x}_i + \beta$$ Where $\gamma$ and $\beta$ are learnable scaling parameters. This reduces internal covariate shift and stabilizes training.

Common mistakes

Using momentum without tuning base learning rate.
Ignoring velocity interaction with batch size.
Different seeds → false conclusions.

Interview checkpoints

Q: Default optimizer today? A: Often Adam/AdamW with tuned LR.
Q: SGD+momentum when? A: Vision with careful schedule can beat Adam.

Practice

Basic: Describe Momentum SGD.
Intermediate: Train same net with SGD vs Adam.
Advanced: Plot loss per optimizer to equal compute.

Recap

Momentum SGD changes optimization dynamics.
Fair comparisons need same budget.
Module 6: convolutions next.

Next: Day 46 — Nesterov Momentum

Day 46

Nesterov Momentum

Contents

33.3 Exponential Moving Average (EMA) - Mathematical Intuition . . . . 353

33.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353

33.3.2 Basic Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . 353

33.3.3 Initial Conditions . . . . . . . . . . . . . . . . . . . . . . . . . 354

33.3.4 Recursive Expansion . . . . . . . . . . . . . . . . . . . . . . . 354

33.3.5 General Formula Pattern . . . . . . . . . . . . . . . . . . . . . 354

33.3.6 Key Mathematical Insight . . . . . . . . . . . . . . . . . . . . 355

33.3.7 Practical Implications . . . . . . . . . . . . . . . . . . . . . . 355

33.3.8 Example Calculation . . . . . . . . . . . . . . . . . . . . . . . 356

33.3.9 Benefits & Limitations . . . . . . . . . . . . . . . . . . . . . . 356

33.3.10Visualization Tools . . . . . . . . . . . . . . . . . . . . . . . . 357 33.3.11Summary & Best Practices . . . . . . . . . . . . . . . . . . . . 357 34 SGD with Momentum Explained in Detail with Animations Opti- mizers in Deep Learning Part 2 365

34.1 Deep Learning Optimization Techniques: Momentum with SGD . . . 365

34.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365

34.1.2 Understanding Graph Visualizations . . . . . . . . . . . . . . 365

34.1.3 Convex vs Non-Convex Optimization . . . . . . . . . . . . . . 369

34.1.4 Problems with Standard Gradient Descent . . . . . . . . . . . 371

34.2 Momentum Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 372

34.2.1 Problems Momentum Solves . . . . . . . . . . . . . . . . . . . 372

34.2.2 Core Concept of Momentum . . . . . . . . . . . . . . . . . . . 372

34.2.3 Mathematical Formulation . . . . . . . . . . . . . . . . . . . . 373

34.2.4 Effect of Beta Parameter . . . . . . . . . . . . . . . . . . . . . 374

34.2.5 Problems with Momentum . . . . . . . . . . . . . . . . . . . . 375

34.2.6 Visualizations and Comparisons . . . . . . . . . . . . . . . . . 376

34.2.7 Implementation Example . . . . . . . . . . . . . . . . . . . . . 377

34.2.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378

35 Nesterov Accelerated Gradient (NAG) Explained in Detail Anima- tions Optimizers in Deep Learning 379

35.1 Nesterov Accelerated Gradient (NAG) Explained in Detail | Anima-

tions | Optimizers in Deep Learning . . . . . . . . . . . . . . . . . . . 379

35.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379

35.1.2 Comparison with Other Optimizers . . . . . . . . . . . . . . . 379

35.1.3 Mathematical Formulation . . . . . . . . . . . . . . . . . . . . 379

35.1.4 Visual Comparison: Momentum vs NAG . . . . . . . . . . . . 380

35.1.5 Geometric Intuition . . . . . . . . . . . . . . . . . . . . . . . . 382

35.1.6 Why NAG Works Better . . . . . . . . . . . . . . . . . . . . . 382

Python

35.1.7 Implementation in Keras . . . . . . . . . . . . . . . . . . . . . 383
35.1.8 Advantages & Disadvantages . . . . . . . . . . . . . . . . . . . 384
35.1.9 Best Practices . . . . . . . . . . . . . . . . . . . . . . . . . . . 384
35.1.10Hyperparameter Guidelines . . . . . . . . . . . . . . . . . . . 384
35.1.11Algorithm Comparison Summary . . . . . . . . . . . . . . . . 385
35.1.12Key Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . 385
36 AdaGrad Explained in Detail with Animations Optimizers in Deep
Learning Part 4 387
xiv

Why this matters

Nesterov looks ahead before gradient step — faster convergence often.

37.2.11 Key Insights Covered:

The Core Innovation 1Adagrad: v_t = ?(?w_i)^2 -> grows forever -> learning rate -> 0 2RMSProp: v_t = beta*v_{t-1} + (1-beta)*(?w_t)^2 -> controlled growth Performance Characteristics –Excellent for neural networks and non-convex problems –Handles sparse data efficiently –No major disadvantages (still competitive with ADAM) –Was the gold standard before ADAM arrived Modern Usage –Second choice after ADAM for most problems –First choice when ADAM doesn’t perform well –Particularly good for RNNs and memory-constrained environments 402

37.2. RMSProp Optimizer: Complete Deep Learning Notes 403

Chapter 38 AdamOptimizerExplainedinDe- tail with Animations Optimizers in Deep Learning Part 5

38.1 Adam Optimizer Explained in Detail with

Animations | Optimizers in Deep Learning Part 5

38.2 ADAMOptimizer: CompleteDeepLearn-

ing Notes

38.2.1 Introduction & Overview

What is ADAM? ADAM=Adaptive Moment Estimation Feature Description TypeGradient-based optimization algorithm PopularityMost widely used optimizer in deep learning ApplicationsANNs, CNNs, RNNs, and most neural architectures Key StrengthCombines momentum and adaptive learning rates Key Insight: ADAM is currently the most powerful optimization technique and is used in most deep learning implementations. 404

38.2. ADAM Optimizer: Complete Deep Learning Notes

38.2.2 Background: Evolution of Optimization

Optimization Techniques Timeline Figure 38.1: image 405

Chapter 38. Adam Optimizer Explained in Detail with Animations Optimizers in Deep Learning Part 5 Comparison of Optimization Methods Method Speed Oscillations Sparse Data Learning Rate Decay Convergence SGD/BGDSlow Minimal Poor Manual Good but slow MomentumFast High Poor Manual Fast but oscillates NAGFast Reduced Poor Manual Good AdagradFast Minimal Excellent Too aggressive Stops learning RMSpropFast Minimal Good Controlled Excellent ADAMFast Minimal Excellent AutomaticBest Overall Problem-Solution Evolution 1 Batch Gradient Descent Problem – Issue: Very slow convergence – Solution: Momentum→Uses past gradients for current update 2 Momentum Problem – Issue: High oscillations around minimum – Solution: NAG (Nesterov Accelerated Gradient)→Dampens oscillations 3 Sparse Data Problem – Issue: Poor performance on sparse features – Solution: Adagrad→Adaptive learning rates per parameter 4 Adagrad Problem – Issue: Learning rate becomes too small, stops learning – Solution: RMSprop→Controls learning rate decay 5 Integration Opportunity – Observation: Two successful concepts exist: –Momentum (velocity concept) –Adaptive learning rate decay – Solution: ADAM→Combines both concepts 406

38.2. ADAM Optimizer: Complete Deep Learning Notes

38.2.3 Mathematical Formulation

Core ADAM Equations The ADAM algorithm uses the following mathematical formulation: Weight Update Rule: wt+1 =w t− η√ˆvt +ϵ×ˆmt Momentum Estimation (1st Moment): mt =β1×mt−1+ (1−β1)×∇wt Velocity Estimation (2nd Moment): vt =β2×vt−1+ (1−β2)×(∇wt)2 Bias Correction: ˆmt = mt 1−βt 1 ˆvt = vt 1−βt 2 Default Hyperparameters Parameter Symbol Default Value Purpose Learning Rateη0.001 Step size control Momentum Decayβ1 0.9 Controls momentum Velocity Decayβ 2 0.999 Controls adaptive learning Epsilonε1e-8 Numerical stability

38.2.4 Algorithm Components

ADAM Algorithm Breakdown Step 1: Calculate First Moment (Momentum) 1# Exponentially weighted average of gradients 2m_t = beta1 * m_{t-1} + (1 - beta1) * gradient 407

Chapter 38. Adam Optimizer Explained in Detail with Animations Optimizers in Deep Learning Part 5 Step 2: Calculate Second Moment (Velocity) 1# Exponentially weighted average of squared gradients 2v_t = beta2 * v_{t-1} + (1 - beta2) * gradient^2 Step 3: Bias Correction 1# Correct for initialization bias 2m_hat = m_t / (1 - beta1^t) 3v_hat = v_t / (1 - beta2^t) Step 4: Parameter Update 1# Update weights 2w = w - learning_rate * m_hat / (sqrt(v_hat) + epsilon) Why Bias Correction? Problem: Initially, bothm = 0andv = 0 Effect: Creates bias towards zero in early iterations Solution: Bias correction factors(1-β)and(1-β)offset this bias

38.2.5 Visual Understanding

ADAM Behavior Animation Analysis Scenario ADAM Behavior Comparison Sparse DataDirect descent to center Better than Momentum’s zigzag Convergence SpeedFastest convergence Beats all previous methods Oscillation ControlMinimal oscillations Stable approach to minimum Non-convex Optimization Excellent performance Ideal for neural networks Performance Characteristics: 408

38.2. ADAM Optimizer: Complete Deep Learning Notes Convergence Comparison Chart Figure 38.2: image

38.2.6 Implementation Guidelines

Practical Usage Recommendations First Choice Strategy: 1# Start with ADAM - most cases 2optimizer = Adam(learning_rate=0.001) Alternative Options: 1# If ADAM doesn’t perform well 2optimizer_rmsprop = RMSprop(learning_rate=0.001) 3optimizer_momentum = SGD(learning_rate=0.01, momentum=0.9) Hyperparameter Tuning Guide Parameter Typical Range When to Adjust Learning Rate0.0001 - 0.01 Always tune first β1 (Momentum)0.8 - 0.95 For different momentum needs β2 (Velocity)0.99 - 0.999 For adaptive rate sensitivity Epsilon1e-8 - 1e-6 For numerical stability issues 409

Chapter 38. Adam Optimizer Explained in Detail with Animations Optimizers in Deep Learning Part 5 Decision Framework Figure 38.3: image

38.2.7 Performance Analysis

Why ADAM is Superior Automatic Learning Rate Management: –No manual learning rate scheduling needed 410

38.2. ADAM Optimizer: Complete Deep Learning Notes –Adaptive decay prevents overshooting –Balances exploration vs exploitation Robust to Hyperparameters: –Default values work well in most cases –Less sensitive to initial learning rate choice –Consistent performance across problems Memory Efficiency: –Only stores first and second moment estimates –O(p) memory complexity (p = parameters) –Computationally efficient Empirical Results Summary Research Findings: Over the past 3-4 years, ADAM has con- sistently delivered better results across different types of problems compared to other optimizers. Success Metrics: –Faster convergence (typically 2-5x speedup) –Better final performance –More stable training –Requires less hyperparameter tuning

38.2.8 Key Takeaways

Core Concepts to Remember 1. Combination: ADAM = Momentum + Adaptive Learning Rate 2. Mathematics: Uses both first and second moment estimates 3. Bias Correction: Essential for proper initialization 4. Default Choice: Start with ADAM for most deep learning problems 5. Flexibility: Can fall back to RMSprop or Momentum if needed Best Practices – Start with ADAMas your default optimizer – Monitor convergenceand compare with alternatives – Tune learning ratefirst, other parameters later – Use early stoppingto prevent overfitting – Experimentwith different optimizers for specific problems 411

Chapter 38. Adam Optimizer Explained in Detail with Animations Optimizers in Deep Learning Part 5 412

Part IX Hyperparameter Tuning 413

Chapter 39

Python

KerasTunerHyperparameterTun-
ing a Neural Network
39.1 Keras Tuner | Hyperparameter Tuning a
Neural Network
39.2 HyperparameterTuningwithKerasTuner
- Complete Guide
39.2.1 Introduction
Problem Statement
When building neural networks, we face multiple decisions: - How many hidden
layers? - How many neurons per layer? - Which activation function? - What
batch size? - Which optimizer?
Solution: Keras Tuner
Keras Tuneris one of the most famous hyperparameter tuning libraries that
helps automate the process of finding optimal hyperparameters.
39.2.2 Setup and Installation
Required Libraries
1# Core libraries
2importpandasaspd
3importnumpyasnp
4fromsklearn.preprocessingimportStandardScaler
5fromsklearn.model_selectionimporttrain_test_split
6
7# TensorFlow/Keras
8importtensorflowastf
9fromtensorflow.keras.modelsimportSequential
10fromtensorflow.keras.layersimportDense, Dropout
11
12# Keras Tuner
13importkeras_tuneraskt
414

Python

39.2. Hyperparameter Tuning with Keras Tuner - Complete Guide
Installation
1pip install keras-tuner
39.2.3 Dataset Preparation
Dataset: Pima Indians Diabetes
Feature Description Type
Pregnancies Number of pregnancies Numeric
Glucose Glucose concentration Numeric
BloodPressure Blood pressure Numeric
SkinThickness Skin thickness Numeric
Insulin Insulin level Numeric
BMI Body Mass Index Numeric
DiabetesPedigreeFunction Diabetes pedigree function Numeric
Age Age Numeric
Outcome Diabetes (0/1) Binary
Data Preprocessing Steps
1# Load dataset
2data = pd.read_csv(’diabetes.csv’)
3
4# Separate features and target
5X = data.iloc[:, :-1]# All columns except last
6y = data.iloc[:, -1]# Last column (Outcome)
7
8# Scale features
9scaler = StandardScaler()
10X_scaled = scaler.fit_transform(X)
11
12# Split data
13X_train, X_test, y_train, y_test = train_test_split(
14X_scaled, y, test_size=0.2, random_state=42
15)
39.2.4 Basic Model Building
Manual Approach (Before Tuning)
415

Python

Chapter 39. Keras Tuner Hyperparameter Tuning a Neural Network
1model = Sequential([
2Dense(32, activation=’relu’, input_dim=8),
3Dense(1, activation=’sigmoid’)
4])
5
6model.compile(
7optimizer=’rmsprop’,
8loss=’binary_crossentropy’,
9metrics=[’accuracy’]
10)
Results Analysis
Approach Accuracy Issue
Manual ~70% Trial and error
Intuition-based Variable Time-consuming
Automated Tuning Optimized Systematic
39.2.5 Optimizer Selection
Step 1: Define Build Function
1defbuild_model(hp):
2model = Sequential()
3
4# Fixed architecture for optimizer testing
5model.add(Dense(32, activation=’relu’, input_dim=8))
6model.add(Dense(1, activation=’sigmoid’))
7
8# Hyperparameter: Optimizer selection
9optimizer = hp.Choice(
10’optimizer’,
11values=[’adam’, ’rmsprop’, ’sgd’, ’adagrad’]
12)
13
14model.compile(
15optimizer=optimizer,
16loss=’binary_crossentropy’,
17metrics=[’accuracy’]
18)
19
20returnmodel
Step 2: Create Tuner Object
1tuner = kt.RandomSearch(
416

Python

39.2. Hyperparameter Tuning with Keras Tuner - Complete Guide
2build_model,
3objective=’val_accuracy’,
4max_trials=5,
5directory=’my_dir’,
6project_name=’optimizer_tuning’
7)
Step 3: Search for Best Optimizer
1tuner.search(
2X_train, y_train,
3epochs=10,
4validation_data=(X_test, y_test)
5)
6
7# Get best hyperparameters
8best_params = tuner.get_best_hyperparameters()[0]
9print(f"Best optimizer: {best_params.get(’optimizer’)}")
Optimizer Comparison Results
Optimizer Validation Accuracy Performance
RMSprop 0.538
Adam 0.650
SGD 0.570
Adagrad 0.650
39.2.6 Number of Neurons Optimization
Hyperparameter: Units Selection
1defbuild_model(hp):
2model = Sequential()
3
4# Variable number of units
5units = hp.Int(’units’, min_value=8, max_value=128, step=8)
6
7model.add(Dense(
8units=units,
9activation=’relu’,
10input_dim=8
11))
12model.add(Dense(1, activation=’sigmoid’))
13
14model.compile(
15optimizer=’rmsprop’,# Use best from previous step
417

Python

Chapter 39. Keras Tuner Hyperparameter Tuning a Neural Network
16loss=’binary_crossentropy’,
17metrics=[’accuracy’]
18)
19
20returnmodel
Units Testing Range
Figure 39.1: Mermaid diagram
Best Results
– Optimal Units: 120 neurons
– Validation Accuracy: Improved performance
– Pattern: More neurons generally better (up to a point)
39.2.7 Number of Layers Optimization
Dynamic Layer Creation
1defbuild_model(hp):
2model = Sequential()
3
4# Variable number of layers
5num_layers = hp.Int(’num_layers’, min_value=1, max_value=10)
6
7foriin range(num_layers):
8ifi == 0:
9# First layer with input dimension
10model.add(Dense(
11units=hp.Int(f’units_{i}’, 8, 128, step=8),
12activation=’relu’,
13input_dim=8
14))
15else:
16# Hidden layers
17model.add(Dense(
18units=hp.Int(f’units_{i}’, 8, 128, step=8),
19activation=’relu’
20))
21
22# Output layer
23model.add(Dense(1, activation=’sigmoid’))
24
25model.compile(
26optimizer=’rmsprop’,
27loss=’binary_crosse

Modern optimization techniques adjust learning rates dynamically per parameter based on historical gradients:

Momentum: Introduces velocity to carry updates past local minima oscillations.
RMSProp: Adapts learning rates based on exponentially decaying average of squared gradients.
Adam (Adaptive Moment Estimation): Combines Momentum (first moment estimate) and RMSProp (second moment estimate).

Common mistakes

Using nesterov without tuning base learning rate.
Ignoring lookahead interaction with batch size.
Different seeds → false conclusions.

Interview checkpoints

Q: Default optimizer today? A: Often Adam/AdamW with tuned LR.
Q: SGD+momentum when? A: Vision with careful schedule can beat Adam.

Practice

Basic: Describe Nesterov Momentum.
Intermediate: Train same net with SGD vs Adam.
Advanced: Plot loss per optimizer to equal compute.

Recap

Nesterov Momentum changes optimization dynamics.
Fair comparisons need same budget.
Module 6: convolutions next.

Next: Day 47 — AdaGrad

Day 47

AdaGrad

Why this matters

AdaGrad adapts per-parameter LR — good for sparse features.

36.1 AdaGrad Explained in Detail with Animations | Optimizers in Deep

Learning Part 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387

36.1 AdaGrad Explained in Detail with Animations | Optimizers in Deep

Learning Part 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387

Adam updates parameters using both first-moment vector $m_t$ and second-moment vector $v_t$: $$m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t, \quad v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$$ With bias-correction terms: $$\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}$$ $$w_{t+1} = w_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t$$

Adam Optimization Step Flow

Common mistakes

Using adagrad without tuning base learning rate.
Ignoring sparse interaction with batch size.
Different seeds → false conclusions.

Interview checkpoints

Q: Default optimizer today? A: Often Adam/AdamW with tuned LR.
Q: SGD+momentum when? A: Vision with careful schedule can beat Adam.

Practice

Basic: Describe AdaGrad.
Intermediate: Train same net with SGD vs Adam.
Advanced: Plot loss per optimizer to equal compute.

Recap

AdaGrad changes optimization dynamics.
Fair comparisons need same budget.
Module 6: convolutions next.

Next: Day 48 — RMSProp

Day 48

RMSProp

Contents

36.1 AdaGrad Explained in Detail with Animations | Optimizers in Deep

Learning Part 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387

36.2 AdaGrad (Adaptive Gradient) Optimization Algorithm . . . . . . . . 387

36.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387

36.2.2 When AdaGrad Excels . . . . . . . . . . . . . . . . . . . . . . 387

36.2.3 The Elongated Bowl Problem . . . . . . . . . . . . . . . . . . 388

36.2.4 Mathematical Formulation . . . . . . . . . . . . . . . . . . . . 388

36.2.5 Intuition Behind AdaGrad . . . . . . . . . . . . . . . . . . . . 389

36.2.6 Example: Sparse Data Problem . . . . . . . . . . . . . . . . . 390

36.2.7 Advantages of AdaGrad . . . . . . . . . . . . . . . . . . . . . 390

36.2.8 Major Disadvantage . . . . . . . . . . . . . . . . . . . . . . . 390

36.2.9 Practical Implications . . . . . . . . . . . . . . . . . . . . . . 391

36.2.10Summary Table . . . . . . . . . . . . . . . . . . . . . . . . . . 391 37 RMSProp Explained in Detail with Animations Optimizers in Deep Learning Part 5 393

37.1 RMSProp Explained in Detail with Animations | Optimizers in Deep

Learning Part 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393

37.2 RMSProp Optimizer: Complete Deep Learning Notes . . . . . . . . . 393

37.2.1 Introduction & Overview . . . . . . . . . . . . . . . . . . . . . 393

37.2.2 The Problem RMSProp Solves . . . . . . . . . . . . . . . . . . 393

37.2.3 Mathematical Formulation . . . . . . . . . . . . . . . . . . . . 395

37.2.4 Algorithm Breakdown . . . . . . . . . . . . . . . . . . . . . . 396

37.2.5 Visual Understanding . . . . . . . . . . . . . . . . . . . . . . . 397

37.2.6 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . 398

37.2.7 Implementation Guidelines . . . . . . . . . . . . . . . . . . . . 400

37.2.8 Key Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . 401

37.2.9 Additional Resources . . . . . . . . . . . . . . . . . . . . . . . 401

37.2.10Comprehensive Coverage:. . . . . . . . . . . . . . . . . . 401 37.2.11Key Insights Covered:. . . . . . . . . . . . . . . . . . . . 402 38 Adam Optimizer Explained in Detail with Animations Optimizers in Deep Learning Part 5 404

38.1 Adam Optimizer Explained in Detail with Animations | Optimizers in

Deep Learning Part 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . 404

38.2 ADAM Optimizer: Complete Deep Learning Notes . . . . . . . . . . 404

38.2.1 Introduction & Overview . . . . . . . . . . . . . . . . . . . . . 404

38.2.2 Background: Evolution of Optimization . . . . . . . . . . . . . 405

38.2.3 Mathematical Formulation . . . . . . . . . . . . . . . . . . . . 407

38.2.4 Algorithm Components . . . . . . . . . . . . . . . . . . . . . . 407

38.2.5 Visual Understanding . . . . . . . . . . . . . . . . . . . . . . . 408

38.2.6 Implementation Guidelines . . . . . . . . . . . . . . . . . . . . 409

38.2.7 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . 410

38.2.8 Key Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . 411

IX Hyperparameter Tuning 413

Python

39 Keras Tuner Hyperparameter Tuning a Neural Network 414
xv

Why this matters

RMSProp fixes AdaGrad decay — popular before Adam.

37.1 RMSProp Explained in Detail with Animations | Optimizers in Deep

Learning Part 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393

37.1 RMSProp Explained in Detail with Animations | Optimizers in Deep

Learning Part 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393

Content sourced from CampusX Deep Learning notes (PDF). Run merge script for full body.

Common mistakes

Using rmsprop without tuning base learning rate.
Ignoring moving avg interaction with batch size.
Different seeds → false conclusions.

Interview checkpoints

Q: Default optimizer today? A: Often Adam/AdamW with tuned LR.
Q: SGD+momentum when? A: Vision with careful schedule can beat Adam.

Practice

Basic: Describe RMSProp.
Intermediate: Train same net with SGD vs Adam.
Advanced: Plot loss per optimizer to equal compute.

Recap

RMSProp changes optimization dynamics.
Fair comparisons need same budget.
Module 6: convolutions next.

Next: Day 49 — Adam Optimizer

Day 49

Adam Optimizer

Contents

36.1 AdaGrad Explained in Detail with Animations | Optimizers in Deep

Learning Part 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387

36.2 AdaGrad (Adaptive Gradient) Optimization Algorithm . . . . . . . . 387

36.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387

36.2.2 When AdaGrad Excels . . . . . . . . . . . . . . . . . . . . . . 387

36.2.3 The Elongated Bowl Problem . . . . . . . . . . . . . . . . . . 388

36.2.4 Mathematical Formulation . . . . . . . . . . . . . . . . . . . . 388

36.2.5 Intuition Behind AdaGrad . . . . . . . . . . . . . . . . . . . . 389

36.2.6 Example: Sparse Data Problem . . . . . . . . . . . . . . . . . 390

36.2.7 Advantages of AdaGrad . . . . . . . . . . . . . . . . . . . . . 390

36.2.8 Major Disadvantage . . . . . . . . . . . . . . . . . . . . . . . 390

36.2.9 Practical Implications . . . . . . . . . . . . . . . . . . . . . . 391

36.2.10Summary Table . . . . . . . . . . . . . . . . . . . . . . . . . . 391 37 RMSProp Explained in Detail with Animations Optimizers in Deep Learning Part 5 393

37.1 RMSProp Explained in Detail with Animations | Optimizers in Deep

Learning Part 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393

37.2 RMSProp Optimizer: Complete Deep Learning Notes . . . . . . . . . 393

37.2.1 Introduction & Overview . . . . . . . . . . . . . . . . . . . . . 393

37.2.2 The Problem RMSProp Solves . . . . . . . . . . . . . . . . . . 393

37.2.3 Mathematical Formulation . . . . . . . . . . . . . . . . . . . . 395

37.2.4 Algorithm Breakdown . . . . . . . . . . . . . . . . . . . . . . 396

37.2.5 Visual Understanding . . . . . . . . . . . . . . . . . . . . . . . 397

37.2.6 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . 398

37.2.7 Implementation Guidelines . . . . . . . . . . . . . . . . . . . . 400

37.2.8 Key Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . 401

37.2.9 Additional Resources . . . . . . . . . . . . . . . . . . . . . . . 401

38.1 Adam Optimizer Explained in Detail with Animations | Optimizers in

Deep Learning Part 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . 404

38.2 ADAM Optimizer: Complete Deep Learning Notes . . . . . . . . . . 404

38.2.1 Introduction & Overview . . . . . . . . . . . . . . . . . . . . . 404

38.2.2 Background: Evolution of Optimization . . . . . . . . . . . . . 405

38.2.3 Mathematical Formulation . . . . . . . . . . . . . . . . . . . . 407

38.2.4 Algorithm Components . . . . . . . . . . . . . . . . . . . . . . 407

38.2.5 Visual Understanding . . . . . . . . . . . . . . . . . . . . . . . 408

38.2.6 Implementation Guidelines . . . . . . . . . . . . . . . . . . . . 409

38.2.7 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . 410

38.2.8 Key Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . 411

IX Hyperparameter Tuning 413

Python

39 Keras Tuner Hyperparameter Tuning a Neural Network 414
xv

Why this matters

Adam combines momentum + adaptive LR — default for many tasks.

38.1 Adam Optimizer Explained in Detail with Animations | Optimizers in

Deep Learning Part 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . 404

38.1 Adam Optimizer Explained in Detail with Animations | Optimizers in

Deep Learning Part 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . 404

Content sourced from CampusX Deep Learning notes (PDF). Run merge script for full body.

Common mistakes

Using adam without tuning base learning rate.
Ignoring bias correction interaction with batch size.
Different seeds → false conclusions.

Interview checkpoints

Q: Default optimizer today? A: Often Adam/AdamW with tuned LR.
Q: SGD+momentum when? A: Vision with careful schedule can beat Adam.

Practice

Basic: Describe Adam Optimizer.
Intermediate: Train same net with SGD vs Adam.
Advanced: Plot loss per optimizer to equal compute.

Recap

Adam Optimizer changes optimization dynamics.
Fair comparisons need same budget.
Module 6: convolutions next.

Next: Day 50 — AdamW

Day 50

AdamW

Chapter 32. Optimizers in Deep Learning Part 1 Complete Deep Learning Course When to Use Which Optimizer? Scenario Recommended Optimizer Reason Sparse DataAdaGrad Per-parameter adaptation Computer VisionSGD with Momentum Well-tested, reliable NLP/TransformersAdam/AdamW Handles varying gradients Online LearningStochastic GD Single sample updates Research/ExperimentationAdam Good default choice

32.1.7 Key Takeaways

Essential Points 1.Optimizers are crucialfor training neural networks efficiently 2.Learning rate selectionis one of the most important hyperparameters 3.Modern optimizerssolve many limitations of vanilla gradient descent 4.Adam is often a good defaultchoice for most applications 5.No single optimizerworks best for all problems 348

Why this matters

AdamW decouples weight decay — better generalization than Adam+L2.

31.1.10 Summary & Key Takeaways

Core Benefits Recap Benefit Impact Explanation ** Speed** 2-14x faster Higher learning rates possible ** Stability** Much improved Reduces internal covariate shift ** Regularization** Mild effect Batch statistics add noise ** Robustness** High Less sensitive to initialization When to Use Batch Normalization –Deep networks (> 3 layers) –Computer vision tasks –When using high learning rates –Large batch sizes available –Training from scratch When to Avoid –Very small batch sizes (< 16) –Online learning (batch size = 1) 340

31.1. Batch Normalization: The Complete Deep Learning Guide –Some RNN architectures –When training time is critical 341

Chapter 32 OptimizersinDeepLearningPart 1CompleteDeepLearningCourse

32.1 Optimizers in Deep Learning | Part 1 |

Complete Deep Learning Course

32.1.1 Introduction to Optimizers

What are Optimizers? Optimizers are algorithms that adjust the parameters of neural networks to minimize the loss function and improve model performance. Why Do We Need Optimizers? Need Description Impact Speed Up TrainingReduce time to convergence High Find Optimal Parameters Locate global minimum Critical Handle Complex Loss Surfaces Navigate non-convex functions Essential Adaptive LearningAdjust to different data patterns Important 342

32.1. Optimizers in Deep Learning | Part 1 | Complete Deep Learning Course

32.1.2 Role of Optimizers

The Optimization Process Figure 32.1: image Mathematical Foundation The core update rule for gradient descent: wt+1 =w t−η∂L ∂w 343

Chapter 32. Optimizers in Deep Learning Part 1 Complete Deep Learning Course Where: –wt = weights at timet –η= learning rate –∂L ∂w= gradient of loss with respect to weights

32.1.3 Types of Gradient Descent

Comparison Table Type Update Frequency Batch Size Advantages Disadvantages Batch GDAfter full dataset All samples Stable convergence Slow, memory intensive Stochastic GD After each sample 1 sample Fast, online learning Noisy updates Mini-batch GD After mini-batch 32-512 samples Balanced approach Hyperparameter tuning Visual Comparison Figure 32.2: image Update Rules Comparison 1 Batch Gradient Descent 1forepochin range(num_epochs): 2gradients = compute_gradients(entire_dataset) 3weights = weights - learning_rate * gradients 2 Stochastic Gradient Descent 1forepochin range(num_epochs): 2forsampleindataset: 3gradient = compute_gradient(sample) 4weights = weights - learning_rate * gradient 344

32.1. Optimizers in Deep Learning | Part 1 | Complete Deep Learning Course 3 Mini-batch Gradient Descent 1forepochin range(num_epochs): 2forbatchinmini_batches: 3gradients = compute_gradients(batch) 4weights = weights - learning_rate * gradients

32.1.4 Challenges with Traditional Optimizers

Learning Rate Selection Learning Rate Effect Visualization Too SmallSlow convergence Painfully slow Too LargeOvershooting/Divergence Unstable Just RightOptimal convergence Perfect The Goldilocks Problem 2 Learning Rate Scheduling Problem: Pre-defined schedules don’t adapt to data 1# Common scheduling strategies 2strategies = { 3"Step Decay": "lr = lr * 0.1 every 30 epochs", 4"Exponential": "lr = lr * exp(-decay * epoch)", 5"Cosine Annealing": "lr = lr_min + 0.5 * (lr_max - lr_min) * (1 + cos (? * epoch/total))" 6} 3 Same Learning Rate for All Parameters Issue: Different parameters may need different learning rates Figure 32.3: image 345

Chapter 32. Optimizers in Deep Learning Part 1 Complete Deep Learning Course Local Minima Problem Figure 32.4: image Visualization of Local vs Global Minima

32.1.5 Modern Optimization Algorithms

Overview of Advanced Optimizers Optimizer Key Innovation Best Use Case MomentumVelocity accumulation Smooth loss surfaces AdaGradAdaptive learning rates Sparse data NAGLook-ahead gradient Faster convergence RMSpropRunning average of gradients Non-stationary objectives AdamMomentum + Adaptive LR General purpose Mathematical Formulations Momentum vt =βvt−1+η∇wL wt+1 =w t−vt AdaGrad gt =g t−1+ (∇wL)2 wt+1 =w t− η√gt +ϵ∇wL 346

32.1. Optimizers in Deep Learning | Part 1 | Complete Deep Learning Course 3 RMSprop vt =βvt−1+ (1−β)(∇wL)2 wt+1 =w t− η√vt +ϵ∇wL 4 Adam mt =β1mt−1+ (1−β1)∇wL vt =β2vt−1+ (1−β2)(∇wL)2 ˆmt = mt 1−βt 1 ,ˆv t = vt 1−βt 2 wt+1 =w t− η√ˆvt +ϵˆmt

32.1.6 Practical Implementation

Python

TensorFlow/Keras Example
1# Different optimizers in Keras
2optimizers = {
3’sgd’: tf.keras.optimizers.SGD(learning_rate=0.01),
4’momentum’: tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9),
5’adagrad’: tf.keras.optimizers.Adagrad(learning_rate=0.01),
6’rmsprop’: tf.keras.optimizers.RMSprop(learning_rate=0.001),
7’adam’: tf.keras.optimizers.Adam(learning_rate=0.001)
8}
9
10# Compile model with optimizer
11model.compile(
12optimizer=optimizers[’adam’],
13loss=’categorical_crossentropy’,
14metrics=[’accuracy’]
15)
PyTorch Example
1# Different optimizers in PyTorch
2importtorch.optimasoptim
3
4optimizers = {
5’sgd’: optim.SGD(model.parameters(), lr=0.01),
6’momentum’: optim.SGD(model.parameters(), lr=0.01, momentum=0.9),
7’adagrad’: optim.Adagrad(model.parameters(), lr=0.01),
8’rmsprop’: optim.RMSprop(model.parameters(), lr=0.001),
9’adam’: optim.Adam(model.parameters(), lr=0.001)
10}
347

32.1.7 Key Takeaways

Chapter 33 Exponentially Weighted Moving Average or Exponential Weighted Average Deep Learning

33.1 Exponentially Weighted Moving Average

or Exponential Weighted Average | Deep Learn- ing

33.2 SGD with Momentum Optimization

33.2.1 Introduction

Momentumis a crucial optimization technique in deep learning that acceler- ates gradient descent by accumulating velocity from past gradients. It’s par- ticularly effective for: - Speeding up convergence - Escaping local minima - Navigating elongated valleys in loss landscapes Key Insight Momentum works like a ball rolling down a hill - it accumulates velocity in consistent directions and dampens oscillations in incon- sistent directions.

33.2.2 Understanding Graph Representations

Three Types of Visualizations Graph Type Dimension Purpose Visual Representation 2D Loss PlotLoss vs Single Parameter Simple optimization view Parabolic curve 3D Surface PlotLoss vs Two Parameters Complete loss landscape Mountain/valley view Contour Plot2D projection of 3D Top-down view Concentric circles 349

Chapter 33. Exponentially Weighted Moving Average or Exponential Weighted Average Deep Learning Visual Interpretation Guide Figure 33.1: image – Yellow/Orange= High altitude (high loss) – Blue/Purple= Low altitude (low loss) – Circular contours= Well-conditioned optimization – Elongated contours= Ill-conditioned optimization

33.2.3 Convex vs Non-Convex Optimization

Comparison Table Aspect Convex Optimization Non-Convex Optimization ShapeBowl-like Multiple valleys MinimaSingle global minimum Multiple local minima ChallengesRelatively simple Complex navigation ConvergenceGuaranteed Not guaranteed Three Major Problems in Non-Convex Optimization 1. Local Minima – Problem: Algorithm gets stuck in suboptimal solutions – Visual: Small valleys that trap the optimizer 350

33.2. SGD with Momentum Optimization – Impact: Poor model performance 2. Saddle Points – Problem: Flat regions with mixed curvature – Visual: Areas that curve up in one direction, down in another – Impact: Extremely slow convergence 3. High Curvature – Problem: Sharp turns in loss landscape – Visual: Narrow valleys or ridges – Impact: Oscillations and instability

33.2.4 Why Momentum?

Problems with Vanilla Gradient Descent Figure 33.2: image Momentum Solutions Problem How Momentum Helps Consistent gradientsAccelerates in consistent directions Inconsistent gradientsDampens oscillations Local minimaBuilds up speed to escape Flat regionsMaintains velocity through plateaus

Content sourced from CampusX Deep Learning notes (PDF). Run merge script for full body.

Common mistakes

Using adamw without tuning base learning rate.
Ignoring decoupled interaction with batch size.
Different seeds → false conclusions.

Interview checkpoints

Q: Default optimizer today? A: Often Adam/AdamW with tuned LR.
Q: SGD+momentum when? A: Vision with careful schedule can beat Adam.

Practice

Basic: Describe AdamW.
Intermediate: Train same net with SGD vs Adam.
Advanced: Plot loss per optimizer to equal compute.

Recap

AdamW changes optimization dynamics.
Fair comparisons need same budget.
Module 6: convolutions next.

Next: Day 51 — Optimizer Comparison

Day 51

Optimizer Comparison

Contents

30.1.1 Problems with Poor Initialization . . . . . . . . . . . . . . . . 317

30.1.2 Xavier/Glorot Initialization . . . . . . . . . . . . . . . . . . . 318

30.1.3 He Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . 319

30.1.4 Mathematical Formulations . . . . . . . . . . . . . . . . . . . 320

Python

30.1.5 Implementation in Keras . . . . . . . . . . . . . . . . . . . . . 320
30.1.6 Comparison Table . . . . . . . . . . . . . . . . . . . . . . . . . 321
30.1.7 Best Practices . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
30.1.8 Visual Summary . . . . . . . . . . . . . . . . . . . . . . . . . 322
30.1.9 Code Demonstration Results . . . . . . . . . . . . . . . . . . . 323
30.1.10Key Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . 324
VIII Optimizers in Deep Learning 325
31 Batch Normalization in Deep Learning Batch Learning in Keras 326
31.1 Batch Normalization: The Complete Deep Learning Guide . . . . . . 326
31.1.1 Introduction & Overview . . . . . . . . . . . . . . . . . . . . . 326
31.1.2 Theoretical Foundation . . . . . . . . . . . . . . . . . . . . . . 326
31.1.3 Why Batch Normalization? . . . . . . . . . . . . . . . . . . . 328
31.1.4 Mathematical Framework . . . . . . . . . . . . . . . . . . . . 329
31.1.5 Implementation Details . . . . . . . . . . . . . . . . . . . . . . 330
31.1.6 Advantages & Benefits . . . . . . . . . . . . . . . . . . . . . . 332
31.1.7 Complete Code Implementation . . . . . . . . . . . . . . . . . 332
31.1.8 Advanced Topics . . . . . . . . . . . . . . . . . . . . . . . . . 338
31.1.9 Best Practices & Tips . . . . . . . . . . . . . . . . . . . . . . 339
31.1.10Summary & Key Takeaways . . . . . . . . . . . . . . . . . . . 340
32 OptimizersinDeepLearningPart1CompleteDeepLearningCourse342
32.1 Optimizers in Deep Learning | Part 1 | Complete Deep Learning Course342
32.1.1 Introduction to Optimizers . . . . . . . . . . . . . . . . . . . . 342
32.1.2 Role of Optimizers . . . . . . . . . . . . . . . . . . . . . . . . 343
32.1.3 Types of Gradient Descent . . . . . . . . . . . . . . . . . . . . 344
32.1.4 Challenges with Traditional Optimizers . . . . . . . . . . . . . 345
32.1.5 Modern Optimization Algorithms . . . . . . . . . . . . . . . . 346
32.1.6 Practical Implementation . . . . . . . . . . . . . . . . . . . . . 347
32.1.7 Key Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . 348
33 Exponentially Weighted Moving Average or Exponential Weighted
Average Deep Learning 349
33.1 Exponentially Weighted Moving Average or Exponential Weighted Av-
erage | Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 349
33.2 SGD with Momentum Optimization . . . . . . . . . . . . . . . . . . . 349
33.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
33.2.2 Understanding Graph Representations . . . . . . . . . . . . . 349
33.2.3 Convex vs Non-Convex Optimization . . . . . . . . . . . . . . 350
33.2.4 Why Momentum? . . . . . . . . . . . . . . . . . . . . . . . . . 351
33.2.5 Mathematical Formulation . . . . . . . . . . . . . . . . . . . . 351
33.2.6 How Momentum Works . . . . . . . . . . . . . . . . . . . . . 352
33.2.7 Effect of Beta Parameter . . . . . . . . . . . . . . . . . . . . . 353
xiii

Why this matters

Compare optimizers on same data, budget, and seed.

19.10.1 Speed Comparison

– Given same number of epochs: Batch GD isfaster – Reason: Batch has fewer updates (10 epochs = 10 updates vs.10×n updates in SGD) 201

Chapter 19. Gradient Descent in Neural Network: Batch vs Stochastic vs Mini-Batch

19.10.1 Speed Comparison

– Given same number of epochs: Batch GD isfaster – Reason: Batch has fewer updates (10 epochs = 10 updates vs.10×n updates in SGD) 201

Chapter 19. Gradient Descent in Neural Network: Batch vs Stochastic vs Mini-Batch

Content sourced from CampusX Deep Learning notes (PDF). Run merge script for full body.

Common mistakes

Using compare without tuning base learning rate.
Ignoring seed interaction with batch size.
Different seeds → false conclusions.

Interview checkpoints

Q: Default optimizer today? A: Often Adam/AdamW with tuned LR.
Q: SGD+momentum when? A: Vision with careful schedule can beat Adam.

Practice

Basic: Describe Optimizer Comparison.
Intermediate: Train same net with SGD vs Adam.
Advanced: Plot loss per optimizer to equal compute.

Recap

Optimizer Comparison changes optimization dynamics.
Fair comparisons need same budget.
Module 6: convolutions next.

Next: Day 52 — Warmup Schedules

Day 52

Warmup Schedules

32.1.4 Challenges with Traditional Optimizers

Why this matters

Warmup gradually increases LR early in training — transformers often need it.

31.1.10 Summary & Key Takeaways

31.1. Batch Normalization: The Complete Deep Learning Guide –Some RNN architectures –When training time is critical 341

Chapter 32 OptimizersinDeepLearningPart 1CompleteDeepLearningCourse

32.1 Optimizers in Deep Learning | Part 1 |

Complete Deep Learning Course

32.1.1 Introduction to Optimizers

32.1. Optimizers in Deep Learning | Part 1 | Complete Deep Learning Course

32.1.2 Role of Optimizers

The Optimization Process Figure 32.1: image Mathematical Foundation The core update rule for gradient descent: wt+1 =w t−η∂L ∂w 343

Chapter 32. Optimizers in Deep Learning Part 1 Complete Deep Learning Course Where: –wt = weights at timet –η= learning rate –∂L ∂w= gradient of loss with respect to weights

32.1.3 Types of Gradient Descent

32.1.4 Challenges with Traditional Optimizers

Chapter 32. Optimizers in Deep Learning Part 1 Complete Deep Learning Course Local Minima Problem Figure 32.4: image Visualization of Local vs Global Minima

32.1.5 Modern Optimization Algorithms

32.1.6 Practical Implementation

Python

TensorFlow/Keras Example
1# Different optimizers in Keras
2optimizers = {
3’sgd’: tf.keras.optimizers.SGD(learning_rate=0.01),
4’momentum’: tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9),
5’adagrad’: tf.keras.optimizers.Adagrad(learning_rate=0.01),
6’rmsprop’: tf.keras.optimizers.RMSprop(learning_rate=0.001),
7’adam’: tf.keras.optimizers.Adam(learning_rate=0.001)
8}
9
10# Compile model with optimizer
11model.compile(
12optimizer=optimizers[’adam’],
13loss=’categorical_crossentropy’,
14metrics=[’accuracy’]
15)
PyTorch Example
1# Different optimizers in PyTorch
2importtorch.optimasoptim
3
4optimizers = {
5’sgd’: optim.SGD(model.parameters(), lr=0.01),
6’momentum’: optim.SGD(model.parameters(), lr=0.01, momentum=0.9),
7’adagrad’: optim.Adagrad(model.parameters(), lr=0.01),
8’rmsprop’: optim.RMSprop(model.parameters(), lr=0.001),
9’adam’: optim.Adam(model.parameters(), lr=0.001)
10}
347

32.1.7 Key Takeaways

Chapter 33 Exponentially Weighted Moving Average or Exponential Weighted Average Deep Learning

33.1 Exponentially Weighted Moving Average

or Exponential Weighted Average | Deep Learn- ing

33.2 SGD with Momentum Optimization

33.2.1 Introduction

33.2.2 Understanding Graph Representations

33.2.3 Convex vs Non-Convex Optimization

33.2.4 Why Momentum?

Content sourced from CampusX Deep Learning notes (PDF). Run merge script for full body.

Common mistakes

Using warmup without tuning base learning rate.
Ignoring transformer interaction with batch size.
Different seeds → false conclusions.

Interview checkpoints

Q: Default optimizer today? A: Often Adam/AdamW with tuned LR.
Q: SGD+momentum when? A: Vision with careful schedule can beat Adam.

Practice

Basic: Describe Warmup Schedules.
Intermediate: Train same net with SGD vs Adam.
Advanced: Plot loss per optimizer to equal compute.

Recap

Warmup Schedules changes optimization dynamics.
Fair comparisons need same budget.
Module 6: convolutions next.

Next: Day 53 — Optimizer Project

Day 53

Optimizer Project

1.3. Artificial Neural Networks (ANN)

1.3.3 MLP [Multi-layer perceptron]

•Intuition of MLP •MLP Notation •Prediction in MLP

1.3.4 Training an MLP [Most used Algorithm]

•Gradient Descent •Backpropagation

Python

1.3.5 Practical with Keras
•CPU Vs GPU
•Installation
•Example 1 - Regression using Keras
•Example 2 - Classification using Keras
1.3.6 How to improve an ANN
•Vanishing Gradients
•Exploding Gradients
•Dropouts
•Regularization
•Weight Initialization
•Optimizers
•Gradient Checking and Clipping
•Batch Normalization
•Hyperparameter Tuning
1.3.7 Advanced Topics
•Callbacks
•Tensorboard
•Pretrained Models
•Keras Functional API
•Saving and Loading a Keras model
•Building a Streamlit Application
1.3.8 Project
•End-to-End Final Project
•AWS deployment
Convolutional Neural Networks (CNN)
•Convolution operations and filters
•Pooling layers and techniques
•Feature maps and visualization
3

Why this matters

Optimizer project: log convergence speed and final val metric.

33.3.11 Summary & Best Practices

When to Use Momentum Use when:- Training deep neural networks - Dealing with elongated loss valleys - Need to escape local minima - Gradients are relatively consistent Avoid when:- Near convergence (consider reducingβ) - Extremely noisy gradients - Need precise convergence Python code 1importnumpyasnp 2importmatplotlib.pyplotasplt 3fromtypingimportList, Tuple, Optional 4importpandasaspd 5 357

Chapter 33. Exponentially Weighted Moving Average or Exponential Weighted Average Deep Learning 6classExponentialMovingAverage: 7""" 8Exponential Moving Average (EMA) implementation from scratch. 9 10Mathematical Formula: V_t = beta * V_{t-1} + (1-beta) * theta_t 11Where: 12- V_t: EMA value at time t 13- beta: smoothing factor (0 < beta < 1) 14- theta_t: actual data point at time t 15""" 16 17def__init__(self, beta:float= 0.8): 18""" 19Initialize EMA calculator. 20 21Args: 22beta (float): Smoothing factor (0 < beta < 1) 23Higher beta = more smoothing (slower response) 24Lower beta = less smoothing (faster response) 25""" 26if not0 < beta < 1: 27raiseValueError("Beta must be between 0 and 1 (exclusive)") 28 29self.beta = beta 30self.alpha = 1 - beta# Weight for new observations 31self.ema_values = [] 32self.data_points = [] 33 34defcalculate_ema_single(self, data: List[float], 35initial_value:float= 0) -> List[float]: 36""" 37Calculate EMA for a complete dataset using recursive formula. 38 39Args: 40data (List[float]): Input data points 41initial_value (float): Initial EMA value (default: 0) 42 43Returns: 44List[float]: EMA values for each data point 45""" 46if notdata: 47return[] 48 49ema_values = [] 50current_ema = initial_value 51 52fori, data_pointin enumerate(data): 53ifi == 0andinitial_value == 0: 54# First value: V_1 = (1-beta) * theta_1 55current_ema = self.alpha * data_point 56else: 57# Recursive formula: V_t = beta * V_{t-1} + (1-beta) * theta_t 58current_ema = self.beta * current_ema + self.alpha * data_point 59 60ema_values.append(current_ema) 61 358

33.3. Exponential Moving Average (EMA) - Mathematical Intuition 62returnema_values 63 64defcalculate_ema_step_by_step(self, data: List[float]) -> Tuple[List[ float], List[dict]]: 65""" 66Calculate EMA with detailed step-by-step breakdown. 67 68Args: 69data (List[float]): Input data points 70 71Returns: 72Tuple[List[float], List[dict]]: EMA values and calculation details 73""" 74if notdata: 75return[], [] 76 77ema_values = [] 78calculations = [] 79current_ema = 0 80 81fori, data_pointin enumerate(data): 82ifi == 0: 83# First calculation 84current_ema = self.alpha * data_point 85calc_detail = { 86’step’: i + 1, 87’data_point’: data_point, 88’formula’: f’V_1 = (1-beta) * theta_1 = {self.alpha:.3 f} * {data_point:.3f}’, 89’calculation’: f’{self.alpha:.3f} * {data_point:.3f} = {current_ema:.6f}’, 90’ema_value’: current_ema 91} 92else: 93# Recursive calculation 94prev_ema = current_ema 95current_ema = self.beta * current_ema + self.alpha * data_point 96calc_detail = { 97’step’: i + 1, 98’data_point’: data_point, 99’formula’: f’V_{i+1} = beta * V_{i} + (1-beta) * theta_{i+1}’, 100’calculation’: f’{self.beta:.3f} * {prev_ema:.6f} + { self.alpha:.3f} * {data_point:.3f} = {current_ema:.6f}’, 101’ema_value’: current_ema 102} 103 104ema_values.append(current_ema) 105calculations.append(calc_detail) 106 107returnema_values, calculations 108 109defcalculate_weights(self, n_periods:int) -> List[Tuple[int,float ]]: 110""" 111Calculate the exponential weights for historical data points. 359

Chapter 33. Exponentially Weighted Moving Average or Exponential Weighted Average Deep Learning 112 113Args: 114n_periods (int): Number of historical periods to calculate weights for 115 116Returns: 117List[Tuple[int, float]]: List of (age, weight) tuples 118""" 119weights = [] 120forkin range(n_periods): 121# Weight for data point k periods ago: beta^k * (1-beta) 122weight = (self.beta ** k) * self.alpha 123weights.append((k, weight)) 124 125returnweights 126 127defexpand_ema_formula(self, n_terms:int= 5) ->str: 128""" 129Generate the expanded EMA formula showing weights for historical data. 130 131Args: 132n_terms (int): Number of terms to show in expansion 133 134Returns: 135str: Mathematical formula as string 136""" 137terms = [] 138foriin range(n_terms): 139ifi == 0: 140terms.append(f"(1-beta)theta_n") 141else: 142terms.append(f"beta^{i}(1-beta)theta_(n-{i})") 143 144formula = f"V_n = {’ + ’.join(terms)}" 145ifn_terms > 1: 146formula += " + ..." 147 148returnformula 149 150defcompare_with_sma(self, data: List[float], sma_window:int) ->dict : 151""" 152Compare EMA with Simple Moving Average (SMA). 153 154Args: 155data (List[float]): Input data 156sma_window (int): Window size for SMA calculation 157 158Returns: 159dict: Comparison results 160""" 161ema_values = self.calculate_ema_single(data) 162 163# Calculate SMA 164sma_values = [] 165foriin range(len(data)): 166ifi < sma_window - 1: 360

33.3. Exponential Moving Average (EMA) - Mathematical Intuition 167sma_values.append(np.nan) 168else: 169sma_values.append(np.mean(data[i-sma_window+1:i+1])) 170 171return{ 172’data’: data, 173’ema’: ema_values, 174’sma’: sma_values, 175’ema_params’: {’beta’: self.beta, ’alpha’: self.alpha}, 176’sma_params’: {’window’: sma_window} 177} 178 179defplot_ema_analysis(self, data: List[float], title:str= "EMA Analysis"): 180""" 181Create comprehensive plots for EMA analysis. 182 183Args: 184data (List[float]): Input data 185title (str): Plot title 186""" 187ema_values = self.calculate_ema_single(data) 188weights = self.calculate_weights(min(10,len(data))) 189 190fig, axes = plt.subplots(2, 2, figsize=(15, 10)) 191fig.suptitle(title, fontsize=16) 192 193# Plot 1: Original data vs EMA 194axes[0, 0].plot(data, ’b-o’, label=’Original Data’, markersize=4) 195axes[0, 0].plot(ema_values, ’r-’, label=f’EMA (beta={self.beta})’, linewidth=2) 196axes[0, 0].set_title(’Data vs EMA’) 197axes[0, 0].set_xlabel(’Time Period’) 198axes[0, 0].set_ylabel(’Value’) 199axes[0, 0].legend() 200axes[0, 0].grid(True, alpha=0.3) 201 202# Plot 2: Weight distribution 203ages, weight_values =zip(*weights) 204axes[0, 1].bar(ages, weight_values, alpha=0.7, color=’green’) 205axes[0, 1].set_title(’Exponential Weight Distribution’) 206axes[0, 1].set_xlabel(’Periods Ago’) 207axes[0, 1].set_ylabel(’Weight’) 208axes[0, 1].grid(True, alpha=0.3) 209 210# Plot 3: Convergence analysis 211differences = [abs(ema_values[i] - data[i])foriin range(len( data))] 212axes[1, 0].plot(differences, ’purple’, marker=’s’, markersize=3) 213axes[1, 0].set_title(’EMA-Data Absolute Difference’) 214axes[1, 0].set_xlabel(’Time Period’) 215axes[1, 0].set_ylabel(’|EMA - Data|’) 216axes[1, 0].grid(True, alpha=0.3) 217 218# Plot 4: Cumulative weight (showing memory effect) 219cumulative_weights = np.cumsum([w[1]forwinweights]) 220axes[1, 1].plot(ages, cumulative_weights, ’orange’, marker=’d’, markersize=4) 361

Chapter 33. Exponentially Weighted Moving Average or Exponential Weighted Average Deep Learning 221axes[1, 1].axhline(y=0.95, color=’red’, linestyle=’--’, alpha=0.7, label=’95% Memory’) 222axes[1, 1].set_title(’Cumulative Weight Distribution’) 223axes[1, 1].set_xlabel(’Periods Ago’) 224axes[1, 1].set_ylabel(’Cumulative Weight’) 225axes[1, 1].legend() 226axes[1, 1].grid(True, alpha=0.3) 227 228plt.tight_layout() 229plt.show() 230 231# Example usage and demonstration 232defdemo_ema(): 233"""Demonstrate EMA functionality with examples.""" 234 235print("=" * 60) 236print("EXPONENTIAL MOVING AVERAGE - PYTHON IMPLEMENTATION") 237print("=" * 60) 238 239# Create sample data 240np.random.seed(42) 241trend_data = np.linspace(10, 20, 10) + np.random.normal(0, 0.5, 10) 242print(f"\nSample Data: {[round(x, 2) for x in trend_data]}") 243 244# Initialize EMA calculator 245ema_calc = ExponentialMovingAverage(beta=0.8) 246 247# Calculate EMA 248ema_values, calculations = ema_calc.calculate_ema_step_by_step( trend_data) 249 250print(f"\nEMA Parameters:") 251print(f"beta (beta) = {ema_calc.beta}") 252print(f"alpha (alpha) = 1-beta = {ema_calc.alpha}") 253 254print(f"\n{’-’*80}") 255print("STEP-BY-STEP EMA CALCULATIONS:") 256print(f"{’-’*80}") 257 258forcalcincalculations[:5]:# Show first 5 steps 259print(f"Step {calc[’step’]}: Data = {calc[’data_point’]:.3f}") 260print(f" Formula: {calc[’formula’]}") 261print(f" Calculation: {calc[’calculation’]}") 262print(f" EMA Value: {calc[’ema_value’]:.6f}") 263print() 264 265# Show expanded formula 266print(f"\n{’-’*80}") 267print("EXPANDED EMA FORMULA:") 268print(f"{’-’*80}") 269expanded_formula = ema_calc.expand_ema_formula(4) 270print(expanded_formula) 271 272# Show weight distribution 273print(f"\n{’-’*80}") 274print("EXPONENTIAL WEIGHT DISTRIBUTION:") 275print(f"{’-’*80}") 276weights = ema_calc.calculate_weights(8) 362

33.3. Exponential Moving Average (EMA) - Mathematical Intuition 277print(f"{’Periods Ago’:<12} {’Weight’:<10} {’Percentage’:<12}") 278print("-" * 35) 279total_weight =sum(w[1]forwinweights) 280forage, weightinweights: 281percentage = (weight / total_weight) * 100 282print(f"{age:<12} {weight:<10.6f} {percentage:<10.2f}%") 283 284# Mathematical verification 285print(f"\n{’-’*80}") 286print("MATHEMATICAL VERIFICATION:") 287print(f"{’-’*80}") 288 289# Verify last EMA value using expanded formula 290n =len(trend_data) 291manual_ema = 0 292fori, data_pointin enumerate(trend_data): 293age = n - 1 - i 294weight = (ema_calc.beta ** age) * ema_calc.alpha 295manual_ema += weight * data_point 296 297print(f"EMA (recursive method): {ema_values[-1]:.8f}") 298print(f"EMA (expanded formula): {manual_ema:.8f}") 299print(f"Difference: {abs(ema_values[-1] - manual_ema):.2e}") 300 301returnema_calc, trend_data, ema_values 302 303# Run demonstration 304if__name__ == "__main__": 305ema_calculator, data, ema_result = demo_ema() 306 307# Optional: Create plots (uncomment if matplotlib is available) 308# ema_calculator.plot_ema_analysis(data, "EMA Mathematical Demonstration") 309 310print(f"\n{’=’*60}") 311print("DEMONSTRATION COMPLETE") 312print(f"{’=’*60}") Key Takeaways 1.Momentum = Velocity accumulationfrom past gradients 2.βcontrols history influence(0.9 is standard) 3.Accelerates convergencebut may overshoot 4.Escapes local minimabetter than vanilla GD 5.Dampens oscillationsin narrow valleys 363

Chapter 33. Exponentially Weighted Moving Average or Exponential Weighted Average Deep Learning 364

Chapter 34 SGD with Momentum Explained in Detail with Animations Opti- mizers in Deep Learning Part 2

34.1 Deep Learning Optimization Techniques:

Momentum with SGD

34.1.1 Introduction

This guide covers optimization techniques in deep learning, specifically focusing onSGD with Momentum. In deep learning, we deal with complex loss land- scapes that require sophisticated optimization algorithms to navigate effectively. Key Concepts – Loss Function:L(θ) =f(W,b)whereθrepresents all parameters – Objective: Findθ∗= arg minθL(θ) – Challenge: Non-convex optimization in high-dimensional spaces

34.1.2 Understanding Graph Visualizations

Figure 34.1: image 1. 2D Loss Function Plot – X-axis: Single parameter (e.g., weightw) – Y-axis: Loss valueL(w) – Purpose: Visualize how loss changes with one parameter 365

Chapter 34. SGD with Momentum Explained in Detail with Animations Optimizers in Deep Learning Part 2 L=f(w) 2. 3D Loss Surface – X,Y axes: Two parameters (e.g.,w1,w 2) – Z-axis: Loss valueL(w1,w 2) – Purpose: Visualize loss landscape in 3D space L=f(w 1,w 2) 3. Contour Plot – 2D projectionof 3D loss surface – Contour lines: Connect points of equal loss – Color coding: ∗Blue = Lower loss (minima) ∗Yellow/ Red = Higher loss (maxima) 366

34.1. Deep Learning Optimization Techniques: Momentum with SGD 367

Chapter 34. SGD with Momentum Explained in Detail with Animations Optimizers in Deep Learning Part 2 368

Content sourced from CampusX Deep Learning notes (PDF). Run merge script for full body.

Common mistakes

Using project without tuning base learning rate.
Ignoring log interaction with batch size.
Different seeds → false conclusions.

Interview checkpoints

Q: Default optimizer today? A: Often Adam/AdamW with tuned LR.
Q: SGD+momentum when? A: Vision with careful schedule can beat Adam.

Practice

Basic: Describe Optimizer Project.
Intermediate: Train same net with SGD vs Adam.
Advanced: Plot loss per optimizer to equal compute.

Recap

Optimizer Project changes optimization dynamics.
Fair comparisons need same budget.
Module 6: convolutions next.

Next: Day 54 — Convolution Operation

← Module 4: Regularization Module 6: CNNs & Keras →