Module 5: Batch Normalization & Optimizers
Stabilize intermediate training: implement Batch Normalization scaling. Map learning rate schedules in Momentum, Nesterov, AdaGrad, RMSProp, and Adam.
Momentum SGD
Contents
30.1.1 Problems with Poor Initialization . . . . . . . . . . . . . . . . 317
30.1.2 Xavier/Glorot Initialization . . . . . . . . . . . . . . . . . . . 318
30.1.3 He Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . 319
30.1.4 Mathematical Formulations . . . . . . . . . . . . . . . . . . . 320
30.1.5 Implementation in Keras . . . . . . . . . . . . . . . . . . . . . 320
30.1.6 Comparison Table . . . . . . . . . . . . . . . . . . . . . . . . . 321
30.1.7 Best Practices . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
30.1.8 Visual Summary . . . . . . . . . . . . . . . . . . . . . . . . . 322
30.1.9 Code Demonstration Results . . . . . . . . . . . . . . . . . . . 323
30.1.10Key Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . 324
VIII Optimizers in Deep Learning 325
31 Batch Normalization in Deep Learning Batch Learning in Keras 326
31.1 Batch Normalization: The Complete Deep Learning Guide . . . . . . 326
31.1.1 Introduction & Overview . . . . . . . . . . . . . . . . . . . . . 326
31.1.2 Theoretical Foundation . . . . . . . . . . . . . . . . . . . . . . 326
31.1.3 Why Batch Normalization? . . . . . . . . . . . . . . . . . . . 328
31.1.4 Mathematical Framework . . . . . . . . . . . . . . . . . . . . 329
31.1.5 Implementation Details . . . . . . . . . . . . . . . . . . . . . . 330
31.1.6 Advantages & Benefits . . . . . . . . . . . . . . . . . . . . . . 332
31.1.7 Complete Code Implementation . . . . . . . . . . . . . . . . . 332
31.1.8 Advanced Topics . . . . . . . . . . . . . . . . . . . . . . . . . 338
31.1.9 Best Practices & Tips . . . . . . . . . . . . . . . . . . . . . . 339
31.1.10Summary & Key Takeaways . . . . . . . . . . . . . . . . . . . 340
32 OptimizersinDeepLearningPart1CompleteDeepLearningCourse342
32.1 Optimizers in Deep Learning | Part 1 | Complete Deep Learning Course342
32.1.1 Introduction to Optimizers . . . . . . . . . . . . . . . . . . . . 342
32.1.2 Role of Optimizers . . . . . . . . . . . . . . . . . . . . . . . . 343
32.1.3 Types of Gradient Descent . . . . . . . . . . . . . . . . . . . . 344
32.1.4 Challenges with Traditional Optimizers . . . . . . . . . . . . . 345
32.1.5 Modern Optimization Algorithms . . . . . . . . . . . . . . . . 346
32.1.6 Practical Implementation . . . . . . . . . . . . . . . . . . . . . 347
32.1.7 Key Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . 348
33 Exponentially Weighted Moving Average or Exponential Weighted
Average Deep Learning 349
33.1 Exponentially Weighted Moving Average or Exponential Weighted Av-
erage | Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 349
33.2 SGD with Momentum Optimization . . . . . . . . . . . . . . . . . . . 349
33.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
33.2.2 Understanding Graph Representations . . . . . . . . . . . . . 349
33.2.3 Convex vs Non-Convex Optimization . . . . . . . . . . . . . . 350
33.2.4 Why Momentum? . . . . . . . . . . . . . . . . . . . . . . . . . 351
33.2.5 Mathematical Formulation . . . . . . . . . . . . . . . . . . . . 351
33.2.6 How Momentum Works . . . . . . . . . . . . . . . . . . . . . 352
33.2.7 Effect of Beta Parameter . . . . . . . . . . . . . . . . . . . . . 353
xiiiWhy this matters
Momentum accumulates velocity — smooths noisy gradients.
34.2 Momentum Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 372
34.2 Momentum Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 372
To accelerate convergence, **Batch Normalization** standardizes the activations of each layer across a mini-batch: $$\mu_B = \frac{1}{m} \sum x_i, \quad \sigma_B^2 = \frac{1}{m} \sum (x_i - \mu_B)^2$$ $$\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}}, \quad y_i = \gamma \hat{x}_i + \beta$$ Where $\gamma$ and $\beta$ are learnable scaling parameters. This reduces internal covariate shift and stabilizes training.
Common mistakes
- Using momentum without tuning base learning rate.
- Ignoring velocity interaction with batch size.
- Different seeds → false conclusions.
Interview checkpoints
- Q: Default optimizer today? A: Often Adam/AdamW with tuned LR.
- Q: SGD+momentum when? A: Vision with careful schedule can beat Adam.
Practice
- Basic: Describe Momentum SGD.
- Intermediate: Train same net with SGD vs Adam.
- Advanced: Plot loss per optimizer to equal compute.
Recap
- Momentum SGD changes optimization dynamics.
- Fair comparisons need same budget.
- Module 6: convolutions next.
Nesterov Momentum
Contents
33.3 Exponential Moving Average (EMA) - Mathematical Intuition . . . . 353
33.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
33.3.2 Basic Formula . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
33.3.3 Initial Conditions . . . . . . . . . . . . . . . . . . . . . . . . . 354
33.3.4 Recursive Expansion . . . . . . . . . . . . . . . . . . . . . . . 354
33.3.5 General Formula Pattern . . . . . . . . . . . . . . . . . . . . . 354
33.3.6 Key Mathematical Insight . . . . . . . . . . . . . . . . . . . . 355
33.3.7 Practical Implications . . . . . . . . . . . . . . . . . . . . . . 355
33.3.8 Example Calculation . . . . . . . . . . . . . . . . . . . . . . . 356
33.3.9 Benefits & Limitations . . . . . . . . . . . . . . . . . . . . . . 356
33.3.10Visualization Tools . . . . . . . . . . . . . . . . . . . . . . . . 357 33.3.11Summary & Best Practices . . . . . . . . . . . . . . . . . . . . 357 34 SGD with Momentum Explained in Detail with Animations Opti- mizers in Deep Learning Part 2 365
34.1 Deep Learning Optimization Techniques: Momentum with SGD . . . 365
34.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
34.1.2 Understanding Graph Visualizations . . . . . . . . . . . . . . 365
34.1.3 Convex vs Non-Convex Optimization . . . . . . . . . . . . . . 369
34.1.4 Problems with Standard Gradient Descent . . . . . . . . . . . 371
34.2 Momentum Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 372
34.2.1 Problems Momentum Solves . . . . . . . . . . . . . . . . . . . 372
34.2.2 Core Concept of Momentum . . . . . . . . . . . . . . . . . . . 372
34.2.3 Mathematical Formulation . . . . . . . . . . . . . . . . . . . . 373
34.2.4 Effect of Beta Parameter . . . . . . . . . . . . . . . . . . . . . 374
34.2.5 Problems with Momentum . . . . . . . . . . . . . . . . . . . . 375
34.2.6 Visualizations and Comparisons . . . . . . . . . . . . . . . . . 376
34.2.7 Implementation Example . . . . . . . . . . . . . . . . . . . . . 377
34.2.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378
35 Nesterov Accelerated Gradient (NAG) Explained in Detail Anima- tions Optimizers in Deep Learning 379
35.1 Nesterov Accelerated Gradient (NAG) Explained in Detail | Anima-
tions | Optimizers in Deep Learning . . . . . . . . . . . . . . . . . . . 379
35.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 379
35.1.2 Comparison with Other Optimizers . . . . . . . . . . . . . . . 379
35.1.3 Mathematical Formulation . . . . . . . . . . . . . . . . . . . . 379
35.1.4 Visual Comparison: Momentum vs NAG . . . . . . . . . . . . 380
35.1.5 Geometric Intuition . . . . . . . . . . . . . . . . . . . . . . . . 382
35.1.6 Why NAG Works Better . . . . . . . . . . . . . . . . . . . . . 382
35.1.7 Implementation in Keras . . . . . . . . . . . . . . . . . . . . . 383
35.1.8 Advantages & Disadvantages . . . . . . . . . . . . . . . . . . . 384
35.1.9 Best Practices . . . . . . . . . . . . . . . . . . . . . . . . . . . 384
35.1.10Hyperparameter Guidelines . . . . . . . . . . . . . . . . . . . 384
35.1.11Algorithm Comparison Summary . . . . . . . . . . . . . . . . 385
35.1.12Key Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . 385
36 AdaGrad Explained in Detail with Animations Optimizers in Deep
Learning Part 4 387
xivWhy this matters
Nesterov looks ahead before gradient step — faster convergence often.
37.2.11 Key Insights Covered:
The Core Innovation 1Adagrad: v_t = ?(?w_i)^2 -> grows forever -> learning rate -> 0 2RMSProp: v_t = beta*v_{t-1} + (1-beta)*(?w_t)^2 -> controlled growth Performance Characteristics –Excellent for neural networks and non-convex problems –Handles sparse data efficiently –No major disadvantages (still competitive with ADAM) –Was the gold standard before ADAM arrived Modern Usage –Second choice after ADAM for most problems –First choice when ADAM doesn’t perform well –Particularly good for RNNs and memory-constrained environments 402
37.2. RMSProp Optimizer: Complete Deep Learning Notes 403
Chapter 38 AdamOptimizerExplainedinDe- tail with Animations Optimizers in Deep Learning Part 5
38.1 Adam Optimizer Explained in Detail with
Animations | Optimizers in Deep Learning Part 5
38.2 ADAMOptimizer: CompleteDeepLearn-
ing Notes
38.2.1 Introduction & Overview
What is ADAM? ADAM=Adaptive Moment Estimation Feature Description TypeGradient-based optimization algorithm PopularityMost widely used optimizer in deep learning ApplicationsANNs, CNNs, RNNs, and most neural architectures Key StrengthCombines momentum and adaptive learning rates Key Insight: ADAM is currently the most powerful optimization technique and is used in most deep learning implementations. 404
38.2. ADAM Optimizer: Complete Deep Learning Notes
38.2.2 Background: Evolution of Optimization
Optimization Techniques Timeline Figure 38.1: image 405
Chapter 38. Adam Optimizer Explained in Detail with Animations Optimizers in Deep Learning Part 5 Comparison of Optimization Methods Method Speed Oscillations Sparse Data Learning Rate Decay Convergence SGD/BGDSlow Minimal Poor Manual Good but slow MomentumFast High Poor Manual Fast but oscillates NAGFast Reduced Poor Manual Good AdagradFast Minimal Excellent Too aggressive Stops learning RMSpropFast Minimal Good Controlled Excellent ADAMFast Minimal Excellent AutomaticBest Overall Problem-Solution Evolution 1 Batch Gradient Descent Problem – Issue: Very slow convergence – Solution: Momentum→Uses past gradients for current update 2 Momentum Problem – Issue: High oscillations around minimum – Solution: NAG (Nesterov Accelerated Gradient)→Dampens oscillations 3 Sparse Data Problem – Issue: Poor performance on sparse features – Solution: Adagrad→Adaptive learning rates per parameter 4 Adagrad Problem – Issue: Learning rate becomes too small, stops learning – Solution: RMSprop→Controls learning rate decay 5 Integration Opportunity – Observation: Two successful concepts exist: –Momentum (velocity concept) –Adaptive learning rate decay – Solution: ADAM→Combines both concepts 406
38.2. ADAM Optimizer: Complete Deep Learning Notes
38.2.3 Mathematical Formulation
Core ADAM Equations The ADAM algorithm uses the following mathematical formulation: Weight Update Rule: wt+1 =w t− η√ˆvt +ϵ׈mt Momentum Estimation (1st Moment): mt =β1×mt−1+ (1−β1)×∇wt Velocity Estimation (2nd Moment): vt =β2×vt−1+ (1−β2)×(∇wt)2 Bias Correction: ˆmt = mt 1−βt 1 ˆvt = vt 1−βt 2 Default Hyperparameters Parameter Symbol Default Value Purpose Learning Rateη0.001 Step size control Momentum Decayβ1 0.9 Controls momentum Velocity Decayβ 2 0.999 Controls adaptive learning Epsilonε1e-8 Numerical stability
38.2.4 Algorithm Components
ADAM Algorithm Breakdown Step 1: Calculate First Moment (Momentum) 1# Exponentially weighted average of gradients 2m_t = beta1 * m_{t-1} + (1 - beta1) * gradient 407
Chapter 38. Adam Optimizer Explained in Detail with Animations Optimizers in Deep Learning Part 5 Step 2: Calculate Second Moment (Velocity) 1# Exponentially weighted average of squared gradients 2v_t = beta2 * v_{t-1} + (1 - beta2) * gradient^2 Step 3: Bias Correction 1# Correct for initialization bias 2m_hat = m_t / (1 - beta1^t) 3v_hat = v_t / (1 - beta2^t) Step 4: Parameter Update 1# Update weights 2w = w - learning_rate * m_hat / (sqrt(v_hat) + epsilon) Why Bias Correction? Problem: Initially, bothm = 0andv = 0 Effect: Creates bias towards zero in early iterations Solution: Bias correction factors(1-β)and(1-β)offset this bias
38.2.5 Visual Understanding
ADAM Behavior Animation Analysis Scenario ADAM Behavior Comparison Sparse DataDirect descent to center Better than Momentum’s zigzag Convergence SpeedFastest convergence Beats all previous methods Oscillation ControlMinimal oscillations Stable approach to minimum Non-convex Optimization Excellent performance Ideal for neural networks Performance Characteristics: 408
38.2. ADAM Optimizer: Complete Deep Learning Notes Convergence Comparison Chart Figure 38.2: image
38.2.6 Implementation Guidelines
Practical Usage Recommendations First Choice Strategy: 1# Start with ADAM - most cases 2optimizer = Adam(learning_rate=0.001) Alternative Options: 1# If ADAM doesn’t perform well 2optimizer_rmsprop = RMSprop(learning_rate=0.001) 3optimizer_momentum = SGD(learning_rate=0.01, momentum=0.9) Hyperparameter Tuning Guide Parameter Typical Range When to Adjust Learning Rate0.0001 - 0.01 Always tune first β1 (Momentum)0.8 - 0.95 For different momentum needs β2 (Velocity)0.99 - 0.999 For adaptive rate sensitivity Epsilon1e-8 - 1e-6 For numerical stability issues 409
Chapter 38. Adam Optimizer Explained in Detail with Animations Optimizers in Deep Learning Part 5 Decision Framework Figure 38.3: image
38.2.7 Performance Analysis
Why ADAM is Superior Automatic Learning Rate Management: –No manual learning rate scheduling needed 410
38.2. ADAM Optimizer: Complete Deep Learning Notes –Adaptive decay prevents overshooting –Balances exploration vs exploitation Robust to Hyperparameters: –Default values work well in most cases –Less sensitive to initial learning rate choice –Consistent performance across problems Memory Efficiency: –Only stores first and second moment estimates –O(p) memory complexity (p = parameters) –Computationally efficient Empirical Results Summary Research Findings: Over the past 3-4 years, ADAM has con- sistently delivered better results across different types of problems compared to other optimizers. Success Metrics: –Faster convergence (typically 2-5x speedup) –Better final performance –More stable training –Requires less hyperparameter tuning
38.2.8 Key Takeaways
Core Concepts to Remember 1. Combination: ADAM = Momentum + Adaptive Learning Rate 2. Mathematics: Uses both first and second moment estimates 3. Bias Correction: Essential for proper initialization 4. Default Choice: Start with ADAM for most deep learning problems 5. Flexibility: Can fall back to RMSprop or Momentum if needed Best Practices – Start with ADAMas your default optimizer – Monitor convergenceand compare with alternatives – Tune learning ratefirst, other parameters later – Use early stoppingto prevent overfitting – Experimentwith different optimizers for specific problems 411
Chapter 38. Adam Optimizer Explained in Detail with Animations Optimizers in Deep Learning Part 5 412
Part IX Hyperparameter Tuning 413
Chapter 39
KerasTunerHyperparameterTun-
ing a Neural Network
39.1 Keras Tuner | Hyperparameter Tuning a
Neural Network
39.2 HyperparameterTuningwithKerasTuner
- Complete Guide
39.2.1 Introduction
Problem Statement
When building neural networks, we face multiple decisions: - How many hidden
layers? - How many neurons per layer? - Which activation function? - What
batch size? - Which optimizer?
Solution: Keras Tuner
Keras Tuneris one of the most famous hyperparameter tuning libraries that
helps automate the process of finding optimal hyperparameters.
39.2.2 Setup and Installation
Required Libraries
1# Core libraries
2importpandasaspd
3importnumpyasnp
4fromsklearn.preprocessingimportStandardScaler
5fromsklearn.model_selectionimporttrain_test_split
6
7# TensorFlow/Keras
8importtensorflowastf
9fromtensorflow.keras.modelsimportSequential
10fromtensorflow.keras.layersimportDense, Dropout
11
12# Keras Tuner
13importkeras_tuneraskt
41439.2. Hyperparameter Tuning with Keras Tuner - Complete Guide
Installation
1pip install keras-tuner
39.2.3 Dataset Preparation
Dataset: Pima Indians Diabetes
Feature Description Type
Pregnancies Number of pregnancies Numeric
Glucose Glucose concentration Numeric
BloodPressure Blood pressure Numeric
SkinThickness Skin thickness Numeric
Insulin Insulin level Numeric
BMI Body Mass Index Numeric
DiabetesPedigreeFunction Diabetes pedigree function Numeric
Age Age Numeric
Outcome Diabetes (0/1) Binary
Data Preprocessing Steps
1# Load dataset
2data = pd.read_csv(’diabetes.csv’)
3
4# Separate features and target
5X = data.iloc[:, :-1]# All columns except last
6y = data.iloc[:, -1]# Last column (Outcome)
7
8# Scale features
9scaler = StandardScaler()
10X_scaled = scaler.fit_transform(X)
11
12# Split data
13X_train, X_test, y_train, y_test = train_test_split(
14X_scaled, y, test_size=0.2, random_state=42
15)
39.2.4 Basic Model Building
Manual Approach (Before Tuning)
415Chapter 39. Keras Tuner Hyperparameter Tuning a Neural Network
1model = Sequential([
2Dense(32, activation=’relu’, input_dim=8),
3Dense(1, activation=’sigmoid’)
4])
5
6model.compile(
7optimizer=’rmsprop’,
8loss=’binary_crossentropy’,
9metrics=[’accuracy’]
10)
Results Analysis
Approach Accuracy Issue
Manual ~70% Trial and error
Intuition-based Variable Time-consuming
Automated Tuning Optimized Systematic
39.2.5 Optimizer Selection
Step 1: Define Build Function
1defbuild_model(hp):
2model = Sequential()
3
4# Fixed architecture for optimizer testing
5model.add(Dense(32, activation=’relu’, input_dim=8))
6model.add(Dense(1, activation=’sigmoid’))
7
8# Hyperparameter: Optimizer selection
9optimizer = hp.Choice(
10’optimizer’,
11values=[’adam’, ’rmsprop’, ’sgd’, ’adagrad’]
12)
13
14model.compile(
15optimizer=optimizer,
16loss=’binary_crossentropy’,
17metrics=[’accuracy’]
18)
19
20returnmodel
Step 2: Create Tuner Object
1tuner = kt.RandomSearch(
41639.2. Hyperparameter Tuning with Keras Tuner - Complete Guide
2build_model,
3objective=’val_accuracy’,
4max_trials=5,
5directory=’my_dir’,
6project_name=’optimizer_tuning’
7)
Step 3: Search for Best Optimizer
1tuner.search(
2X_train, y_train,
3epochs=10,
4validation_data=(X_test, y_test)
5)
6
7# Get best hyperparameters
8best_params = tuner.get_best_hyperparameters()[0]
9print(f"Best optimizer: {best_params.get(’optimizer’)}")
Optimizer Comparison Results
Optimizer Validation Accuracy Performance
RMSprop 0.538
Adam 0.650
SGD 0.570
Adagrad 0.650
39.2.6 Number of Neurons Optimization
Hyperparameter: Units Selection
1defbuild_model(hp):
2model = Sequential()
3
4# Variable number of units
5units = hp.Int(’units’, min_value=8, max_value=128, step=8)
6
7model.add(Dense(
8units=units,
9activation=’relu’,
10input_dim=8
11))
12model.add(Dense(1, activation=’sigmoid’))
13
14model.compile(
15optimizer=’rmsprop’,# Use best from previous step
417Chapter 39. Keras Tuner Hyperparameter Tuning a Neural Network
16loss=’binary_crossentropy’,
17metrics=[’accuracy’]
18)
19
20returnmodel
Units Testing Range
Figure 39.1: Mermaid diagram
Best Results
– Optimal Units: 120 neurons
– Validation Accuracy: Improved performance
– Pattern: More neurons generally better (up to a point)
39.2.7 Number of Layers Optimization
Dynamic Layer Creation
1defbuild_model(hp):
2model = Sequential()
3
4# Variable number of layers
5num_layers = hp.Int(’num_layers’, min_value=1, max_value=10)
6
7foriin range(num_layers):
8ifi == 0:
9# First layer with input dimension
10model.add(Dense(
11units=hp.Int(f’units_{i}’, 8, 128, step=8),
12activation=’relu’,
13input_dim=8
14))
15else:
16# Hidden layers
17model.add(Dense(
18units=hp.Int(f’units_{i}’, 8, 128, step=8),
19activation=’relu’
20))
21
22# Output layer
23model.add(Dense(1, activation=’sigmoid’))
24
25model.compile(
26optimizer=’rmsprop’,
27loss=’binary_crosseModern optimization techniques adjust learning rates dynamically per parameter based on historical gradients:
- Momentum: Introduces velocity to carry updates past local minima oscillations.
- RMSProp: Adapts learning rates based on exponentially decaying average of squared gradients.
- Adam (Adaptive Moment Estimation): Combines Momentum (first moment estimate) and RMSProp (second moment estimate).
Common mistakes
- Using nesterov without tuning base learning rate.
- Ignoring lookahead interaction with batch size.
- Different seeds → false conclusions.
Interview checkpoints
- Q: Default optimizer today? A: Often Adam/AdamW with tuned LR.
- Q: SGD+momentum when? A: Vision with careful schedule can beat Adam.
Practice
- Basic: Describe Nesterov Momentum.
- Intermediate: Train same net with SGD vs Adam.
- Advanced: Plot loss per optimizer to equal compute.
Recap
- Nesterov Momentum changes optimization dynamics.
- Fair comparisons need same budget.
- Module 6: convolutions next.
Next: Day 47 — AdaGrad
AdaGrad
Why this matters
AdaGrad adapts per-parameter LR — good for sparse features.
36.1 AdaGrad Explained in Detail with Animations | Optimizers in Deep
Learning Part 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
36.1 AdaGrad Explained in Detail with Animations | Optimizers in Deep
Learning Part 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
Adam updates parameters using both first-moment vector $m_t$ and second-moment vector $v_t$: $$m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t, \quad v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$$ With bias-correction terms: $$\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}$$ $$w_{t+1} = w_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t$$
Common mistakes
- Using adagrad without tuning base learning rate.
- Ignoring sparse interaction with batch size.
- Different seeds → false conclusions.
Interview checkpoints
- Q: Default optimizer today? A: Often Adam/AdamW with tuned LR.
- Q: SGD+momentum when? A: Vision with careful schedule can beat Adam.
Practice
- Basic: Describe AdaGrad.
- Intermediate: Train same net with SGD vs Adam.
- Advanced: Plot loss per optimizer to equal compute.
Recap
- AdaGrad changes optimization dynamics.
- Fair comparisons need same budget.
- Module 6: convolutions next.
Next: Day 48 — RMSProp
RMSProp
Contents
36.1 AdaGrad Explained in Detail with Animations | Optimizers in Deep
Learning Part 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
36.2 AdaGrad (Adaptive Gradient) Optimization Algorithm . . . . . . . . 387
36.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
36.2.2 When AdaGrad Excels . . . . . . . . . . . . . . . . . . . . . . 387
36.2.3 The Elongated Bowl Problem . . . . . . . . . . . . . . . . . . 388
36.2.4 Mathematical Formulation . . . . . . . . . . . . . . . . . . . . 388
36.2.5 Intuition Behind AdaGrad . . . . . . . . . . . . . . . . . . . . 389
36.2.6 Example: Sparse Data Problem . . . . . . . . . . . . . . . . . 390
36.2.7 Advantages of AdaGrad . . . . . . . . . . . . . . . . . . . . . 390
36.2.8 Major Disadvantage . . . . . . . . . . . . . . . . . . . . . . . 390
36.2.9 Practical Implications . . . . . . . . . . . . . . . . . . . . . . 391
36.2.10Summary Table . . . . . . . . . . . . . . . . . . . . . . . . . . 391 37 RMSProp Explained in Detail with Animations Optimizers in Deep Learning Part 5 393
37.1 RMSProp Explained in Detail with Animations | Optimizers in Deep
Learning Part 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
37.2 RMSProp Optimizer: Complete Deep Learning Notes . . . . . . . . . 393
37.2.1 Introduction & Overview . . . . . . . . . . . . . . . . . . . . . 393
37.2.2 The Problem RMSProp Solves . . . . . . . . . . . . . . . . . . 393
37.2.3 Mathematical Formulation . . . . . . . . . . . . . . . . . . . . 395
37.2.4 Algorithm Breakdown . . . . . . . . . . . . . . . . . . . . . . 396
37.2.5 Visual Understanding . . . . . . . . . . . . . . . . . . . . . . . 397
37.2.6 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . 398
37.2.7 Implementation Guidelines . . . . . . . . . . . . . . . . . . . . 400
37.2.8 Key Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . 401
37.2.9 Additional Resources . . . . . . . . . . . . . . . . . . . . . . . 401
37.2.10Comprehensive Coverage:. . . . . . . . . . . . . . . . . . 401 37.2.11Key Insights Covered:. . . . . . . . . . . . . . . . . . . . 402 38 Adam Optimizer Explained in Detail with Animations Optimizers in Deep Learning Part 5 404
38.1 Adam Optimizer Explained in Detail with Animations | Optimizers in
Deep Learning Part 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . 404
38.2 ADAM Optimizer: Complete Deep Learning Notes . . . . . . . . . . 404
38.2.1 Introduction & Overview . . . . . . . . . . . . . . . . . . . . . 404
38.2.2 Background: Evolution of Optimization . . . . . . . . . . . . . 405
38.2.3 Mathematical Formulation . . . . . . . . . . . . . . . . . . . . 407
38.2.4 Algorithm Components . . . . . . . . . . . . . . . . . . . . . . 407
38.2.5 Visual Understanding . . . . . . . . . . . . . . . . . . . . . . . 408
38.2.6 Implementation Guidelines . . . . . . . . . . . . . . . . . . . . 409
38.2.7 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . 410
38.2.8 Key Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . 411
IX Hyperparameter Tuning 413
39 Keras Tuner Hyperparameter Tuning a Neural Network 414
xvWhy this matters
RMSProp fixes AdaGrad decay — popular before Adam.
37.1 RMSProp Explained in Detail with Animations | Optimizers in Deep
Learning Part 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
37.1 RMSProp Explained in Detail with Animations | Optimizers in Deep
Learning Part 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
Content sourced from CampusX Deep Learning notes (PDF). Run merge script for full body.
Common mistakes
- Using rmsprop without tuning base learning rate.
- Ignoring moving avg interaction with batch size.
- Different seeds → false conclusions.
Interview checkpoints
- Q: Default optimizer today? A: Often Adam/AdamW with tuned LR.
- Q: SGD+momentum when? A: Vision with careful schedule can beat Adam.
Practice
- Basic: Describe RMSProp.
- Intermediate: Train same net with SGD vs Adam.
- Advanced: Plot loss per optimizer to equal compute.
Recap
- RMSProp changes optimization dynamics.
- Fair comparisons need same budget.
- Module 6: convolutions next.
Next: Day 49 — Adam Optimizer
Adam Optimizer
Contents
36.1 AdaGrad Explained in Detail with Animations | Optimizers in Deep
Learning Part 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
36.2 AdaGrad (Adaptive Gradient) Optimization Algorithm . . . . . . . . 387
36.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
36.2.2 When AdaGrad Excels . . . . . . . . . . . . . . . . . . . . . . 387
36.2.3 The Elongated Bowl Problem . . . . . . . . . . . . . . . . . . 388
36.2.4 Mathematical Formulation . . . . . . . . . . . . . . . . . . . . 388
36.2.5 Intuition Behind AdaGrad . . . . . . . . . . . . . . . . . . . . 389
36.2.6 Example: Sparse Data Problem . . . . . . . . . . . . . . . . . 390
36.2.7 Advantages of AdaGrad . . . . . . . . . . . . . . . . . . . . . 390
36.2.8 Major Disadvantage . . . . . . . . . . . . . . . . . . . . . . . 390
36.2.9 Practical Implications . . . . . . . . . . . . . . . . . . . . . . 391
36.2.10Summary Table . . . . . . . . . . . . . . . . . . . . . . . . . . 391 37 RMSProp Explained in Detail with Animations Optimizers in Deep Learning Part 5 393
37.1 RMSProp Explained in Detail with Animations | Optimizers in Deep
Learning Part 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393
37.2 RMSProp Optimizer: Complete Deep Learning Notes . . . . . . . . . 393
37.2.1 Introduction & Overview . . . . . . . . . . . . . . . . . . . . . 393
37.2.2 The Problem RMSProp Solves . . . . . . . . . . . . . . . . . . 393
37.2.3 Mathematical Formulation . . . . . . . . . . . . . . . . . . . . 395
37.2.4 Algorithm Breakdown . . . . . . . . . . . . . . . . . . . . . . 396
37.2.5 Visual Understanding . . . . . . . . . . . . . . . . . . . . . . . 397
37.2.6 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . 398
37.2.7 Implementation Guidelines . . . . . . . . . . . . . . . . . . . . 400
37.2.8 Key Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . 401
37.2.9 Additional Resources . . . . . . . . . . . . . . . . . . . . . . . 401
37.2.10Comprehensive Coverage:. . . . . . . . . . . . . . . . . . 401 37.2.11Key Insights Covered:. . . . . . . . . . . . . . . . . . . . 402 38 Adam Optimizer Explained in Detail with Animations Optimizers in Deep Learning Part 5 404
38.1 Adam Optimizer Explained in Detail with Animations | Optimizers in
Deep Learning Part 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . 404
38.2 ADAM Optimizer: Complete Deep Learning Notes . . . . . . . . . . 404
38.2.1 Introduction & Overview . . . . . . . . . . . . . . . . . . . . . 404
38.2.2 Background: Evolution of Optimization . . . . . . . . . . . . . 405
38.2.3 Mathematical Formulation . . . . . . . . . . . . . . . . . . . . 407
38.2.4 Algorithm Components . . . . . . . . . . . . . . . . . . . . . . 407
38.2.5 Visual Understanding . . . . . . . . . . . . . . . . . . . . . . . 408
38.2.6 Implementation Guidelines . . . . . . . . . . . . . . . . . . . . 409
38.2.7 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . 410
38.2.8 Key Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . 411
IX Hyperparameter Tuning 413
39 Keras Tuner Hyperparameter Tuning a Neural Network 414
xvWhy this matters
Adam combines momentum + adaptive LR — default for many tasks.
38.1 Adam Optimizer Explained in Detail with Animations | Optimizers in
Deep Learning Part 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . 404
38.1 Adam Optimizer Explained in Detail with Animations | Optimizers in
Deep Learning Part 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . 404
Content sourced from CampusX Deep Learning notes (PDF). Run merge script for full body.
Common mistakes
- Using adam without tuning base learning rate.
- Ignoring bias correction interaction with batch size.
- Different seeds → false conclusions.
Interview checkpoints
- Q: Default optimizer today? A: Often Adam/AdamW with tuned LR.
- Q: SGD+momentum when? A: Vision with careful schedule can beat Adam.
Practice
- Basic: Describe Adam Optimizer.
- Intermediate: Train same net with SGD vs Adam.
- Advanced: Plot loss per optimizer to equal compute.
Recap
- Adam Optimizer changes optimization dynamics.
- Fair comparisons need same budget.
- Module 6: convolutions next.
Next: Day 50 — AdamW
AdamW
Chapter 32. Optimizers in Deep Learning Part 1 Complete Deep Learning Course When to Use Which Optimizer? Scenario Recommended Optimizer Reason Sparse DataAdaGrad Per-parameter adaptation Computer VisionSGD with Momentum Well-tested, reliable NLP/TransformersAdam/AdamW Handles varying gradients Online LearningStochastic GD Single sample updates Research/ExperimentationAdam Good default choice
32.1.7 Key Takeaways
Essential Points 1.Optimizers are crucialfor training neural networks efficiently 2.Learning rate selectionis one of the most important hyperparameters 3.Modern optimizerssolve many limitations of vanilla gradient descent 4.Adam is often a good defaultchoice for most applications 5.No single optimizerworks best for all problems 348
Why this matters
AdamW decouples weight decay — better generalization than Adam+L2.
31.1.10 Summary & Key Takeaways
Core Benefits Recap Benefit Impact Explanation ** Speed** 2-14x faster Higher learning rates possible ** Stability** Much improved Reduces internal covariate shift ** Regularization** Mild effect Batch statistics add noise ** Robustness** High Less sensitive to initialization When to Use Batch Normalization –Deep networks (> 3 layers) –Computer vision tasks –When using high learning rates –Large batch sizes available –Training from scratch When to Avoid –Very small batch sizes (< 16) –Online learning (batch size = 1) 340
31.1. Batch Normalization: The Complete Deep Learning Guide –Some RNN architectures –When training time is critical 341
Chapter 32 OptimizersinDeepLearningPart 1CompleteDeepLearningCourse
32.1 Optimizers in Deep Learning | Part 1 |
Complete Deep Learning Course
32.1.1 Introduction to Optimizers
What are Optimizers? Optimizers are algorithms that adjust the parameters of neural networks to minimize the loss function and improve model performance. Why Do We Need Optimizers? Need Description Impact Speed Up TrainingReduce time to convergence High Find Optimal Parameters Locate global minimum Critical Handle Complex Loss Surfaces Navigate non-convex functions Essential Adaptive LearningAdjust to different data patterns Important 342
32.1. Optimizers in Deep Learning | Part 1 | Complete Deep Learning Course
32.1.2 Role of Optimizers
The Optimization Process Figure 32.1: image Mathematical Foundation The core update rule for gradient descent: wt+1 =w t−η∂L ∂w 343
Chapter 32. Optimizers in Deep Learning Part 1 Complete Deep Learning Course Where: –wt = weights at timet –η= learning rate –∂L ∂w= gradient of loss with respect to weights
32.1.3 Types of Gradient Descent
Comparison Table Type Update Frequency Batch Size Advantages Disadvantages Batch GDAfter full dataset All samples Stable convergence Slow, memory intensive Stochastic GD After each sample 1 sample Fast, online learning Noisy updates Mini-batch GD After mini-batch 32-512 samples Balanced approach Hyperparameter tuning Visual Comparison Figure 32.2: image Update Rules Comparison 1 Batch Gradient Descent 1forepochin range(num_epochs): 2gradients = compute_gradients(entire_dataset) 3weights = weights - learning_rate * gradients 2 Stochastic Gradient Descent 1forepochin range(num_epochs): 2forsampleindataset: 3gradient = compute_gradient(sample) 4weights = weights - learning_rate * gradient 344
32.1. Optimizers in Deep Learning | Part 1 | Complete Deep Learning Course 3 Mini-batch Gradient Descent 1forepochin range(num_epochs): 2forbatchinmini_batches: 3gradients = compute_gradients(batch) 4weights = weights - learning_rate * gradients
32.1.4 Challenges with Traditional Optimizers
Learning Rate Selection Learning Rate Effect Visualization Too SmallSlow convergence Painfully slow Too LargeOvershooting/Divergence Unstable Just RightOptimal convergence Perfect The Goldilocks Problem 2 Learning Rate Scheduling Problem: Pre-defined schedules don’t adapt to data 1# Common scheduling strategies 2strategies = { 3"Step Decay": "lr = lr * 0.1 every 30 epochs", 4"Exponential": "lr = lr * exp(-decay * epoch)", 5"Cosine Annealing": "lr = lr_min + 0.5 * (lr_max - lr_min) * (1 + cos (? * epoch/total))" 6} 3 Same Learning Rate for All Parameters Issue: Different parameters may need different learning rates Figure 32.3: image 345
Chapter 32. Optimizers in Deep Learning Part 1 Complete Deep Learning Course Local Minima Problem Figure 32.4: image Visualization of Local vs Global Minima
32.1.5 Modern Optimization Algorithms
Overview of Advanced Optimizers Optimizer Key Innovation Best Use Case MomentumVelocity accumulation Smooth loss surfaces AdaGradAdaptive learning rates Sparse data NAGLook-ahead gradient Faster convergence RMSpropRunning average of gradients Non-stationary objectives AdamMomentum + Adaptive LR General purpose Mathematical Formulations Momentum vt =βvt−1+η∇wL wt+1 =w t−vt AdaGrad gt =g t−1+ (∇wL)2 wt+1 =w t− η√gt +ϵ∇wL 346
32.1. Optimizers in Deep Learning | Part 1 | Complete Deep Learning Course 3 RMSprop vt =βvt−1+ (1−β)(∇wL)2 wt+1 =w t− η√vt +ϵ∇wL 4 Adam mt =β1mt−1+ (1−β1)∇wL vt =β2vt−1+ (1−β2)(∇wL)2 ˆmt = mt 1−βt 1 ,ˆv t = vt 1−βt 2 wt+1 =w t− η√ˆvt +ϵˆmt
32.1.6 Practical Implementation
TensorFlow/Keras Example
1# Different optimizers in Keras
2optimizers = {
3’sgd’: tf.keras.optimizers.SGD(learning_rate=0.01),
4’momentum’: tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9),
5’adagrad’: tf.keras.optimizers.Adagrad(learning_rate=0.01),
6’rmsprop’: tf.keras.optimizers.RMSprop(learning_rate=0.001),
7’adam’: tf.keras.optimizers.Adam(learning_rate=0.001)
8}
9
10# Compile model with optimizer
11model.compile(
12optimizer=optimizers[’adam’],
13loss=’categorical_crossentropy’,
14metrics=[’accuracy’]
15)
PyTorch Example
1# Different optimizers in PyTorch
2importtorch.optimasoptim
3
4optimizers = {
5’sgd’: optim.SGD(model.parameters(), lr=0.01),
6’momentum’: optim.SGD(model.parameters(), lr=0.01, momentum=0.9),
7’adagrad’: optim.Adagrad(model.parameters(), lr=0.01),
8’rmsprop’: optim.RMSprop(model.parameters(), lr=0.001),
9’adam’: optim.Adam(model.parameters(), lr=0.001)
10}
347Chapter 32. Optimizers in Deep Learning Part 1 Complete Deep Learning Course When to Use Which Optimizer? Scenario Recommended Optimizer Reason Sparse DataAdaGrad Per-parameter adaptation Computer VisionSGD with Momentum Well-tested, reliable NLP/TransformersAdam/AdamW Handles varying gradients Online LearningStochastic GD Single sample updates Research/ExperimentationAdam Good default choice
32.1.7 Key Takeaways
Essential Points 1.Optimizers are crucialfor training neural networks efficiently 2.Learning rate selectionis one of the most important hyperparameters 3.Modern optimizerssolve many limitations of vanilla gradient descent 4.Adam is often a good defaultchoice for most applications 5.No single optimizerworks best for all problems 348
Chapter 33 Exponentially Weighted Moving Average or Exponential Weighted Average Deep Learning
33.1 Exponentially Weighted Moving Average
or Exponential Weighted Average | Deep Learn- ing
33.2 SGD with Momentum Optimization
33.2.1 Introduction
Momentumis a crucial optimization technique in deep learning that acceler- ates gradient descent by accumulating velocity from past gradients. It’s par- ticularly effective for: - Speeding up convergence - Escaping local minima - Navigating elongated valleys in loss landscapes Key Insight Momentum works like a ball rolling down a hill - it accumulates velocity in consistent directions and dampens oscillations in incon- sistent directions.
33.2.2 Understanding Graph Representations
Three Types of Visualizations Graph Type Dimension Purpose Visual Representation 2D Loss PlotLoss vs Single Parameter Simple optimization view Parabolic curve 3D Surface PlotLoss vs Two Parameters Complete loss landscape Mountain/valley view Contour Plot2D projection of 3D Top-down view Concentric circles 349
Chapter 33. Exponentially Weighted Moving Average or Exponential Weighted Average Deep Learning Visual Interpretation Guide Figure 33.1: image – Yellow/Orange= High altitude (high loss) – Blue/Purple= Low altitude (low loss) – Circular contours= Well-conditioned optimization – Elongated contours= Ill-conditioned optimization
33.2.3 Convex vs Non-Convex Optimization
Comparison Table Aspect Convex Optimization Non-Convex Optimization ShapeBowl-like Multiple valleys MinimaSingle global minimum Multiple local minima ChallengesRelatively simple Complex navigation ConvergenceGuaranteed Not guaranteed Three Major Problems in Non-Convex Optimization 1. Local Minima – Problem: Algorithm gets stuck in suboptimal solutions – Visual: Small valleys that trap the optimizer 350
33.2. SGD with Momentum Optimization – Impact: Poor model performance 2. Saddle Points – Problem: Flat regions with mixed curvature – Visual: Areas that curve up in one direction, down in another – Impact: Extremely slow convergence 3. High Curvature – Problem: Sharp turns in loss landscape – Visual: Narrow valleys or ridges – Impact: Oscillations and instability
33.2.4 Why Momentum?
Problems with Vanilla Gradient Descent Figure 33.2: image Momentum Solutions Problem How Momentum Helps Consistent gradientsAccelerates in consistent directions Inconsistent gradientsDampens oscillations Local minimaBuilds up speed to escape Flat regionsMaintains velocity through plateaus
Content sourced from CampusX Deep Learning notes (PDF). Run merge script for full body.
Common mistakes
- Using adamw without tuning base learning rate.
- Ignoring decoupled interaction with batch size.
- Different seeds → false conclusions.
Interview checkpoints
- Q: Default optimizer today? A: Often Adam/AdamW with tuned LR.
- Q: SGD+momentum when? A: Vision with careful schedule can beat Adam.
Practice
- Basic: Describe AdamW.
- Intermediate: Train same net with SGD vs Adam.
- Advanced: Plot loss per optimizer to equal compute.
Recap
- AdamW changes optimization dynamics.
- Fair comparisons need same budget.
- Module 6: convolutions next.
Optimizer Comparison
Contents
30.1.1 Problems with Poor Initialization . . . . . . . . . . . . . . . . 317
30.1.2 Xavier/Glorot Initialization . . . . . . . . . . . . . . . . . . . 318
30.1.3 He Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . 319
30.1.4 Mathematical Formulations . . . . . . . . . . . . . . . . . . . 320
30.1.5 Implementation in Keras . . . . . . . . . . . . . . . . . . . . . 320
30.1.6 Comparison Table . . . . . . . . . . . . . . . . . . . . . . . . . 321
30.1.7 Best Practices . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
30.1.8 Visual Summary . . . . . . . . . . . . . . . . . . . . . . . . . 322
30.1.9 Code Demonstration Results . . . . . . . . . . . . . . . . . . . 323
30.1.10Key Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . 324
VIII Optimizers in Deep Learning 325
31 Batch Normalization in Deep Learning Batch Learning in Keras 326
31.1 Batch Normalization: The Complete Deep Learning Guide . . . . . . 326
31.1.1 Introduction & Overview . . . . . . . . . . . . . . . . . . . . . 326
31.1.2 Theoretical Foundation . . . . . . . . . . . . . . . . . . . . . . 326
31.1.3 Why Batch Normalization? . . . . . . . . . . . . . . . . . . . 328
31.1.4 Mathematical Framework . . . . . . . . . . . . . . . . . . . . 329
31.1.5 Implementation Details . . . . . . . . . . . . . . . . . . . . . . 330
31.1.6 Advantages & Benefits . . . . . . . . . . . . . . . . . . . . . . 332
31.1.7 Complete Code Implementation . . . . . . . . . . . . . . . . . 332
31.1.8 Advanced Topics . . . . . . . . . . . . . . . . . . . . . . . . . 338
31.1.9 Best Practices & Tips . . . . . . . . . . . . . . . . . . . . . . 339
31.1.10Summary & Key Takeaways . . . . . . . . . . . . . . . . . . . 340
32 OptimizersinDeepLearningPart1CompleteDeepLearningCourse342
32.1 Optimizers in Deep Learning | Part 1 | Complete Deep Learning Course342
32.1.1 Introduction to Optimizers . . . . . . . . . . . . . . . . . . . . 342
32.1.2 Role of Optimizers . . . . . . . . . . . . . . . . . . . . . . . . 343
32.1.3 Types of Gradient Descent . . . . . . . . . . . . . . . . . . . . 344
32.1.4 Challenges with Traditional Optimizers . . . . . . . . . . . . . 345
32.1.5 Modern Optimization Algorithms . . . . . . . . . . . . . . . . 346
32.1.6 Practical Implementation . . . . . . . . . . . . . . . . . . . . . 347
32.1.7 Key Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . 348
33 Exponentially Weighted Moving Average or Exponential Weighted
Average Deep Learning 349
33.1 Exponentially Weighted Moving Average or Exponential Weighted Av-
erage | Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 349
33.2 SGD with Momentum Optimization . . . . . . . . . . . . . . . . . . . 349
33.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
33.2.2 Understanding Graph Representations . . . . . . . . . . . . . 349
33.2.3 Convex vs Non-Convex Optimization . . . . . . . . . . . . . . 350
33.2.4 Why Momentum? . . . . . . . . . . . . . . . . . . . . . . . . . 351
33.2.5 Mathematical Formulation . . . . . . . . . . . . . . . . . . . . 351
33.2.6 How Momentum Works . . . . . . . . . . . . . . . . . . . . . 352
33.2.7 Effect of Beta Parameter . . . . . . . . . . . . . . . . . . . . . 353
xiiiWhy this matters
Compare optimizers on same data, budget, and seed.
19.10.1 Speed Comparison
– Given same number of epochs: Batch GD isfaster – Reason: Batch has fewer updates (10 epochs = 10 updates vs.10×n updates in SGD) 201
Chapter 19. Gradient Descent in Neural Network: Batch vs Stochastic vs Mini-Batch
19.10.1 Speed Comparison
– Given same number of epochs: Batch GD isfaster – Reason: Batch has fewer updates (10 epochs = 10 updates vs.10×n updates in SGD) 201
Chapter 19. Gradient Descent in Neural Network: Batch vs Stochastic vs Mini-Batch
Content sourced from CampusX Deep Learning notes (PDF). Run merge script for full body.
Common mistakes
- Using compare without tuning base learning rate.
- Ignoring seed interaction with batch size.
- Different seeds → false conclusions.
Interview checkpoints
- Q: Default optimizer today? A: Often Adam/AdamW with tuned LR.
- Q: SGD+momentum when? A: Vision with careful schedule can beat Adam.
Practice
- Basic: Describe Optimizer Comparison.
- Intermediate: Train same net with SGD vs Adam.
- Advanced: Plot loss per optimizer to equal compute.
Recap
- Optimizer Comparison changes optimization dynamics.
- Fair comparisons need same budget.
- Module 6: convolutions next.
Warmup Schedules
32.1. Optimizers in Deep Learning | Part 1 | Complete Deep Learning Course 3 Mini-batch Gradient Descent 1forepochin range(num_epochs): 2forbatchinmini_batches: 3gradients = compute_gradients(batch) 4weights = weights - learning_rate * gradients
32.1.4 Challenges with Traditional Optimizers
Learning Rate Selection Learning Rate Effect Visualization Too SmallSlow convergence Painfully slow Too LargeOvershooting/Divergence Unstable Just RightOptimal convergence Perfect The Goldilocks Problem 2 Learning Rate Scheduling Problem: Pre-defined schedules don’t adapt to data 1# Common scheduling strategies 2strategies = { 3"Step Decay": "lr = lr * 0.1 every 30 epochs", 4"Exponential": "lr = lr * exp(-decay * epoch)", 5"Cosine Annealing": "lr = lr_min + 0.5 * (lr_max - lr_min) * (1 + cos (? * epoch/total))" 6} 3 Same Learning Rate for All Parameters Issue: Different parameters may need different learning rates Figure 32.3: image 345
Why this matters
Warmup gradually increases LR early in training — transformers often need it.
31.1.10 Summary & Key Takeaways
Core Benefits Recap Benefit Impact Explanation ** Speed** 2-14x faster Higher learning rates possible ** Stability** Much improved Reduces internal covariate shift ** Regularization** Mild effect Batch statistics add noise ** Robustness** High Less sensitive to initialization When to Use Batch Normalization –Deep networks (> 3 layers) –Computer vision tasks –When using high learning rates –Large batch sizes available –Training from scratch When to Avoid –Very small batch sizes (< 16) –Online learning (batch size = 1) 340
31.1. Batch Normalization: The Complete Deep Learning Guide –Some RNN architectures –When training time is critical 341
Chapter 32 OptimizersinDeepLearningPart 1CompleteDeepLearningCourse
32.1 Optimizers in Deep Learning | Part 1 |
Complete Deep Learning Course
32.1.1 Introduction to Optimizers
What are Optimizers? Optimizers are algorithms that adjust the parameters of neural networks to minimize the loss function and improve model performance. Why Do We Need Optimizers? Need Description Impact Speed Up TrainingReduce time to convergence High Find Optimal Parameters Locate global minimum Critical Handle Complex Loss Surfaces Navigate non-convex functions Essential Adaptive LearningAdjust to different data patterns Important 342
32.1. Optimizers in Deep Learning | Part 1 | Complete Deep Learning Course
32.1.2 Role of Optimizers
The Optimization Process Figure 32.1: image Mathematical Foundation The core update rule for gradient descent: wt+1 =w t−η∂L ∂w 343
Chapter 32. Optimizers in Deep Learning Part 1 Complete Deep Learning Course Where: –wt = weights at timet –η= learning rate –∂L ∂w= gradient of loss with respect to weights
32.1.3 Types of Gradient Descent
Comparison Table Type Update Frequency Batch Size Advantages Disadvantages Batch GDAfter full dataset All samples Stable convergence Slow, memory intensive Stochastic GD After each sample 1 sample Fast, online learning Noisy updates Mini-batch GD After mini-batch 32-512 samples Balanced approach Hyperparameter tuning Visual Comparison Figure 32.2: image Update Rules Comparison 1 Batch Gradient Descent 1forepochin range(num_epochs): 2gradients = compute_gradients(entire_dataset) 3weights = weights - learning_rate * gradients 2 Stochastic Gradient Descent 1forepochin range(num_epochs): 2forsampleindataset: 3gradient = compute_gradient(sample) 4weights = weights - learning_rate * gradient 344
32.1. Optimizers in Deep Learning | Part 1 | Complete Deep Learning Course 3 Mini-batch Gradient Descent 1forepochin range(num_epochs): 2forbatchinmini_batches: 3gradients = compute_gradients(batch) 4weights = weights - learning_rate * gradients
32.1.4 Challenges with Traditional Optimizers
Learning Rate Selection Learning Rate Effect Visualization Too SmallSlow convergence Painfully slow Too LargeOvershooting/Divergence Unstable Just RightOptimal convergence Perfect The Goldilocks Problem 2 Learning Rate Scheduling Problem: Pre-defined schedules don’t adapt to data 1# Common scheduling strategies 2strategies = { 3"Step Decay": "lr = lr * 0.1 every 30 epochs", 4"Exponential": "lr = lr * exp(-decay * epoch)", 5"Cosine Annealing": "lr = lr_min + 0.5 * (lr_max - lr_min) * (1 + cos (? * epoch/total))" 6} 3 Same Learning Rate for All Parameters Issue: Different parameters may need different learning rates Figure 32.3: image 345
Chapter 32. Optimizers in Deep Learning Part 1 Complete Deep Learning Course Local Minima Problem Figure 32.4: image Visualization of Local vs Global Minima
32.1.5 Modern Optimization Algorithms
Overview of Advanced Optimizers Optimizer Key Innovation Best Use Case MomentumVelocity accumulation Smooth loss surfaces AdaGradAdaptive learning rates Sparse data NAGLook-ahead gradient Faster convergence RMSpropRunning average of gradients Non-stationary objectives AdamMomentum + Adaptive LR General purpose Mathematical Formulations Momentum vt =βvt−1+η∇wL wt+1 =w t−vt AdaGrad gt =g t−1+ (∇wL)2 wt+1 =w t− η√gt +ϵ∇wL 346
32.1. Optimizers in Deep Learning | Part 1 | Complete Deep Learning Course 3 RMSprop vt =βvt−1+ (1−β)(∇wL)2 wt+1 =w t− η√vt +ϵ∇wL 4 Adam mt =β1mt−1+ (1−β1)∇wL vt =β2vt−1+ (1−β2)(∇wL)2 ˆmt = mt 1−βt 1 ,ˆv t = vt 1−βt 2 wt+1 =w t− η√ˆvt +ϵˆmt
32.1.6 Practical Implementation
TensorFlow/Keras Example
1# Different optimizers in Keras
2optimizers = {
3’sgd’: tf.keras.optimizers.SGD(learning_rate=0.01),
4’momentum’: tf.keras.optimizers.SGD(learning_rate=0.01, momentum=0.9),
5’adagrad’: tf.keras.optimizers.Adagrad(learning_rate=0.01),
6’rmsprop’: tf.keras.optimizers.RMSprop(learning_rate=0.001),
7’adam’: tf.keras.optimizers.Adam(learning_rate=0.001)
8}
9
10# Compile model with optimizer
11model.compile(
12optimizer=optimizers[’adam’],
13loss=’categorical_crossentropy’,
14metrics=[’accuracy’]
15)
PyTorch Example
1# Different optimizers in PyTorch
2importtorch.optimasoptim
3
4optimizers = {
5’sgd’: optim.SGD(model.parameters(), lr=0.01),
6’momentum’: optim.SGD(model.parameters(), lr=0.01, momentum=0.9),
7’adagrad’: optim.Adagrad(model.parameters(), lr=0.01),
8’rmsprop’: optim.RMSprop(model.parameters(), lr=0.001),
9’adam’: optim.Adam(model.parameters(), lr=0.001)
10}
347Chapter 32. Optimizers in Deep Learning Part 1 Complete Deep Learning Course When to Use Which Optimizer? Scenario Recommended Optimizer Reason Sparse DataAdaGrad Per-parameter adaptation Computer VisionSGD with Momentum Well-tested, reliable NLP/TransformersAdam/AdamW Handles varying gradients Online LearningStochastic GD Single sample updates Research/ExperimentationAdam Good default choice
32.1.7 Key Takeaways
Essential Points 1.Optimizers are crucialfor training neural networks efficiently 2.Learning rate selectionis one of the most important hyperparameters 3.Modern optimizerssolve many limitations of vanilla gradient descent 4.Adam is often a good defaultchoice for most applications 5.No single optimizerworks best for all problems 348
Chapter 33 Exponentially Weighted Moving Average or Exponential Weighted Average Deep Learning
33.1 Exponentially Weighted Moving Average
or Exponential Weighted Average | Deep Learn- ing
33.2 SGD with Momentum Optimization
33.2.1 Introduction
Momentumis a crucial optimization technique in deep learning that acceler- ates gradient descent by accumulating velocity from past gradients. It’s par- ticularly effective for: - Speeding up convergence - Escaping local minima - Navigating elongated valleys in loss landscapes Key Insight Momentum works like a ball rolling down a hill - it accumulates velocity in consistent directions and dampens oscillations in incon- sistent directions.
33.2.2 Understanding Graph Representations
Three Types of Visualizations Graph Type Dimension Purpose Visual Representation 2D Loss PlotLoss vs Single Parameter Simple optimization view Parabolic curve 3D Surface PlotLoss vs Two Parameters Complete loss landscape Mountain/valley view Contour Plot2D projection of 3D Top-down view Concentric circles 349
Chapter 33. Exponentially Weighted Moving Average or Exponential Weighted Average Deep Learning Visual Interpretation Guide Figure 33.1: image – Yellow/Orange= High altitude (high loss) – Blue/Purple= Low altitude (low loss) – Circular contours= Well-conditioned optimization – Elongated contours= Ill-conditioned optimization
33.2.3 Convex vs Non-Convex Optimization
Comparison Table Aspect Convex Optimization Non-Convex Optimization ShapeBowl-like Multiple valleys MinimaSingle global minimum Multiple local minima ChallengesRelatively simple Complex navigation ConvergenceGuaranteed Not guaranteed Three Major Problems in Non-Convex Optimization 1. Local Minima – Problem: Algorithm gets stuck in suboptimal solutions – Visual: Small valleys that trap the optimizer 350
33.2. SGD with Momentum Optimization – Impact: Poor model performance 2. Saddle Points – Problem: Flat regions with mixed curvature – Visual: Areas that curve up in one direction, down in another – Impact: Extremely slow convergence 3. High Curvature – Problem: Sharp turns in loss landscape – Visual: Narrow valleys or ridges – Impact: Oscillations and instability
33.2.4 Why Momentum?
Problems with Vanilla Gradient Descent Figure 33.2: image Momentum Solutions Problem How Momentum Helps Consistent gradientsAccelerates in consistent directions Inconsistent gradientsDampens oscillations Local minimaBuilds up speed to escape Flat regionsMaintains velocity through plateaus
Content sourced from CampusX Deep Learning notes (PDF). Run merge script for full body.
Common mistakes
- Using warmup without tuning base learning rate.
- Ignoring transformer interaction with batch size.
- Different seeds → false conclusions.
Interview checkpoints
- Q: Default optimizer today? A: Often Adam/AdamW with tuned LR.
- Q: SGD+momentum when? A: Vision with careful schedule can beat Adam.
Practice
- Basic: Describe Warmup Schedules.
- Intermediate: Train same net with SGD vs Adam.
- Advanced: Plot loss per optimizer to equal compute.
Recap
- Warmup Schedules changes optimization dynamics.
- Fair comparisons need same budget.
- Module 6: convolutions next.
Optimizer Project
1.3. Artificial Neural Networks (ANN)
1.3.3 MLP [Multi-layer perceptron]
•Intuition of MLP •MLP Notation •Prediction in MLP
1.3.4 Training an MLP [Most used Algorithm]
•Gradient Descent •Backpropagation
1.3.5 Practical with Keras
•CPU Vs GPU
•Installation
•Example 1 - Regression using Keras
•Example 2 - Classification using Keras
1.3.6 How to improve an ANN
•Vanishing Gradients
•Exploding Gradients
•Dropouts
•Regularization
•Weight Initialization
•Optimizers
•Gradient Checking and Clipping
•Batch Normalization
•Hyperparameter Tuning
1.3.7 Advanced Topics
•Callbacks
•Tensorboard
•Pretrained Models
•Keras Functional API
•Saving and Loading a Keras model
•Building a Streamlit Application
1.3.8 Project
•End-to-End Final Project
•AWS deployment
Convolutional Neural Networks (CNN)
•Convolution operations and filters
•Pooling layers and techniques
•Feature maps and visualization
3Why this matters
Optimizer project: log convergence speed and final val metric.
33.3.11 Summary & Best Practices
When to Use Momentum Use when:- Training deep neural networks - Dealing with elongated loss valleys - Need to escape local minima - Gradients are relatively consistent Avoid when:- Near convergence (consider reducingβ) - Extremely noisy gradients - Need precise convergence Python code 1importnumpyasnp 2importmatplotlib.pyplotasplt 3fromtypingimportList, Tuple, Optional 4importpandasaspd 5 357
Chapter 33. Exponentially Weighted Moving Average or Exponential Weighted Average Deep Learning 6classExponentialMovingAverage: 7""" 8Exponential Moving Average (EMA) implementation from scratch. 9 10Mathematical Formula: V_t = beta * V_{t-1} + (1-beta) * theta_t 11Where: 12- V_t: EMA value at time t 13- beta: smoothing factor (0 < beta < 1) 14- theta_t: actual data point at time t 15""" 16 17def__init__(self, beta:float= 0.8): 18""" 19Initialize EMA calculator. 20 21Args: 22beta (float): Smoothing factor (0 < beta < 1) 23Higher beta = more smoothing (slower response) 24Lower beta = less smoothing (faster response) 25""" 26if not0 < beta < 1: 27raiseValueError("Beta must be between 0 and 1 (exclusive)") 28 29self.beta = beta 30self.alpha = 1 - beta# Weight for new observations 31self.ema_values = [] 32self.data_points = [] 33 34defcalculate_ema_single(self, data: List[float], 35initial_value:float= 0) -> List[float]: 36""" 37Calculate EMA for a complete dataset using recursive formula. 38 39Args: 40data (List[float]): Input data points 41initial_value (float): Initial EMA value (default: 0) 42 43Returns: 44List[float]: EMA values for each data point 45""" 46if notdata: 47return[] 48 49ema_values = [] 50current_ema = initial_value 51 52fori, data_pointin enumerate(data): 53ifi == 0andinitial_value == 0: 54# First value: V_1 = (1-beta) * theta_1 55current_ema = self.alpha * data_point 56else: 57# Recursive formula: V_t = beta * V_{t-1} + (1-beta) * theta_t 58current_ema = self.beta * current_ema + self.alpha * data_point 59 60ema_values.append(current_ema) 61 358
33.3. Exponential Moving Average (EMA) - Mathematical Intuition 62returnema_values 63 64defcalculate_ema_step_by_step(self, data: List[float]) -> Tuple[List[ float], List[dict]]: 65""" 66Calculate EMA with detailed step-by-step breakdown. 67 68Args: 69data (List[float]): Input data points 70 71Returns: 72Tuple[List[float], List[dict]]: EMA values and calculation details 73""" 74if notdata: 75return[], [] 76 77ema_values = [] 78calculations = [] 79current_ema = 0 80 81fori, data_pointin enumerate(data): 82ifi == 0: 83# First calculation 84current_ema = self.alpha * data_point 85calc_detail = { 86’step’: i + 1, 87’data_point’: data_point, 88’formula’: f’V_1 = (1-beta) * theta_1 = {self.alpha:.3 f} * {data_point:.3f}’, 89’calculation’: f’{self.alpha:.3f} * {data_point:.3f} = {current_ema:.6f}’, 90’ema_value’: current_ema 91} 92else: 93# Recursive calculation 94prev_ema = current_ema 95current_ema = self.beta * current_ema + self.alpha * data_point 96calc_detail = { 97’step’: i + 1, 98’data_point’: data_point, 99’formula’: f’V_{i+1} = beta * V_{i} + (1-beta) * theta_{i+1}’, 100’calculation’: f’{self.beta:.3f} * {prev_ema:.6f} + { self.alpha:.3f} * {data_point:.3f} = {current_ema:.6f}’, 101’ema_value’: current_ema 102} 103 104ema_values.append(current_ema) 105calculations.append(calc_detail) 106 107returnema_values, calculations 108 109defcalculate_weights(self, n_periods:int) -> List[Tuple[int,float ]]: 110""" 111Calculate the exponential weights for historical data points. 359
Chapter 33. Exponentially Weighted Moving Average or Exponential Weighted Average Deep Learning 112 113Args: 114n_periods (int): Number of historical periods to calculate weights for 115 116Returns: 117List[Tuple[int, float]]: List of (age, weight) tuples 118""" 119weights = [] 120forkin range(n_periods): 121# Weight for data point k periods ago: beta^k * (1-beta) 122weight = (self.beta ** k) * self.alpha 123weights.append((k, weight)) 124 125returnweights 126 127defexpand_ema_formula(self, n_terms:int= 5) ->str: 128""" 129Generate the expanded EMA formula showing weights for historical data. 130 131Args: 132n_terms (int): Number of terms to show in expansion 133 134Returns: 135str: Mathematical formula as string 136""" 137terms = [] 138foriin range(n_terms): 139ifi == 0: 140terms.append(f"(1-beta)theta_n") 141else: 142terms.append(f"beta^{i}(1-beta)theta_(n-{i})") 143 144formula = f"V_n = {’ + ’.join(terms)}" 145ifn_terms > 1: 146formula += " + ..." 147 148returnformula 149 150defcompare_with_sma(self, data: List[float], sma_window:int) ->dict : 151""" 152Compare EMA with Simple Moving Average (SMA). 153 154Args: 155data (List[float]): Input data 156sma_window (int): Window size for SMA calculation 157 158Returns: 159dict: Comparison results 160""" 161ema_values = self.calculate_ema_single(data) 162 163# Calculate SMA 164sma_values = [] 165foriin range(len(data)): 166ifi < sma_window - 1: 360
33.3. Exponential Moving Average (EMA) - Mathematical Intuition 167sma_values.append(np.nan) 168else: 169sma_values.append(np.mean(data[i-sma_window+1:i+1])) 170 171return{ 172’data’: data, 173’ema’: ema_values, 174’sma’: sma_values, 175’ema_params’: {’beta’: self.beta, ’alpha’: self.alpha}, 176’sma_params’: {’window’: sma_window} 177} 178 179defplot_ema_analysis(self, data: List[float], title:str= "EMA Analysis"): 180""" 181Create comprehensive plots for EMA analysis. 182 183Args: 184data (List[float]): Input data 185title (str): Plot title 186""" 187ema_values = self.calculate_ema_single(data) 188weights = self.calculate_weights(min(10,len(data))) 189 190fig, axes = plt.subplots(2, 2, figsize=(15, 10)) 191fig.suptitle(title, fontsize=16) 192 193# Plot 1: Original data vs EMA 194axes[0, 0].plot(data, ’b-o’, label=’Original Data’, markersize=4) 195axes[0, 0].plot(ema_values, ’r-’, label=f’EMA (beta={self.beta})’, linewidth=2) 196axes[0, 0].set_title(’Data vs EMA’) 197axes[0, 0].set_xlabel(’Time Period’) 198axes[0, 0].set_ylabel(’Value’) 199axes[0, 0].legend() 200axes[0, 0].grid(True, alpha=0.3) 201 202# Plot 2: Weight distribution 203ages, weight_values =zip(*weights) 204axes[0, 1].bar(ages, weight_values, alpha=0.7, color=’green’) 205axes[0, 1].set_title(’Exponential Weight Distribution’) 206axes[0, 1].set_xlabel(’Periods Ago’) 207axes[0, 1].set_ylabel(’Weight’) 208axes[0, 1].grid(True, alpha=0.3) 209 210# Plot 3: Convergence analysis 211differences = [abs(ema_values[i] - data[i])foriin range(len( data))] 212axes[1, 0].plot(differences, ’purple’, marker=’s’, markersize=3) 213axes[1, 0].set_title(’EMA-Data Absolute Difference’) 214axes[1, 0].set_xlabel(’Time Period’) 215axes[1, 0].set_ylabel(’|EMA - Data|’) 216axes[1, 0].grid(True, alpha=0.3) 217 218# Plot 4: Cumulative weight (showing memory effect) 219cumulative_weights = np.cumsum([w[1]forwinweights]) 220axes[1, 1].plot(ages, cumulative_weights, ’orange’, marker=’d’, markersize=4) 361
Chapter 33. Exponentially Weighted Moving Average or Exponential Weighted Average Deep Learning 221axes[1, 1].axhline(y=0.95, color=’red’, linestyle=’--’, alpha=0.7, label=’95% Memory’) 222axes[1, 1].set_title(’Cumulative Weight Distribution’) 223axes[1, 1].set_xlabel(’Periods Ago’) 224axes[1, 1].set_ylabel(’Cumulative Weight’) 225axes[1, 1].legend() 226axes[1, 1].grid(True, alpha=0.3) 227 228plt.tight_layout() 229plt.show() 230 231# Example usage and demonstration 232defdemo_ema(): 233"""Demonstrate EMA functionality with examples.""" 234 235print("=" * 60) 236print("EXPONENTIAL MOVING AVERAGE - PYTHON IMPLEMENTATION") 237print("=" * 60) 238 239# Create sample data 240np.random.seed(42) 241trend_data = np.linspace(10, 20, 10) + np.random.normal(0, 0.5, 10) 242print(f"\nSample Data: {[round(x, 2) for x in trend_data]}") 243 244# Initialize EMA calculator 245ema_calc = ExponentialMovingAverage(beta=0.8) 246 247# Calculate EMA 248ema_values, calculations = ema_calc.calculate_ema_step_by_step( trend_data) 249 250print(f"\nEMA Parameters:") 251print(f"beta (beta) = {ema_calc.beta}") 252print(f"alpha (alpha) = 1-beta = {ema_calc.alpha}") 253 254print(f"\n{’-’*80}") 255print("STEP-BY-STEP EMA CALCULATIONS:") 256print(f"{’-’*80}") 257 258forcalcincalculations[:5]:# Show first 5 steps 259print(f"Step {calc[’step’]}: Data = {calc[’data_point’]:.3f}") 260print(f" Formula: {calc[’formula’]}") 261print(f" Calculation: {calc[’calculation’]}") 262print(f" EMA Value: {calc[’ema_value’]:.6f}") 263print() 264 265# Show expanded formula 266print(f"\n{’-’*80}") 267print("EXPANDED EMA FORMULA:") 268print(f"{’-’*80}") 269expanded_formula = ema_calc.expand_ema_formula(4) 270print(expanded_formula) 271 272# Show weight distribution 273print(f"\n{’-’*80}") 274print("EXPONENTIAL WEIGHT DISTRIBUTION:") 275print(f"{’-’*80}") 276weights = ema_calc.calculate_weights(8) 362
33.3. Exponential Moving Average (EMA) - Mathematical Intuition 277print(f"{’Periods Ago’:<12} {’Weight’:<10} {’Percentage’:<12}") 278print("-" * 35) 279total_weight =sum(w[1]forwinweights) 280forage, weightinweights: 281percentage = (weight / total_weight) * 100 282print(f"{age:<12} {weight:<10.6f} {percentage:<10.2f}%") 283 284# Mathematical verification 285print(f"\n{’-’*80}") 286print("MATHEMATICAL VERIFICATION:") 287print(f"{’-’*80}") 288 289# Verify last EMA value using expanded formula 290n =len(trend_data) 291manual_ema = 0 292fori, data_pointin enumerate(trend_data): 293age = n - 1 - i 294weight = (ema_calc.beta ** age) * ema_calc.alpha 295manual_ema += weight * data_point 296 297print(f"EMA (recursive method): {ema_values[-1]:.8f}") 298print(f"EMA (expanded formula): {manual_ema:.8f}") 299print(f"Difference: {abs(ema_values[-1] - manual_ema):.2e}") 300 301returnema_calc, trend_data, ema_values 302 303# Run demonstration 304if__name__ == "__main__": 305ema_calculator, data, ema_result = demo_ema() 306 307# Optional: Create plots (uncomment if matplotlib is available) 308# ema_calculator.plot_ema_analysis(data, "EMA Mathematical Demonstration") 309 310print(f"\n{’=’*60}") 311print("DEMONSTRATION COMPLETE") 312print(f"{’=’*60}") Key Takeaways 1.Momentum = Velocity accumulationfrom past gradients 2.βcontrols history influence(0.9 is standard) 3.Accelerates convergencebut may overshoot 4.Escapes local minimabetter than vanilla GD 5.Dampens oscillationsin narrow valleys 363
Chapter 33. Exponentially Weighted Moving Average or Exponential Weighted Average Deep Learning 364
Chapter 34 SGD with Momentum Explained in Detail with Animations Opti- mizers in Deep Learning Part 2
34.1 Deep Learning Optimization Techniques:
Momentum with SGD
34.1.1 Introduction
This guide covers optimization techniques in deep learning, specifically focusing onSGD with Momentum. In deep learning, we deal with complex loss land- scapes that require sophisticated optimization algorithms to navigate effectively. Key Concepts – Loss Function:L(θ) =f(W,b)whereθrepresents all parameters – Objective: Findθ∗= arg minθL(θ) – Challenge: Non-convex optimization in high-dimensional spaces
34.1.2 Understanding Graph Visualizations
Figure 34.1: image 1. 2D Loss Function Plot – X-axis: Single parameter (e.g., weightw) – Y-axis: Loss valueL(w) – Purpose: Visualize how loss changes with one parameter 365
Chapter 34. SGD with Momentum Explained in Detail with Animations Optimizers in Deep Learning Part 2 L=f(w) 2. 3D Loss Surface – X,Y axes: Two parameters (e.g.,w1,w 2) – Z-axis: Loss valueL(w1,w 2) – Purpose: Visualize loss landscape in 3D space L=f(w 1,w 2) 3. Contour Plot – 2D projectionof 3D loss surface – Contour lines: Connect points of equal loss – Color coding: ∗Blue = Lower loss (minima) ∗Yellow/ Red = Higher loss (maxima) 366
34.1. Deep Learning Optimization Techniques: Momentum with SGD 367
Chapter 34. SGD with Momentum Explained in Detail with Animations Optimizers in Deep Learning Part 2 368
Content sourced from CampusX Deep Learning notes (PDF). Run merge script for full body.
Common mistakes
- Using project without tuning base learning rate.
- Ignoring log interaction with batch size.
- Different seeds → false conclusions.
Interview checkpoints
- Q: Default optimizer today? A: Often Adam/AdamW with tuned LR.
- Q: SGD+momentum when? A: Vision with careful schedule can beat Adam.
Practice
- Basic: Describe Optimizer Project.
- Intermediate: Train same net with SGD vs Adam.
- Advanced: Plot loss per optimizer to equal compute.
Recap
- Optimizer Project changes optimization dynamics.
- Fair comparisons need same budget.
- Module 6: convolutions next.
