Module 4 · 100 Days of DL

Module 4: Performance Hacks, Regularization & Activations

Calibrate neural networks: address L1/L2 regularization weight decays, implement Dropout units, scale inputs, and compare ReLU bounds with LeakyReLU/ELU variants.

⏱ 30 Min Read • Author: GenAIWallah Team • Updated: May 2026

Day 34

Overfitting in DL

Contents

21.1.3 Part 1: Hyperparameter Tuning . . . . . . . . . . . . . . . . . 218

21.1.4 Types of Gradient Descent - . . . . . . . . . . . . . . . . . . . 220

21.1.5 Part 2: Common Deep Learning Problems . . . . . . . . . . . 220

21.1.6 Solution Techniques Summary . . . . . . . . . . . . . . . . . . 221

21.1.7 Future Learning Roadmap . . . . . . . . . . . . . . . . . . . . 222

21.1.8 Key Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . 223

22 Early Stopping In Neural Networks End to End Deep Learning Course 225

22.1 Early Stopping in Neural Networks . . . . . . . . . . . . . . . . . . . 225

22.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225

22.1.2 Learning Objectives . . . . . . . . . . . . . . . . . . . . . . . . 225

22.1.3 The Problem: When to Stop Training? . . . . . . . . . . . . . 225

22.1.4 What is Early Stopping? . . . . . . . . . . . . . . . . . . . . . 226

22.1.5 Practical Implementation . . . . . . . . . . . . . . . . . . . . . 226

22.1.6 Early Stopping Parameters . . . . . . . . . . . . . . . . . . . . 227

22.1.7 Training Flow with Early Stopping . . . . . . . . . . . . . . . 229

22.1.8 Best Practices . . . . . . . . . . . . . . . . . . . . . . . . . . . 230

22.1.9 Advanced Configuration . . . . . . . . . . . . . . . . . . . . . 230

22.1.10Real-World Benefits . . . . . . . . . . . . . . . . . . . . . . . 231 22.1.11Quick Start Checklist . . . . . . . . . . . . . . . . . . . . . . . 232 22.1.12Key Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . 232 23 Data Scaling in Neural Network Feature Scaling in ANN End to End Deep Learning Course 233

23.1 Deep Learning: Feature Scaling and Normalization - Detailed Notes . 233

23.1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . 233

23.1.2 Technical Intuition . . . . . . . . . . . . . . . . . . . . . . . . 233

23.1.3 Solutions: Feature Scaling Techniques . . . . . . . . . . . . . . 234

23.1.4 When to Use Which Technique? . . . . . . . . . . . . . . . . . 235

23.1.5 Practical Implementation . . . . . . . . . . . . . . . . . . . . . 235

23.1.6 Results Comparison . . . . . . . . . . . . . . . . . . . . . . . . 235

23.1.7 Neural Network Architecture Used . . . . . . . . . . . . . . . 236

23.1.8 Key Insights and Best Practices . . . . . . . . . . . . . . . . . 236

23.1.9 Summary and Takeaways . . . . . . . . . . . . . . . . . . . . . 236

23.1.10Practical Checklist . . . . . . . . . . . . . . . . . . . . . . . . 237 24 Dropout Layer in Deep LearningDropoutsin ANN Endto End Deep Learning 239

24.0.1 Detailed Notes on Dropout in Neural Networks . . . . . . . . 239

25 Dropout Layers in ANN Code Example Regression Classification 246 25.0.1Data and Model Setup. . . . . . . . . . . . . . . . . . . . 246 25.0.2Overfitting Observation. . . . . . . . . . . . . . . . . . . . 246 25.0.3Dropout Implementation. . . . . . . . . . . . . . . . . . . 247 25.0.4Classification Example. . . . . . . . . . . . . . . . . . . . 247 25.0.5Practical Tips for Dropout. . . . . . . . . . . . . . . . . . 248 25.0.6Limitations and Challenges. . . . . . . . . . . . . . . . . 248 25.0.7Visual Summary. . . . . . . . . . . . . . . . . . . . . . . . 249 xi

Why this matters

Deep nets overfit easily — memorization shows high train acc, poor val acc.

25.0.2 Overfitting Observation

– Model Behavior: 246

∗Training loss decreases, but validation loss plateaus or increases, indicating overfitting. ∗The decision boundary (regression line) closely fits training points but does not generalize well to test points. ∗The gap between training and validation loss is a clear sign of over- fitting. – Visualization: ∗Plotting training vs validation loss shows a widening gap as training progresses.

25.0.2 Overfitting Observation

– Model Behavior: 246

Improving convergence performance requires careful calibration of data scaling, weight initialization, and activation bounds.

Common mistakes

Applying overfit incorrectly at inference.
Combining too many val techniques without ablation.
No validation split.

Interview checkpoints

Q: When use Overfitting in DL? A: When val loss diverges from train.
Q: Dropout at test? A: Scale activations or disable dropout.

Practice

Basic: Define Overfitting in DL.
Intermediate: Add Overfitting in DL to Keras model; plot curves.
Advanced: Ablation table of regularizers.

Recap

Overfitting in DL reduces overfitting.
Always measure on validation.
Combine with good baselines.

Next: Day 35 — L2 Regularization

Day 35

L2 Regularization

Contents 25.0.8Key Takeaways. . . . . . . . . . . . . . . . . . . . . . . . . 252 25.0.9Additional Notes. . . . . . . . . . . . . . . . . . . . . . . . 253 26 Regularization in Deep Learning L2 Regularization in ANN L1 Reg- ularization Weight Decay in ANN 257 26.0.1Introduction to Regularization in Neural Networks. . 257 26.0.2Building Neural Networks: Basics. . . . . . . . . . . . . 257 26.0.3Understanding Overfitting. . . . . . . . . . . . . . . . . . 257 26.0.4Ways to Reduce Overfitting. . . . . . . . . . . . . . . . . 258

26.0.5 Complete Cost Function with Regularization . . . . . . . . . . 259

26.0.6 Regularization Types . . . . . . . . . . . . . . . . . . . . . . . 259

26.0.7 Parameter Definitions . . . . . . . . . . . . . . . . . . . . . . 259

26.0.8 Weight Structure . . . . . . . . . . . . . . . . . . . . . . . . . 259

26.0.9Regularization: How It Works. . . . . . . . . . . . . . . 260 26.0.10Intuition Behind Regularization. . . . . . . . . . . . . . 260 26.0.11Practical Implementation & Code Demo. . . . . . . . . 260 26.0.12Comparison Table: With vs Without Regularization. 261 26.0.13Visual Summary: Regularization Process. . . . . . . . 262 26.0.14Key Takeaways. . . . . . . . . . . . . . . . . . . . . . . . . 262 26.0.15Tips & Best Practices. . . . . . . . . . . . . . . . . . . . . 263 26.0.16Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . 263 27 Activation Functions in Deep Learning Sigmoid, Tanh and Relu Ac- tivation Function 265

27.1 Activation Functions in Neural Networks . . . . . . . . . . . . . . . . 265

27.1.1 Introduction to Activation Functions . . . . . . . . . . . . . . 265

27.1.2 Why Activation Functions are Needed . . . . . . . . . . . . . 266

27.1.3 Ideal Activation Function Properties . . . . . . . . . . . . . . 268

27.1.4 Sigmoid Activation Function . . . . . . . . . . . . . . . . . . . 269

27.1.5 Tanh Activation Function . . . . . . . . . . . . . . . . . . . . 272

27.1.6 ReLU Activation Function . . . . . . . . . . . . . . . . . . . . 275

27.1.7 Summary and Comparison . . . . . . . . . . . . . . . . . . . . 279

27.1.8Final Takeaways. . . . . . . . . . . . . . . . . . . . . . . . 279 28 Relu Variants Explained Leaky Relu Parametric Relu Selu Activa- tion Functions Part 2 281

28.0.1 Introduction to Activation Functions . . . . . . . . . . . . . . 281

28.0.2 Why Activation Functions are Needed . . . . . . . . . . . . . 283

28.0.3 Ideal Activation Function Properties . . . . . . . . . . . . . . 285

28.0.4 Sigmoid Activation Function . . . . . . . . . . . . . . . . . . . 286

28.0.5 Tanh Activation Function . . . . . . . . . . . . . . . . . . . . 292

28.0.6 ReLU Activation Function . . . . . . . . . . . . . . . . . . . . 300

28.0.7 Summary and Comparison . . . . . . . . . . . . . . . . . . . . 306

28.0.8 Key Takeaways & Architecture Guide . . . . . . . . . . . . . . 312

29 Weight Initialization Techniques What not to do Deep Learning 316 30 Xavier Glorat And He Weight Initialization in Deep Learning 317

30.1 Neural Network Weight Initialization Techniques . . . . . . . . . . . . 317

xii

Why this matters

L2 weight decay penalizes large weights — smoother decision boundaries.

26.0.10 Intuition Behind Regularization

–Why Does It Work? ∗Regularization penalizes large weights, reducing the model’s ability to fit noise. ∗Result:Model becomes simpler, focuses on major patterns, and generalizes better ∗Visualization:Weights are distributed closer to zero, reducing model complexity. –L1 vs L2 Regularization ∗L1(Lasso):Canmakesomeweightsexactlyzero—usefulforfeature selection. ∗L2 (Ridge):Shrinks weights but rarely to zero—better for general use. ∗Elastic Net:Combines both L1 and L2 penalties

26.0.10 Intuition Behind Regularization

To prevent overfitting, we apply L1/L2 regularization to add a penalty term on weight magnitudes to the loss function: $$L_{reg} = L + \lambda \sum w^2$$ Alternatively, **Dropout** randomly disables a fraction $p$ of hidden units at each training iteration, forcing the network to learn redundant representations and prevent joint-adaptation of weights.

Standard Network vs. Network with Dropout Applied

Common mistakes

Applying l2 incorrectly at inference.
Combining too many weight decay techniques without ablation.
No validation split.

Interview checkpoints

Q: When use L2 Regularization? A: When val loss diverges from train.
Q: Dropout at test? A: Scale activations or disable dropout.

Practice

Basic: Define L2 Regularization.
Intermediate: Add L2 Regularization to Keras model; plot curves.
Advanced: Ablation table of regularizers.

Recap

L2 Regularization reduces overfitting.
Always measure on validation.
Combine with good baselines.

Next: Day 36 — L1 Regularization

Day 36

L1 Regularization

26.0.5 Complete Cost Function with Regularization . . . . . . . . . . 259

26.0.6 Regularization Types . . . . . . . . . . . . . . . . . . . . . . . 259

26.0.7 Parameter Definitions . . . . . . . . . . . . . . . . . . . . . . 259

26.0.8 Weight Structure . . . . . . . . . . . . . . . . . . . . . . . . . 259

27.1 Activation Functions in Neural Networks . . . . . . . . . . . . . . . . 265

27.1.1 Introduction to Activation Functions . . . . . . . . . . . . . . 265

27.1.2 Why Activation Functions are Needed . . . . . . . . . . . . . 266

27.1.3 Ideal Activation Function Properties . . . . . . . . . . . . . . 268

27.1.4 Sigmoid Activation Function . . . . . . . . . . . . . . . . . . . 269

27.1.5 Tanh Activation Function . . . . . . . . . . . . . . . . . . . . 272

27.1.6 ReLU Activation Function . . . . . . . . . . . . . . . . . . . . 275

27.1.7 Summary and Comparison . . . . . . . . . . . . . . . . . . . . 279

27.1.8Final Takeaways. . . . . . . . . . . . . . . . . . . . . . . . 279 28 Relu Variants Explained Leaky Relu Parametric Relu Selu Activa- tion Functions Part 2 281

28.0.1 Introduction to Activation Functions . . . . . . . . . . . . . . 281

28.0.2 Why Activation Functions are Needed . . . . . . . . . . . . . 283

28.0.3 Ideal Activation Function Properties . . . . . . . . . . . . . . 285

28.0.4 Sigmoid Activation Function . . . . . . . . . . . . . . . . . . . 286

28.0.5 Tanh Activation Function . . . . . . . . . . . . . . . . . . . . 292

28.0.6 ReLU Activation Function . . . . . . . . . . . . . . . . . . . . 300

28.0.7 Summary and Comparison . . . . . . . . . . . . . . . . . . . . 306

28.0.8 Key Takeaways & Architecture Guide . . . . . . . . . . . . . . 312

29 Weight Initialization Techniques What not to do Deep Learning 316 30 Xavier Glorat And He Weight Initialization in Deep Learning 317

30.1 Neural Network Weight Initialization Techniques . . . . . . . . . . . . 317

xii

Why this matters

L1 sparsifies weights — feature selection effect.

26.0.10 Intuition Behind Regularization

Non-linear layers prevent the collapse of multiple layers into a single linear matrix operation. While **ReLU** ($f(x) = \max(0, x)$) is the standard hidden layer default, variants like **LeakyReLU** ($f(x) = \max(lpha x, x)$) and **ELU** mitigate the "dying ReLU" problem where inactive nodes yield zero gradients.

Common mistakes

Applying l1 incorrectly at inference.
Combining too many sparse techniques without ablation.
No validation split.

Interview checkpoints

Q: When use L1 Regularization? A: When val loss diverges from train.
Q: Dropout at test? A: Scale activations or disable dropout.

Practice

Basic: Define L1 Regularization.
Intermediate: Add L1 Regularization to Keras model; plot curves.
Advanced: Ablation table of regularizers.

Recap

L1 Regularization reduces overfitting.
Always measure on validation.
Combine with good baselines.

Next: Day 37 — Dropout

Day 37

Dropout

Contents

21.1.3 Part 1: Hyperparameter Tuning . . . . . . . . . . . . . . . . . 218

21.1.4 Types of Gradient Descent - . . . . . . . . . . . . . . . . . . . 220

21.1.5 Part 2: Common Deep Learning Problems . . . . . . . . . . . 220

21.1.6 Solution Techniques Summary . . . . . . . . . . . . . . . . . . 221

21.1.7 Future Learning Roadmap . . . . . . . . . . . . . . . . . . . . 222

21.1.8 Key Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . 223

22 Early Stopping In Neural Networks End to End Deep Learning Course 225

22.1 Early Stopping in Neural Networks . . . . . . . . . . . . . . . . . . . 225

22.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225

22.1.2 Learning Objectives . . . . . . . . . . . . . . . . . . . . . . . . 225

22.1.3 The Problem: When to Stop Training? . . . . . . . . . . . . . 225

22.1.4 What is Early Stopping? . . . . . . . . . . . . . . . . . . . . . 226

22.1.5 Practical Implementation . . . . . . . . . . . . . . . . . . . . . 226

22.1.6 Early Stopping Parameters . . . . . . . . . . . . . . . . . . . . 227

22.1.7 Training Flow with Early Stopping . . . . . . . . . . . . . . . 229

22.1.8 Best Practices . . . . . . . . . . . . . . . . . . . . . . . . . . . 230

22.1.9 Advanced Configuration . . . . . . . . . . . . . . . . . . . . . 230

23.1 Deep Learning: Feature Scaling and Normalization - Detailed Notes . 233

23.1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . 233

23.1.2 Technical Intuition . . . . . . . . . . . . . . . . . . . . . . . . 233

23.1.3 Solutions: Feature Scaling Techniques . . . . . . . . . . . . . . 234

23.1.4 When to Use Which Technique? . . . . . . . . . . . . . . . . . 235

23.1.5 Practical Implementation . . . . . . . . . . . . . . . . . . . . . 235

23.1.6 Results Comparison . . . . . . . . . . . . . . . . . . . . . . . . 235

23.1.7 Neural Network Architecture Used . . . . . . . . . . . . . . . 236

23.1.8 Key Insights and Best Practices . . . . . . . . . . . . . . . . . 236

23.1.9 Summary and Takeaways . . . . . . . . . . . . . . . . . . . . . 236

23.1.10Practical Checklist . . . . . . . . . . . . . . . . . . . . . . . . 237 24 Dropout Layer in Deep LearningDropoutsin ANN Endto End Deep Learning 239

24.0.1 Detailed Notes on Dropout in Neural Networks . . . . . . . . 239

Why this matters

Dropout randomly drops units during training — ensemble effect at test time.

25.0.3 Dropout Implementation

– New Model with Dropout: ∗Same architecture as before, but dropout layers are added after each hidden layer. ∗Dropout rate: 0.2 (i.e., 20% of neurons are randomly dropped during training). ∗Dropout is applied only during training, not during testing. – Training and Results:

Python

∗Training loss increases slightly compared to the previous model.
∗Validation loss is reduced, and the gap between training and valida-
tion loss narrows.
∗The model becomes less sensitive to small fluctuations in the data.
– Comparison Table:
Model Type Training Loss Validation
Loss
Overfitting Notes
Without
Dropout
Low High Yes Decision
boundary fits
training data
With Dropout
(0.2)
Slightly High Lower No Smoother, less
sensitive to
noise

25.0.3 Dropout Implementation

Python

∗Training loss increases slightly compared to the previous model.
∗Validation loss is reduced, and the gap between training and valida-
tion loss narrows.
∗The model becomes less sensitive to small fluctuations in the data.
– Comparison Table:
Model Type Training Loss Validation
Loss
Overfitting Notes
Without
Dropout
Low High Yes Decision
boundary fits
training data
With Dropout
(0.2)
Slightly High Lower No Smoother, less
sensitive to
noise

Content sourced from CampusX Deep Learning notes (PDF). Run merge script for full body.

Common mistakes

Applying dropout incorrectly at inference.
Combining too many inference techniques without ablation.
No validation split.

Interview checkpoints

Q: When use Dropout? A: When val loss diverges from train.
Q: Dropout at test? A: Scale activations or disable dropout.

Practice

Basic: Define Dropout.
Intermediate: Add Dropout to Keras model; plot curves.
Advanced: Ablation table of regularizers.

Recap

Dropout reduces overfitting.
Always measure on validation.
Combine with good baselines.

Next: Day 38 — Batch Normalization

Day 38

Batch Normalization

Contents

30.1.1 Problems with Poor Initialization . . . . . . . . . . . . . . . . 317

30.1.2 Xavier/Glorot Initialization . . . . . . . . . . . . . . . . . . . 318

30.1.3 He Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . 319

30.1.4 Mathematical Formulations . . . . . . . . . . . . . . . . . . . 320

Python

30.1.5 Implementation in Keras . . . . . . . . . . . . . . . . . . . . . 320
30.1.6 Comparison Table . . . . . . . . . . . . . . . . . . . . . . . . . 321
30.1.7 Best Practices . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
30.1.8 Visual Summary . . . . . . . . . . . . . . . . . . . . . . . . . 322
30.1.9 Code Demonstration Results . . . . . . . . . . . . . . . . . . . 323
30.1.10Key Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . 324
VIII Optimizers in Deep Learning 325
31 Batch Normalization in Deep Learning Batch Learning in Keras 326
31.1 Batch Normalization: The Complete Deep Learning Guide . . . . . . 326
31.1.1 Introduction & Overview . . . . . . . . . . . . . . . . . . . . . 326
31.1.2 Theoretical Foundation . . . . . . . . . . . . . . . . . . . . . . 326
31.1.3 Why Batch Normalization? . . . . . . . . . . . . . . . . . . . 328
31.1.4 Mathematical Framework . . . . . . . . . . . . . . . . . . . . 329
31.1.5 Implementation Details . . . . . . . . . . . . . . . . . . . . . . 330
31.1.6 Advantages & Benefits . . . . . . . . . . . . . . . . . . . . . . 332
31.1.7 Complete Code Implementation . . . . . . . . . . . . . . . . . 332
31.1.8 Advanced Topics . . . . . . . . . . . . . . . . . . . . . . . . . 338
31.1.9 Best Practices & Tips . . . . . . . . . . . . . . . . . . . . . . 339
31.1.10Summary & Key Takeaways . . . . . . . . . . . . . . . . . . . 340
32 OptimizersinDeepLearningPart1CompleteDeepLearningCourse342
32.1 Optimizers in Deep Learning | Part 1 | Complete Deep Learning Course342
32.1.1 Introduction to Optimizers . . . . . . . . . . . . . . . . . . . . 342
32.1.2 Role of Optimizers . . . . . . . . . . . . . . . . . . . . . . . . 343
32.1.3 Types of Gradient Descent . . . . . . . . . . . . . . . . . . . . 344
32.1.4 Challenges with Traditional Optimizers . . . . . . . . . . . . . 345
32.1.5 Modern Optimization Algorithms . . . . . . . . . . . . . . . . 346
32.1.6 Practical Implementation . . . . . . . . . . . . . . . . . . . . . 347
32.1.7 Key Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . 348
33 Exponentially Weighted Moving Average or Exponential Weighted
Average Deep Learning 349
33.1 Exponentially Weighted Moving Average or Exponential Weighted Av-
erage | Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 349
33.2 SGD with Momentum Optimization . . . . . . . . . . . . . . . . . . . 349
33.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
33.2.2 Understanding Graph Representations . . . . . . . . . . . . . 349
33.2.3 Convex vs Non-Convex Optimization . . . . . . . . . . . . . . 350
33.2.4 Why Momentum? . . . . . . . . . . . . . . . . . . . . . . . . . 351
33.2.5 Mathematical Formulation . . . . . . . . . . . . . . . . . . . . 351
33.2.6 How Momentum Works . . . . . . . . . . . . . . . . . . . . . 352
33.2.7 Effect of Beta Parameter . . . . . . . . . . . . . . . . . . . . . 353
xiii

Why this matters

BatchNorm stabilizes activations — different train vs inference behavior.

79.2.2 WhyDon’tWeUseBatchNormalizationinTransformers? (23:01

- 38:25) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1004

79.2.2 WhyDon’tWeUseBatchNormalizationinTransformers? (23:01

- 38:25) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1004

Content sourced from CampusX Deep Learning notes (PDF). Run merge script for full body.

Common mistakes

Applying batchnorm incorrectly at inference.
Combining too many eval techniques without ablation.
No validation split.

Interview checkpoints

Q: When use Batch Normalization? A: When val loss diverges from train.
Q: Dropout at test? A: Scale activations or disable dropout.

Practice

Basic: Define Batch Normalization.
Intermediate: Add Batch Normalization to Keras model; plot curves.
Advanced: Ablation table of regularizers.

Recap

Batch Normalization reduces overfitting.
Always measure on validation.
Combine with good baselines.

Next: Day 39 — Layer Normalization

Day 39

Layer Normalization

Contents

21.1.3 Part 1: Hyperparameter Tuning . . . . . . . . . . . . . . . . . 218

21.1.4 Types of Gradient Descent - . . . . . . . . . . . . . . . . . . . 220

21.1.5 Part 2: Common Deep Learning Problems . . . . . . . . . . . 220

21.1.6 Solution Techniques Summary . . . . . . . . . . . . . . . . . . 221

21.1.7 Future Learning Roadmap . . . . . . . . . . . . . . . . . . . . 222

21.1.8 Key Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . 223

22 Early Stopping In Neural Networks End to End Deep Learning Course 225

22.1 Early Stopping in Neural Networks . . . . . . . . . . . . . . . . . . . 225

22.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225

22.1.2 Learning Objectives . . . . . . . . . . . . . . . . . . . . . . . . 225

22.1.3 The Problem: When to Stop Training? . . . . . . . . . . . . . 225

22.1.4 What is Early Stopping? . . . . . . . . . . . . . . . . . . . . . 226

22.1.5 Practical Implementation . . . . . . . . . . . . . . . . . . . . . 226

22.1.6 Early Stopping Parameters . . . . . . . . . . . . . . . . . . . . 227

22.1.7 Training Flow with Early Stopping . . . . . . . . . . . . . . . 229

22.1.8 Best Practices . . . . . . . . . . . . . . . . . . . . . . . . . . . 230

22.1.9 Advanced Configuration . . . . . . . . . . . . . . . . . . . . . 230

23.1 Deep Learning: Feature Scaling and Normalization - Detailed Notes . 233

23.1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . 233

23.1.2 Technical Intuition . . . . . . . . . . . . . . . . . . . . . . . . 233

23.1.3 Solutions: Feature Scaling Techniques . . . . . . . . . . . . . . 234

23.1.4 When to Use Which Technique? . . . . . . . . . . . . . . . . . 235

23.1.5 Practical Implementation . . . . . . . . . . . . . . . . . . . . . 235

23.1.6 Results Comparison . . . . . . . . . . . . . . . . . . . . . . . . 235

23.1.7 Neural Network Architecture Used . . . . . . . . . . . . . . . 236

23.1.8 Key Insights and Best Practices . . . . . . . . . . . . . . . . . 236

23.1.9 Summary and Takeaways . . . . . . . . . . . . . . . . . . . . . 236

23.1.10Practical Checklist . . . . . . . . . . . . . . . . . . . . . . . . 237 24 Dropout Layer in Deep LearningDropoutsin ANN Endto End Deep Learning 239

24.0.1 Detailed Notes on Dropout in Neural Networks . . . . . . . . 239

Why this matters

LayerNorm normalizes per token — standard in transformers.

79.2 WhyBatchNormalizationFailsinTransformers&LayerNormalization

Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1004

79.2 WhyBatchNormalizationFailsinTransformers&LayerNormalization

Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1004

Content sourced from CampusX Deep Learning notes (PDF). Run merge script for full body.

Common mistakes

Applying layernorm incorrectly at inference.
Combining too many transformer techniques without ablation.
No validation split.

Interview checkpoints

Q: When use Layer Normalization? A: When val loss diverges from train.
Q: Dropout at test? A: Scale activations or disable dropout.

Practice

Basic: Define Layer Normalization.
Intermediate: Add Layer Normalization to Keras model; plot curves.
Advanced: Ablation table of regularizers.

Recap

Layer Normalization reduces overfitting.
Always measure on validation.
Combine with good baselines.

Next: Day 40 — Data Augmentation

Day 40

Data Augmentation

Contents

47.1.6 Backpropagation Process . . . . . . . . . . . . . . . . . . . . . 548

47.1.7 Backpropagation Strategy . . . . . . . . . . . . . . . . . . . . 549

47.1.8 Gradient Computations . . . . . . . . . . . . . . . . . . . . . 551

47.1.9 Batch Processing . . . . . . . . . . . . . . . . . . . . . . . . . 555

48 CNN Backpropagation Part 2 How Backpropagation works on Con- volution, Maxpooling and Flatten Layers 557

48.1 CNN Backpropagation Part 2: Complete Mathematical Deep Dive . . 557

48.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 557

48.1.2 CNN Architecture . . . . . . . . . . . . . . . . . . . . . . . . 557

48.1.3 Forward Propagation . . . . . . . . . . . . . . . . . . . . . . . 558

48.1.4 Backpropagation Process . . . . . . . . . . . . . . . . . . . . . 559

48.1.5 Layer-wise Backpropagation . . . . . . . . . . . . . . . . . . . 560

48.1.6 Complete Example with Calculations . . . . . . . . . . . . . . 562

48.1.7 Key Insights and Implementation . . . . . . . . . . . . . . . . 563

48.1.8 Key Mathematical Insights . . . . . . . . . . . . . . . . . . . . 565

48.1.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565

49 Cat Vs Dog Image Classification Project Deep Learning Project CNN Project 566

49.1 CatVsDogImageClassificationProject|DeepLearningProject|CNN

Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566

49.1.1 Project Overview . . . . . . . . . . . . . . . . . . . . . . . . . 566

49.1.2 Dataset Information . . . . . . . . . . . . . . . . . . . . . . . 566

49.1.3 Setup and Environment . . . . . . . . . . . . . . . . . . . . . 566

49.1.4 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . 567

49.1.5 CNN Architecture . . . . . . . . . . . . . . . . . . . . . . . . 567

49.1.6 Model Compilation and Training . . . . . . . . . . . . . . . . 568

49.1.7 Overfitting Analysis . . . . . . . . . . . . . . . . . . . . . . . 569

49.1.8 Overfitting Solutions Implemented . . . . . . . . . . . . . . . 569

49.1.9 Improved Results . . . . . . . . . . . . . . . . . . . . . . . . . 570

49.1.10Making Predictions on New Images . . . . . . . . . . . . . . . 570 49.1.11Key Learning Points . . . . . . . . . . . . . . . . . . . . . . . 570 49.1.12Project Extensions and Improvements . . . . . . . . . . . . . 571 49.1.13Files and Resources . . . . . . . . . . . . . . . . . . . . . . . . 571 50 Data Augmentation in Deep Learning CNN 572

50.1 Data Augmentation in Deep Learning | CNN . . . . . . . . . . . . . . 572

50.2 Data Augmentation and Pretrained Models - Detailed Notes . . . . . 572

50.2.1 1. Data Augmentation . . . . . . . . . . . . . . . . . . . . . . 572

50.2.2 2. Pretrained Models (From Image Notes) . . . . . . . . . . . 575

50.2.3 Best Practices . . . . . . . . . . . . . . . . . . . . . . . . . . . 575

Python

51 PretrainedmodelsinCNNImageNETDatasetILSVRCKerasCode577
51.1 Pretrained models in CNN | ImageNET Dataset | ILSVRC | Keras Code577
51.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577
51.1.2 Why Use Pre-trained Models? . . . . . . . . . . . . . . . . . . 577
51.1.3 ImageNet Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 578
51.1.4 ILSVRC Challenge . . . . . . . . . . . . . . . . . . . . . . . . 579
xxi

Why this matters

Data augmentation multiplies effective training data for vision.

53.1.10 Practical Application Example

Cat vs Dog Classification Challenge: ImageNet doesn’t specifically contain “Cat” and “Dog” as separate classes among its 1000 categories. Solution: Transfer Learning Approach 1. Use VGG16 pre-trained on ImageNet 2. Remove top classification layers 3. Add binary classification head 4. Train only new layers on cat/dog dataset

53.2 Why Transfer Learning Works & Im-

plementation Methods

53.2.1 Why Transfer Learning Works - The Science

Behind It Quick Recap: Transfer Learning Process The Core Philosophy: “Don’t Reinvent the Wheel” Key Insight: “pahaiyaaa bana chaukaaa haai, tao usasae gaaaDaii banaaao” (The wheel is already built, so use it to build a car)

53.2.2 Feature Hierarchy in CNNs

Layer-wise Feature Learning Progression Layer Position Feature Type Examples Transferability Early LayersPrimitive Features Edges, corners, textures Highly Transferable Middle LayersIntermediate Patterns Shapes, simple objects Moderately Transferable Deep LayersComplex Features Specific objects, faces Task-Specific 596

53.2. Why Transfer Learning Works & Implementation Methods Universal Feature Concept Figure 53.5: image Why Primitive Features are Universal Real-World Object Primitive Features Required Cat Edges, curves, textures Dog Edges, curves, textures Phone Edges, rectangles, textures Car Edges, curves, metallic textures Core Principle: All real-world objects share similarprimi- tive building blocks- regardless of the specific classification task!

53.2.3 Two Main Approaches to Transfer Learning

Method 1: Feature Extraction Component Status Purpose Convolutional BaseFrozen Feature extraction FC LayersTrainable Task-specific classification WeightsFixed in conv base Preserve learned features Configuration Details 597

Python

Chapter 53. What is Transfer Learning Transfer Learning in Keras Fine Tuning Vs
Feature Extraction
When to Use Feature Extraction Ideal Scenarios:- Your task
classes aresimilarto pre-training data - Example: Cat/Dog classification
(ImageNet has animals) - Limited computational resources - Small dataset
available
Figure 53.6: image
Architecture Modification
598

53.2. Why Transfer Learning Works & Implementation Methods Method 2: Fine Tuning Layer Section Training Status Learning Rate Early Conv LayersFrozen N/A Late Conv LayersTrainable Very Low Custom FC LayersTrainable Standard Fine Tuning Strategy When to Use Fine Tuning Ideal Scenarios:- Your task issignifi- cantly differentfrom pre-training data - Example: Phone vs Tablet (not well represented in ImageNet) - Larger dataset available - More computa- tional resources available Aspect Feature Extraction Fine Tuning Training TimeFast Slower Memory UsageLow Higher FlexibilityLimited High Data RequirementsSmall dataset OK Larger dataset preferred Trade-offs Comparison

53.2.4 Technical Implementation Strategy

Feature Extraction Implementation 1# Pseudo-code structure 2model = VGG16(weights=’imagenet’, include_top=False)# Remove top layers

Python

3model.trainable = False# Freeze convolutional base
4
5# Add custom classification head
6custom_model = Sequential([
7model,
8GlobalAveragePooling2D(),
9Dense(128, activation=’relu’),
10Dense(1, activation=’sigmoid’)# Binary classification
11])
Fine Tuning Implementation
1# Pseudo-code structure
2model = VGG16(weights=’imagenet’, include_top=False)
599

Python

Chapter 53. What is Transfer Learning Transfer Learning in Keras Fine Tuning Vs
Feature Extraction
3
4# Freeze early layers, unfreeze later layers
5forlayerinmodel.layers[:-4]:# Freeze all but last 4 layers
6layer.trainable = False
7forlayerinmodel.layers[-4:]:# Unfreeze last 4 layers
8layer.trainable = True
9
10# Add custom head with lower learning rate
53.2.5 DecisionFramework: WhichMethodtoChoose?
Decision Tree
Figure 53.7: image
Quick Reference Guide
Your Scenario Recommended
Approach
Expected Results
Animal Classification Feature Extraction Excellent
Tech Device Classification Fine Tuning Very Good
Medical Imaging Fine Tuning Good with care
Art Style Classification Fine Tuning Very Good
600

53.2. Why Transfer Learning Works & Implementation Methods

53.2.6 Next Steps: Practical Implementation

Learning Path 1. Feature Extraction Demo - Cat/Dog classification 2. Fine Tuning Demo - Custom classification task 3. Performance Comparison - Both methods side-by-side 4. Real-world Application - Deploy your model Coming Up: Hands-on implementation of both Feature Ex-

Python

traction and Fine Tuning methods using Keras, with practical
examples and performance comparisons!
53.2.7 Transfer Learning Implementations
Dataset
∗Dogs vs Cats: https://www.kaggle.com/datasets/salader/dogs-vs-
cats
Notebooks
1.FeatureExtraction(Basic): https://colab.research.google.com/drive/1VxoR4vMmZJAOCsDUnfezPuFQqHdKabcL?usp=sharing
2.FeatureExtraction(+Augmentation): https://colab.research.google.com/drive/1q_INiVDAzhSy1L_A87fBTf2wC3l3MiWy?usp=sharing
3.FineTuning: https://colab.research.google.com/drive/1q_INiVDAzhSy1L_A87fBTf2wC3l3MiWy?usp=sharing
601

Python

Chapter 53. What is Transfer Learning Transfer Learning in Keras Fine Tuning Vs
Feature Extraction
# Transfer Learning: Feature Extraction Implementation
# Feature Extraction + Data Augmentation
## Dataset Setup “‘python !mkdir -p ~/.kaggle !cp kaggle.json ~/.kaggle/ !kaggle
datasets download -d salader/dogs-vs-cats
import zipfile zip_ref = zipfile.ZipFile(‘/content/dogs-vs-cats.zip’, ‘r’)
zip_ref.extractall(‘/content’) zip_ref.close() “‘
## Model Architecture “‘python import tensorflow from tensorflow import keras from
keras import Sequential from keras.layers import Dense, Flatten from
keras.applications.vgg16 import VGG16
conv_base = VGG16( weights=‘imagenet’, include_top=False,
input_shape=(150,150,3) )
model = Sequential() model.add(conv_base) model.add(Flatten())
model.add(Dense(256, activation=‘relu’)) model.add(Dense(1, activation=‘sigmoid’))
conv_base.trainable = False “‘
## Data Augmentation Pipeline “‘python from keras.preprocessing.image import
ImageDataGenerator
batch_size = 32
train_datagen = ImageDataGenerator( rescale=1./255, shear_range=0.2,
zoom_range=0.2, horizontal_flip=True )
test_datagen = ImageDataGenerator(rescale=1./255)
train_generator = train_datagen.flow_from_directory( ‘/content/train’,
target_size=(150, 150), batch_size=batch_size, class_mode=‘binary’ )
validation_generator = test_datagen.flow_from_directory( ‘/content/test’,
target_size=(150, 150), batch_size=batch_size, class_mode=‘binary’ ) “‘
## Training “‘python model.compile(optimizer=‘adam’, loss=‘binary_crossentropy’,
metrics=[‘accuracy’])
history = model.fit_generator( train_generator, epochs=10,
validation_data=validation_generator ) “‘
## Visualization “‘python import matplotlib.pyplot as plt
plt.plot(history.history[‘accuracy’], color=‘red’, label=‘train’)
plt.plot(history.history[‘val_accuracy’], color=‘blue’, label=‘validation’) plt.legend()
plt.show()
plt.plot(history.history[‘loss’], color=‘red’, label=‘train’)
plt.plot(history.history[‘val_loss’], color=‘blue’, label=‘validation’) plt.legend()
plt.show() “‘
## Augmentation Settings -Rescale: 1./255 -Shear: 0.2 -Zoom: 0.2 -Horizontal
Flip: True
## Key Difference UsesImageDataGeneratorfor data augmentation instead of basic
preprocessing.
602

53.3. Transfer Learning: Fine-Tuning Implementation

53.3 TransferLearning: Fine-TuningImple-

mentation

53.3.1 Dataset Setup

1# Download and extract Dogs vs Cats dataset 2!mkdir -p ~/.kaggle 3!cp kaggle.json ~/.kaggle/ 4!kaggle datasets download -d salader/dogs-vs-cats 5 6importzipfile 7zip_ref = zipfile.ZipFile(’/content/dogs-vs-cats.zip’, ’r’) 8zip_ref.extractall(’/content’) 9zip_ref.close()

53.3.2 Model Architecture Setup

Import Libraries 1importtensorflow

Python

2fromtensorflowimportkeras
3fromkerasimportSequential
4fromkeras.layersimportDense, Flatten
5fromkeras.applications.vgg16importVGG16
Load VGG16 Base Model
1conv_base = VGG16(
2weights=’imagenet’,# Pre-trained weights
3include_top=False,# Remove top classification layers
4input_shape=(150,150,3)# Input image dimensions
5)
53.3.3 Fine-Tuning Configuration
Selective Layer Unfreezing
1conv_base.trainable = True
2set_trainable = False
3
4forlayerinconv_base.layers:
5iflayer.name == ’block5_conv1’:# Start unfreezing from here
6set_trainable = True
7ifset_trainable:
8layer.trainable = True
9else:
10layer.trainable = False
603

Python

Chapter 53. What is Transfer Learning Transfer Learning in Keras Fine Tuning Vs
Feature Extraction
Layer Status Verification
1forlayerinconv_base.layers:
2print(layer.name, layer.trainable)
53.3.4 Complete Model Assembly
1model = Sequential()
2model.add(conv_base)# Pre-trained base
3model.add(Flatten())# Flatten for dense layers
4model.add(Dense(256, activation=’relu’))# Custom dense layer
5model.add(Dense(1, activation=’sigmoid’))# Binary output
53.3.5 Data Pipeline Setup
Data Generators
1train_ds = keras.utils.image_dataset_from_directory(
2directory=’/content/train’,
3labels=’inferred’,
4label_mode=’int’,
5batch_size=32,
6image_size=(150,150)
7)
8
9validation_ds = keras.utils.image_dataset_from_directory(
10directory=’/content/test’,
11labels=’inferred’,
12label_mode=’int’,
13batch_size=32,
14image_size=(150,150)
15)
Data Normalization
1defprocess(image, label):
2image = tensorflow.cast(image/255., tensorflow.float32)
3returnimage, label
4
5train_ds = train_ds.map(process)
6validation_ds = validation_ds.map(process)
53.3.6 Model Compilation & Training
Compilation Settings
1model.compile(
2optimizer=keras.optimizers.RMSprop(lr=1e-5),# Very low
learning rate
3loss=’binary_crossentropy’,
604

53.3. Transfer Learning: Fine-Tuning Implementation 4metrics=[’accuracy’] 5) Model Training

Python

1history = model.fit(
2train_ds,
3epochs=10,
4validation_data=validation_ds
5)
53.3.7 Results Visualization
Training Metrics Plot
1importmatplotlib.pyplotasplt
2
3# Accuracy Plot
4plt.plot(history.history[’accuracy’], color=’red’, label=’train’)
5plt.plot(history.history[’val_accuracy’], color=’blue’, label=’
validation’)
6plt.legend()
7plt.show()
8
9# Loss Plot
10plt.plot(history.history[’loss’], color=’red’, label=’train’)
11plt.plot(history.history[’val_loss’], color=’blue’, label=’
validation’)
12plt.legend()
13plt.show()
53.3.8 Key Implementation Details
Component Configuration Purpose
Base ModelVGG16 (ImageNet) Feature extraction backbone
Frozen Layersblock1-block4 Preserve low-level features
Trainable Layersblock5_conv1 onwards Task-specific adaptation
Learning Rate1e-5 Very low for fine-tuning
Input Size150x150x3 Optimized for efficiency
Batch Size32 Memory-efficient training
53.3.9 Fine-Tuning Strategy
Frozen Section
∗Layers: block1_conv1→block4_pool
605

Python

Chapter 53. What is Transfer Learning Transfer Learning in Keras Fine Tuning Vs
Feature Extraction
∗Purpose: Preserve primitive feature extraction
∗Status: Weights unchanged during training
Trainable Section
∗Layers: block5_conv1→block5_pool
∗Purpose: Adapt high-level features to cats/dogs
∗Status: Weights updated with very low learning rate

53.1.10 Practical Application Example

53.2 Why Transfer Learning Works & Im-

plementation Methods

53.2.1 Why Transfer Learning Works - The Science

53.2.2 Feature Hierarchy in CNNs

53.2.3 Two Main Approaches to Transfer Learning

Python

Chapter 53. What is Transfer Learning Transfer Learning in Keras Fine Tuning Vs
Feature Extraction
When to Use Feature Extraction Ideal Scenarios:- Your task
classes aresimilarto pre-training data - Example: Cat/Dog classification
(ImageNet has animals) - Limited computational resources - Small dataset
available
Figure 53.6: image
Architecture Modification
598

53.2.4 Technical Implementation Strategy

Feature Extraction Implementation 1# Pseudo-code structure 2model = VGG16(weights=’imagenet’, include_top=False)# Remove top layers

Python

3model.trainable = False# Freeze convolutional base
4
5# Add custom classification head
6custom_model = Sequential([
7model,
8GlobalAveragePooling2D(),
9Dense(128, activation=’relu’),
10Dense(1, activation=’sigmoid’)# Binary classification
11])
Fine Tuning Implementation
1# Pseudo-code structure
2model = VGG16(weights=’imagenet’, include_top=False)
599

Python

Chapter 53. What is Transfer Learning Transfer Learning in Keras Fine Tuning Vs
Feature Extraction
3
4# Freeze early layers, unfreeze later layers
5forlayerinmodel.layers[:-4]:# Freeze all but last 4 layers
6layer.trainable = False
7forlayerinmodel.layers[-4:]:# Unfreeze last 4 layers
8layer.trainable = True
9
10# Add custom head with lower learning rate
53.2.5 DecisionFramework: WhichMethodtoChoose?
Decision Tree
Figure 53.7: image
Quick Reference Guide
Your Scenario Recommended
Approach
Expected Results
Animal Classification Feature Extraction Excellent
Tech Device Classification Fine Tuning Very Good
Medical Imaging Fine Tuning Good with care
Art Style Classification Fine Tuning Very Good
600

53.2. Why Transfer Learning Works & Implementation Methods

53.2.6 Next Steps: Practical Implementation

Python

traction and Fine Tuning methods using Keras, with practical
examples and performance comparisons!
53.2.7 Transfer Learning Implementations
Dataset
∗Dogs vs Cats: https://www.kaggle.com/datasets/salader/dogs-vs-
cats
Notebooks
1.FeatureExtraction(Basic): https://colab.research.google.com/drive/1VxoR4vMmZJAOCsDUnfezPuFQqHdKabcL?usp=sharing
2.FeatureExtraction(+Augmentation): https://colab.research.google.com/drive/1q_INiVDAzhSy1L_A87fBTf2wC3l3MiWy?usp=sharing
3.FineTuning: https://colab.research.google.com/drive/1q_INiVDAzhSy1L_A87fBTf2wC3l3MiWy?usp=sharing
601

Python

Chapter 53. What is Transfer Learning Transfer Learning in Keras Fine Tuning Vs
Feature Extraction
# Transfer Learning: Feature Extraction Implementation
# Feature Extraction + Data Augmentation
## Dataset Setup “‘python !mkdir -p ~/.kaggle !cp kaggle.json ~/.kaggle/ !kaggle
datasets download -d salader/dogs-vs-cats
import zipfile zip_ref = zipfile.ZipFile(‘/content/dogs-vs-cats.zip’, ‘r’)
zip_ref.extractall(‘/content’) zip_ref.close() “‘
## Model Architecture “‘python import tensorflow from tensorflow import keras from
keras import Sequential from keras.layers import Dense, Flatten from
keras.applications.vgg16 import VGG16
conv_base = VGG16( weights=‘imagenet’, include_top=False,
input_shape=(150,150,3) )
model = Sequential() model.add(conv_base) model.add(Flatten())
model.add(Dense(256, activation=‘relu’)) model.add(Dense(1, activation=‘sigmoid’))
conv_base.trainable = False “‘
## Data Augmentation Pipeline “‘python from keras.preprocessing.image import
ImageDataGenerator
batch_size = 32
train_datagen = ImageDataGenerator( rescale=1./255, shear_range=0.2,
zoom_range=0.2, horizontal_flip=True )
test_datagen = ImageDataGenerator(rescale=1./255)
train_generator = train_datagen.flow_from_directory( ‘/content/train’,
target_size=(150, 150), batch_size=batch_size, class_mode=‘binary’ )
validation_generator = test_datagen.flow_from_directory( ‘/content/test’,
target_size=(150, 150), batch_size=batch_size, class_mode=‘binary’ ) “‘
## Training “‘python model.compile(optimizer=‘adam’, loss=‘binary_crossentropy’,
metrics=[‘accuracy’])
history = model.fit_generator( train_generator, epochs=10,
validation_data=validation_generator ) “‘
## Visualization “‘python import matplotlib.pyplot as plt
plt.plot(history.history[‘accuracy’], color=‘red’, label=‘train’)
plt.plot(history.history[‘val_accuracy’], color=‘blue’, label=‘validation’) plt.legend()
plt.show()
plt.plot(history.history[‘loss’], color=‘red’, label=‘train’)
plt.plot(history.history[‘val_loss’], color=‘blue’, label=‘validation’) plt.legend()
plt.show() “‘
## Augmentation Settings -Rescale: 1./255 -Shear: 0.2 -Zoom: 0.2 -Horizontal
Flip: True
## Key Difference UsesImageDataGeneratorfor data augmentation instead of basic
preprocessing.
602

53.3. Transfer Learning: Fine-Tuning Implementation

53.3 TransferLearning: Fine-TuningImple-

mentation

53.3.1 Dataset Setup

53.3.2 Model Architecture Setup

Import Libraries 1importtensorflow

Python

2fromtensorflowimportkeras
3fromkerasimportSequential
4fromkeras.layersimportDense, Flatten
5fromkeras.applications.vgg16importVGG16
Load VGG16 Base Model
1conv_base = VGG16(
2weights=’imagenet’,# Pre-trained weights
3include_top=False,# Remove top classification layers
4input_shape=(150,150,3)# Input image dimensions
5)
53.3.3 Fine-Tuning Configuration
Selective Layer Unfreezing
1conv_base.trainable = True
2set_trainable = False
3
4forlayerinconv_base.layers:
5iflayer.name == ’block5_conv1’:# Start unfreezing from here
6set_trainable = True
7ifset_trainable:
8layer.trainable = True
9else:
10layer.trainable = False
603

Python

Chapter 53. What is Transfer Learning Transfer Learning in Keras Fine Tuning Vs
Feature Extraction
Layer Status Verification
1forlayerinconv_base.layers:
2print(layer.name, layer.trainable)
53.3.4 Complete Model Assembly
1model = Sequential()
2model.add(conv_base)# Pre-trained base
3model.add(Flatten())# Flatten for dense layers
4model.add(Dense(256, activation=’relu’))# Custom dense layer
5model.add(Dense(1, activation=’sigmoid’))# Binary output
53.3.5 Data Pipeline Setup
Data Generators
1train_ds = keras.utils.image_dataset_from_directory(
2directory=’/content/train’,
3labels=’inferred’,
4label_mode=’int’,
5batch_size=32,
6image_size=(150,150)
7)
8
9validation_ds = keras.utils.image_dataset_from_directory(
10directory=’/content/test’,
11labels=’inferred’,
12label_mode=’int’,
13batch_size=32,
14image_size=(150,150)
15)
Data Normalization
1defprocess(image, label):
2image = tensorflow.cast(image/255., tensorflow.float32)
3returnimage, label
4
5train_ds = train_ds.map(process)
6validation_ds = validation_ds.map(process)
53.3.6 Model Compilation & Training
Compilation Settings
1model.compile(
2optimizer=keras.optimizers.RMSprop(lr=1e-5),# Very low
learning rate
3loss=’binary_crossentropy’,
604

53.3. Transfer Learning: Fine-Tuning Implementation 4metrics=[’accuracy’] 5) Model Training

Python

1history = model.fit(
2train_ds,
3epochs=10,
4validation_data=validation_ds
5)
53.3.7 Results Visualization
Training Metrics Plot
1importmatplotlib.pyplotasplt
2
3# Accuracy Plot
4plt.plot(history.history[’accuracy’], color=’red’, label=’train’)
5plt.plot(history.history[’val_accuracy’], color=’blue’, label=’
validation’)
6plt.legend()
7plt.show()
8
9# Loss Plot
10plt.plot(history.history[’loss’], color=’red’, label=’train’)
11plt.plot(history.history[’val_loss’], color=’blue’, label=’
validation’)
12plt.legend()
13plt.show()
53.3.8 Key Implementation Details
Component Configuration Purpose
Base ModelVGG16 (ImageNet) Feature extraction backbone
Frozen Layersblock1-block4 Preserve low-level features
Trainable Layersblock5_conv1 onwards Task-specific adaptation
Learning Rate1e-5 Very low for fine-tuning
Input Size150x150x3 Optimized for efficiency
Batch Size32 Memory-efficient training
53.3.9 Fine-Tuning Strategy
Frozen Section
∗Layers: block1_conv1→block4_pool
605

Python

Chapter 53. What is Transfer Learning Transfer Learning in Keras Fine Tuning Vs
Feature Extraction
∗Purpose: Preserve primitive feature extraction
∗Status: Weights unchanged during training
Trainable Section
∗Layers: block5_conv1→block5_pool
∗Purpose: Adapt high-level features to cats/dogs
∗Status: Weights updated with very low learning rate

Content sourced from CampusX Deep Learning notes (PDF). Run merge script for full body.

Common mistakes

Applying augment incorrectly at inference.
Combining too many leakage techniques without ablation.
No validation split.

Interview checkpoints

Q: When use Data Augmentation? A: When val loss diverges from train.
Q: Dropout at test? A: Scale activations or disable dropout.

Practice

Basic: Define Data Augmentation.
Intermediate: Add Data Augmentation to Keras model; plot curves.
Advanced: Ablation table of regularizers.

Recap

Data Augmentation reduces overfitting.
Always measure on validation.
Combine with good baselines.

Next: Day 41 — ELU & LeakyReLU

Day 41

ELU & LeakyReLU

21.1. How to Improve Neural Networks Problem 3: Slow Training Solutions:-Better Optimizers: Adam, AdaGrad instead of basic SGD - Learning Rate Schedulers: Adjust learning rate during training -Hardware Optimization: Use GPUs effectively Problem 4: Overfitting Issue: Model memorizes training data, poor generalizationSolutions:-Reg- ularization: L1, L2 penalties -Dropout: Randomly disable neurons during training -Early Stopping: Stop before overfitting occurs

21.1.6 Solution Techniques Summary

Figure 21.4: image For Vanishing/Exploding Gradients Technique Description Weight Initialization Xavier, He initialization Activation Functions ReLU, LeakyReLU, ELU Batch Normalization Normalize layer inputs Gradient Clipping Limit gradient values 221

Why this matters

ELU/LeakyReLU reduce dead ReLU problem.

20.1.12 Key Takeaways

Understanding 1.Mathematicalfoundation: Problemarisesfrommultiplyingmanysmall numbers (<1) 2.Deep networks only: Affects networks with 8-10+ layers 3.Activation dependent: Mainly with sigmoid/tanh functions 4.Training failure: Results in inability to learn Detection Methods 1.Loss monitoring: Watch for plateau in loss curves 2.Weight tracking: Monitor weight changes across epochs 3.Training observation: Loss not reducing after many epochs Solution Priority 1.Most practical: UseReLU activation function 2.Modern approach: Implementbatch normalization 3.Architecture: ConsiderResNet for very deep networks 4.Fundamentals: Useproper weight initialization 5.Last resort:Reduce model complexity [Codelink-https://colab.research.google.com/drive/1j1qAWzo6sjNU3f_vkMMijOAuFi1JoV8p?usp=sharing] 214

20.1. Vanishing Gradient Problem 215

Part VII Improving Neural Network Performance 216

Chapter 21 How to Improve the Performance of a Neural Network

21.1 How to Improve Neural Networks

21.1.1 Overview

This video covers techniques to improve neural network performance after un- derstanding the basic concepts of perceptrons, multi-layer perceptrons, forward propagation, and backpropagation.

21.1.2 Main Objective

Learn how to improve an already trained artificial neural network’s performance - moving from basic accuracy (like 90%) to higher performance (like 99%). Figure 21.1: image 217

Chapter 21. How to Improve the Performance of a Neural Network

21.1.3 Part 1: Hyperparameter Tuning

Key Hyperparameters Table Hyperparameter Description Impact Number of Hidden Layers Depth of the network More layers→Better complex pattern recognition Neurons per LayerWidth of each layer More neurons→Greater learning capacity Learning RateSpeed of gradient descent Too small→slow training, Too large→poor results OptimizerAlgorithm for weight updates Affects convergence speed and stability Batch SizeNumber of samples per update Affects training speed and generalization Activation FunctionNon-linear transformation Affects gradient flow and learning EpochsNumber of complete data passes More epochs→better learning (until overfitting) Network Architecture Decisions Figure 21.2: image Number of Hidden Layers - – Recommendation: Use multiple layers rather than single wide layer – Reason: Deep networks enablerepresentation learning 218

21.1. How to Improve Neural Networks ∗Early layers: Capture primitive features (lines, edges) ∗Middle layers: Combine primitives into shapes ∗Final layers: Form complex patterns (faces, objects) Architecture Comparison: 1Wide & Shallow: [Input] -> [512 neurons] -> [Output] 2Deep & Narrow: [Input] -> [128] -> [64] -> [32] -> [Output] Neurons per Layer Traditional Pyramid Approach:- Decreasing neu- rons as you go deeper - Logic: Primitive features (many)→Complex features (few) - Example: 512→256→128→64 Figure 21.3: image Modern Approach:- Equal neurons across layers also works well -Key Rule: Always use sufficient neurons - Start with more, reduce only if overfitting occurs 219

Chapter 21. How to Improve the Performance of a Neural Network Batch Size Strategies

21.1.4 Types of Gradient Descent -

1) Batch GD 2) Stacholic GD 3) Mini-Batch GD Approach Batch Size Advantages Disadvantages Small Batch32, 64 Better generalization, Stable training Slower training Large Batch512, 1024+ Faster training May not generalize well Warm-up Strategy Variable Combines benefits of both More complex implementation Learning Rate Warm-up Technique 1.Start: Small learning rate with large batch size 2.Progress: Gradually increase learning rate 3.Result: Fast training + Good accuracy Epochs and Early Stopping Strategy:- Set high number of epochs (don’t worry about exact number) - Use Early Stoppingcallback - System automatically stops when no improvement detected -Icon: Auto-stop when performance plateaus

21.1.5 Part 2: Common Deep Learning Problems

Problem 1: Vanishing/Exploding Gradients Issue: - Gradients become too small (vanishing) or too large (exploding) - Affects weight updates in early layers - Training becomes ineffective Solutions:-Weight Initialization: Better initial weight values -Activation Functions: Use ReLU instead of sigmoid -Batch Normalization: Normalize inputs to each layer -Gradient Clipping: Limit gradient magnitude Problem 2: Insufficient Data Challenge: Deep learning is data-hungrySolutions:-Transfer Learning: Use pre-trained models -Data Augmentation: Create more training samples -Unsupervised Learning: Learn from unlabeled data 220

21.1.6 Solution Techniques Summary

Chapter 21. How to Improve the Performance of a Neural Network For Insufficient Data Technique Description Transfer Learning Use pre-trained models Data Augmentation Rotation, scaling, noise Unsupervised Learning Learn representations first For Slow Training Technique Description Advanced Optimizers Adam, RMSprop, AdaGrad Learning Rate Scheduling Decay, cyclic, warm restart Better Hardware GPU utilization For Overfitting Technique Description Dropout Random neuron deactivation Regularization L1/L2 weight penalties Early Stopping Stop at optimal point

21.1.7 Future Learning Roadmap

Upcoming Topics (Next 10-15 Videos) 1.Weight Initializationtechniques 2.Activation Functionsin detail 3.Optimizerscomparison and implementation 4.Batch Normalizationtheory and practice 5.Gradient Clippingimplementation 6.Transfer Learningpractical applications 7.Dropoutand regularization 8.Learning Rate Schedulers 9.Data Augmentationtechniques 10.Hyperparameter Optimization 222

21.1. How to Improve Neural Networks

21.1.8 Key Takeaways

General Guidelines 1.Start Complex: Begin with more layers/neurons, reduce if needed 2.Sufficient Capacity: Always ensure enough neurons per layer 3.Experiment: Try different combinations systematically 4.Monitor: Use early stopping and validation metrics 5.Transfer Learning: Leverage pre-trained models when possible Performance Improvement Strategy 1Step 1: Tune Hyperparameters 2Step 2: Address Specific Problems 3Step 3: Implement Advanced Techniques 4Step 4: Monitor and Iterate 223

Chapter 21. How to Improve the Performance of a Neural Network 224

Chapter 22 EarlyStoppingInNeuralNetworks EndtoEndDeepLearningCourse

22.1 Early Stopping in Neural Networks

22.1.1 Overview

Early Stopping is a technique to prevent overfitting by automatically stopping the training process when the model’s performance stops improving on valida- tion data.

22.1.2 Learning Objectives

–Understand what early stopping is and why it’s essential

Python

–Learn how to implement early stopping in Keras/TensorFlow
–Master the key parameters for effective early stopping
–Prevent overfitting automatically during training
22.1.3 The Problem: When to Stop Training?
Common Dilemma
– Question: How many epochs should I train my model?
– Naive Approach: Train for many epochs (100, 1000+) and see what
happens
– Problem: This often leads to overfitting!
Overfitting Scenario
1Training Data Performance: Excellent results
2New/Test Data Performance: Poor results
Why this happens:- Model memorizes training data instead of learning
patterns - Performance degrades on unseen data - Training continues beyond
optimal point
225

Chapter 22. Early Stopping In Neural Networks End to End Deep Learning Course

22.1.4 What is Early Stopping?

Definition Early Stopping is an automatic mechanism that: -Monitorsvalidation per- formance during training -Detectswhen further training becomes harmful - Stopstraining at the optimal point -Preventsoverfitting automatically Visual Concept 1Training Loss: Continuously decreasing 2Validation Loss: Decreases initially -> Starts increasing 3? 4Optimal stopping point

22.1.5 Practical Implementation

Content sourced from CampusX Deep Learning notes (PDF). Run merge script for full body.

Common mistakes

Applying elu incorrectly at inference.
Combining too many leaky techniques without ablation.
No validation split.

Interview checkpoints

Q: When use ELU & LeakyReLU? A: When val loss diverges from train.
Q: Dropout at test? A: Scale activations or disable dropout.

Practice

Basic: Define ELU & LeakyReLU.
Intermediate: Add ELU & LeakyReLU to Keras model; plot curves.
Advanced: Ablation table of regularizers.

Recap

ELU & LeakyReLU reduces overfitting.
Always measure on validation.
Combine with good baselines.

Next: Day 42 — Early Stopping

Day 42

Early Stopping

Contents

21.1.3 Part 1: Hyperparameter Tuning . . . . . . . . . . . . . . . . . 218

21.1.4 Types of Gradient Descent - . . . . . . . . . . . . . . . . . . . 220

21.1.5 Part 2: Common Deep Learning Problems . . . . . . . . . . . 220

21.1.6 Solution Techniques Summary . . . . . . . . . . . . . . . . . . 221

21.1.7 Future Learning Roadmap . . . . . . . . . . . . . . . . . . . . 222

21.1.8 Key Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . 223

22 Early Stopping In Neural Networks End to End Deep Learning Course 225

22.1 Early Stopping in Neural Networks . . . . . . . . . . . . . . . . . . . 225

22.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225

22.1.2 Learning Objectives . . . . . . . . . . . . . . . . . . . . . . . . 225

22.1.3 The Problem: When to Stop Training? . . . . . . . . . . . . . 225

22.1.4 What is Early Stopping? . . . . . . . . . . . . . . . . . . . . . 226

22.1.5 Practical Implementation . . . . . . . . . . . . . . . . . . . . . 226

22.1.6 Early Stopping Parameters . . . . . . . . . . . . . . . . . . . . 227

22.1.7 Training Flow with Early Stopping . . . . . . . . . . . . . . . 229

22.1.8 Best Practices . . . . . . . . . . . . . . . . . . . . . . . . . . . 230

22.1.9 Advanced Configuration . . . . . . . . . . . . . . . . . . . . . 230

23.1 Deep Learning: Feature Scaling and Normalization - Detailed Notes . 233

23.1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . 233

23.1.2 Technical Intuition . . . . . . . . . . . . . . . . . . . . . . . . 233

23.1.3 Solutions: Feature Scaling Techniques . . . . . . . . . . . . . . 234

23.1.4 When to Use Which Technique? . . . . . . . . . . . . . . . . . 235

23.1.5 Practical Implementation . . . . . . . . . . . . . . . . . . . . . 235

23.1.6 Results Comparison . . . . . . . . . . . . . . . . . . . . . . . . 235

23.1.7 Neural Network Architecture Used . . . . . . . . . . . . . . . 236

23.1.8 Key Insights and Best Practices . . . . . . . . . . . . . . . . . 236

23.1.9 Summary and Takeaways . . . . . . . . . . . . . . . . . . . . . 236

23.1.10Practical Checklist . . . . . . . . . . . . . . . . . . . . . . . . 237 24 Dropout Layer in Deep LearningDropoutsin ANN Endto End Deep Learning 239

24.0.1 Detailed Notes on Dropout in Neural Networks . . . . . . . . 239

Why this matters

Early stopping halts when val metric worsens — cheap regularizer.

22.1 Early Stopping in Neural Networks . . . . . . . . . . . . . . . . . . . 225

Content sourced from CampusX Deep Learning notes (PDF). Run merge script for full body.

Common mistakes

Applying early stop incorrectly at inference.
Combining too many patience techniques without ablation.
No validation split.

Interview checkpoints

Q: When use Early Stopping? A: When val loss diverges from train.
Q: Dropout at test? A: Scale activations or disable dropout.

Practice

Basic: Define Early Stopping.
Intermediate: Add Early Stopping to Keras model; plot curves.
Advanced: Ablation table of regularizers.

Recap

Early Stopping reduces overfitting.
Always measure on validation.
Combine with good baselines.

Next: Day 43 — Regularization Project

Day 43

Regularization Project

1.3. Artificial Neural Networks (ANN)

1.3.3 MLP [Multi-layer perceptron]

•Intuition of MLP •MLP Notation •Prediction in MLP

1.3.4 Training an MLP [Most used Algorithm]

•Gradient Descent •Backpropagation

Python

1.3.5 Practical with Keras
•CPU Vs GPU
•Installation
•Example 1 - Regression using Keras
•Example 2 - Classification using Keras
1.3.6 How to improve an ANN
•Vanishing Gradients
•Exploding Gradients
•Dropouts
•Regularization
•Weight Initialization
•Optimizers
•Gradient Checking and Clipping
•Batch Normalization
•Hyperparameter Tuning
1.3.7 Advanced Topics
•Callbacks
•Tensorboard
•Pretrained Models
•Keras Functional API
•Saving and Loading a Keras model
•Building a Streamlit Application
1.3.8 Project
•End-to-End Final Project
•AWS deployment
Convolutional Neural Networks (CNN)
•Convolution operations and filters
•Pooling layers and techniques
•Feature maps and visualization
3

Why this matters

Regularization project compares techniques on same baseline.

49.1.12 Project Extensions and Improvements

Suggested Enhancements 1.Data Augmentation: Rotate, flip, zoom images 2.Transfer Learning: Use pre-trained models (VGG, ResNet) 3.Hyperparameter Tuning: Learning rate, batch size optimization 4.Advanced Regularization: L1/L2 penalties 5.More Complex Architectures: Deeper networks Performance Optimization 1.Learning Rate Scheduling: Adaptive learning rates 2.Early Stopping: Prevent overfitting automatically 3.Model Checkpointing: Save best performing models 4.Cross-Validation: Better performance estimation

49.1.12 Project Extensions and Improvements

Content sourced from CampusX Deep Learning notes (PDF). Run merge script for full body.

Common mistakes

Applying project incorrectly at inference.
Combining too many compare techniques without ablation.
No validation split.

Interview checkpoints

Q: When use Regularization Project? A: When val loss diverges from train.
Q: Dropout at test? A: Scale activations or disable dropout.

Practice

Basic: Define Regularization Project.
Intermediate: Add Regularization Project to Keras model; plot curves.
Advanced: Ablation table of regularizers.

Recap

Regularization Project reduces overfitting.
Always measure on validation.
Combine with good baselines.

Next: Day 44 — Model Comparison

Day 44

Model Comparison

Contents

9.1.7 Performance Comparison . . . . . . . . . . . . . . . . . . . . . 118

9.1.8 Key Learning Outcomes . . . . . . . . . . . . . . . . . . . . . 119

10 Forward Propagation How a neural network predicts output 121

10.1 Neural Network Forward Propagation . . . . . . . . . . . . . . . . . . 121

10.1.1 Course Continuation Overview . . . . . . . . . . . . . . . . . . 121

10.1.2 Today’s Focus: Forward Propagation . . . . . . . . . . . . . . 121

10.1.3Video Objectives. . . . . . . . . . . . . . . . . . . . . . . . 122 10.1.4Notation Explanation:. . . . . . . . . . . . . . . . . . . . 124 10.1.5Course Structure Difference. . . . . . . . . . . . . . . . . 124 10.1.6Key Takeaways. . . . . . . . . . . . . . . . . . . . . . . . . 125 IV Practical Applications with ANN 126

Python

11 Customer Churn Prediction using ANN Keras and Tensorflow Deep
Learning Classification 127
11.1 Neural Networks for Customer Churn Prediction - Complete Guide . 127
11.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
11.1.2 Dataset Overview . . . . . . . . . . . . . . . . . . . . . . . . . 127
11.1.3 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . 128
11.1.4 Building Neural Network . . . . . . . . . . . . . . . . . . . . . 130
11.1.5 Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . 131
11.1.6 Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 132
11.1.7 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
11.1.8 Model Improvement . . . . . . . . . . . . . . . . . . . . . . . 133
11.1.9 Advanced Techniques (Mentioned in Video) . . . . . . . . . . 134
11.1.10Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
12 Handwritten Digit Classification using ANN MNIST Dataset 136
12.1 MNIST Digit Classification with Neural Networks - Complete Guide . 136
12.1.1 Introduction to Multi-Class Classification . . . . . . . . . . . . 136
12.1.2 MNIST Dataset Overview . . . . . . . . . . . . . . . . . . . . 136
12.1.3 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . 137
12.1.4 Building the Neural Network . . . . . . . . . . . . . . . . . . . 138
12.1.5 Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . 139
12.1.6 Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 140
12.1.7 Visualization & Analysis . . . . . . . . . . . . . . . . . . . . . 141
12.1.8 Model Improvements . . . . . . . . . . . . . . . . . . . . . . . 143
12.1.9 Advanced Concepts . . . . . . . . . . . . . . . . . . . . . . . . 144
12.1.10Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
13 Graduate Admission Prediction using ANN 148
13.1 Neural Networks for Regression - Graduate Admission Prediction . . 148
13.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
13.1.2 Dataset Overview . . . . . . . . . . . . . . . . . . . . . . . . . 149
13.1.3 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . 150
13.1.4 Building Neural Network . . . . . . . . . . . . . . . . . . . . . 150
13.1.5 Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . 151
vii

Why this matters

Model comparison must use same splits and metrics.

26.0.12 Comparison Table: With vs Without Regular-

ization Aspect Without Regularization With Regularization Model Complexity High Reduced Weights Large, spread out Small, close to zero Overfitting Yes Reduced Generalization Poor Improved 261

Chapter 26. Regularization in Deep Learning L2 Regularization in ANN L1 Regularization Weight Decay in ANN

26.0.12 Comparison Table: With vs Without Regular-

ization Aspect Without Regularization With Regularization Model Complexity High Reduced Weights Large, spread out Small, close to zero Overfitting Yes Reduced Generalization Poor Improved 261

Chapter 26. Regularization in Deep Learning L2 Regularization in ANN L1 Regularization Weight Decay in ANN

Content sourced from CampusX Deep Learning notes (PDF). Run merge script for full body.

Common mistakes

Applying compare incorrectly at inference.
Combining too many fair techniques without ablation.
No validation split.

Interview checkpoints

Q: When use Model Comparison? A: When val loss diverges from train.
Q: Dropout at test? A: Scale activations or disable dropout.

Practice

Basic: Define Model Comparison.
Intermediate: Add Model Comparison to Keras model; plot curves.
Advanced: Ablation table of regularizers.

Recap

Model Comparison reduces overfitting.
Always measure on validation.
Combine with good baselines.

Next: Day 45 — Momentum SGD

← Module 3: Gradients Module 5: Deep Learning Optimizers →