Module 2 · 100 Days of DL

Module 2: Multi-Layer Perceptrons & Training

Dive into stacked Multi-Layer Perceptrons (MLPs), vector node notations, matrix computations in forward propagation, and training using gradient backpropagation chain rule.

⏱ 35 Min Read • Author: GenAIWallah Team • Updated: May 2026

Day 11

MLP Architecture

Why this matters

Stacking layers turns linear perceptrons into universal function approximators — every deep net is an MLP at its core.

A multi-layer perceptron (MLP) stacks fully connected layers: input → one or more hidden layers → output. Each layer applies a weight matrix, bias, and nonlinear activation.

Layer roles

Input: raw features (e.g. flattened pixels).
Hidden: learned intermediate representations.
Output: task head — softmax (multi-class), sigmoid (binary), or linear (regression).

Multi-layer Perceptron (MLP) Feedforward Structure

Common mistakes

Too few hidden units (underfitting).
Too many units with no regularization (overfitting).
Mismatched output activation vs loss (sigmoid + MSE on classification).

Interview checkpoints

Q: MLP layer roles? A: Input → hidden (learn features) → output (task head).
Q: Universal approximation? A: One hidden layer with enough units can approximate continuous functions.

Practice

Basic: Draw a 2-hidden-layer MLP for 10 inputs, 3 classes.
Intermediate: Count parameters for layers [784,128,64,10].
Advanced: Explain why depth often beats width in practice.

Recap

MLP = fully connected stack with nonlinear activations.
Depth creates composed nonlinear features.
Output layer matches task (softmax, sigmoid, linear).

Next: Day 12 — Forward Propagation

Day 12

Forward Propagation

Why this matters

Forward propagation is how predictions are computed — you must trace tensor shapes layer by layer to debug any network.

Forward propagation computes predictions by passing inputs through each layer in order. Cache every activation — backprop needs them.

One neuron in layer $l$

$$z_j^{[l]} = \sum_k w_{jk}^{[l]} a_k^{[l-1]} + b_j^{[l]}, \quad a_j^{[l]} = \sigma(z_j^{[l]})$$

Vectorized for a batch: $Z^{[l]} = A^{[l-1]} W^{[l]} + b^{[l]}$, then $A^{[l]} = \sigma(Z^{[l]})$.

Common mistakes

Forgetting bias in z = Wx + b.
Applying activation before the last layer when loss expects logits.
Batch dimension dropped (shape (features,) vs (batch, features)).

Interview checkpoints

Q: Forward pass for layer l? A: a^[l] = σ(W^[l] a^[l-1] + b^[l]).
Q: Why cache activations? A: Needed for backprop.

Practice

Basic: Compute z and a for 2 inputs, 1 neuron, ReLU.
Intermediate: Implement forward pass in NumPy for 2-layer net.
Advanced: Vectorize forward pass with matrix multiply.

Recap

Forward = repeated affine transform + activation.
Shapes: (batch, n_in) @ (n_in, n_out).
Store activations for backward pass.

Next: Day 13 — Matrix Notation

Day 13

Matrix Notation

Why this matters

Matrix notation is how GPUs train nets in milliseconds — vectorization is not optional at scale.

Matrix notation is how frameworks run forward passes on GPUs. Always track shapes: $(\text{batch}, n_{in}) @ (n_{in}, n_{out}) \rightarrow (\text{batch}, n_{out})$.

Symbol	Meaning
$A^{[0]}$	Input batch
$W^{[l]}, b^{[l]}$	Weights and biases for layer $l$
$\sigma$	Activation (ReLU, sigmoid, …)

Common mistakes

Looping over samples instead of batch matmul.
Wrong transpose in W @ x vs x @ W.
Mixing (batch, features) with (features, batch).

Interview checkpoints

Q: Batch matrix multiply shape? A: (B, n) @ (n, m) → (B, m).
Q: Why vectorize? A: BLAS/GPU parallelism.

Practice

Basic: Write shapes for batch=32, features=784, hidden=128.
Intermediate: Replace Python loops with np.dot on toy data.
Advanced: Profile loop vs vectorized forward pass.

Recap

One batch step = matrix multiplies across layers.
Consistent shape discipline prevents bugs.
Keras handles this; you still verify with model.summary().

Next: Day 14 — Sigmoid Activation

Day 14

Sigmoid Activation

Why this matters

Sigmoid squashes outputs to (0,1) — historically vital, now mostly for binary output layers and gates.

Sigmoid maps logits to $(0, 1)$: $\sigma(z) = 1 / (1 + e^{-z})$. Use on binary output neurons with binary cross-entropy.

$$\sigma(z) = \frac{1}{1 + e^{-z}}, \quad \sigma'(z) = \sigma(z)(1 - \sigma(z))$$

⚠️

Avoid sigmoid in deep hidden layers — gradients shrink (max derivative 0.25 at $z=0$).

Common mistakes

Sigmoid in deep hidden layers → vanishing gradients.
Treating sigmoid outputs as logits in cross-entropy.
Ignoring saturation near 0 and 1.

Interview checkpoints

Q: Sigmoid derivative max? A: 0.25 at z=0.
Q: When still use sigmoid? A: Binary output + BCE, LSTM gates.

Practice

Basic: Plot sigmoid and derivative.
Intermediate: Train 2-layer net with sigmoid hidden vs ReLU on MNIST subset.
Advanced: Explain saturation and slow learning.

Recap

Sigmoid = smooth S-curve.
Derivative small when |z| large.
Avoid in deep hidden stacks.

Next: Day 15 — Tanh Activation

Day 15

Tanh Activation

Why this matters

Tanh centers outputs around zero — often better than sigmoid in hidden layers but still suffers vanishing gradients when deep.

Tanh outputs in $(-1, 1)$ and is zero-centered: $\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}$. Often preferred over sigmoid in hidden layers historically, but still saturates.

Compare: ReLU for hidden stacks today; tanh/sigmoid for gates (LSTM) or bounded outputs.

Common mistakes

Assuming tanh and sigmoid are interchangeable without checking output range.
Using tanh on pixel inputs already in [0,255] without scaling.
Forgetting tanh derivative depends on output (1 - tanh²).

Interview checkpoints

Q: Tanh vs sigmoid? A: Tanh zero-centered; often converges faster in hidden layers.
Q: Output range? A: (-1, 1).

Practice

Basic: Compare tanh(0) and sigmoid(0).
Intermediate: Swap activations in same MLP; compare val loss.
Advanced: When prefer LeCun init with tanh.

Recap

Tanh is zero-centered sigmoid cousin.
Still vanishes in very deep nets.
LSTM often uses tanh in cell update.

Next: Day 16 — ReLU Activation

Day 16

ReLU Activation

Why this matters

ReLU is the default hidden activation — cheap, sparse, and avoids saturation on the positive side.

ReLU (Rectified Linear Unit): $\text{ReLU}(z) = \max(0, z)$. Default for hidden layers — cheap, sparse activations, no saturation for $z > 0$.

Dead ReLU: neuron stuck at 0 if weights push all inputs negative — use Leaky ReLU or better init.
Derivative: 1 if $z > 0$, else 0.

Common mistakes

Dead ReLU: neuron never activates if weights push z always negative.
Using ReLU on output for regression without thought.
LeakyReLU/ELU exist precisely to fix dead neurons.

Interview checkpoints

Q: ReLU formula? A: max(0, z).
Q: Dead ReLU fix? A: Lower LR, better init, LeakyReLU, He init.

Practice

Basic: Count % dead units after bad init on random data.
Intermediate: Train same net ReLU vs tanh hidden.
Advanced: He initialization intuition for ReLU.

Recap

ReLU = default for CNN/MLP hidden layers.
Watch dead neuron fraction.
Pair with He init.

Next: Day 17 — Loss Functions

Day 17

Loss Functions

Why this matters

The loss function defines what 'wrong' means — wrong loss means the model optimizes the wrong objective.

The loss function measures prediction error. It must match your output activation and task.

Task	Output	Loss
Regression	Linear	MSE, MAE, Huber
Binary classification	Sigmoid	Binary cross-entropy
Multi-class	Softmax	Categorical cross-entropy

Common mistakes

MSE on classification.
Cross-entropy without softmax on multi-class.
Not reducing loss over training (bug, not 'slow convergence').

Interview checkpoints

Q: Binary classification loss? A: Binary cross-entropy (log loss).
Q: Multi-class? A: Categorical cross-entropy + softmax.

Practice

Basic: Match loss to task type table (regression, binary, multi-class).
Intermediate: Implement MSE and BCE in NumPy.
Advanced: Why CE beats MSE for classification probabilistically.

Recap

Loss drives all gradients.
Output activation must align with loss.
Monitor train vs val loss.

Next: Day 18 — Backpropagation

Day 18

Backpropagation

Why this matters

Backpropagation is the engine of deep learning — it distributes output error backward to every weight efficiently.

Backpropagation applies the chain rule to compute $\partial L / \partial w$ for every weight, flowing error from output back to input.

$$\delta_j^{[l]} = \frac{\partial L}{\partial z_j^{[l]}}, \quad \frac{\partial L}{\partial w_{jk}^{[l]}} = \delta_j^{[l]} \, a_k^{[l-1]}$$

Implementation pattern: forward pass (cache $a, z$) → compute loss → backward pass → optimizer step.

Common mistakes

Manual backprop shape errors in custom layers.
Forgetting to zero_grad / reset tape in frameworks.
Stopping at loss without checking intermediate gradients.

Interview checkpoints

Q: Backprop core idea? A: Chain rule applied layer-wise from loss to weights.
Q: Computational graph? A: Nodes = ops; backward = reverse topological order.

Practice

Basic: Backprop through 2-layer net on paper.
Intermediate: Use tf.GradientTape on toy function.
Advanced: Explain why backprop is efficient vs finite differences.

Recap

Backprop = chain rule + caching.
Frameworks automate; you verify shapes.
Vanishing/exploding appear here first.

Next: Day 19 — Chain Rule

Day 19

Chain Rule

Why this matters

The chain rule is the calculus backbone of backprop — one broken link breaks the entire gradient flow.

The chain rule links nested functions: if $L = f(g(h(x)))$, then $\frac{dL}{dx} = \frac{dL}{df}\frac{df}{dg}\frac{dg}{dh}\frac{dh}{dx}$.

In nets, each layer is one link in the chain. Backprop multiplies local gradients (activation derivative × upstream delta) across layers.

Common mistakes

Mixing partial derivatives when multiple paths exist (need sum of paths).
Treating independent variables as dependent.
Numerical instability when multiplying many small derivatives.

Interview checkpoints

Q: Chain rule example? A: dL/dx = dL/dy · dy/dx.
Q: ReLU chain rule at z=0? A: Subgradient convention (0 or 1).

Practice

Basic: Differentiate composed function f(g(x)).
Intermediate: Trace chain for loss → sigmoid → affine.
Advanced: Multi-path graph (branching) gradient sum.

Recap

Master chain rule on scalars then tensors.
Each layer is a composed function.
Vanishing = product of many small terms.

Next: Day 20 — Gradient Descent

Day 20

Gradient Descent

Why this matters

Gradient descent turns gradients into weight updates — learning rate is the most important hyperparameter.

Gradient descent updates weights opposite the loss gradient: $w \leftarrow w - \eta \frac{\partial L}{\partial w}$. $\eta$ is the learning rate.

Batch: full dataset per step — stable, expensive.
SGD: one sample — noisy, fast.
Mini-batch: practical default (e.g. 32–256).

Common mistakes

Learning rate too high: divergence oscillation.
LR too low: weeks to converge.
Updating weights on validation set (data leakage).

Interview checkpoints

Q: Update rule? A: w ← w - η ∇L.
Q: Why mini-batch? A: Noise + GPU efficiency + better generalization.

Practice

Basic: Sketch loss curve for too high vs too low η.
Intermediate: Train with lr=0.1 vs 0.001; compare.
Advanced: Link to Module 3 optimizers (Adam, etc.).

Recap

GD needs differentiable loss.
η controls step size.
Mini-batch SGD is the practical default.

Next: Day 21 — MLP in Keras

Day 21

MLP in Keras

Why this matters

Keras lets you build and train MLPs in minutes — production workflows start with a correct Sequential or Functional model.

Keras

import tensorflow as tf
from tensorflow import keras

model = keras.Sequential([
    keras.layers.Dense(128, activation='relu', input_shape=(784,)),
    keras.layers.Dropout(0.2),
    keras.layers.Dense(10, activation='softmax'),
])
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.fit(x_train, y_train, epochs=10, batch_size=128, validation_split=0.1)
model.summary()

Build and train an MLP in Keras with Sequential: stack layers, compile with optimizer + loss, then fit.

Common mistakes

Input shape missing on first layer.
Wrong loss in compile() for last-layer activation.
Not calling model.summary() before training.

Interview checkpoints

Q: model.compile args? A: optimizer, loss, metrics.
Q: fit() key args? A: x, y, epochs, batch_size, validation_data.

Practice

Basic: Sequential model: Dense(128,relu) → Dense(10,softmax) on MNIST.
Intermediate: Add EarlyStopping + ModelCheckpoint.
Advanced: Functional API two-input model.

Recap

Keras abstracts forward/backprop.
Always verify shapes with summary().
Callbacks automate training hygiene.

Next: Day 22 — MLP Project

Day 22

MLP Project

Why this matters

An end-to-end MLP project consolidates architecture, training, evaluation, and error analysis — portfolio-worthy if documented.

MLP project checklist: define baseline → build pipeline (scale features, train/val split) → train MLP with sane architecture → track metrics → compare to logistic regression / XGBoost on tabular data.

Document input shape, layer sizes, activation, loss.
Plot learning curves (loss / accuracy vs epoch).
Save best model with ModelCheckpoint.

Common mistakes

No held-out test set.
Tuning on test set.
Reporting accuracy only on imbalanced data.

Interview checkpoints

Q: Project checklist? A: EDA → baseline → model → val metrics → error analysis.
Q: What to log? A: Config, metrics, confusion matrix, failure cases.

Practice

Basic: Define problem, metric, and baseline.
Intermediate: Train MLP; plot learning curves.
Advanced: Write README with reproducibility steps.

Recap

Projects prove you can ship, not just copy notebooks.
Document data, splits, and metrics.
Ready for gradient and regularization modules.

Next: Day 23 — Batch vs SGD

← Module 1: Foundations Module 3: Gradients & Tuning →

Symbol	Meaning
\(A^{[0]}\)	Input batch
\(W^{[l]}, b^{[l]}\)	Weights and biases for layer \(l\)
\(\sigma\)	Activation (ReLU, sigmoid, …)