Search topics…
Tutorials
Explore
June 6 Offline Event →
Module 2 · 100 Days of DL

Module 2: Multi-Layer Perceptrons & Training

Dive into stacked Multi-Layer Perceptrons (MLPs), vector node notations, matrix computations in forward propagation, and training using gradient backpropagation chain rule.

⏱ 35 Min Read Author: GenAIWallah Team Updated: May 2026
Day 11

MLP Architecture

Why this matters

Stacking layers turns linear perceptrons into universal function approximators — every deep net is an MLP at its core.

A multi-layer perceptron (MLP) stacks fully connected layers: input → one or more hidden layers → output. Each layer applies a weight matrix, bias, and nonlinear activation.

Layer roles

  • Input: raw features (e.g. flattened pixels).
  • Hidden: learned intermediate representations.
  • Output: task head — softmax (multi-class), sigmoid (binary), or linear (regression).
Multi-layer Perceptron (MLP) Feedforward Structure
Input Layer Hidden Layer Output Layer

Common mistakes

  • Too few hidden units (underfitting).
  • Too many units with no regularization (overfitting).
  • Mismatched output activation vs loss (sigmoid + MSE on classification).

Interview checkpoints

  • Q: MLP layer roles? A: Input → hidden (learn features) → output (task head).
  • Q: Universal approximation? A: One hidden layer with enough units can approximate continuous functions.

Practice

  1. Basic: Draw a 2-hidden-layer MLP for 10 inputs, 3 classes.
  2. Intermediate: Count parameters for layers [784,128,64,10].
  3. Advanced: Explain why depth often beats width in practice.

Recap

  • MLP = fully connected stack with nonlinear activations.
  • Depth creates composed nonlinear features.
  • Output layer matches task (softmax, sigmoid, linear).

Next: Day 12 — Forward Propagation

Day 12

Forward Propagation

Why this matters

Forward propagation is how predictions are computed — you must trace tensor shapes layer by layer to debug any network.

Forward propagation computes predictions by passing inputs through each layer in order. Cache every activation — backprop needs them.

One neuron in layer \(l\)

$$z_j^{[l]} = \sum_k w_{jk}^{[l]} a_k^{[l-1]} + b_j^{[l]}, \quad a_j^{[l]} = \sigma(z_j^{[l]})$$

Vectorized for a batch: \(Z^{[l]} = A^{[l-1]} W^{[l]} + b^{[l]}\), then \(A^{[l]} = \sigma(Z^{[l]})\).

Common mistakes

  • Forgetting bias in z = Wx + b.
  • Applying activation before the last layer when loss expects logits.
  • Batch dimension dropped (shape (features,) vs (batch, features)).

Interview checkpoints

  • Q: Forward pass for layer l? A: a^[l] = σ(W^[l] a^[l-1] + b^[l]).
  • Q: Why cache activations? A: Needed for backprop.

Practice

  1. Basic: Compute z and a for 2 inputs, 1 neuron, ReLU.
  2. Intermediate: Implement forward pass in NumPy for 2-layer net.
  3. Advanced: Vectorize forward pass with matrix multiply.

Recap

  • Forward = repeated affine transform + activation.
  • Shapes: (batch, n_in) @ (n_in, n_out).
  • Store activations for backward pass.

Next: Day 13 — Matrix Notation

Day 13

Matrix Notation

Why this matters

Matrix notation is how GPUs train nets in milliseconds — vectorization is not optional at scale.

Matrix notation is how frameworks run forward passes on GPUs. Always track shapes: \((\text{batch}, n_{in}) @ (n_{in}, n_{out}) \rightarrow (\text{batch}, n_{out})\).

SymbolMeaning
\(A^{[0]}\)Input batch
\(W^{[l]}, b^{[l]}\)Weights and biases for layer \(l\)
\(\sigma\)Activation (ReLU, sigmoid, …)

Common mistakes

  • Looping over samples instead of batch matmul.
  • Wrong transpose in W @ x vs x @ W.
  • Mixing (batch, features) with (features, batch).

Interview checkpoints

  • Q: Batch matrix multiply shape? A: (B, n) @ (n, m) → (B, m).
  • Q: Why vectorize? A: BLAS/GPU parallelism.

Practice

  1. Basic: Write shapes for batch=32, features=784, hidden=128.
  2. Intermediate: Replace Python loops with np.dot on toy data.
  3. Advanced: Profile loop vs vectorized forward pass.

Recap

  • One batch step = matrix multiplies across layers.
  • Consistent shape discipline prevents bugs.
  • Keras handles this; you still verify with model.summary().

Next: Day 14 — Sigmoid Activation

Day 14

Sigmoid Activation

Why this matters

Sigmoid squashes outputs to (0,1) — historically vital, now mostly for binary output layers and gates.

Sigmoid maps logits to \((0, 1)\): \(\sigma(z) = 1 / (1 + e^{-z})\). Use on binary output neurons with binary cross-entropy.

$$\sigma(z) = \frac{1}{1 + e^{-z}}, \quad \sigma'(z) = \sigma(z)(1 - \sigma(z))$$
⚠️

Avoid sigmoid in deep hidden layers — gradients shrink (max derivative 0.25 at \(z=0\)).

Common mistakes

  • Sigmoid in deep hidden layers → vanishing gradients.
  • Treating sigmoid outputs as logits in cross-entropy.
  • Ignoring saturation near 0 and 1.

Interview checkpoints

  • Q: Sigmoid derivative max? A: 0.25 at z=0.
  • Q: When still use sigmoid? A: Binary output + BCE, LSTM gates.

Practice

  1. Basic: Plot sigmoid and derivative.
  2. Intermediate: Train 2-layer net with sigmoid hidden vs ReLU on MNIST subset.
  3. Advanced: Explain saturation and slow learning.

Recap

  • Sigmoid = smooth S-curve.
  • Derivative small when |z| large.
  • Avoid in deep hidden stacks.

Next: Day 15 — Tanh Activation

Day 15

Tanh Activation

Why this matters

Tanh centers outputs around zero — often better than sigmoid in hidden layers but still suffers vanishing gradients when deep.

Tanh outputs in \((-1, 1)\) and is zero-centered: \(\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}\). Often preferred over sigmoid in hidden layers historically, but still saturates.

Compare: ReLU for hidden stacks today; tanh/sigmoid for gates (LSTM) or bounded outputs.

Common mistakes

  • Assuming tanh and sigmoid are interchangeable without checking output range.
  • Using tanh on pixel inputs already in [0,255] without scaling.
  • Forgetting tanh derivative depends on output (1 - tanh²).

Interview checkpoints

  • Q: Tanh vs sigmoid? A: Tanh zero-centered; often converges faster in hidden layers.
  • Q: Output range? A: (-1, 1).

Practice

  1. Basic: Compare tanh(0) and sigmoid(0).
  2. Intermediate: Swap activations in same MLP; compare val loss.
  3. Advanced: When prefer LeCun init with tanh.

Recap

  • Tanh is zero-centered sigmoid cousin.
  • Still vanishes in very deep nets.
  • LSTM often uses tanh in cell update.

Next: Day 16 — ReLU Activation

Day 16

ReLU Activation

Why this matters

ReLU is the default hidden activation — cheap, sparse, and avoids saturation on the positive side.

ReLU (Rectified Linear Unit): \(\text{ReLU}(z) = \max(0, z)\). Default for hidden layers — cheap, sparse activations, no saturation for \(z > 0\).

  • Dead ReLU: neuron stuck at 0 if weights push all inputs negative — use Leaky ReLU or better init.
  • Derivative: 1 if \(z > 0\), else 0.

Common mistakes

  • Dead ReLU: neuron never activates if weights push z always negative.
  • Using ReLU on output for regression without thought.
  • LeakyReLU/ELU exist precisely to fix dead neurons.

Interview checkpoints

  • Q: ReLU formula? A: max(0, z).
  • Q: Dead ReLU fix? A: Lower LR, better init, LeakyReLU, He init.

Practice

  1. Basic: Count % dead units after bad init on random data.
  2. Intermediate: Train same net ReLU vs tanh hidden.
  3. Advanced: He initialization intuition for ReLU.

Recap

  • ReLU = default for CNN/MLP hidden layers.
  • Watch dead neuron fraction.
  • Pair with He init.

Next: Day 17 — Loss Functions

Day 17

Loss Functions

Why this matters

The loss function defines what 'wrong' means — wrong loss means the model optimizes the wrong objective.

The loss function measures prediction error. It must match your output activation and task.

TaskOutputLoss
RegressionLinearMSE, MAE, Huber
Binary classificationSigmoidBinary cross-entropy
Multi-classSoftmaxCategorical cross-entropy

Common mistakes

  • MSE on classification.
  • Cross-entropy without softmax on multi-class.
  • Not reducing loss over training (bug, not 'slow convergence').

Interview checkpoints

  • Q: Binary classification loss? A: Binary cross-entropy (log loss).
  • Q: Multi-class? A: Categorical cross-entropy + softmax.

Practice

  1. Basic: Match loss to task type table (regression, binary, multi-class).
  2. Intermediate: Implement MSE and BCE in NumPy.
  3. Advanced: Why CE beats MSE for classification probabilistically.

Recap

  • Loss drives all gradients.
  • Output activation must align with loss.
  • Monitor train vs val loss.

Next: Day 18 — Backpropagation

Day 18

Backpropagation

Why this matters

Backpropagation is the engine of deep learning — it distributes output error backward to every weight efficiently.

Backpropagation applies the chain rule to compute \(\partial L / \partial w\) for every weight, flowing error from output back to input.

$$\delta_j^{[l]} = \frac{\partial L}{\partial z_j^{[l]}}, \quad \frac{\partial L}{\partial w_{jk}^{[l]}} = \delta_j^{[l]} \, a_k^{[l-1]}$$

Implementation pattern: forward pass (cache \(a, z\)) → compute loss → backward pass → optimizer step.

Common mistakes

  • Manual backprop shape errors in custom layers.
  • Forgetting to zero_grad / reset tape in frameworks.
  • Stopping at loss without checking intermediate gradients.

Interview checkpoints

  • Q: Backprop core idea? A: Chain rule applied layer-wise from loss to weights.
  • Q: Computational graph? A: Nodes = ops; backward = reverse topological order.

Practice

  1. Basic: Backprop through 2-layer net on paper.
  2. Intermediate: Use tf.GradientTape on toy function.
  3. Advanced: Explain why backprop is efficient vs finite differences.

Recap

  • Backprop = chain rule + caching.
  • Frameworks automate; you verify shapes.
  • Vanishing/exploding appear here first.

Next: Day 19 — Chain Rule

Day 19

Chain Rule

Why this matters

The chain rule is the calculus backbone of backprop — one broken link breaks the entire gradient flow.

The chain rule links nested functions: if \(L = f(g(h(x)))\), then \(\frac{dL}{dx} = \frac{dL}{df}\frac{df}{dg}\frac{dg}{dh}\frac{dh}{dx}\).

In nets, each layer is one link in the chain. Backprop multiplies local gradients (activation derivative × upstream delta) across layers.

Common mistakes

  • Mixing partial derivatives when multiple paths exist (need sum of paths).
  • Treating independent variables as dependent.
  • Numerical instability when multiplying many small derivatives.

Interview checkpoints

  • Q: Chain rule example? A: dL/dx = dL/dy · dy/dx.
  • Q: ReLU chain rule at z=0? A: Subgradient convention (0 or 1).

Practice

  1. Basic: Differentiate composed function f(g(x)).
  2. Intermediate: Trace chain for loss → sigmoid → affine.
  3. Advanced: Multi-path graph (branching) gradient sum.

Recap

  • Master chain rule on scalars then tensors.
  • Each layer is a composed function.
  • Vanishing = product of many small terms.

Next: Day 20 — Gradient Descent

Day 20

Gradient Descent

Why this matters

Gradient descent turns gradients into weight updates — learning rate is the most important hyperparameter.

Gradient descent updates weights opposite the loss gradient: \(w \leftarrow w - \eta \frac{\partial L}{\partial w}\). \(\eta\) is the learning rate.

  • Batch: full dataset per step — stable, expensive.
  • SGD: one sample — noisy, fast.
  • Mini-batch: practical default (e.g. 32–256).

Common mistakes

  • Learning rate too high: divergence oscillation.
  • LR too low: weeks to converge.
  • Updating weights on validation set (data leakage).

Interview checkpoints

  • Q: Update rule? A: w ← w - η ∇L.
  • Q: Why mini-batch? A: Noise + GPU efficiency + better generalization.

Practice

  1. Basic: Sketch loss curve for too high vs too low η.
  2. Intermediate: Train with lr=0.1 vs 0.001; compare.
  3. Advanced: Link to Module 3 optimizers (Adam, etc.).

Recap

  • GD needs differentiable loss.
  • η controls step size.
  • Mini-batch SGD is the practical default.

Next: Day 21 — MLP in Keras

Day 21

MLP in Keras

Why this matters

Keras lets you build and train MLPs in minutes — production workflows start with a correct Sequential or Functional model.

Keras
import tensorflow as tf
from tensorflow import keras

model = keras.Sequential([
    keras.layers.Dense(128, activation='relu', input_shape=(784,)),
    keras.layers.Dropout(0.2),
    keras.layers.Dense(10, activation='softmax'),
])
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.fit(x_train, y_train, epochs=10, batch_size=128, validation_split=0.1)
model.summary()

Build and train an MLP in Keras with Sequential: stack layers, compile with optimizer + loss, then fit.

Common mistakes

  • Input shape missing on first layer.
  • Wrong loss in compile() for last-layer activation.
  • Not calling model.summary() before training.

Interview checkpoints

  • Q: model.compile args? A: optimizer, loss, metrics.
  • Q: fit() key args? A: x, y, epochs, batch_size, validation_data.

Practice

  1. Basic: Sequential model: Dense(128,relu) → Dense(10,softmax) on MNIST.
  2. Intermediate: Add EarlyStopping + ModelCheckpoint.
  3. Advanced: Functional API two-input model.

Recap

  • Keras abstracts forward/backprop.
  • Always verify shapes with summary().
  • Callbacks automate training hygiene.

Next: Day 22 — MLP Project

Day 22

MLP Project

Why this matters

An end-to-end MLP project consolidates architecture, training, evaluation, and error analysis — portfolio-worthy if documented.

MLP project checklist: define baseline → build pipeline (scale features, train/val split) → train MLP with sane architecture → track metrics → compare to logistic regression / XGBoost on tabular data.

  • Document input shape, layer sizes, activation, loss.
  • Plot learning curves (loss / accuracy vs epoch).
  • Save best model with ModelCheckpoint.

Common mistakes

  • No held-out test set.
  • Tuning on test set.
  • Reporting accuracy only on imbalanced data.

Interview checkpoints

  • Q: Project checklist? A: EDA → baseline → model → val metrics → error analysis.
  • Q: What to log? A: Config, metrics, confusion matrix, failure cases.

Practice

  1. Basic: Define problem, metric, and baseline.
  2. Intermediate: Train MLP; plot learning curves.
  3. Advanced: Write README with reproducibility steps.

Recap

  • Projects prove you can ship, not just copy notebooks.
  • Document data, splits, and metrics.
  • Ready for gradient and regularization modules.

Next: Day 23 — Batch vs SGD

← Module 1: Foundations Module 3: Gradients & Tuning →