Module 2: Multi-Layer Perceptrons & Training
Dive into stacked Multi-Layer Perceptrons (MLPs), vector node notations, matrix computations in forward propagation, and training using gradient backpropagation chain rule.
MLP Architecture
Why this matters
Stacking layers turns linear perceptrons into universal function approximators — every deep net is an MLP at its core.
A multi-layer perceptron (MLP) stacks fully connected layers: input → one or more hidden layers → output. Each layer applies a weight matrix, bias, and nonlinear activation.
Layer roles
- Input: raw features (e.g. flattened pixels).
- Hidden: learned intermediate representations.
- Output: task head — softmax (multi-class), sigmoid (binary), or linear (regression).
Common mistakes
- Too few hidden units (underfitting).
- Too many units with no regularization (overfitting).
- Mismatched output activation vs loss (sigmoid + MSE on classification).
Interview checkpoints
- Q: MLP layer roles? A: Input → hidden (learn features) → output (task head).
- Q: Universal approximation? A: One hidden layer with enough units can approximate continuous functions.
Practice
- Basic: Draw a 2-hidden-layer MLP for 10 inputs, 3 classes.
- Intermediate: Count parameters for layers [784,128,64,10].
- Advanced: Explain why depth often beats width in practice.
Recap
- MLP = fully connected stack with nonlinear activations.
- Depth creates composed nonlinear features.
- Output layer matches task (softmax, sigmoid, linear).
Forward Propagation
Why this matters
Forward propagation is how predictions are computed — you must trace tensor shapes layer by layer to debug any network.
Forward propagation computes predictions by passing inputs through each layer in order. Cache every activation — backprop needs them.
One neuron in layer \(l\)
$$z_j^{[l]} = \sum_k w_{jk}^{[l]} a_k^{[l-1]} + b_j^{[l]}, \quad a_j^{[l]} = \sigma(z_j^{[l]})$$Vectorized for a batch: \(Z^{[l]} = A^{[l-1]} W^{[l]} + b^{[l]}\), then \(A^{[l]} = \sigma(Z^{[l]})\).
Common mistakes
- Forgetting bias in z = Wx + b.
- Applying activation before the last layer when loss expects logits.
- Batch dimension dropped (shape (features,) vs (batch, features)).
Interview checkpoints
- Q: Forward pass for layer l? A: a^[l] = σ(W^[l] a^[l-1] + b^[l]).
- Q: Why cache activations? A: Needed for backprop.
Practice
- Basic: Compute z and a for 2 inputs, 1 neuron, ReLU.
- Intermediate: Implement forward pass in NumPy for 2-layer net.
- Advanced: Vectorize forward pass with matrix multiply.
Recap
- Forward = repeated affine transform + activation.
- Shapes: (batch, n_in) @ (n_in, n_out).
- Store activations for backward pass.
Next: Day 13 — Matrix Notation
Matrix Notation
Why this matters
Matrix notation is how GPUs train nets in milliseconds — vectorization is not optional at scale.
Matrix notation is how frameworks run forward passes on GPUs. Always track shapes: \((\text{batch}, n_{in}) @ (n_{in}, n_{out}) \rightarrow (\text{batch}, n_{out})\).
| Symbol | Meaning |
|---|---|
| \(A^{[0]}\) | Input batch |
| \(W^{[l]}, b^{[l]}\) | Weights and biases for layer \(l\) |
| \(\sigma\) | Activation (ReLU, sigmoid, …) |
Common mistakes
- Looping over samples instead of batch matmul.
- Wrong transpose in W @ x vs x @ W.
- Mixing (batch, features) with (features, batch).
Interview checkpoints
- Q: Batch matrix multiply shape? A: (B, n) @ (n, m) → (B, m).
- Q: Why vectorize? A: BLAS/GPU parallelism.
Practice
- Basic: Write shapes for batch=32, features=784, hidden=128.
- Intermediate: Replace Python loops with np.dot on toy data.
- Advanced: Profile loop vs vectorized forward pass.
Recap
- One batch step = matrix multiplies across layers.
- Consistent shape discipline prevents bugs.
- Keras handles this; you still verify with model.summary().
Sigmoid Activation
Why this matters
Sigmoid squashes outputs to (0,1) — historically vital, now mostly for binary output layers and gates.
Sigmoid maps logits to \((0, 1)\): \(\sigma(z) = 1 / (1 + e^{-z})\). Use on binary output neurons with binary cross-entropy.
Avoid sigmoid in deep hidden layers — gradients shrink (max derivative 0.25 at \(z=0\)).
Common mistakes
- Sigmoid in deep hidden layers → vanishing gradients.
- Treating sigmoid outputs as logits in cross-entropy.
- Ignoring saturation near 0 and 1.
Interview checkpoints
- Q: Sigmoid derivative max? A: 0.25 at z=0.
- Q: When still use sigmoid? A: Binary output + BCE, LSTM gates.
Practice
- Basic: Plot sigmoid and derivative.
- Intermediate: Train 2-layer net with sigmoid hidden vs ReLU on MNIST subset.
- Advanced: Explain saturation and slow learning.
Recap
- Sigmoid = smooth S-curve.
- Derivative small when |z| large.
- Avoid in deep hidden stacks.
Next: Day 15 — Tanh Activation
Tanh Activation
Why this matters
Tanh centers outputs around zero — often better than sigmoid in hidden layers but still suffers vanishing gradients when deep.
Tanh outputs in \((-1, 1)\) and is zero-centered: \(\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}\). Often preferred over sigmoid in hidden layers historically, but still saturates.
Compare: ReLU for hidden stacks today; tanh/sigmoid for gates (LSTM) or bounded outputs.
Common mistakes
- Assuming tanh and sigmoid are interchangeable without checking output range.
- Using tanh on pixel inputs already in [0,255] without scaling.
- Forgetting tanh derivative depends on output (1 - tanh²).
Interview checkpoints
- Q: Tanh vs sigmoid? A: Tanh zero-centered; often converges faster in hidden layers.
- Q: Output range? A: (-1, 1).
Practice
- Basic: Compare tanh(0) and sigmoid(0).
- Intermediate: Swap activations in same MLP; compare val loss.
- Advanced: When prefer LeCun init with tanh.
Recap
- Tanh is zero-centered sigmoid cousin.
- Still vanishes in very deep nets.
- LSTM often uses tanh in cell update.
Next: Day 16 — ReLU Activation
ReLU Activation
Why this matters
ReLU is the default hidden activation — cheap, sparse, and avoids saturation on the positive side.
ReLU (Rectified Linear Unit): \(\text{ReLU}(z) = \max(0, z)\). Default for hidden layers — cheap, sparse activations, no saturation for \(z > 0\).
- Dead ReLU: neuron stuck at 0 if weights push all inputs negative — use Leaky ReLU or better init.
- Derivative: 1 if \(z > 0\), else 0.
Common mistakes
- Dead ReLU: neuron never activates if weights push z always negative.
- Using ReLU on output for regression without thought.
- LeakyReLU/ELU exist precisely to fix dead neurons.
Interview checkpoints
- Q: ReLU formula? A: max(0, z).
- Q: Dead ReLU fix? A: Lower LR, better init, LeakyReLU, He init.
Practice
- Basic: Count % dead units after bad init on random data.
- Intermediate: Train same net ReLU vs tanh hidden.
- Advanced: He initialization intuition for ReLU.
Recap
- ReLU = default for CNN/MLP hidden layers.
- Watch dead neuron fraction.
- Pair with He init.
Next: Day 17 — Loss Functions
Loss Functions
Why this matters
The loss function defines what 'wrong' means — wrong loss means the model optimizes the wrong objective.
The loss function measures prediction error. It must match your output activation and task.
| Task | Output | Loss |
|---|---|---|
| Regression | Linear | MSE, MAE, Huber |
| Binary classification | Sigmoid | Binary cross-entropy |
| Multi-class | Softmax | Categorical cross-entropy |
Common mistakes
- MSE on classification.
- Cross-entropy without softmax on multi-class.
- Not reducing loss over training (bug, not 'slow convergence').
Interview checkpoints
- Q: Binary classification loss? A: Binary cross-entropy (log loss).
- Q: Multi-class? A: Categorical cross-entropy + softmax.
Practice
- Basic: Match loss to task type table (regression, binary, multi-class).
- Intermediate: Implement MSE and BCE in NumPy.
- Advanced: Why CE beats MSE for classification probabilistically.
Recap
- Loss drives all gradients.
- Output activation must align with loss.
- Monitor train vs val loss.
Next: Day 18 — Backpropagation
Backpropagation
Why this matters
Backpropagation is the engine of deep learning — it distributes output error backward to every weight efficiently.
Backpropagation applies the chain rule to compute \(\partial L / \partial w\) for every weight, flowing error from output back to input.
Implementation pattern: forward pass (cache \(a, z\)) → compute loss → backward pass → optimizer step.
Common mistakes
- Manual backprop shape errors in custom layers.
- Forgetting to zero_grad / reset tape in frameworks.
- Stopping at loss without checking intermediate gradients.
Interview checkpoints
- Q: Backprop core idea? A: Chain rule applied layer-wise from loss to weights.
- Q: Computational graph? A: Nodes = ops; backward = reverse topological order.
Practice
- Basic: Backprop through 2-layer net on paper.
- Intermediate: Use tf.GradientTape on toy function.
- Advanced: Explain why backprop is efficient vs finite differences.
Recap
- Backprop = chain rule + caching.
- Frameworks automate; you verify shapes.
- Vanishing/exploding appear here first.
Next: Day 19 — Chain Rule
Chain Rule
Why this matters
The chain rule is the calculus backbone of backprop — one broken link breaks the entire gradient flow.
The chain rule links nested functions: if \(L = f(g(h(x)))\), then \(\frac{dL}{dx} = \frac{dL}{df}\frac{df}{dg}\frac{dg}{dh}\frac{dh}{dx}\).
In nets, each layer is one link in the chain. Backprop multiplies local gradients (activation derivative × upstream delta) across layers.
Common mistakes
- Mixing partial derivatives when multiple paths exist (need sum of paths).
- Treating independent variables as dependent.
- Numerical instability when multiplying many small derivatives.
Interview checkpoints
- Q: Chain rule example? A: dL/dx = dL/dy · dy/dx.
- Q: ReLU chain rule at z=0? A: Subgradient convention (0 or 1).
Practice
- Basic: Differentiate composed function f(g(x)).
- Intermediate: Trace chain for loss → sigmoid → affine.
- Advanced: Multi-path graph (branching) gradient sum.
Recap
- Master chain rule on scalars then tensors.
- Each layer is a composed function.
- Vanishing = product of many small terms.
Gradient Descent
Why this matters
Gradient descent turns gradients into weight updates — learning rate is the most important hyperparameter.
Gradient descent updates weights opposite the loss gradient: \(w \leftarrow w - \eta \frac{\partial L}{\partial w}\). \(\eta\) is the learning rate.
- Batch: full dataset per step — stable, expensive.
- SGD: one sample — noisy, fast.
- Mini-batch: practical default (e.g. 32–256).
Common mistakes
- Learning rate too high: divergence oscillation.
- LR too low: weeks to converge.
- Updating weights on validation set (data leakage).
Interview checkpoints
- Q: Update rule? A: w ← w - η ∇L.
- Q: Why mini-batch? A: Noise + GPU efficiency + better generalization.
Practice
- Basic: Sketch loss curve for too high vs too low η.
- Intermediate: Train with lr=0.1 vs 0.001; compare.
- Advanced: Link to Module 3 optimizers (Adam, etc.).
Recap
- GD needs differentiable loss.
- η controls step size.
- Mini-batch SGD is the practical default.
Next: Day 21 — MLP in Keras
MLP in Keras
Why this matters
Keras lets you build and train MLPs in minutes — production workflows start with a correct Sequential or Functional model.
import tensorflow as tf
from tensorflow import keras
model = keras.Sequential([
keras.layers.Dense(128, activation='relu', input_shape=(784,)),
keras.layers.Dropout(0.2),
keras.layers.Dense(10, activation='softmax'),
])
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.fit(x_train, y_train, epochs=10, batch_size=128, validation_split=0.1)
model.summary()Build and train an MLP in Keras with Sequential: stack layers, compile with optimizer + loss, then fit.
Common mistakes
- Input shape missing on first layer.
- Wrong loss in compile() for last-layer activation.
- Not calling model.summary() before training.
Interview checkpoints
- Q: model.compile args? A: optimizer, loss, metrics.
- Q: fit() key args? A: x, y, epochs, batch_size, validation_data.
Practice
- Basic: Sequential model: Dense(128,relu) → Dense(10,softmax) on MNIST.
- Intermediate: Add EarlyStopping + ModelCheckpoint.
- Advanced: Functional API two-input model.
Recap
- Keras abstracts forward/backprop.
- Always verify shapes with summary().
- Callbacks automate training hygiene.
Next: Day 22 — MLP Project
MLP Project
Why this matters
An end-to-end MLP project consolidates architecture, training, evaluation, and error analysis — portfolio-worthy if documented.
MLP project checklist: define baseline → build pipeline (scale features, train/val split) → train MLP with sane architecture → track metrics → compare to logistic regression / XGBoost on tabular data.
- Document input shape, layer sizes, activation, loss.
- Plot learning curves (loss / accuracy vs epoch).
- Save best model with
ModelCheckpoint.
Common mistakes
- No held-out test set.
- Tuning on test set.
- Reporting accuracy only on imbalanced data.
Interview checkpoints
- Q: Project checklist? A: EDA → baseline → model → val metrics → error analysis.
- Q: What to log? A: Config, metrics, confusion matrix, failure cases.
Practice
- Basic: Define problem, metric, and baseline.
- Intermediate: Train MLP; plot learning curves.
- Advanced: Write README with reproducibility steps.
Recap
- Projects prove you can ship, not just copy notebooks.
- Document data, splits, and metrics.
- Ready for gradient and regularization modules.
Next: Day 23 — Batch vs SGD
