Module 1: Deep Learning Foundations & Perceptrons
Master Deep Learning foundations: compare with Machine Learning, trace the Perceptron model biology and logic, Rosenblatt's step function, and binary classification boundaries.
What is DL?
Why this matters
You must know where deep learning sits in the AI stack and when representation learning beats classical ML — this frames every architecture choice later.
Deep Learning (DL) is a subfield of Machine Learning that learns hierarchical representations from data using neural networks with many layers. Instead of hand-engineering features, the model discovers useful features automatically.
AI → ML → DL
- Artificial Intelligence: Any system that mimics intelligent behavior (rules, search, ML, DL).
- Machine Learning: Learns patterns from data without explicit rules for every case.
- Deep Learning: ML with deep neural networks — especially strong on images, audio, and text.
| Aspect | Machine Learning | Deep Learning |
|---|---|---|
| Features | Often manual (domain expertise) | Learned automatically from raw inputs |
| Data | Works on smaller structured datasets | Needs large datasets; shines at scale |
| Compute | CPU is often enough | GPUs/TPUs for parallel matrix math |
| Interpretability | Easier with linear models, trees | Harder — black-box tradeoffs |
Common mistakes
- Using deep learning on tiny tabular data where gradient boosting wins.
- Assuming more layers always help without enough data or regularization.
- Ignoring compute cost (GPU memory, training time) in project planning.
Interview checkpoints
- Q: ML vs DL in one line? A: DL learns hierarchical features automatically; classical ML often needs hand-crafted features.
- Q: Why did DL take off after 2012? A: Big data + GPUs + better activations/optimizers + breakthrough architectures.
- Q: When not to use DL? A: Small data, strict interpretability, or simple rules suffice.
Practice
- Basic: List 3 DL applications and the input type (image, text, audio).
- Intermediate: Draw the feature-engineering pipeline for ML vs end-to-end DL.
- Advanced: Argue DL vs ML for a 5k-row fraud dataset with 40 features.
Recap
- DL learns representations from raw data using stacked nonlinear layers.
- It needs data, compute, and careful regularization.
- Choose DL when signal is in high-dimensional raw inputs.
Biological Neurons
Why this matters
The biological metaphor explains why networks use weighted sums, thresholds, and layers — it makes perceptrons intuitive, not magical.
Biological neurons inspired early neural network design. An artificial neuron is a simplified mathematical unit — not a literal copy of biology, but a useful mental model.
Biological neuron (simplified)
- Dendrites receive signals from other neurons.
- The cell body integrates incoming signals.
- If activation exceeds a threshold, the axon fires an output signal.
Artificial neuron mapping
| Biology | Artificial model |
|---|---|
| Inputs from other cells | Feature values \(x_1, x_2, \ldots\) |
| Synaptic strength | Weights \(w_i\) |
| Resting potential / threshold | Bias \(b\) |
| Fire or not | Activation function (e.g. step, ReLU) |
Common mistakes
- Treating artificial neurons as literal copies of biology (they are abstractions).
- Forgetting that dendrites map to inputs and axon to output.
- Ignoring that biological spikes are not identical to ReLU outputs.
Interview checkpoints
- Q: What does a neuron compute? A: Weighted sum of inputs plus bias, then an activation function.
- Q: Why bias matters? A: Shifts the decision boundary without changing input weights.
Practice
- Basic: Label inputs, weights, bias, activation on a neuron diagram.
- Intermediate: Compare integrate-and-fire vs perceptron step activation.
- Advanced: Explain why biological plausibility is not required for useful ANNs.
Recap
- ANNs are inspired by biology but optimized for math and GPUs.
- Weighted sum + activation is the universal building block.
- Next: formal perceptron model.
Next: Day 3 — Perceptron Model
Perceptron Model
Why this matters
The perceptron is the simplest trainable classifier — mastering it explains weights, bias, and linear decision boundaries before MLPs.
The perceptron (Frank Rosenblatt, 1958) is a linear binary classifier: it computes a weighted sum and applies a step function to produce 0 or 1.
Perceptron output
$$z = \sum_{i=1}^{n} w_i x_i + b, \quad \hat{y} = \begin{cases} 1 & z \geq 0 \\ 0 & z < 0 \end{cases}$$Worked example
Let \(x_1=2, x_2=1\), \(w_1=1, w_2=-1\), \(b=0\). Then \(z = 2 - 1 = 1 \geq 0\), so prediction is 1. The decision boundary is the line \(x_1 = x_2\).
What a single perceptron can learn
Only linearly separable patterns (AND, OR gates). It cannot solve XOR without hidden layers — that limitation motivated multi-layer networks.
Common mistakes
- Omitting bias and wondering why the boundary must pass through origin.
- Confusing pre-activation score z with output prediction.
- Using perceptron on multi-class without one-vs-rest strategy.
Interview checkpoints
- Q: Perceptron output rule? A: y = 1 if w·x + b ≥ 0 else 0 (with step activation).
- Q: What can a single perceptron learn? A: Only linearly separable patterns.
Practice
- Basic: Compute perceptron output for w=[1,-1], b=0, x=[2,1].
- Intermediate: Plot the decision boundary for 2D weights.
- Advanced: Implement perceptron update rule in NumPy on a toy dataset.
Recap
- Perceptron = linear classifier + step activation.
- Geometry: hyperplane divides feature space.
- Single layer cannot solve XOR.
Next: Day 4 — Step Activation
Step Activation
Why this matters
Activation functions introduce nonlinearity — without them, stacked layers collapse to one linear map.
Activation functions introduce nonlinearity. Without them, stacking layers would still be one big linear transformation.
Common activations
| Function | Formula / rule | Typical use |
|---|---|---|
| Step | 1 if \(z \geq 0\) else 0 | Historical perceptron |
| Sigmoid | \(\sigma(z) = 1/(1+e^{-z})\) | Binary output (probability) |
| Tanh | Output in \((-1, 1)\), zero-centered | Hidden layers (older nets) |
| ReLU | \(\max(0, z)\) | Default for hidden layers today |
Failure mode
Using sigmoid in many deep hidden layers causes vanishing gradients — training stalls. Prefer ReLU in hidden stacks.
Common mistakes
- Using step function in deep networks (zero gradient almost everywhere).
- Applying softmax on hidden layers instead of output for classification.
- Mixing up activation output range and loss function expectations.
Interview checkpoints
- Q: Why not linear activation in hidden layers? A: Composition of linear maps is still linear.
- Q: Step vs sigmoid? A: Sigmoid is differentiable; step is not (historical perceptron only).
Practice
- Basic: Sketch step, sigmoid, ReLU on the same axis.
- Intermediate: Identify vanishing gradient risk for sigmoid in deep nets.
- Advanced: Pick an activation for output layer on binary vs multi-class tasks.
Recap
- Activations enable nonlinear decision surfaces.
- Step is for understanding; smooth activations train deep nets.
- Match activation to loss (sigmoid+BCE, softmax+CE).
Perceptron Learning Rule
Why this matters
The perceptron learning rule is the ancestor of gradient descent — it shows how errors drive weight updates geometrically.
The perceptron learning rule updates weights only when the model misclassifies a training example.
Update rule
$$w_i \leftarrow w_i + \eta (y - \hat{y}) x_i, \quad b \leftarrow b + \eta (y - \hat{y})$$\(\eta\) = learning rate. Update only when \(y \neq \hat{y}\).
Convergence
If the data is linearly separable, the algorithm converges in finite steps. On noisy or non-separable data, it may never settle — use logistic regression or an MLP instead.
Common mistakes
- Updating weights when prediction is correct (wastes steps).
- Learning rate too large causing oscillation on separable data.
- Expecting convergence on non-separable noisy data.
Interview checkpoints
- Q: Perceptron update when wrong? A: w ← w + η(y − ŷ)x (and bias similarly).
- Q: Convergence guarantee? A: Only if data is linearly separable.
Practice
- Basic: Apply one manual update step on a misclassified point.
- Intermediate: Train perceptron until convergence on AND gate data.
- Advanced: Show failure on XOR with single perceptron.
Recap
- Mistakes push the boundary toward correctly classified region.
- Converges only for linearly separable sets.
- XOR needs hidden layer (MLP).
Next: Day 6 — XOR Problem
XOR Problem
Why this matters
XOR is the famous proof that shallow linear models fail — it motivated multi-layer networks and modern deep learning.
The XOR problem proved that a single perceptron cannot learn non-linear boundaries — a key moment that led to the first AI winter and later to multi-layer networks.
XOR truth table
| x₁ | x₂ | y (XOR) |
|---|---|---|
| 0 | 0 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
No single straight line separates the 1s from the 0s in 2D. You need at least one hidden layer with a nonlinear activation so the network can bend the boundary.
Intuition
AND and OR are linearly separable; XOR is not. That is why depth matters — not more neurons alone, but composition of nonlinear layers.
Common mistakes
- Thinking more epochs will make single-layer perceptron learn XOR.
- Not visualizing why no single line separates XOR classes.
- Skipping to deep nets without understanding why depth helps.
Interview checkpoints
- Q: Why XOR breaks perceptron? A: Not linearly separable in 2D input space.
- Q: Minimal fix? A: Add hidden layer with nonlinear activation (MLP).
Practice
- Basic: Draw XOR points and show no single line works.
- Intermediate: Add one hidden unit and sketch new boundary idea.
- Advanced: Train 2-layer MLP in Keras on XOR.
Recap
- XOR requires hidden representations.
- This limitation caused the first AI winter.
- MLPs solve it with depth + nonlinearity.
Linear Separability
Why this matters
Linear separability tells you whether a single layer suffices — essential before choosing model depth.
A dataset is linearly separable if there exists a hyperplane that separates the classes with zero training error (for a linear classifier).
How to check (2D)
- Plot points colored by class.
- Try to draw one straight line (or curve for kernel methods) separating them.
- If convex hulls of classes overlap in a way that forbids a line — not linearly separable.
Support Vector Machines find the maximum-margin separator. A perceptron finds a separator if one exists, but need not be optimal margin.
Common mistakes
- Checking separability in raw space when features should be transformed first.
- Confusing linear separability with linear regression assumptions.
- Ignoring soft-margin SVM as alternative to perceptron.
Interview checkpoints
- Q: Define linear separability. A: ∃ hyperplane that separates classes perfectly.
- Q: Test in 2D? A: Try to draw a line; convex hulls disjoint ⇒ separable.
Practice
- Basic: Classify 4 small 2D datasets as separable or not.
- Intermediate: Use sklearn LinearSVC vs Perceptron on same data.
- Advanced: Kernel trick intuition: separable in higher dimension.
Recap
- One neuron = one hyperplane.
- Non-separable ⇒ need features, depth, or kernels.
- Always visualize 2D/3D when possible.
Decision Boundaries
Why this matters
Decision boundaries connect math to intuition — you debug models by seeing where they flip predictions.
The decision boundary is where the model's score equals the threshold (0 for perceptron). In 2D it is a line; in higher dimensions, a hyperplane.
- Perceptron: One hyperplane — simple but limited.
- MLP with ReLU: Piecewise-linear boundaries — can approximate complex shapes.
- Overfitting: Overly wiggly boundaries on training data often fail on validation data.
Common mistakes
- Plotting boundaries without scaling features (distorted geometry).
- Ignoring that ReLU nets create piecewise-linear boundaries.
- Only looking at accuracy, not boundary complexity (overfitting).
Interview checkpoints
- Q: Effect of more hidden units on boundary? A: More pieces, more complex shapes.
- Q: L2 regularization effect? A: Simpler, smoother boundaries, less overfit.
Practice
- Basic: Sketch boundary for AND, OR, XOR.
- Intermediate: Plot 2D decision regions for small MLP.
- Advanced: Compare boundaries: perceptron vs 2-layer ReLU MLP.
Recap
- Boundaries visualize what the model learned.
- Depth increases boundary complexity.
- Regularization keeps boundaries sane.
Next: Day 9 — DL vs ML
DL vs ML
Why this matters
Teams waste money choosing DL when sklearn suffices — this day is the decision framework for real projects.
Choosing ML vs DL saves time, money, and reliability. Not every problem needs a neural network.
| Scenario | Prefer | Why |
|---|---|---|
| 5k rows, 40 tabular features, fraud detection | ML (XGBoost, logistic) | Less data, need interpretability |
| 1M labeled images, object detection | DL (CNN + transfer learning) | Raw pixels, representation learning |
| Small text dataset, intent classification | Start ML; try fine-tuned BERT if needed | Baseline first, then scale model |
Common mistakes
- Defaulting to ResNet for 500 labeled tabular rows.
- Skipping baselines (logistic regression, XGBoost) before CNNs.
- Underestimating labeling cost for DL data hunger.
Interview checkpoints
- Q: Image 1M labels vs 500 tabular rows — pick? A: CNN/transfer vs boosted trees.
- Q: Interpretability requirement? A: Favor classical ML or explainable models.
Practice
- Basic: For 5 scenarios, pick ML or DL and justify in one sentence.
- Intermediate: Build sklearn baseline then DL model; compare metric/cost.
- Advanced: Write a one-page model selection memo for a startup use case.
Recap
- DL wins on raw high-dimensional data at scale.
- ML wins on small structured data and interpretability.
- Always baseline simple first.
Keras & TensorFlow Setup
Why this matters
A reproducible Keras/TF environment prevents silent GPU/CPU bugs and version skew across the rest of the 100 days.
Set up TensorFlow and Keras once correctly — wrong CUDA versions and missing GPU detection cause hours of debugging later.
import tensorflow as tf
print("TF version:", tf.__version__)
print("GPUs:", tf.config.list_physical_devices("GPU"))
from tensorflow import keras
from tensorflow.keras import layers
# Quick sanity check on MNIST subset
(x_train, y_train), _ = keras.datasets.mnist.load_data()
x_train = x_train[:5000].reshape(-1, 784).astype("float32") / 255.0
model = keras.Sequential([
layers.Dense(128, activation="relu", input_shape=(784,)),
layers.Dense(10, activation="softmax"),
])
model.compile(optimizer="adam",
loss="sparse_categorical_crossentropy",
metrics=["accuracy"])
model.fit(x_train, y_train[:5000], epochs=3, batch_size=128, validation_split=0.1)
model.summary()- Pin versions in
requirements.txt. - Verify GPU with
nvidia-smiand TensorFlow device list. - Use
model.summary()before every long training run.
Common mistakes
- Installing TF GPU without matching CUDA/cuDNN versions.
- Not pinning package versions in requirements.txt.
- Running training on CPU while thinking GPU is active.
Interview checkpoints
- Q: Check GPU visible in TF? A: tf.config.list_physical_devices('GPU') or nvidia-smi.
- Q: Keras 3 backend? A: Can use TF, JAX, or PyTorch as backend — know your install.
Practice
- Basic: Install TF/Keras; run Hello tensor addition.
- Intermediate: Train a 2-layer MLP on MNIST subset in <2 min.
- Advanced: Dockerfile with pinned TF-GPU for reproducible training.
Recap
- Verify GPU and versions before big experiments.
- Keras Sequential API is enough for early modules.
- Ready for Module 2: MLPs.
