/* ============================================================ PART I — ANN ============================================================ Renders sections 02–09 into the reader, then mounts React for the interactive bits (learning rate playground, backprop animation). */ (function() { const reader = document.getElementById('reader'); const ANCHOR = document.getElementById('anchor-ann'); const html = `

Part I · Section 02 · 4 min

The artificial neuron is just a tiny calculator.

Stripped of biological metaphor, a neuron is three steps: weight the inputs, add a bias, squash the result through a nonlinearity. That's it.

The unit you'll see drawn as a circle in every diagram does this and only this:

y = σ( w1x1 + w2x2 + … + wnxn + b) one neuron, one number out

Anatomy of a neuroninputs · weights · sum · activation · output

Inputs flow in from the left, each multiplied by its weight. The neuron sums them, adds a bias, then runs the result through an activation function.

Why the activation function matters

If you removed the squash — the σ — every layer would just be a matrix multiply. Stack a hundred of them and you still have a single (giant) linear function: a hyperplane. Linear functions can't learn an XOR, never mind a face. The nonlinearity is what gives the network its expressive power.

Three activations you'll meet dailyshape determines behavior

Sigmoid squashes to (0,1) — historic, now mostly retired. Tanh squashes to (−1,1) — zero-centered, useful inside RNNs. ReLU is just max(0,x) — almost free to compute, the modern default.

Developer takeaway

When in doubt, use ReLU for hidden layers. Use sigmoid on a single output for binary classification, softmax for multi-class, and nothing (linear) for regression. We'll see why later.

Part I · Section 03 · 3 min

Wire a few thousand of them up. That's a network.

A layer is a column of neurons that all see the same inputs. A network is layers stacked back-to-back. The output of layer L becomes the input of layer L+1.

From neuron to networksame operation, repeated

Every neuron in a "dense" layer is connected to every neuron in the next. We call this fully-connected or dense. Each connection has its own weight; each neuron has its own bias.

For a network with input dimension n and a hidden layer of m neurons, the layer is just a matrix multiply:

h = σ(Wx + b) W is m×n, b is m, h is m. Linear algebra all the way down.

"Deep" learning just means: more than a couple of these. A modern image classifier might stack 50 or 150. A large language model: hundreds. The principle doesn't change.

Counting parameters

A dense layer from 784 inputs (a flattened 28×28 MNIST image) to 128 hidden units has 784 × 128 + 128 = 100,480 parameters. Stack three such layers and you're already past a quarter-million weights — and we haven't even started training.

Part I · Section 04 · 4 min

The forward pass: data flows left to right.

Press play. Watch a vector turn into a prediction, one layer at a time.

The activation pulses are colored by sign and sized by magnitude. After ~6 layers, the input vector has been compressed and reshaped into a 3-class probability distribution.

Three things to notice as you watch:

Information bottlenecks. Each layer can be wider or narrower than the last. Forcing data through a narrow layer makes the network find a compressed representation — this is how autoencoders work.
The output is just another layer. A softmax output is a dense layer with the softmax activation. Nothing special.
It's all matrix math. Every animation pulse is a multiply-accumulate. A modern GPU does billions per millisecond.

Part I · Section 05 · 4 min

Loss measures how wrong we are.

Until we put a number on the network's mistakes, there's nothing for gradient descent to descend on.

The loss function takes the network's prediction ŷ and the true label y, and returns a single scalar — bigger is worse. The two you'll meet 95% of the time:

Mean Squared Error · regression

L = ½ · ( ŷ − y )2

Penalizes by squared distance. Predict house prices, temperatures, anything continuous.

Cross-entropy · classification

L = − Σ yi log ŷi

Punishes confident-wrong predictions hard. Pair with softmax for multi-class.

The loss landscapetraining is a hike downhill

Each axis is a weight. The surface is the loss for every weight combination. Training picks a starting point at random and rolls downhill — guided by the gradient.

In a real network, the landscape lives in millions of dimensions, not two. We can't visualize it; we just trust the gradient and step.

Part I · Section 06 · 7 min

Backpropagation: assigning blame, layer by layer.

Every weight in the network deserves a share of credit (or blame) for the final error. Backprop is the bookkeeping.

The intuition is older than computers: the chain rule. If L depends on z which depends on w, then

∂L/∂w = (∂L/∂z) · (∂z/∂w)chain rule — calculus in two clauses

Backprop is just the chain rule applied repeatedly, from the output back to the very first weight. The clever bit is that it reuses partial computations: the gradient flowing into layer L is exactly what you need to compute the gradient inside layer L.

Forward pass in blue carries activations left-to-right. Backward pass in red carries gradients right-to-left. Each weight gets updated by its own contribution to the loss.

The training loop, in five lines

pseudocodefor epoch in range(N):
    for x, y in dataset:
        y_hat = model(x)              # forward
        loss  = loss_fn(y_hat, y)     # scalar
        grads = loss.backward()       # backprop
        weights -= lr * grads         # step

That's the entire algorithm. Every framework — TensorFlow, PyTorch, JAX — is fundamentally a high-performance implementation of these five lines.

Why “automatic differentiation” matters

You don't write the gradient yourself. The framework records every operation in the forward pass as a graph, then walks it backward applying the chain rule mechanically. This is why building new architectures is mostly composing forward passes and letting autograd handle the rest.

Part I · Section 07 · 4 min · interactive

The learning rate is the most-tuned hyperparameter you'll ever touch.

Too small: training crawls. Too large: it explodes. Just right: a smooth descent.

Drag the slider. Watch the loss curve and the path on the landscape change in real time.

The same network, the same data, the same starting weights. Only the learning rate changes. Notice how a 100× difference in lr turns "learns in 30 steps" into "diverges in 3."

Field notes on learning rates

Start with 1e-3 for Adam. 1e-2 for SGD with momentum. These are reasonable defaults for 90% of problems.
If loss explodes to NaN in the first few steps — your lr is too high. Cut by 10×.
If loss flatlines for hundreds of steps — your lr is probably too low (or your data is broken).
Use a scheduler. Most modern training drops the lr by a factor at certain epochs, or follows a cosine curve. tf.keras.callbacks.ReduceLROnPlateau is a fine starting point.

Part I · Section 08 · 5 min

The vanishing gradient: why deep used to be impossible.

For decades, training networks deeper than a few layers simply didn't work. The reason was a small, sneaky multiplication.

Backprop multiplies gradients through every layer. If each layer's local gradient is less than 1 on average — and for sigmoid/tanh, it almost always is — then 20 layers later the signal has been multiplied by something like 0.2520 ≈ 10−12. The first layers learn essentially nothing.

Each bar is the gradient magnitude at one layer. With sigmoid activations, gradients halve roughly every layer; by layer 10 they're statistically zero.

What rescued deep learning

ReLU activations — gradient is exactly 1 for positive inputs, no shrinkage.
Better initialization (He, Xavier) — keep activations from collapsing or exploding at layer 1.
Batch / Layer normalization — re-center activations between layers so the signal doesn't drift.
Residual connections — let gradients skip layers entirely. ResNet's central trick.

Together, these turned "deep" from a research curiosity into engineering reality. Keep them in your back pocket — they're the answer when training silently fails to learn.

The same issue, in time instead of depth, is the reason we'll need LSTMs in Part IV. Same disease, different vector.

Part I · Section 09 · 3 min

An ANN in Keras, in 12 lines.

Everything we've covered, expressed as code. Skim now; come back when you're ready to run it.

tensorflow / kerasimport tensorflow as tf
from tensorflow.keras import layers, models

# 1. Define the architecture
model = models.Sequential([
    layers.Input(shape=(784,)),               # flat MNIST image
    layers.Dense(128, activation='relu'),
    layers.Dense(64,  activation='relu'),
    layers.Dense(10,  activation='softmax')    # 10 digit classes
])

# 2. Tell it what loss & optimizer to use
model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=1e-3),
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

# 3. Train
model.fit(x_train, y_train, epochs=10, batch_size=64,
          validation_data=(x_val, y_val))

Map this back to what we just learned:

Sequential ↔ stack of layers, output of one feeds the next.
Dense(128, relu) ↔ a fully-connected layer of 128 neurons with ReLU activation. (The W and b are created automatically.)
Adam ↔ a smarter SGD that adapts the learning rate per-weight.
fit ↔ the five-line training loop, hidden behind one call.

Practical sanity check

On MNIST, this exact model reaches ~98% test accuracy in under a minute on a CPU. If yours doesn't — your data isn't normalized to [0,1], or your labels are one-hot but you used sparse_categorical_crossentropy (or vice versa). It's almost always one of those two.

`; if (ANCHOR) { ANCHOR.insertAdjacentHTML('afterend', html); } // ============================================================ // Static SVG figures (mounted as innerHTML) // ============================================================ // ---- Neuron figure ---- const figNeuron = document.getElementById('fig-neuron'); if (figNeuron) { figNeuron.innerHTML = ` `; } // ---- Activations figure ---- const figAct = document.getElementById('fig-activations'); if (figAct) { const W = 720, H = 220; function curve(fn) { const pts = []; for (let i = 0; i <= 60; i++) { const x = -6 + (i / 60) * 12; const y = fn(x); pts.push([x, y]); } return pts; } function plot(pts, x0, w, h) { const yPad = 10; const xs = pts.map(p => p[0]); const ys = pts.map(p => p[1]); const xmin = Math.min(...xs), xmax = Math.max(...xs); const ymin = -1.2, ymax = 2.2; const X = (x) => x0 + ((x - xmin) / (xmax - xmin)) * w; const Y = (y) => yPad + (1 - (y - ymin) / (ymax - ymin)) * (h - 2*yPad); let d = ''; pts.forEach((p, i) => { d += (i ? ' L ' : 'M ') + X(p[0]).toFixed(1) + ' ' + Y(p[1]).toFixed(1); }); return { path: d, X, Y, x0, w }; } const sig = plot(curve(x => 1/(1+Math.exp(-x))), 30, 200, H); const tah = plot(curve(x => Math.tanh(x)), 270, 200, H); const rel = plot(curve(x => Math.max(0,x)), 510, 200, H); function axes(ax) { const yMid = ax.Y(0); const xMid = ax.X(0); return ` `; } figAct.innerHTML = ` `; function ax_w(ax){ return ax.w; } } // ---- Network figure ---- const figNet = document.getElementById('fig-network'); if (figNet) { const W = 880, H = 320; const layers = [ { x: 100, n: 4, label: 'INPUT (4)' }, { x: 320, n: 6, label: 'HIDDEN (6)' }, { x: 540, n: 6, label: 'HIDDEN (6)' }, { x: 760, n: 3, label: 'OUTPUT (3)' }, ]; const positions = layers.map(l => { const sp = Math.min(40, (H-80) / Math.max(1, l.n-1)); const total = sp * (l.n-1); return Array.from({length:l.n},(_,i)=>({x:l.x, y:H/2 - total/2 + i*sp})); }); let edges = ''; for (let i = 0; i < positions.length-1; i++) { for (const a of positions[i]) { for (const b of positions[i+1]) { edges += ``; } } } let nodes = ''; positions.forEach((layer, li) => { layer.forEach(p => { const fill = li === 0 ? '#dde7f7' : li === positions.length-1 ? '#d6e8de' : '#fbfaf6'; const stroke = li === 0 ? '#1f6feb' : li === positions.length-1 ? '#1a7a4c' : '#1a1a1a'; nodes += ``; }); }); let labels = ''; layers.forEach(l => { labels += `${l.label}`; }); figNet.innerHTML = ``; } // ---- Loss landscape ---- const figLand = document.getElementById('fig-landscape'); if (figLand) { const W = 720, H = 320; // Draw concentric ovals as a contour map + a descent path let contours = ''; for (let i = 1; i <= 7; i++) { const rx = i * 38; const ry = i * 24; const op = 0.08 + i * 0.05; contours += ``; } // descent path const path = [ [120, 60], [180, 100], [230, 140], [270, 175], [310, 200], [340, 220], [365, 232], [380, 240], [388, 246] ]; let pathD = path.map((p,i)=>(i?'L':'M')+' '+p[0]+' '+p[1]).join(' '); let dots = path.map((p,i)=>``).join(''); figLand.innerHTML = ` `; } // ============================================================ // React-mounted interactive figures (forward, backprop, lr, vanish) // ============================================================ // We rely on React being available — this whole script is type="text/babel". // Mounting happens at the end (window load) so DOM nodes exist. window.__mountAnnFigures = function() { if (window.__annMounted) return; if (!window.React || !window.ReactDOM) return; window.__annMounted = true; const { useState, useEffect, useRef } = React; // ---------- Forward pass animation ---------- function ForwardPassFig() { const layers = [4, 6, 6, 4, 3]; const W = 900, H = 320; const xs = layers.map((_,i) => 80 + i * (W-160) / (layers.length-1)); const positions = layers.map((n, i) => { const sp = Math.min(40, (H-80) / Math.max(1, n-1)); const total = sp * (n-1); return Array.from({length:n},(_,j)=>({x:xs[i], y:H/2 - total/2 + j*sp})); }); const [tick, setTick] = useState(0); const [playing, setPlaying] = useState(true); useEffect(() => { if (!playing) return; const id = setInterval(() => setTick(t => (t+1) % 600), 30); return () => clearInterval(id); }, [playing]); // moving "front" of activation const T = (tick % 200) / 200; // 0..1 const segIdx = Math.min(layers.length - 2, Math.floor(T * (layers.length - 1))); const segT = (T * (layers.length - 1)) - segIdx; // Determine which layers are "lit" const litUpTo = Math.floor(T * layers.length); const edges = []; for (let i = 0; i < positions.length - 1; i++) { for (const a of positions[i]) { for (const b of positions[i+1]) { const lit = i === segIdx; edges.push( ); } } } const nodes = []; positions.forEach((layer, li) => { const lit = li <= litUpTo; layer.forEach((p, ni) => { const fill = li === 0 ? '#dde7f7' : li === positions.length-1 ? '#d6e8de' : (lit ? '#f6e4d8' : '#fbfaf6'); const stroke = li === 0 ? '#1f6feb' : li === positions.length-1 ? '#1a7a4c' : (lit ? '#c84e1d' : '#1a1a1a'); const r = lit ? 11 : 9; nodes.push(); }); }); // pulse particles between segIdx and segIdx+1 const pulses = []; const A = positions[segIdx]; const B = positions[segIdx+1]; if (A && B) { for (let i = 0; i < A.length; i++) { for (let j = 0; j < B.length; j++) { const x = A[i].x + (B[j].x - A[i].x) * segT; const y = A[i].y + (B[j].y - A[i].y) * segT; pulses.push(); } } } // labels const labels = ['INPUT', 'HIDDEN 1', 'HIDDEN 2', 'HIDDEN 3', 'OUTPUT']; const labelEls = layers.map((_,i) => ( {labels[i]} )); // probability bars at output const outProbs = [0.12, 0.71, 0.17]; const outX = xs[xs.length-1] + 30; const probEls = outProbs.map((p,i) => { const y = H/2 - 30 + i * 22; return ( cls {i+1} {(p*100).toFixed(0)}% ); }); return (

Forward pass — animateddata flowing left → right

step { setPlaying(false); setTick(+e.target.value); }} /> {tick}

); } const fwdMount = document.getElementById('fig-forward-mount'); if (fwdMount) ReactDOM.createRoot(fwdMount).render(); // ---------- Backprop animation ---------- function BackpropFig() { const layers = [3, 5, 5, 3, 2]; const W = 900, H = 280; const xs = layers.map((_,i) => 80 + i * (W-160) / (layers.length-1)); const positions = layers.map((n, i) => { const sp = Math.min(36, (H-80) / Math.max(1, n-1)); const total = sp * (n-1); return Array.from({length:n},(_,j)=>({x:xs[i], y:H/2 - total/2 + j*sp})); }); const [tick, setTick] = useState(0); useEffect(() => { const id = setInterval(() => setTick(t => (t+1) % 800), 30); return () => clearInterval(id); }, []); // Phase: 0..0.5 forward, 0.5..1 backward const T = (tick % 400) / 400; const isForward = T < 0.5; const phaseT = isForward ? T * 2 : (T - 0.5) * 2; // 0..1 within phase const segCount = layers.length - 1; const segIdx = isForward ? Math.min(segCount-1, Math.floor(phaseT * segCount)) : Math.max(0, segCount - 1 - Math.floor(phaseT * segCount)); const segT = (phaseT * segCount) - Math.floor(phaseT * segCount); const edges = []; for (let i = 0; i < positions.length - 1; i++) { for (const a of positions[i]) { for (const b of positions[i+1]) { const lit = i === segIdx; const color = isForward ? '#1f6feb' : '#c84e1d'; edges.push( ); } } } const nodes = []; positions.forEach((layer, li) => layer.forEach((p, ni) => { nodes.push(); })); // pulses const pulses = []; const A = positions[segIdx]; const B = positions[segIdx+1]; if (A && B) { for (let i = 0; i < A.length; i++) { for (let j = 0; j < B.length; j++) { const fromA = isForward; const x = fromA ? A[i].x + (B[j].x - A[i].x) * segT : B[j].x + (A[i].x - B[j].x) * segT; const y = fromA ? A[i].y + (B[j].y - A[i].y) * segT : B[j].y + (A[i].y - B[j].y) * segT; pulses.push(); } } } // Loss bubble at right const lossX = xs[xs.length-1] + 60; const lossY = H/2; return (

Forward & backward pass {isForward ? '→ FORWARD: activations' : '← BACKWARD: gradients'}

); } const bpMount = document.getElementById('fig-backprop-mount'); if (bpMount) ReactDOM.createRoot(bpMount).render(); // ---------- Learning rate playground ---------- function LRFig() { const [lr, setLr] = useState(0.05); const [seed, setSeed] = useState(0); // Simulate gradient descent on a 2D bowl: f(x,y) = a*x^2 + b*y^2 // Path persists; recompute when lr or seed changes. const path = React.useMemo(() => { const rng = mulberry32(seed * 7919 + 13); let x = -1.6 + rng()*0.4, y = 1.4 - rng()*0.4; const a = 1.0, b = 4.0; const pts = [[x, y]]; for (let i = 0; i < 60; i++) { const gx = 2*a*x, gy = 2*b*y; x = x - lr * gx; y = y - lr * gy; if (Math.abs(x) > 5 || Math.abs(y) > 5) { pts.push([x, y]); break; } pts.push([x, y]); } return pts; }, [lr, seed]); const losses = path.map(([x,y]) => x*x + 4*y*y); // Project to SVG: x in [-2,2], y in [-2,2] const W = 880, H = 320; // left: 2D landscape, right: loss curve const lw = 380, lh = 280; const lx0 = 30, ly0 = 20; const X = (x) => lx0 + ((x + 2) / 4) * lw; const Y = (y) => ly0 + (1 - (y + 2) / 4) * lh; const contourEls = []; for (let i = 1; i <= 6; i++) { const rx = i * 30; const ry = i * 15; contourEls.push(); } const pathD = path.map(([x,y],i)=>(i?'L':'M')+X(x).toFixed(1)+' '+Y(y).toFixed(1)).join(' '); const dots = path.map(([x,y],i) => ); // loss curve const cw = 380, ch = 280; const cx0 = 470, cy0 = 20; const lossMax = Math.max(...losses, 8); const cX = (i) => cx0 + (i / Math.max(1, losses.length-1)) * cw; const cY = (v) => cy0 + (1 - Math.min(v, lossMax) / lossMax) * ch; const lossPath = losses.map((v,i)=>(i?'L':'M')+cX(i).toFixed(1)+' '+cY(v).toFixed(1)).join(' '); // Verdict let verdict = ''; let verdictColor = '#1a7a4c'; const final = losses[losses.length-1]; const div = path.some(([x,y]) => Math.abs(x)>3 || Math.abs(y)>3); if (div) { verdict = 'diverges — learning rate too high'; verdictColor = '#c84e1d'; } else if (final < 0.05) { verdict = 'converged smoothly'; verdictColor = '#1a7a4c'; } else if (final < 0.5) { verdict = 'converging slowly'; verdictColor = '#b8860b'; } else { verdict = 'too slow — lr too low'; verdictColor = '#b8860b'; } return (

Learning rate playgroundsame problem, different lr

learning rate setLr(+e.target.value)} /> {lr.toFixed(3)} {verdict}

); } function mulberry32(a) { return function() { a |= 0; a = a + 0x6D2B79F5 | 0; let t = a; t = Math.imul(t ^ t >>> 15, t | 1); t ^= t + Math.imul(t ^ t >>> 7, t | 61); return ((t ^ t >>> 14) >>> 0) / 4294967296; }; } const lrMount = document.getElementById('fig-lr-mount'); if (lrMount) ReactDOM.createRoot(lrMount).render(); // ---------- Vanishing gradient figure ---------- function VanishFig() { const [activation, setActivation] = useState('sigmoid'); const layers = 12; // simulated gradient magnitudes const data = React.useMemo(() => { const arr = []; let g = 1.0; for (let i = 0; i < layers; i++) { arr.push(g); if (activation === 'sigmoid') g *= 0.45 + Math.random()*0.1; else if (activation === 'tanh') g *= 0.55 + Math.random()*0.15; else g *= 0.92 + Math.random()*0.10; // ReLU loses very little } return arr.reverse(); // layer 1 (deepest from output) on left }, [activation]); const W = 760, H = 240; const barW = (W - 80) / layers; const bars = data.map((g, i) => { const h = Math.max(2, Math.min(180, Math.log10(g+1e-12) * 30 + 180)); const color = activation === 'relu' ? '#1a7a4c' : activation === 'tanh' ? '#b8860b' : '#c84e1d'; return ( L{i+1} {g < 1e-4 ? g.toExponential(0) : g.toFixed(3)} ); }); return (

Gradient magnitude per layerL1 = first layer (furthest from loss)

activation {['sigmoid', 'tanh', 'relu'].map(a => ( ))}

); } const vanMount = document.getElementById('fig-vanish-mount'); if (vanMount) ReactDOM.createRoot(vanMount).render(); }; })();