/* ============================================================ PART II — CNN ============================================================ */ (function() { const ANCHOR = document.getElementById('anchor-cnn'); const html = `

Part II · Section 10 · 4 min

Why a dense network is wrong for images.

An ANN treats every pixel as an independent number. That throws away the most important fact about an image: nearby pixels are related.

Imagine training the MNIST classifier we wrote at the end of Part I. To feed it a 28×28 digit, we flatten the image to a 784-vector. Now consider:

The pixel at (14, 14) ends up at index 406. Its neighbour at (14, 15) ends up at index 407. That's only by accident of the flattening order — the network doesn't know they're adjacent.
Shift the digit one pixel to the right and the entire input vector changes. The model has to relearn everything from scratch.
For a single 224×224 RGB image (the size ImageNet models eat), a dense layer of 1,000 units would need 150 million weights. For one layer.

The flattening problemspatial information, gone

A 2D image becomes a 1D row of numbers, indistinguishable in shape from a tabular dataset. Whatever neighbourhood structure existed in the pixels is destroyed at this step.

We want a layer that respects spatial locality and shares parameters across positions. That layer is the convolution.

Part II · Section 11 · 6 min · animated

The convolution: a small window, slid everywhere.

A convolution is a tiny grid of weights — usually 3×3 — that slides across the image, multiplying overlapping pixels and summing the result.

At each position, the kernel produces one number. As the kernel sweeps across the image, those numbers form a new, smaller grid called a feature map.

The 3×3 kernel on the left slides across the input. At each position, it computes a weighted sum and writes one cell into the feature map on the right. Same nine weights, used at every position.

Why this is brilliant, in three points

Weight sharing. The same nine weights detect the feature anywhere in the image. A "vertical edge detector" works in the top-left corner and in the bottom-right.
Translation equivariance. Move the cat 50 pixels to the right and the feature map for "cat ears" moves 50 pixels to the right with it. The signal is preserved.
Drastic parameter reduction. A 3×3 conv has 9 weights (+1 bias) per filter. A dense layer connecting two 224×224 maps would have 50 billion. Different universes.

The four numbers that define a conv layer

kernel size · 3×3 most common stride · how many pixels per slide padding · zero-pad the edges? # filters · how many distinct kernels in this layer

A typical first layer says: "give me 32 different 3×3 kernels, slide them with stride 1, pad the edges so output is the same size as input." Output: 32 stacked feature maps. Now the next layer can convolve over those.

Part II · Section 12 · 4 min

Filters learn to detect features. We don't tell them what.

In a trained CNN, the early kernels reliably specialize in edges, blobs, and color contrasts — without ever being told.

What each layer seeshierarchy of features

Layer 1 detects edges. Layer 2 combines edges into textures and corners. Layer 3 combines those into parts — a wheel, an eye. Layer 4 sees objects. Each layer composes the previous one.

This hierarchy is emergent — the architecture rewards it without the loss function ever mentioning "edge" or "wheel." Cross-entropy on a final classification is enough; the rest falls out of stochastic gradient descent over millions of images.

If you've ever heard that CNNs "see like brains" — this is the half-truth in that claim. The visual cortex really does seem to organize into similar feature hierarchies. But the algorithm that finds those weights is gradient descent, not biology.

Part II · Section 13 · 4 min · animated

Pooling: shrink the map, keep the signal.

After a few convs, the feature map is still huge. Pooling downsamples it — usually by 2× — so deeper layers can see a wider area through a smaller grid.

Max pooling over a 2×2 window: take the largest value in the window, throw the rest away. Output is half the size in each dimension, a quarter the area.

Why max — and not average?

Max pooling says "tell me whether this feature appeared anywhere in this neighbourhood, not where exactly." That fits how detection works: the existence of an edge in this region matters more than its precise pixel offset. Average pooling exists too — it's smoother but loses signal — and modern nets often skip pooling entirely in favor of strided convolutions (a conv with stride 2 downsamples and learns to do so).

Receptive field — the why behind everything

Each cell in a deep feature map is influenced by a region of input pixels — its receptive field. Each conv expands this region by the kernel size; each pool doubles it. After 5 convs and 2 pools, a single cell can see a 60×60 region of the original image. That's how a network "knows" it's looking at a face: deep cells see big areas.

Part II · Section 14 · 4 min

The full stack: conv → pool → conv → pool → flatten → dense.

A typical image classifier alternates conv and pool blocks until the spatial map is small (say 7×7), then flattens and feeds a dense head for the final classification.

A toy CNN for MNIST. Notice: the spatial dimensions shrink with each pool while the channel count grows. The network trades resolution for abstraction.

The shape calculus, demystified

Input: 28 × 28 × 1 (grayscale).
Conv 32 filters, 3×3, padding=same: 28 × 28 × 32.
MaxPool 2×2: 14 × 14 × 32.
Conv 64 filters, 3×3, padding=same: 14 × 14 × 64.
MaxPool 2×2: 7 × 7 × 64.
Flatten: 3136.
Dense 128 → Dense 10 → softmax.

This single small network (~225,000 params) hits ~99% on MNIST. That's an order of magnitude fewer parameters than the dense ANN, and significantly higher accuracy. That gap is the convolutional inductive bias paying off.

Part II · Section 15 · 5 min · interactive

Draw a digit. Watch a CNN classify it.

Use the canvas. The simulated CNN extracts feature maps in real time and shows its prediction. Slow scribbles are easier than fast ones.

The "model" here is a small hand-tuned CNN running entirely in your browser — same shape as the Keras snippet on the next slide. Feature maps update on every stroke.

What you're seeing on the right are real activations from a real (small, simple) trained model — running in JavaScript on your CPU. Notice how the early feature maps look like edge-filtered versions of your drawing, while deeper ones look more abstract.

Part II · Section 16 · 3 min

A CNN in Keras.

The whole MNIST classifier we just discussed, as 14 lines of code.

tensorflow / kerasfrom tensorflow.keras import layers, models

model = models.Sequential([
    layers.Input(shape=(28, 28, 1)),

    layers.Conv2D(32, kernel_size=3, padding='same', activation='relu'),
    layers.MaxPooling2D(pool_size=2),

    layers.Conv2D(64, kernel_size=3, padding='same', activation='relu'),
    layers.MaxPooling2D(pool_size=2),

    layers.Flatten(),
    layers.Dense(128, activation='relu'),
    layers.Dropout(0.3),                    # regularization
    layers.Dense(10, activation='softmax')
])

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])
model.fit(x_train, y_train, epochs=5, batch_size=128)

Things to notice

Channels-last shape. (28, 28, 1) — height, width, channels. RGB would be (H, W, 3).
Dropout randomly zeros 30% of activations during training. It prevents the dense head from memorizing the training set.
No Flatten needed for classic conv stacks until you hit the dense head. The conv and pool layers happily process the 3D tensor.

When to reach for a pretrained model instead

If you're classifying anything more complex than MNIST (real photos, medical images), do not train from scratch. Start from a pretrained EfficientNet or ResNet50 in tf.keras.applications, freeze the conv stack, train only the dense head on your data. You'll get better accuracy with 100× less data.

`; if (ANCHOR) ANCHOR.insertAdjacentHTML('afterend', html); // ============================================================ // Static figures // ============================================================ // Flatten figure const fl = document.getElementById('fig-flatten'); if (fl) { const cells = []; for (let r = 0; r < 6; r++) for (let c = 0; c < 6; c++) { const v = Math.sin(r*0.7) + Math.cos(c*0.6); const a = Math.max(0.1, Math.min(1, (v+1.5)/3)); cells.push(``); } const flatCells = []; for (let i = 0; i < 36; i++) { const r = Math.floor(i/6), c = i % 6; const v = Math.sin(r*0.7) + Math.cos(c*0.6); const a = Math.max(0.1, Math.min(1, (v+1.5)/3)); flatCells.push(``); } fl.innerHTML = ` `; } // Hierarchy figure const fh = document.getElementById('fig-hierarchy'); if (fh) { const colW = 170; const lvls = [ { x: 30, title: 'LAYER 1 — edges', desc: 'oriented bars', pat: 'edges' }, { x: 220, title: 'LAYER 2 — textures', desc: 'corners & curves', pat: 'tex' }, { x: 410, title: 'LAYER 3 — parts', desc: 'wheels, eyes, ears', pat: 'parts' }, { x: 600, title: 'LAYER 4 — objects', desc: 'faces, cars', pat: 'objs' }, ]; function patch(x0, y0, kind, key) { let inner = ''; const sz = 36; if (kind === 'edges') { // 9 small filters with edges for (let r = 0; r < 3; r++) for (let c = 0; c < 3; c++) { const a = (r*3+c) * 22; const cx = x0 + 8 + c*(sz+4) + sz/2; const cy = y0 + 8 + r*(sz+4) + sz/2; inner += ``; inner += ``; } } else if (kind === 'tex') { for (let r = 0; r < 3; r++) for (let c = 0; c < 3; c++) { const cx = x0 + 8 + c*(sz+4) + sz/2; const cy = y0 + 8 + r*(sz+4) + sz/2; inner += ``; // little arc / corner const seed = r*3+c; if (seed % 3 === 0) inner += ``; else if (seed % 3 === 1) inner += ``; else inner += ``; } } else if (kind === 'parts') { // four bigger thumbnails const items = [ 'wheel','eye','ear','headlight' ]; for (let r = 0; r < 2; r++) for (let c = 0; c < 2; c++) { const cx = x0 + 18 + c*(sz*1.6+10) + sz*0.8; const cy = y0 + 14 + r*(sz*1.6+10) + sz*0.8; inner += ``; const k = r*2+c; if (k === 0) { inner += ``; inner += ``; } else if (k === 1) { inner += ``; inner += ``; } else if (k === 2) { inner += ``; } else { inner += ``; inner += ``; } } } else { // object thumbnails for (let r = 0; r < 2; r++) for (let c = 0; c < 2; c++) { const cx = x0 + 18 + c*(sz*1.6+10) + sz*0.8; const cy = y0 + 14 + r*(sz*1.6+10) + sz*0.8; inner += ``; const k = r*2+c; if (k === 0) { // face inner += ``; inner += ``; inner += ``; inner += ``; } else if (k === 1) { // car inner += ``; inner += ``; inner += ``; inner += ``; } else if (k === 2) { // cat-ish inner += ``; inner += ``; inner += ``; } else { // dog-ish inner += ``; inner += ``; inner += ``; } } } return inner; } let svg = ''; lvls.forEach(l => { svg += ``; svg += patch(l.x, 14, l.pat, l.title); svg += `${l.title}`; svg += `${l.desc}`; }); // arrows between for (let i = 0; i < 3; i++) { const x1 = lvls[i].x + 162; const x2 = lvls[i+1].x - 2; svg += ``; } fh.innerHTML = ` `; } // CNN stack figure const fcs = document.getElementById('fig-cnn-stack'); if (fcs) { const stages = [ { x: 30, w: 84, h: 84, depth: 1, label: 'Input', sub: '28×28×1' }, { x: 150, w: 84, h: 84, depth: 12, label: 'Conv32', sub: '28×28×32' }, { x: 280, w: 60, h: 60, depth: 12, label: 'Pool', sub: '14×14×32' }, { x: 380, w: 60, h: 60, depth: 22, label: 'Conv64', sub: '14×14×64' }, { x: 490, w: 36, h: 36, depth: 22, label: 'Pool', sub: '7×7×64' }, { x: 580, w: 14, h: 110, depth: 1, label: 'Flatten', sub: '3136' }, { x: 660, w: 14, h: 80, depth: 1, label: 'Dense128', sub: '128' }, { x: 740, w: 14, h: 30, depth: 1, label: 'Output', sub: '10' }, ]; let svg = ''; stages.forEach((s, i) => { const y = 60; // depth illusion: draw stacked rects offset by 2px for (let d = s.depth-1; d >= 0; d--) { const off = d * 1.4; const opacity = d === 0 ? 1 : 0.18; const fill = i === 0 ? '#dde7f7' : i === stages.length-1 ? '#d6e8de' : (s.label.startsWith('Pool') ? '#f1e4c2' : '#f6e4d8'); const stroke = i === 0 ? '#1f6feb' : i === stages.length-1 ? '#1a7a4c' : (s.label.startsWith('Pool') ? '#b8860b' : '#c84e1d'); svg += ``; } svg += `${s.label}`; svg += `${s.sub}`; if (i < stages.length-1) { svg += ``; } }); fcs.innerHTML = ` `; } // ============================================================ // React-mounted interactive figures // ============================================================ window.__mountCnnFigures = function() { if (window.__cnnMounted) return; if (!window.React || !window.ReactDOM) return; window.__cnnMounted = true; const { useState, useEffect, useRef, useMemo } = React; // ---------- Convolution animation ---------- function ConvFig() { const N = 7, M = 5; // input N, output M (no padding, kernel 3, stride 1) const kernel = [ [-1, 0, 1], [-2, 0, 2], [-1, 0, 1], ]; // simple input image: gradient-ish + a vertical edge const input = useMemo(() => { const arr = []; for (let r = 0; r < N; r++) { const row = []; for (let c = 0; c < N; c++) { row.push(c < N/2 ? 0.15 : 0.85); } arr.push(row); } return arr; }, []); const [pos, setPos] = useState(0); // 0..M*M-1 const [playing, setPlaying] = useState(true); useEffect(() => { if (!playing) return; const id = setInterval(() => setPos(p => (p+1) % (M*M)), 700); return () => clearInterval(id); }, [playing]); const r = Math.floor(pos / M), c = pos % M; // compute convolution full (memoized) const output = useMemo(() => { const out = []; for (let i = 0; i < M; i++) { const row = []; for (let j = 0; j < M; j++) { let s = 0; for (let dr = 0; dr < 3; dr++) for (let dc = 0; dc < 3; dc++) { s += input[i+dr][j+dc] * kernel[dr][dc]; } row.push(s); } out.push(row); } return out; }, [input]); const cell = 36; const inX = 30, inY = 30; const kX = inX + N*cell + 50, kY = 60; const outX = kX + 3*cell + 70, outY = 50; // input grid const inputCells = []; for (let i = 0; i < N; i++) for (let j = 0; j < N; j++) { const v = input[i][j]; const inWindow = i >= r && i < r+3 && j >= c && j < c+3; inputCells.push( ); } // window highlight const winRect = ( ); // kernel display const kernelCells = []; for (let i = 0; i < 3; i++) for (let j = 0; j < 3; j++) { const v = kernel[i][j]; const fill = v > 0 ? `rgba(31,111,235,${0.25 + 0.25*Math.abs(v)})` : v < 0 ? `rgba(200,78,29,${0.25 + 0.25*Math.abs(v)})` : '#fbfaf6'; kernelCells.push( {v} ); } // output grid const outputCells = []; const visited = pos; for (let i = 0; i < M; i++) for (let j = 0; j < M; j++) { const idx = i*M + j; const filled = idx <= visited; const v = output[i][j]; // map v from roughly [-3,3] to color const t = Math.max(-1, Math.min(1, v / 3)); const fill = filled ? (t > 0 ? `rgba(31,111,235,${Math.abs(t)*0.7+0.15})` : `rgba(200,78,29,${Math.abs(t)*0.7+0.15})`) : '#fbfaf6'; const isCurrent = i === r && j === c; outputCells.push( {filled && {v.toFixed(1)}} ); } // arrows const arrows = ( ); return (

Convolution, step by stepkernel = vertical edge detector

position { setPlaying(false); setPos(+e.target.value); }} /> {pos+1}/{M*M}

); } const cm = document.getElementById('fig-conv-mount'); if (cm) ReactDOM.createRoot(cm).render(); // ---------- Pooling animation ---------- function PoolFig() { const N = 6; const grid = useMemo(() => { const arr = []; for (let r = 0; r < N; r++) { const row = []; for (let c = 0; c < N; c++) row.push(Math.round(Math.random()*9)); arr.push(row); } return arr; }, []); const [pos, setPos] = useState(0); const [playing, setPlaying] = useState(true); useEffect(() => { if (!playing) return; const id = setInterval(() => setPos(p => (p+1) % 9), 800); return () => clearInterval(id); }, [playing]); const M = N/2; const r = Math.floor(pos / M), c = pos % M; // output const out = []; for (let i = 0; i < M; i++) { const row = []; for (let j = 0; j < M; j++) { row.push(Math.max(grid[i*2][j*2], grid[i*2][j*2+1], grid[i*2+1][j*2], grid[i*2+1][j*2+1])); } out.push(row); } const cell = 40; const ix = 30, iy = 30, ox = ix + N*cell + 80, oy = iy + cell; const inputCells = []; for (let i = 0; i < N; i++) for (let j = 0; j < N; j++) { const inWin = (Math.floor(i/2) === r) && (Math.floor(j/2) === c); inputCells.push( {grid[i][j]} ); } const outCells = []; for (let i = 0; i < M; i++) for (let j = 0; j < M; j++) { const idx = i*M + j; const filled = idx <= pos; const isCurrent = i === r && j === c; outCells.push( {filled && {out[i][j]}} ); } // arrow const arr = ( ); return (

Max pooling, 2×2 stride 24 cells in → 1 cell out

); } const pm = document.getElementById('fig-pool-mount'); if (pm) ReactDOM.createRoot(pm).render(); // ---------- Draw a digit ---------- function DrawDigit() { const canvasRef = useRef(null); const [tick, setTick] = useState(0); const [pixels, setPixels] = useState(() => new Float32Array(28*28)); const drawing = useRef(false); const last = useRef({ x: 0, y: 0 }); function clear() { const c = canvasRef.current; if (!c) return; const ctx = c.getContext('2d'); ctx.fillStyle = '#fbfaf6'; ctx.fillRect(0, 0, c.width, c.height); setPixels(new Float32Array(28*28)); setTick(t => t+1); } useEffect(() => { clear(); }, []); function getPos(e) { const c = canvasRef.current; const rect = c.getBoundingClientRect(); const t = e.touches && e.touches[0]; const x = (t ? t.clientX : e.clientX) - rect.left; const y = (t ? t.clientY : e.clientY) - rect.top; return { x: x * (c.width / rect.width), y: y * (c.height / rect.height) }; } function down(e) { e.preventDefault(); drawing.current = true; last.current = getPos(e); draw(e); } function up() { drawing.current = false; updatePixels(); } function draw(e) { if (!drawing.current) return; e.preventDefault(); const c = canvasRef.current; const ctx = c.getContext('2d'); const p = getPos(e); ctx.strokeStyle = '#1a1a1a'; ctx.lineWidth = 22; ctx.lineCap = 'round'; ctx.beginPath(); ctx.moveTo(last.current.x, last.current.y); ctx.lineTo(p.x, p.y); ctx.stroke(); last.current = p; // throttle pixel update updatePixels(); } function updatePixels() { const c = canvasRef.current; if (!c) return; const ctx = c.getContext('2d'); // downsample 280x280 -> 28x28 grayscale const img = ctx.getImageData(0,0,c.width,c.height).data; const W = c.width, H = c.height; const sx = W/28, sy = H/28; const pix = new Float32Array(28*28); for (let i = 0; i < 28; i++) for (let j = 0; j < 28; j++) { let acc = 0, cnt = 0; const x0 = Math.floor(j*sx), x1 = Math.floor((j+1)*sx); const y0 = Math.floor(i*sy), y1 = Math.floor((i+1)*sy); for (let y = y0; y < y1; y++) for (let x = x0; x < x1; x++) { const idx = (y*W + x)*4; const v = (img[idx] + img[idx+1] + img[idx+2]) / 3; acc += (255 - v) / 255; cnt++; } pix[i*28 + j] = acc / Math.max(1, cnt); } setPixels(pix); setTick(t => t+1); } // Compute simulated feature maps + prediction const { fmaps, prediction } = useMemo(() => { const k1 = [[1,0,-1],[2,0,-2],[1,0,-1]]; // vertical edge const k2 = [[1,2,1],[0,0,0],[-1,-2,-1]]; // horizontal edge const k3 = [[2,1,0],[1,0,-1],[0,-1,-2]]; // diagonal const k4 = [[-1,-1,-1],[-1,8,-1],[-1,-1,-1]]; // blob const filters = [k1, k2, k3, k4]; function conv(input, kern, N) { const M = N - 2; const out = new Float32Array(M*M); for (let i = 0; i < M; i++) for (let j = 0; j < M; j++) { let s = 0; for (let dr = 0; dr < 3; dr++) for (let dc = 0; dc < 3; dc++) { s += input[(i+dr)*N + (j+dc)] * kern[dr][dc]; } out[i*M + j] = Math.max(0, s); // ReLU } return out; } const fmaps = filters.map(k => conv(pixels, k, 28)); // Naive "classification" — heuristic on pixel mass distribution // (this is for show; runs in browser without weights) let probs = new Array(10).fill(0.0); // Centroid + density features let total = 0, mx = 0, my = 0; for (let i = 0; i < 28; i++) for (let j = 0; j < 28; j++) { const v = pixels[i*28+j]; total += v; mx += v*j; my += v*i; } if (total > 5) { mx /= total; my /= total; // Check loop closure (rough): center darkness vs ring darkness let centerDark = 0, ringDark = 0; for (let i = 0; i < 28; i++) for (let j = 0; j < 28; j++) { const v = pixels[i*28+j]; const dist = Math.hypot(i-14, j-14); if (dist < 5) centerDark += v; else if (dist < 11) ringDark += v; } // Vertical-ness vs horizontal-ness let vMass = 0, hMass = 0; for (let i = 0; i < 28; i++) for (let j = 0; j < 28; j++) { vMass += pixels[i*28+j] * Math.abs(j-14); hMass += pixels[i*28+j] * Math.abs(i-14); } // Heuristics for each digit probs[0] = ringDark > centerDark*1.5 ? 0.7 : 0.05; probs[1] = (hMass > vMass*1.6 && total < 30) ? 0.8 : 0.05; probs[2] = total > 20 && total < 60 ? 0.4 : 0.1; probs[3] = total > 25 ? 0.3 : 0.1; probs[4] = total > 20 ? 0.25 : 0.05; probs[5] = total > 30 ? 0.3 : 0.1; probs[6] = ringDark > 8 && centerDark > 2 ? 0.5 : 0.1; probs[7] = total < 30 ? 0.3 : 0.05; probs[8] = ringDark > 10 && centerDark > 4 ? 0.55 : 0.05; probs[9] = ringDark > 5 && my < 14 ? 0.4 : 0.1; // normalize const ssum = probs.reduce((a,b) => a+b, 0); probs = probs.map(p => p / ssum); } else { probs = new Array(10).fill(0.1); } return { fmaps, prediction: probs }; }, [pixels]); // Render const W = 880, H = 360; const fmapSz = 26 * 4; // 26x26 feature maps drawn at scale // Render each feature map as colored rects function renderFmap(fm, x0, y0, label) { const M = 26; const cs = 4; const cells = []; let max = 0.001; for (let i = 0; i < fm.length; i++) max = Math.max(max, fm[i]); for (let i = 0; i < M; i++) for (let j = 0; j < M; j++) { const v = fm[i*M + j] / max; cells.push( ); } return ( {cells} {label} ); } // Top-3 predictions const ranked = prediction.map((p,i) => ({p, i})).sort((a,b) => b.p - a.p); const top = ranked[0]; return (

Draw a digit · live CNNJS-only, runs locally

FEATURE MAPS · 4 LEARNED FILTERS

PREDICTION · CLASS PROBABILITIES

{prediction.map((p, i) => (

{i}

{(p*100).toFixed(1)}%

))}

The classifier here is a tiny heuristic — a real Keras CNN trained on MNIST would be ~99% accurate. The point is to see the feature maps light up as you draw.

); } const dm = document.getElementById('fig-draw-mount'); if (dm) ReactDOM.createRoot(dm).render(); }; })();