/* ============================================================ PART IV — LSTM ============================================================ */ (function() { const ANCHOR = document.getElementById('anchor-lstm'); const html = `

Part IV · Section 22 · 5 min

LSTM: a conveyor belt of memory, with valves.

An LSTM keeps an extra piece of state called the cell, that flows along untouched unless the cell explicitly chooses to write to it or read from it.

The vanilla RNN had one piece of state, h, that got rewritten at every step. That rewrite is exactly where information from earlier steps gets lost. The LSTM's central idea is to separate the highway from the workshop:

The cell state C is a long-running memory that flows from step to step almost untouched. Picture a conveyor belt.
The hidden state h is the working output, the same role h played in the vanilla RNN.
Three small gates decide what to forget, what to write, and what to read.

The conveyor beltcell state C flows mostly straight through

The straight horizontal arrow is the cell state. Most of the time it's mostly unchanged; the gates only intervene where they need to.

The genius is in that "almost." If the gates choose, the cell state can carry information across hundreds of steps with no decay. The gradient flowing backward along the belt doesn't get multiplied by anything — so it doesn't vanish.

Part IV · Section 23 · 7 min

Three gates: forget, input, output.

Each gate is a sigmoid layer producing a vector of values between 0 and 1 — multiplied element-wise into the cell state. 0 = closed, 1 = open.

The three gatessigmoid → mask → modify cell

Each gate looks at the current input and previous hidden state, and decides — independently and per-dimension — how much information to let through.

Forget gate · what to drop

ft = σ(Wf·[ht−1, xt] + bf)multiplies into C — values close to 0 erase memory

"Should I remember the subject of the sentence after seeing a period? Probably not — close the gate."

Input gate · what to write

it = σ(Wi·[ht−1, xt] + bi) C̃t = tanh(WC·[ht−1, xt] + bC)i decides how much; C̃ is the candidate to write

"Just saw a new subject. Open the input gate, write it to memory."

Output gate · what to read

ot = σ(Wo·[ht−1, xt] + bo) ht = ot ⊙ tanh(Ct)controls what part of memory becomes the hidden output

"Predicting a verb? I need the subject's number — open the output gate on those dims."

All four equations together describe the entire LSTM step. They look intimidating; they're really just four small dense layers, three with sigmoid, one with tanh, glued together with element-wise multiplications.

Part IV · Section 24 · 6 min · animated

A single LSTM step, animated.

Watch a token enter the cell. Watch the gates compute. Watch the cell state update. Watch the new hidden state emerge.

Phase 1: forget. Phase 2: input + candidate. Phase 3: cell update. Phase 4: output. The cell state belt is yellow; the hidden state line is amber; gates pulse on as they activate.

Why this fixes the vanishing gradient

Trace the gradient backward along the cell state. At each step it's multiplied by the forget gate's activation — which the network can learn to keep close to 1 when long-range memory matters. Compare this to the vanilla RNN, where the gradient is always multiplied by W_hh, a fixed matrix.

Result: LSTMs can learn dependencies across hundreds, sometimes thousands of time steps. RNNs typically max out around 10–20.

Part IV · Section 25 · 6 min · interactive

Gate playground: set the gates, watch memory survive (or die).

Drag the three gate sliders. The chart shows what happens to a piece of information that enters memory at step 0, over 30 time steps.

Try this: set forget = 1, input = 1, output = 1 at step 0, then forget = 1 thereafter. The signal is preserved indefinitely. Now drop forget to 0.6 — watch decay return.

Three preset scenarios to try

Hodling memory. forget = 1.0 throughout, input pulse at step 0, output = 1.0 at step 29. The information arrives untouched 30 steps later.
Slow decay (vanilla RNN equivalent). forget = 0.7. The signal drops to ~0.04 by step 30 — already gone.
Selective overwrite. forget = 0.3 at step 15, input = 1.0 at step 15. The original memory is wiped and replaced with the new input.

Part IV · Section 26 · 4 min

An LSTM in Keras.

Same Keras one-liner discipline. Same training loop. The architecture takes care of itself.

tensorflow / kerasfrom tensorflow.keras import layers, models

# Stock-price next-day prediction · time series
WINDOW, FEATURES = 60, 5          # 60 days × (open, high, low, close, vol)

model = models.Sequential([
    layers.Input(shape=(WINDOW, FEATURES)),
    layers.LSTM(64, return_sequences=True),    # stack a second LSTM
    layers.Dropout(0.2),
    layers.LSTM(32),                          # last-step output
    layers.Dense(1)                              # next-day close
])
model.compile(optimizer='adam', loss='mse')
model.fit(X_train, y_train, epochs=20, batch_size=32,
          validation_split=0.1)

Things to notice

Stacked LSTMs. First layer with return_sequences=True emits a state per step; that becomes the input to the second LSTM. Second one keeps only the final state for the dense head.
Same compile/fit pattern. The framework handles BPTT and gradient clipping defaults.
Input shape (window, features). Time first, features second.

When to use LSTM vs Transformer in 2025

Transformers won natural language. For most text tasks today — translation, summarization, sentiment — start with a pretrained transformer (BERT, T5, or an LLM API). Reach for LSTMs when: (1) sequences are very long but mostly local in dependency, (2) you have small data and limited compute, or (3) the input is genuinely streaming (audio, IoT, financial ticks) and you want a stateful model. They're far from obsolete; just no longer the default for text.

Wrap · Section 27 · 4 min

What you know now, and where to go.

Two hours ago we started with a single neuron. You now understand the four architectures that powered most of deep learning's first decade — and the conceptual seeds for what came after.

The arc you walkedeach rung adds an inductive bias

Where to go next

depth

Train the four models

Build the four Keras snippets in a Colab. There's a chasm between reading and running. Cross it this week.

breadth

Transformers

Self-attention generalizes the recurrence idea: every step looks at every other step. The architecture behind GPT and friends. Read "The Illustrated Transformer."

vision

ResNet & Vision Transformers

Residual connections solved CNN depth. ViTs apply attention to image patches. The frontier of computer vision is here.

tooling

PyTorch + JAX

Once Keras feels fluent, go one level lower. PyTorch for research; JAX for performance.

The single most important habit

When you read a new architecture paper, ask one question: what inductive bias is being built in? CNNs assume locality. RNNs assume temporal order. LSTMs assume long-range memory matters. Transformers assume any pair of positions might matter. Once you can name the bias, the model stops being a black box and becomes an opinion about your data.

Two hours well spent. Now go train something.

END · 27 of 27 sections Deep Learning · 2-hour session

`; if (ANCHOR) ANCHOR.insertAdjacentHTML('afterend', html); // ============================================================ // Static figures // ============================================================ const belt = document.getElementById('fig-belt'); if (belt) { belt.innerHTML = ` `; } const tg = document.getElementById('fig-three-gates'); if (tg) { tg.innerHTML = ` `; } const arc = document.getElementById('fig-arc'); if (arc) { arc.innerHTML = ` `; } // ============================================================ // React figures // ============================================================ window.__mountLstmFigures = function() { if (window.__lstmMounted) return; if (!window.React || !window.ReactDOM) return; window.__lstmMounted = true; const { useState, useEffect, useMemo } = React; // ---------- LSTM walk-through ---------- function LstmWalkFig() { const [phase, setPhase] = useState(0); // 0..3 const [playing, setPlaying] = useState(true); useEffect(() => { if (!playing) return; const id = setInterval(() => setPhase(p => (p+1) % 4), 1500); return () => clearInterval(id); }, [playing]); const W = 920, H = 320; // colors based on phase const fActive = phase >= 0; const iActive = phase >= 1; const cellUpd = phase >= 2; const oActive = phase >= 3; const phaseLabels = [ '1. forget — what to drop from cell state', '2. input — what new info to write', '3. update — combine forget + input into new cell state', '4. output — read out the new hidden state', ]; return (

LSTM step, phase by phase {phaseLabels[phase]}

phase { setPlaying(false); setPhase(+e.target.value); }} /> {phase+1}/4

); } const lwm = document.getElementById('fig-lstm-walk-mount'); if (lwm) ReactDOM.createRoot(lwm).render(); // ---------- Gate playground ---------- function LstmPlayFig() { const [forget, setForget] = useState(1.0); const [input, setInput] = useState(0.0); const [output, setOutput] = useState(1.0); const T = 30; // simulate: c_0 = 1.0 (a memory we want to keep). Each step c = forget*c + input*candidate(=1) // h = output * tanh(c). const data = useMemo(() => { const arr = []; let c = 0; for (let t = 0; t < T; t++) { // initial pulse: at step 0, write a 1.0 candidate via input gate const cand = (t === 0) ? 1.0 : 0.0; c = forget * c + input * cand; // override: at step 0, force c = 1 if user has input gate near 0 — this lets us study just forget. if (t === 0 && input < 0.1) c = 1.0; const h = output * Math.tanh(c); arr.push({ c, h }); } return arr; }, [forget, input, output]); const W = 880, H = 280; const px0 = 60, py0 = 30, pw = W - 120, ph = H - 80; const X = (i) => px0 + (i / (T-1)) * pw; const Yc = (v) => py0 + (1 - Math.max(-0.1, Math.min(1.1, v))) * ph * 0.5; const Yh = (v) => py0 + ph*0.5 + (1 - Math.max(-0.1, Math.min(1.1, v))) * ph * 0.5; const cPath = data.map((d,i)=>(i?'L':'M')+X(i).toFixed(1)+' '+Yc(d.c).toFixed(1)).join(' '); const hPath = data.map((d,i)=>(i?'L':'M')+X(i).toFixed(1)+' '+Yh(d.h).toFixed(1)).join(' '); const cDots = data.map((d,i) => ); const hDots = data.map((d,i) => ); const finalC = data[T-1].c; const survival = finalC > 0.5 ? 'memory survives' : finalC > 0.1 ? 'memory fading' : 'memory lost'; const survivalColor = finalC > 0.5 ? '#1a7a4c' : finalC > 0.1 ? '#b8860b' : '#c84e1d'; return (

Gate playgroundsame memory pulse, different gates

forget gate setForget(+e.target.value)}/> {forget.toFixed(2)}

input gate setInput(+e.target.value)}/> {input.toFixed(2)}

output gate setOutput(+e.target.value)}/> {output.toFixed(2)}

after 30 steps · C = {finalC.toFixed(3)} · {survival}

); } const lpm = document.getElementById('fig-lstm-play-mount'); if (lpm) ReactDOM.createRoot(lpm).render(); }; })();