/* ============================================================ PART III — RNN ============================================================ */ (function() { const ANCHOR = document.getElementById('anchor-rnn'); const html = `

Part III · Section 17 · 4 min

Sequence has a property neither images nor tables have: order.

"The cat sat on the mat" and "Mat the on sat cat the" use identical tokens. They mean very different things. The architecture must respect that.

For images, we exploited spatial locality with convolutions. For sequences — text, audio, time series, anything indexed by t — we need an architecture that:

Processes one token at a time, in order.
Carries a state that summarizes everything it has seen so far.
Updates that state with each new token.
Can produce an output at any (or every) step.

The same model, different sequence applicationsone-to-one is just an ANN

Sequence problems come in shapes: one-to-many (image captioning), many-to-one (sentiment), many-to-many same-length (POS tagging), many-to-many different-length (translation). RNNs handle them all.

Part III · Section 18 · 5 min

One cell. Used over and over.

An RNN is a single dense layer with a twist: it sees not just the current input, but its own previous output.

The recurrent cellfolded view

At every step, the cell receives the current input x_t AND its own previous hidden state h_t−1. It outputs a new state h_t, which becomes the input for step t+1.

The math is one line, on top of what you already know:

ht = tanh(Wxh xt + Whh ht−1 + b) Same W, every step. The "recurrent" weights are reused.

Like a CNN reuses its kernel across spatial positions, an RNN reuses its weights across time positions. Weight sharing through time is the parallel.

Hidden state as compressed memory

Whatever the network "remembers" about the past must fit inside h. If h is a 128-vector, the network has 128 floats to summarize "everything it has seen until now." This bottleneck is why long sequences are hard.

Part III · Section 19 · 5 min · animated

Unroll the RNN to see it as a deep network.

If you draw out the same cell once per time step, what you get looks suspiciously like a very deep feedforward network — with weight sharing.

Same cell, unrolled across 6 time steps. Watch the hidden state propagate left to right. The input at each step combines with the previous state to produce the next state and an output.

This view makes two things obvious:

It really is a deep network — for a 100-step sequence, you have a 100-layer network, all sharing weights.
Backprop applies as usual — except now it has to flow back through time, not just through layers. We call this backpropagation through time, or BPTT.

Two flavors of output

Last-step output (many-to-one): use only h_T. Good for sentiment, classification.
Every-step output (many-to-many): emit y_t at each step. Good for tagging, language modeling.

In Keras: SimpleRNN(64, return_sequences=False) vs SimpleRNN(64, return_sequences=True). One flag, big difference.

Part III · Section 20 · 6 min

BPTT, and why long-range memory fails.

Backprop through time is mathematically clean and operationally a nightmare — for the same reason as the vanishing gradient.

Recall: backprop multiplies gradients through layers. In an RNN unrolled to T steps, gradients flowing from time T back to time 1 get multiplied by the same recurrent weight matrix T times.

∂L/∂h1 ∝ (Whh)T−1 if eigenvalue < 1: vanish · > 1: explode

Gradient magnitude as it flows backward through time. With a recurrent weight ≈ 0.5, the signal halves each step. By step 20, it's effectively zero. By step 50, it's literally zero in float32.

Two failure modes, mirror images:

Vanishing gradients. Eigenvalues < 1 → signal decays exponentially. Network can't learn long-range dependencies. This is the common case.
Exploding gradients. Eigenvalues > 1 → signal blows up to NaN. Easy fix: gradient clipping. Add clipnorm=1.0 to your optimizer and forget about it.

The famous example

"In France, where I grew up speaking ___." A vanilla RNN, even at sequence length 30, struggles to remember "France" by the time it predicts the blank. The relevant signal has decayed below the noise floor. This is precisely the problem LSTM was invented to solve.

And so we arrive at the last station of our tour.

Part III · Section 21 · 3 min

An RNN in Keras.

A sentiment classifier, IMDB-style.

tensorflow / kerasfrom tensorflow.keras import layers, models

VOCAB, EMBED, MAXLEN = 10000, 64, 200

model = models.Sequential([
    layers.Input(shape=(MAXLEN,)),
    layers.Embedding(VOCAB, EMBED),         # word index → vector
    layers.SimpleRNN(64),                  # return last state only
    layers.Dense(1, activation='sigmoid')   # pos / neg
])

model.compile(optimizer=tf.keras.optimizers.Adam(1e-3, clipnorm=1.0),
              loss='binary_crossentropy', metrics=['accuracy'])
model.fit(x_train, y_train, epochs=5, batch_size=64)

Things to notice

Embedding turns an integer word index into a learned dense vector. Without it, RNNs would have to start from one-hot vectors (10,000-D for a 10k vocab — wasteful).
SimpleRNN(64) creates a 64-dimensional hidden state. Default return_sequences=False means we get only h_T.
clipnorm=1.0 guards against exploding gradients. Cheap insurance.

Honest disclaimer

You will probably never train a SimpleRNN in production. They train slowly and forget quickly. We're showing it to make the LSTM, next, feel like the obvious solution it is. In practice, replace SimpleRNN with LSTM in this exact code and you have a much better model.

`; if (ANCHOR) ANCHOR.insertAdjacentHTML('afterend', html); // ============================================================ // Static figures // ============================================================ // Sequence modes const sm = document.getElementById('fig-seq-modes'); if (sm) { function box(x,y,fill,stroke){return ``;} function modes(x0, label, sub, inputs, outputs, hidden) { let g = ''; const stepX = 28; const baseY = 90; const inY = baseY + 50, outY = baseY - 50, hY = baseY; // hidden line g += ``; for (let i = 0; i < hidden; i++) { const x = x0 + i*stepX; g += box(x, hY, '#f1e4c2', '#b8860b'); } // inputs inputs.forEach((t, i) => { const x = x0 + t*stepX; g += box(x, inY, '#dde7f7', '#1f6feb'); g += ``; }); // outputs outputs.forEach((t, i) => { const x = x0 + t*stepX; g += box(x, outY, '#d6e8de', '#1a7a4c'); g += ``; }); g += `${label}`; g += `${sub}`; return g; } sm.innerHTML = ` `; } // RNN cell folded const rc = document.getElementById('fig-rnn-cell'); if (rc) { rc.innerHTML = ` `; } // ============================================================ // React figures // ============================================================ window.__mountRnnFigures = function() { if (window.__rnnMounted) return; if (!window.React || !window.ReactDOM) return; window.__rnnMounted = true; const { useState, useEffect, useMemo } = React; // ---------- Unroll animation ---------- function UnrollFig() { const T = 6; const tokens = ['the', 'cat', 'sat', 'on', 'the', 'mat']; const [step, setStep] = useState(0); const [playing, setPlaying] = useState(true); useEffect(() => { if (!playing) return; const id = setInterval(() => setStep(s => (s+1) % (T+1)), 900); return () => clearInterval(id); }, [playing]); const W = 980, H = 320; const stepX = 130; const x0 = 80; const cellY = 150; const inY = 250; const outY = 50; const cells = []; for (let t = 0; t < T; t++) { const x = x0 + t*stepX; const active = t < step; const current = t === step - 1; // input cells.push( {tokens[t]} x{t} ); // cell cells.push( RNN h{t+1} ); // input -> cell arrow cells.push(); // output line cells.push( y{t} ); // recurrent arrow to next if (t < T-1) { const nx = x0 + (t+1)*stepX; cells.push( h ); } } // labels const labels = ( y_t cell x_t ); return (

RNN unrolled across timestep {step}/{T}

step { setPlaying(false); setStep(+e.target.value); }} /> {step}/{T}

); } const um = document.getElementById('fig-unroll-mount'); if (um) ReactDOM.createRoot(um).render(); // ---------- BPTT vanishing ---------- function BPTTFig() { const [w, setW] = useState(0.6); const T = 30; const data = useMemo(() => { const arr = []; let g = 1.0; for (let i = 0; i < T; i++) { arr.push(g); g *= w; } return arr.reverse(); // step 1 leftmost }, [w]); const W = 760, H = 240; const barW = (W - 80) / T; const bars = data.map((g, i) => { const logG = Math.log10(Math.max(g, 1e-15)); const h = Math.max(2, Math.min(170, (logG + 12) * 14)); const finalLayer = i === T-1; return ( ); }); return (

Gradient flowing back through timestep 1 (left) ← step 30 (right, near loss)

recurrent weight |W| setW(+e.target.value)}/> {w.toFixed(2)} {w < 0.95 ? 'vanishing' : w > 1.05 ? 'exploding' : 'critical regime'}

); } const bm = document.getElementById('fig-bptt-mount'); if (bm) ReactDOM.createRoot(bm).render(); }; })();