~ / track H / neighboring paradigms

Automatic differentiation

Advanced

Automatic differentiation (AD) computes the derivative of a function by manipulating the program that defines it — exactly, to machine precision, without symbolic algebra and without finite-difference approximation. It's how every deep-learning framework computes gradients, and it's a beautiful application of functional principles: programs as data, transformed by other programs.

Three things AD is not

Not symbolic. It doesn't return a closed-form expression for the derivative; it computes its value at a point.
Not numerical (finite differences). No (f(x+h) - f(x)) / h with all its precision and step-size issues. AD's answer is exact (up to floating point).
Not magic. It's just the chain rule, applied systematically by walking the structure of the computation.

Forward mode: dual numbers

The simplest implementation uses dual numbers. Pair every value x with its derivative x′. Define arithmetic on these pairs:

(a + bε) + (c + dε)  = (a+c) + (b+d)ε
(a + bε) * (c + dε)  = ac     + (ad + bc)ε

Here ε is "infinitesimal" — its square is treated as zero. Plug x = (x, 1) (value x, derivative 1 because we're differentiating w.r.t. itself) and run your function. The result's second component is f'(x).

In Clojure:

loading sci

press ⌘/Ctrl-↵ or click ▶ run to evaluate

You didn't write the derivative. You wrote the function — using dadd/dmul on dual numbers — and got both the value and the derivative out. This is forward-mode AD.

Reverse mode: backprop

Reverse-mode AD is what powers deep learning. The idea:

Forward pass records the computation graph (every op, every input).
Backward pass walks the graph in reverse, accumulating ∂output/∂input for every node using the chain rule.

Reverse mode shines when you have many inputs and few outputs — like a neural net with millions of parameters and a single scalar loss. Forward mode is better when there are few inputs and many outputs.

In libraries (PyTorch's autograd, JAX's grad, Clojure's dvl / neanderthal plus tape libraries), you write the model as if you were just computing the forward output; the library records the graph and gives you gradients on request.

Why this is a functional-programming triumph

Programs as data. AD treats your code as a structure to transform — the same instinct behind macros, optics, and term rewriting.
Composability. grad is a higher-order function: takes f, returns f'. You can grad (grad f) and get the second derivative.
Purity matters. AD assumes functions are pure mathematical mappings. Side effects in the middle of a differentiable function are a footgun.
Type-respecting. A scalar's gradient is a scalar; a vector input's gradient is a vector; a matrix's gradient is a matrix. The shape rules are what you'd derive on paper.

Where it shows up

Deep learning: every gradient-based optimizer.
Differentiable physics / graphics: PhysX, Mitsuba 3.
Probabilistic programming: Hamiltonian Monte Carlo needs gradients of the log-posterior.
Optimization in finance: greeks via AD instead of finite differences.

Check yourself

? quiz

When you compute `f'(3)` via forward-mode dual numbers as in the example, what is the role of the `1` in `[3 1]`?

Exercise

Using the dadd and dmul operators above, compute f(x) = x³ and its derivative at x = 2. (f'(x) = 3x² = 12 at x = 2.)

loading sci

press ⌘/Ctrl-↵ or click ▶ run to evaluate

● status: new