~ / track G / neighboring paradigms
Automatic differentiation
AdvancedAutomatic differentiation (AD) computes the derivative of a function by manipulating the program that defines it — exactly, to machine precision, without symbolic algebra and without finite-difference approximation. It's how every deep-learning framework computes gradients, and it's a beautiful application of functional principles: programs as data, transformed by other programs.
Three things AD is not
- Not symbolic. It doesn't return a closed-form expression for the derivative; it computes its value at a point.
- Not numerical (finite differences). No
(f(x+h) - f(x)) / hwith all its precision and step-size issues. AD's answer is exact (up to floating point). - Not magic. It's just the chain rule, applied systematically by walking the structure of the computation.
Forward mode: dual numbers
The simplest implementation uses dual numbers. Pair every value x with
its derivative x′. Define arithmetic on these pairs:
(a + bε) + (c + dε) = (a+c) + (b+d)ε
(a + bε) * (c + dε) = ac + (ad + bc)ε
Here ε is "infinitesimal" — its square is treated as zero. Plug
x = (x, 1) (value x, derivative 1 because we're differentiating w.r.t.
itself) and run your function. The result's second component is f'(x).
In Clojure:
You didn't write the derivative. You wrote the function — using
dadd/dmul on dual numbers — and got both the value and the derivative
out. This is forward-mode AD.
Reverse mode: backprop
Reverse-mode AD is what powers deep learning. The idea:
- Forward pass records the computation graph (every op, every input).
- Backward pass walks the graph in reverse, accumulating ∂output/∂input for every node using the chain rule.
Reverse mode shines when you have many inputs and few outputs — like a neural net with millions of parameters and a single scalar loss. Forward mode is better when there are few inputs and many outputs.
In libraries (PyTorch's autograd, JAX's grad, Clojure's dvl / neanderthal
plus tape libraries), you write the model as if you were just computing the
forward output; the library records the graph and gives you gradients on
request.
Why this is a functional-programming triumph
- Programs as data. AD treats your code as a structure to transform — the same instinct behind macros, optics, and term rewriting.
- Composability.
gradis a higher-order function: takesf, returnsf'. You cangrad (grad f)and get the second derivative. - Purity matters. AD assumes functions are pure mathematical mappings. Side effects in the middle of a differentiable function are a footgun.
- Type-respecting. A scalar's gradient is a scalar; a vector input's gradient is a vector; a matrix's gradient is a matrix. The shape rules are what you'd derive on paper.
Where it shows up
- Deep learning: every gradient-based optimizer.
- Differentiable physics / graphics: PhysX, Mitsuba 3.
- Probabilistic programming: Hamiltonian Monte Carlo needs gradients of the log-posterior.
- Optimization in finance: greeks via AD instead of finite differences.
Check yourself
? quiz
When you compute `f'(3)` via forward-mode dual numbers as in the example, what is the role of the `1` in `[3 1]`?
Exercise
Using the dadd and dmul operators above, compute f(x) = x³ and its
derivative at x = 2. (f'(x) = 3x² = 12 at x = 2.)