Chain rule

Let z, y and x be the (variable) positions of the car, the bicycle, and the walking man, respectively.

So, the rate of change of the relative positions of the car and the walking man is

[3] Guillaume de l'Hôpital used the chain rule implicitly in his Analyse des infiniment petits.

The chain rule does not appear in any of Leonhard Euler's analysis books, even though they were written over a hundred years after Leibniz's discovery.

It is believed that the first "modern" version of the chain rule appears in Lagrange's 1797 Théorie des fonctions analytiques; it also appears in Cauchy's 1823 Résumé des Leçons données a L’École Royale Polytechnique sur Le Calcul Infinitesimal.

[3] The simplest form of the chain rule is for real-valued functions of one real variable.

The chain rule forms the basis of the back propagation algorithm, which is used in gradient descent of neural networks in deep learning (artificial intelligence).

[5] Faà di Bruno's formula generalizes the chain rule to higher derivatives.

One proof of the chain rule begins by defining the derivative of the composite function f ∘ g, where we take the limit of the difference quotient for f ∘ g as x approaches a:

The latter is the difference quotient for g at a, and because g is differentiable at a by assumption, its limit as x tends to a exists and equals g′(a).

Another way of proving the chain rule is to measure the error in the linear approximation determined by the derivative.

In the situation of the chain rule, such a function ε exists because g is assumed to be differentiable at a.

Applying the same theorem on products of limits as in the first proof, the third bracketed term also tends zero.

Constantin Carathéodory's alternative definition of the differentiability of a function can be used to give an elegant proof of the chain rule.

A similar approach works for continuously differentiable (vector-)functions of many variables.

As this case occurs often in the study of functions of a single variable, it is worth describing it separately.

The usual notations for partial derivatives involve names for the arguments of the function.

As these arguments are not named in the above formula, it is simpler and clearer to use D-Notation, and to denote by

The higher-dimensional chain rule can be proved using a technique similar to the second proof given above.

[7] Because the total derivative is a linear transformation, the functions appearing in the formula can be rewritten as matrices.

The Jacobian of f ∘ g is the product of these 1 × 1 matrices, so it is f′(g(a))⋅g′(a), as expected from the one-dimensional chain rule.

Another way of writing the chain rule is used when f and g are expressed in terms of their components as y = f(u) = (f1(u), …, fk(u)) and u = g(x) = (g1(x), …, gm(x)).

Recall that when the total derivative exists, the partial derivative in the i-th coordinate direction is found by multiplying the Jacobian matrix by the i-th basis vector.

Since the entries of the Jacobian matrix are partial derivatives, we may simplify the above formula to get:

Recalling that u = (g1, …, gm), the partial derivative ∂u / ∂xi is also a vector, and the chain rule says that:

Faà di Bruno's formula for higher-order derivatives of single-variable functions generalizes to the multivariable case.

The chain rule is also valid for Fréchet derivatives in Banach spaces.

[8] This case and the previous one admit a simultaneous generalization to Banach manifolds.

The common feature of these examples is that they are expressions of the idea that the derivative is part of a functor.

This variant of the chain rule is not an example of a functor because the two functions being composed are of different types.