Matrix calculus

This greatly simplifies operations such as finding the maximum or minimum of a multivariate function and solving systems of differential equations.

Two competing notational conventions split the field of matrix calculus into two separate groups.

Matrix calculus refers to a number of different notations that use matrices and vectors to collect the derivative of each component of the dependent variable with respect to each component of the independent variable.

The result could be collected in an m×n matrix consisting of all of the possible derivative combinations.

The six kinds of derivatives that can be most neatly organized in matrix form are collected in the following table.

Notice that we could also talk about the derivative of a vector with respect to a matrix, or any of the other unfilled cells in our table.

However, these derivatives are most naturally organized in a tensor of rank higher than 2, so that they do not fit neatly into a matrix.

Matrix calculus is used for deriving optimal stochastic estimators, often involving the use of Lagrange multipliers.

This includes the derivation of: The vector and matrix derivatives presented in the sections to follow take full advantage of matrix notation, using a single variable to represent a large number of variables.

An element of M(n,1), that is, a column vector, is denoted with a boldface lowercase letter: a, x, y, etc.

XT denotes matrix transpose, tr(X) is the trace, and det(X) or |X| is the determinant.

NOTE: As mentioned above, there are competing notations for laying out systems of partial derivatives in vectors and matrices, and no standard appears to be emerging yet.

The next two introductory sections use the numerator layout convention simply for the purposes of convenience, to avoid overly complicating the discussion.

It is important to realize the following: The tensor index notation with its Einstein summation convention is very similar to the matrix calculus, except one writes only a single component at a time.

The notations developed here can accommodate the usual operations of vector calculus by identifying the space M(n,1) of n-vectors with the Euclidean space Rn, and the scalar M(1,1) is identified with R. The corresponding concept from vector calculus is indicated at the end of each subsection.

NOTE: The discussion in this section assumes the numerator layout convention for pedagogical purposes.

The section on layout conventions discusses this issue in greater detail.

, is written (in numerator layout notation) as In vector calculus, the gradient of a scalar field f : Rn → R (whose independent coordinates are the components of x) is the transpose of the derivative of a scalar by a vector.

Note: The discussion in this section assumes the numerator layout convention for pedagogical purposes.

This section discusses the similarities and differences between notational conventions that are used in the various fields that take advantage of matrix calculus.

We also handle cases of scalar-by-scalar derivatives that involve an intermediate vector or matrix.

For each of the various combinations, we give numerator-layout and denominator-layout results, except in the cases above where denominator layout rarely occurs.

In cases involving matrices where it makes sense, we give numerator-layout and mixed-layout results.

Keep in mind that various authors use different combinations of numerator and denominator layouts for different types of derivatives, and there is no guarantee that an author will consistently use either numerator or denominator layout for all types.

For example, in attempting to find the maximum likelihood estimate of a multivariate normal distribution using matrix calculus, if the domain is a k×1 column vector, then the result using the numerator layout will be in the form of a 1×k row vector.

In the latter case, the product rule can't quite be applied directly, either, but the equivalent can be done with a bit more work using the differential identities.

The following identities adopt the following conventions: This is presented first because all of the operations that apply to vector-by-vector differentiation apply directly to vector-by-scalar or scalar-by-vector differentiation simply by reducing the appropriate vector in the numerator or denominator to a scalar.

(whose outputs are matrices) assume the matrices are laid out consistent with the vector layout, i.e. numerator-layout matrix when numerator-layout vector and vice versa; otherwise, transpose the vector-by-vector derivatives.

However, the product rule of this sort does apply to the differential form (see below), and this is the way to derive many of the identities below involving the trace function, combined with the fact that the trace function allows transposing and cyclic permutation, i.e.: For example, to compute

It is often easier to work in differential form and then convert back to normal derivatives.