Compositional data

In statistics, compositional data are quantitative descriptions of the parts of some whole, conveying relative information.

Mathematically, compositional data is represented by points on a simplex.

Measurements involving probabilities, proportions, percentages, and ppm can all be thought of as compositional data.

The use of a barycentric plot on three variables graphically depicts the ratios of the three variables as positions in an equilateral triangle.

In general, John Aitchison defined compositional data to be proportions of some whole in 1982.

[1] In particular, a compositional data point (or composition for short) can be represented by a real vector with positive components.

The sample space of compositional data is a simplex: The only information is given by the ratios between components, so the information of a composition is preserved under multiplication by any positive constant.

Therefore, the sample space of compositional data can always be assumed to be a standard simplex, i.e.

In this context, normalization to the standard simplex is called closure and is denoted by

The simplex can be given the structure of a vector space in several different ways.

Since the Aitchison simplex forms a finite dimensional Hilbert space, it is possible to construct orthonormal bases in the simplex.

forms an orthonormal basis in the simplex.

There are three well-characterized isomorphisms that transform from the Aitchison simplex to real space.

This transform is commonly used in chemistry with measurements such as pH.

In addition, this is the transform most commonly used for multinomial logistic regression.

The alr transform is not an isometry, meaning that distances on transformed values will not be equivalent to distances on the original compositions in the simplex.

The center log ratio (clr) transform is both an isomorphism and an isometry where

The isometric log ratio (ilr) transform is both an isomorphism and an isometry where

There are multiple ways to construct orthonormal bases, including using the Gram–Schmidt orthogonalization or singular-value decomposition of clr transformed data.

Another alternative is to construct log contrasts from a bifurcating tree.

are the respective number of tips in the corresponding subtrees shown in the figure.

are the set of values corresponding to the tips in the subtrees

An illustration of the Aitchison simplex. Here, there are 3 parts, represent values of different proportions. A, B, C, D and E are 5 different compositions within the simplex. A, B and C are all equivalent and D and E are equivalent.
A representation of a tree in terms of its orthogonal components. l represents an internal node, an element of the orthonormal basis. This is a precursor to using the tree as a scaffold for the ilr transform