Correspondence analysis

Correspondence analysis (CA) is a multivariate statistical technique proposed[1] by Herman Otto Hartley (Hirschfeld)[2] and later developed by Jean-Paul Benzécri.

[3] It is conceptually similar to principal component analysis, but applies to categorical rather than continuous data.

In a similar manner to principal component analysis, it provides a means of displaying or summarising a set of data in two-dimensional graphical form.

Its aim is to display in a biplot any structure hidden in the multivariate setting of the data table.

Since the variant of CA described here can be applied either with a focus on the rows or on the columns it should in fact be called simple (symmetric) correspondence analysis.

[4] It is traditionally applied to the contingency table of a pair of nominal variables where each cell contains either a count or a zero value.

If more than two categorical variables are to be summarized, a variant called multiple correspondence analysis should be chosen instead.

Depending on the scores used CA preserves the chi-square distance[5][6] between either the rows or the columns of the table.

Because CA is a descriptive technique, it can be applied to tables regardless of a significant chi-squared test.

Understanding the following computations requires knowledge of matrix algebra.

Before proceeding to the central computational step of the algorithm, the values in matrix C have to be transformed.

respectively i.e. the vector elements are the inverses of the square roots of the masses.

But since CA is not an inferential method the term independence model is inappropriate here.

as The amount of inertia covered by the i-th set of singular vectors is

To transform the singular vectors to coordinates which preserve the chisquare distances between rows or columns an additional weighting step is necessary.

But since all modern algorithms for CA are based on a singular value decomposition this terminology should be avoided.

In the French tradition of CA the coordinates are sometimes called (factor) scores.

Factor scores or principal coordinates for the rows of matrix C are computed by i.e. the left singular vectors are scaled by the inverse of the square roots of the row masses and by the singular values.

Because principal coordinates are computed using singular values they contain the information about the spread between the rows (or columns) in the original table.

This reassures the existence of a inner product between the two sets of coordinates i.e. it leads to meaningful interpretations of their spatial relations in a biplot.

[17] The standard coordinates for the rows are and those for the columns are Note that a scaling 1[15] biplot in ecology implies the rows to be in principal and the columns to be in standard coordinates while scaling 2 implies the rows to be in standard and the columns to be in principal coordinates.

The visualization of a CA result always starts with displaying the scree plot of the principal inertia values to evaluate the success of summarizing spread by the first few singular vectors.

The actual ordination is presented in a graph which could - at first look - be confused with a complicated scatter plot.

In fact it consists of two scatter plots printed one upon the other, one set of points for the rows and one for the columns.

But being a biplot a clear interpretation rule relates the two coordinate matrices used.

A biplot is in fact a low dimensional mapping of a part of the information contained in the original table.

Traditionally, originating from the French tradition in CA,[18] early CA biplots mapped both entities in the same coordinate version, usually principal coordinates, but this kind of display is misleading insofar as: "Although this is called a biplot, it does not have any useful inner product relationship between the row and column scores" as Brian Ripley, maintainer of R package MASS points out correctly.

[19] Today that kind of display should be avoided since laymen usually are not aware of the lacking relation between the two point sets.

A scaling 1[15] biplot (rows in principal coordinates, columns in standard coordinates) is interpreted as follows:[20] Several variants of CA are available, including detrended correspondence analysis (DCA) and canonical correspondence analysis (CCA).

In the social sciences, correspondence analysis, and particularly its extension multiple correspondence analysis, was made known outside France through French sociologist Pierre Bourdieu's application of it.