Model-based clustering

Then standard statistical model selection criteria such as the Bayesian information criterion (BIC) can be used to choose

[4] The integrated completed likelihood (ICL)[5] is a different criterion designed to choose the number of clusters rather than the number of mixture components in the model; these will often be different if highly non-Gaussian clusters are present.

, using a full covariance matrix for each mixture component requires estimation of many parameters, which can result in a loss of precision, generalizabity and interpretability.

Thus it is common to use more parsimonious component covariance matrices exploiting their geometric interpretation.

Gaussian clusters are ellipsoidal, with their volume, shape and orientation determined by the covariance matrix.

[9][10] Different Gaussian model-based clustering methods have been developed with an eye to handling high-dimensional data.

This fits the data well, because the normal patients have low values of both sspg and insulin, while the distributions of the chemical and overt diabetes groups are elongated, but in different directions.

Thus the volumes, shapes and orientations of the three groups are clearly different, and so the unconstrained model is appropriate, as selected by the model-based clustering method.

The classification was quite accurate, with a 12% error rate as defined by the clinical classificiation.

One way of modeling outliers in model-based clustering is to include an additional mixture component that is very dispersed, with for example a uniform distribution.

[20] A different approach is to use mixtures of complex component densities to represent non-Gaussian clusters.

This assumes that the data arise from a finite mixture model, where within each cluster the variables are independent.

A latent class model for mixed data assumes local independence between the variable.

[25] The simplest model-based clustering approach for multivariate count data is based on finite mixtures with locally independent Poisson distributions, similar to the latent class model.

[26] These include methods based on the multivariate Poisson distribution, the multivarate Poisson-log normal distribution, the integer-valued autoregressive (INAR) model and the Gaussian Cox model.

These consist of sequences of categorical values from a finite set of possibilities, such as life course trajectories.

The data are then ordered lists of objects, arising in voting, education, marketing and other areas.

[31] These consist of the presence, absence or strength of connections between individuals or nodes, and are widespread in the social sciences and biology.

Many of these are listed in the CRAN Task View on Cluster Analysis and Finite Mixture Models.

[34] The most used such package is mclust,[35][36] which is used to cluster continuous data and has been downloaded over 8 million times.

[37] The poLCA package [38] clusters categorical data using the latent class model.

The clustMD package [25] clusters mixed data, including continuous, binary, ordinal and nominal variables.

[41] In 1959, Lazarsfeld gave a lecture on latent structure analysis at the University of California-Berkeley, where John H. Wolfe was an M.A.

This led Wolfe to think about how to do the same thing for continuous data, and in 1965 he did so, proposing the Gaussian mixture model for clustering.

[44] However, Wolfe deserves credit as the inventor of model-based clustering for continuous data.

Murtagh and Raftery (1984) developed a model-based clustering method based on the eigenvalue decomposition of the component covariance matrices.

[45] McLachlan and Basford (1988) was the first book on the approach, advancing methodology and sparking interest.

[46] Banfield and Raftery (1993) coined the term "model-based clustering", introduced the family of parsimonious models, described an information criterion for choosing the number of clusters, proposed the uniform model for outliers, and introduced the mclust software.

[6] Celeux and Govaert (1995) showed how to perform maximum likelihood estimation for the models.

[7] Thus, by 1995 the core components of the methodology were in place, laying the groundwork for extensive development since then.