Mixture model

Note that expectation maximization applied to such a model will typically fail to produce realistic results, due (among other things) to the excessive number of parameters.

Typically two sorts of additional components are added to the model: The following example is based on an example in Christopher M. Bishop, Pattern Recognition and Machine Learning.

The financial example above is one direct application of the mixture model, a situation in which we assume an underlying mechanism so that each observation belongs to one of some number of different sources or categories.

A multivariate Gaussian mixture model is used to cluster the feature data into k number of groups where k represents each state of the machine.

Combined with other analytic or geometric tools (e.g., phase transitions over diffusive boundaries), such spatially regularized mixture models could lead to more realistic and computationally efficient segmentation methods.

State-of-the-art methods are e.g. coherent point drift (CPD)[9] and Student's t-distribution mixture models (TMM).

[10] The result of recent research demonstrate the superiority of hybrid mixture models[11] (e.g. combining Student's t-distribution and Watson distribution/Bingham distribution to model spatial positions and axes orientations separately) compare to CPD and TMM, in terms of inherent robustness, accuracy and discriminative capacity.

Some notable departures are the graphical methods as outlined in Tarter and Lock[12] and more recently minimum message length (MML) techniques such as Figueiredo and Jain[13] and to some extent the moment matching pattern analysis routines suggested by McWilliam and Loh (2009).

[14] Expectation maximization (EM) is seemingly the most popular technique used to determine the parameters of a mixture with an a priori given number of components.

Dempster[15] also showed that each successive EM iteration will not decrease the likelihood, a property not shared by other gradient based maximization techniques.

Moreover, EM naturally embeds within it constraints on the probability vector, and for sufficiently large sample sizes positive definiteness of the covariance iterates.

This is a key advantage since explicitly constrained methods incur extra computational costs to check and maintain appropriate values.

Redner and Walker (1984)[full citation needed] make this point arguing in favour of superlinear and second order Newton and quasi-Newton methods and reporting slow convergence in EM on the basis of their empirical tests.

[16] Other common objections to the use of EM are that it has a propensity to spuriously identify local maxima, as well as displaying sensitivity to initial values.

Figueiredo and Jain[13] note that convergence to 'meaningless' parameter values obtained at the boundary (where regularity conditions breakdown, e.g., Ghosh and Sen (1985)) is frequently observed when the number of model components exceeds the optimal/true one.

On this basis they suggest a unified approach to estimation and identification in which the initial n is chosen to greatly exceed the expected optimal value.

Their optimization routine is constructed via a minimum message length (MML) criterion that effectively eliminates a candidate component if there is insufficient information to support it.

The component model parameters θi are also calculated by expectation maximization using data points xj that have been weighted using the membership values.

For example, if θ is a mean μ With new estimates for ai and the θi's, the expectation step is repeated to recompute new membership values.

As an alternative to the EM algorithm, the mixture model parameters can be deduced using posterior sampling as indicated by Bayes' theorem.

The method of moment matching is one of the oldest techniques for determining the mixture parameters dating back to Karl Pearson's seminal work of 1894.

[20] McWilliam and Loh (2009) consider the characterisation of a hyper-cuboid normal mixture copula in large dimensional systems for which EM would be computationally prohibitive.

Here a pattern analysis routine is used to generate multivariate tail-dependencies consistent with a set of univariate and (in some sense) bivariate moments.

The performance of this method is then evaluated using equity log-return data with Kolmogorov–Smirnov test statistics suggesting a good descriptive fit.

Spectral methods of learning mixture models are based on the use of Singular Value Decomposition of a matrix which contains data points.

Tarter and Lock[12] describe a graphical approach to mixture identification in which a kernel function is applied to an empirical frequency plot so to reduce intra-component variance.

Mixture distributions and the problem of mixture decomposition, that is the identification of its constituent components and the parameters thereof, has been cited in the literature as far back as 1846 (Quetelet in McLachlan,[17] 2000) although common reference is made to the work of Karl Pearson (1894)[21] as the first author to explicitly address the decomposition problem in characterising non-normal attributes of forehead to body length ratios in female shore crab populations.

The motivation for this work was provided by the zoologist Walter Frank Raphael Weldon who had speculated in 1893 (in Tarter and Lock[12]) that asymmetry in the histogram of these ratios could signal evolutionary divergence.

While his work was successful in identifying two potentially distinct sub-populations and in demonstrating the flexibility of mixtures as a moment matching tool, the formulation required the solution of a 9th degree (nonic) polynomial which at the time posed a significant computational challenge.

Subsequent works focused on addressing these problems, but it was not until the advent of the modern computer and the popularisation of Maximum Likelihood (MLE) parameterisation techniques that research really took off.

Non-Bayesian Gaussian mixture model using plate notation . Smaller squares indicate fixed parameters; larger circles indicate random variables. Filled-in shapes indicate known values. The indication [K] means a vector of size K .

Bayesian Gaussian mixture model using plate notation . Smaller squares indicate fixed parameters; larger circles indicate random variables. Filled-in shapes indicate known values. The indication [K] means a vector of size K .

Animation of the clustering process for one-dimensional data using a Bayesian Gaussian mixture model where normal distributions are drawn from a Dirichlet process . The histograms of the clusters are shown in different colours. During the parameter estimation process, new clusters are created and grow on the data. The legend shows the cluster colours and the number of datapoints assigned to each cluster.