Dirichlet process

A bag of dice manufactured using a crude process 100 years ago will likely have probabilities that deviate wildly from the uniform pmf, whereas a bag of state-of-the-art dice used by Las Vegas casinos may have barely perceptible imperfections.

A particularly important application of Dirichlet processes is as a prior probability distribution in infinite mixture models.

[2] It has since been applied in data mining and machine learning, among others for natural language processing, computer vision and bioinformatics.

Dirichlet processes are usually used when modelling data that tends to repeat previous values in a so-called "rich get richer" fashion.

This fact can be shown by calculating the joint probability distribution of the observations and noticing that the resulting formula only depends on which

Because of this exchangeability, de Finetti's representation theorem applies and it implies that the observations

In summary, this means that we get an equivalent procedure to the above algorithm: In practice, however, drawing a concrete distribution

This is a common phenomenon in the context of Bayesian non-parametric statistics where a typical task is to learn distributions on function spaces, which involve effectively infinitely many parameters.

The key insight is that in many applications the infinite-dimensional distributions appear only as an intermediary computational device and are not required for either the initial specification of prior beliefs or for the statement of the final inference.

Given a measurable set S, a base probability distribution H and a positive real number

Additionally, a customer opens a new table with a probability proportional to the scaling parameter

Because customers sit at a table with a probability proportional to the number of customers already sitting at the table, two properties of the DP can be deduced: A third approach to the Dirichlet process is the so-called stick-breaking process view.

Conceptually, this involves repeatedly breaking off and discarding a random fraction (sampled from a Beta distribution) of a "stick" that is initially of length 1.

The formula can be understood by noting that after the first k − 1 values have their portions assigned, the length of the remainder of the stick is

is, the less of the stick will be left for subsequent values (on average), yielding more concentrated distributions.

Furthermore, when we draw a black ball, if rather than generating a new colour, we instead pick a random value from a base distribution

In this section, we consider the model The Dirichlet Process distribution satisfies prior conjugacy, posterior consistency, and the Bernstein–von Mises theorem.

A reasonable model for this situation might be to classify each voter as a liberal, a conservative or a moderate and then model the event that a voter says "Yes" to any particular question as a Bernoulli random variable with the probability dependent on which political cluster they belong to.

By looking at how votes were cast in previous years on similar pieces of legislation one could fit a predictive model using a simple clustering algorithm such as k-means.

That algorithm, however, requires knowing in advance the number of clusters that generated the data.

In many situations, it is not possible to determine this ahead of time, and even when we can reasonably assume a number of clusters we would still like to be able to check this assumption.

In this case it is far from obvious how to determine a priori how many clusters (of common velocities) there should be and any model for this would be highly suspect and should be checked against the data.

We assume that we have no initial information distinguishing the clusters, which is captured by the symmetric prior

To understand the connection to Dirichlet process priors we rewrite this model in an equivalent but more suggestive form: Instead of imagining that each data point is first assigned a cluster and then drawn from the distribution associated to that cluster we now think of each observation being associated with parameter

This is exactly what is accomplished by the model: With this in hand we can better understand the computational merits of the Dirichlet process.

In the infinite dimensional case it is obviously impossible to write down the posterior explicitly.

In addition, the Dirichlet process has also been used for developing a mixture of expert models, in the context of supervised learning algorithms (regression or classification settings).

[8][9] As draws from a Dirichlet process are discrete, an important use is as a prior probability in infinite mixture models.

The generative process is therefore that a sample is drawn from a Dirichlet process, and for each data point, in turn, a value is drawn from this sample distribution and used as the component distribution for that data point.

For instance, Bayesian nonparametric versions of the Wilcoxon rank-sum test and the Wilcoxon signed-rank test have been developed by using the imprecise Dirichlet process, a prior ignorance Dirichlet process.

Animation of a Chinese restaurant process with scaling parameter

\alpha =0.5

. Tables are hidden once the customers of a table can not be displayed anymore; however, every table has infinitely many seats. (Recording of an interactive animation. ^{[

3

]} )

Animation of the clustering process for one-dimensional data using Gaussian distributions drawn from a Dirichlet process. The histograms of the clusters are shown in different colours. During the parameter estimation process, new clusters are created and grow on the data. The legend shows the cluster colours and the number of datapoints assigned to each cluster.