Categorical distribution

There is no innate underlying ordering of these outcomes, but numerical labels are often attached for convenience in describing the distribution, (e.g. 1 to K).

The parameters specifying the probabilities of each possible outcome are constrained only by the fact that each must be in the range 0 to 1, and all must sum to 1.

The categorical distribution is the generalization of the Bernoulli distribution for a categorical random variable, i.e. for a discrete variable with more than two possible outcomes, such as the roll of a die.

On the other hand, the categorical distribution is a special case of the multinomial distribution, in that it gives the probabilities of potential outcomes of a single drawing rather than multiple drawings.

[2] This imprecise usage stems from the fact that it is sometimes convenient to express the outcome of a categorical distribution as a "1-of-K" vector (a vector with one element containing a 1 and all other elements containing a 0) rather than as an integer in the range 1 to K; in this form, a categorical distribution is equivalent to a multinomial distribution for a single observation (see below).

For example, in a Dirichlet-multinomial distribution, which arises commonly in natural language processing models (although not usually with this name) as a result of collapsed Gibbs sampling where Dirichlet distributions are collapsed out of a hierarchical Bayesian model, it is very important to distinguish categorical from multinomial.

Both forms have very similar-looking probability mass functions (PMFs), which both make reference to multinomial-style counts of nodes in a category.

Confusing the two can easily lead to incorrect results in settings where this extra factor is not constant with respect to the distributions of interest.

The factor is frequently constant in the complete conditionals used in Gibbs sampling and the optimal distributions in variational methods.

In one formulation of the distribution, the sample space is taken to be a finite sequence of integers.

The exact integers used as labels are unimportant; they might be {0, 1, ..., k − 1} or {1, 2, ..., k} or any other arbitrary set of values.

In the following descriptions, we use {1, 2, ..., k} for convenience, although this disagrees with the convention for the Bernoulli distribution, which uses {0, 1}.

Another formulation that appears more complex but facilitates mathematical manipulations is as follows, using the Iverson bracket:[3] where

This means that in a model consisting of a data point having a categorical distribution with unknown parameter vector p, and (in standard Bayesian style) we choose to treat this parameter as a random variable and give it a prior distribution defined using a Dirichlet distribution, then the posterior distribution of the parameter, after incorporating the knowledge gained from the observed data, is also a Dirichlet.

Intuitively, in such a case, starting from what is known about the parameter prior to observing the data point, knowledge can then be updated based on the data point, yielding a new distribution of the same form as the old one.

As such, knowledge of a parameter can be successively updated by incorporating new observations one at a time, without running into mathematical difficulties.

Given a model then the following holds:[2] This relationship is used in Bayesian statistics to estimate the underlying parameter p of a categorical distribution given a collection of N samples.

Intuitively, we can view the hyperprior vector α as pseudocounts, i.e. as representing the number of observations in each category that we have already seen.

Then we simply add in the counts for all the new observations (the vector c) in order to derive the posterior distribution.

For example, if 3 categories in the ratio 40:5:55 are in the observed data, then ignoring the effect of the prior distribution, the true parameter – i.e. the true, underlying distribution that generated our observed data – would be expected to have the average value of (0.40,0.05,0.55), which is indeed what the posterior reveals.

has a completely flat shape — essentially, a uniform distribution over the simplex of possible values of p. Logically, a flat distribution of this sort represents total ignorance, corresponding to no observations of any sort.

However, the mathematical updating of the posterior works fine if we ignore the

term and simply think of the α vector as directly representing a set of pseudocounts.

The maximum-a-posteriori estimate of the parameter p in the above model is simply the mode of the posterior Dirichlet distribution, i.e.,[2] In many practical applications, the only way to guarantee the condition that

As shown in the Dirichlet-multinomial distribution article, it has a very simple form:[2] There are various relationships among this formula and the previous ones: The reason for the equivalence between posterior predictive probability and the expected value of the posterior distribution of p is evident with re-examination of the above formula.

This type of scenario is often termed a preferential attachment (or "rich get richer") model.

In networks that include categorical variables with Dirichlet priors (e.g. mixture models and models including mixture components), the Dirichlet distributions are often "collapsed out" (marginalized out) of the network, which introduces dependencies among the various categorical nodes dependent on a given prior (specifically, their joint distribution is a Dirichlet-multinomial distribution).

is the number of nodes having category i among the nodes other than node n. There are a number of methods, but the most common way to sample from a categorical distribution uses a type of inverse transform sampling: Assume a distribution is expressed as "proportional to" some expression, with unknown normalizing constant.

In machine learning it is typical to parametrize the categorical distribution,

can be recovered using the softmax function, which can then be sampled using the techniques described above.