, and an observation drawn from a multinomial distribution with probability vector p and number of trials n. The Dirichlet parameter vector captures the prior belief about the situation and can be seen as a pseudocount: observations of each outcome that occur before the actual data is collected.
Another form for this same compound distribution, written more compactly in terms of the beta function, B, is as follows:
The Dirichlet-multinomial distribution can also be motivated via an urn model for positive integer values of the vector
, then the expected number of times the outcome i was observed over n trials is The covariance matrix is as follows.
The entries of the corresponding correlation matrix are The sample size drops out of this expression.
It is this positive correlation which gives rise to overdispersion relative to the multinomial distribution.
If then, if the random variables with subscripts i and j are dropped from the vector and replaced by their sum[citation needed], This aggregation property may be used to derive the marginal distribution of
[clarification needed] Another useful formula, particularly in the context of Gibbs sampling, asks what the conditional density of a given variable
defined above, and We also use the fact that Then: In general, it is not necessary to worry about the normalizing constant at the time of deriving the equations for conditional distributions.
However, when the conditional distribution is written in the simple form above, it turns out that the normalizing constant assumes a simple form: Hence This formula is closely related to the Chinese restaurant process, which results from taking the limit as
In the following sections, we discuss different configurations commonly found in Bayesian networks.
is simply the collection of categorical variables dependent on prior d. Accordingly, the conditional probability distribution can be written as follows: where
Again, in the joint distribution, only the categorical variables dependent on the same prior are linked into a single Dirichlet-multinomial: The conditional distribution of the categorical variables dependent only on their parents and ancestors would have the identical form as above in the simpler case.
Now imagine we have a hierarchical model as follows: Here we have a tricky situation where we have multiple Dirichlet priors as before and a set of dependent categorical variables, but the relationship between the priors and dependent variables isn't fixed, unlike before.
This occurs, for example, in topic models, and indeed the names of the variables above are meant to correspond to those in latent Dirichlet allocation.
However, the topic membership of a given word isn't fixed; rather, it's determined from a set of latent variables
In this case, all variables dependent on a given prior are tied together (i.e. correlated) in a group, as before — specifically, all words belonging to a given topic are linked.
In the standard LDA model, the words are completely observed, and hence we never need to resample them.
In such a case, we would want to initialize the distribution over the words in some reasonable fashion — e.g. from the output of some process that generates sentences, such as a machine translation model — in order for the resulting posterior latent variable distributions to make any sense.)
Using the above formulas, we can write down the conditional probabilities directly: Here we have defined the counts more explicitly to clearly separate counts of words and counts of topics: As in the scenario above with categorical variables with dependent children, the conditional probability of those dependent children appears in the definition of the parent's conditional probability.
In this case, each latent variable has only a single dependent child word, so only one such term appears.
Hence we have to normalize by summing over all word symbols: where It's also worth making another point in detail, which concerns the second factor above in the conditional probability.
Remember that the conditional distribution in general is derived from the joint distribution, and simplified by removing terms not dependent on the domain of the conditional (the part on the left side of the vertical bar).
Usually there is one factor for each dependent node, and it has the same density function as the distribution appearing the mathematical definition.
, that child has a Dirichlet co-parent that we have collapsed out, which induces a Dirichlet-multinomial over the entire set of nodes
However, we don't already know the correct category of any documents; instead, we want to cluster them based on mutual similarities.
In this case, the joint distribution needs to be taken over all words in all documents containing a label assignment equal to the value of
Rather, we can reduce it down only to a smaller joint conditional distribution over the words in the document for the label in question, and hence we cannot simplify it using the trick above that yields a simple sum of expected count and prior.
Although it is in fact possible to rewrite it as a product of such individual sums, the number of factors is very large, and is not clearly more efficient than directly computing the Dirichlet-multinomial distribution probability.
[2] The Dirichlet-multinomial distribution is used in automated document classification and clustering, genetics, economy, combat modeling, and quantitative marketing.