Dirichlet–multinomial mixture model: Gibbs sampling
Similar to the previous Dirichlet–multinomial mixture model with known groups, this time the document–group assignment is no longer observed.
Random variables
- where
- where
where
- : number of tokens
- : number of documents
- : number of vocabularies
- : number of groups
Generative process
Challenges in computing posterior
where
which is a joint of Dirichlet distributions that we know how to compute.
As for , it can be factorized as the follows.
where the numerator is the evidence for the data and that we know how to compute as well.
However, the computation of the denominator is intractable.
This is a sum of terms which cannot be factorized any more because of the dependencies among that corresponds to all , and .
Furthermore, for other factorizations of the posterior, it always ends up being computationally intractable.
Gibbs sampling
Gibbs sampling is a case of Markov chain Monte Carlo method for sampling from a distribution up to a normalization constant over more than one random variables. For , to sample from , Gibbs sampler instead samples from iteratively.
- initialize somehow.
- on iteration
Gibbs sampler for Dirichlet–multinomial mixture model
Let denotes the tokens associated with document .
where the denominator is a normalization constant which can be ignored because we are able to sample from an unnormalized discrete distribution.
The numerator is the evidence for document and its document–group assignment .
where
- ;
- if ; if
- ;
- if , that is, and
Let denotes the number of tokens in document and denotes the number of the -th token in document , The probability can be written as the follows.
Here the data is observed, when is fixed, we can evaluation the probability of for all and draw a sample of from the resulting unnormalized distribution.