Dirichlet–multinomial mixture model: Gibbs sampling

Similar to the previous Dirichlet–multinomial mixture model with known groups, this time the document–group assignment $\bm{z}$ is no longer observed.

Random variables

$\D=\{w_1,\dots,w_N\}$ where $w_n\in\{1,2,\dots,V\}$
- $\gbm{\phi}_t=(\phi_{1|t},\phi_{2|t},\dots,\phi_{V|t})$
- $\gbm{\theta}=(\theta_1,\theta_2,\dots,\theta_T)$
- $\bm{z}=(z_1,z_2,\dots,z_D)$ where $z_d\in\{1,2,\dots,T\}$

where

$N$ : number of tokens
$D$ : number of documents
$V$ : number of vocabularies
$T$ : number of groups

Generative process

$z_d\sim\gbm{\theta}$
$\gbm{\phi}_1,\dots,\gbm{\phi}_T \sim\prod_{t=1}^T\Dir(\gbm{\phi}_t;\beta,\bm{n})$
$\gbm{\theta}\sim\Dir(\gbm{\theta};\alpha,\bm{m})$

Challenges in computing posterior

$P(\Psi|\D,\H) &= P(\gbm{\phi}_1,\dots,\gbm{\phi}_T,\gbm{\theta},\bm{z}|\D,\H) \\ &= P(\gbm{\phi}_1,\dots,\gbm{\phi}_T,\gbm{\theta}|\bm{z},\D,\H) P(\bm{z}|\D,\H)$

where

$P(\gbm{\phi}_1,\dots,\gbm{\phi}_T,\gbm{\theta}|\bm{z},\D,\H) = \Dir\left(\gbm{\theta}; D+\alpha, \left( \frac{D_1+\alpha m_1}{D+\alpha},\dots, \frac{D_T+\alpha m_T}{D+\alpha}\right)\right) \prod_{t=1}^T \Dir\left(\gbm{\phi}_t; N_t+\beta, \left( \frac{N_{1|t}+\beta n_1}{N_t+\beta},\dots, \frac{N_{V|t}+\beta n_V}{N_t+\beta}\right)\right)$

which is a joint of $(T+1)$ Dirichlet distributions that we know how to compute.

As for $P(\bm{z}|\D,\H)$ , it can be factorized as the follows.

$P(\bm{z}|\D,\H)=\frac{P(\D,\bm{z}|\H)}{P(\D|\H)} =\frac{P(\D,\bm{z}|\H)}{\sum_{\bm{z}}P(\D,\bm{z}|\H)}$

where the numerator $P(\D,\bm{z}|\H)$ is the evidence for the data and $\bm{z}$ that we know how to compute as well.

$P(\D,\bm{z}|\H) = \left( \prod_{t=1}^T\frac{\Gamma(\beta)}{\prod_{v=1}^V\Gamma(\beta n_v)} \frac{\prod_{v=1}^V\Gamma(N_{v|t}+\beta n_v)}{\Gamma(N_t+\beta)} \right) \left( \frac{\Gamma(\alpha)}{\prod_{t=1}^T\Gamma(\alpha m_t)} \frac{\prod_{t=1}^T\Gamma(D_t+\alpha m_t)}{\Gamma(D+\alpha)} \right)$

However, the computation of the denominator $P(\D|\H)=\sum_{\bm{z}}P(\D,\bm{z}|\H)$ is intractable.

$\sum_{\bm{z}}P(\D,\bm{z}|\H) &= \sum_{z_1=1}^T \sum_{z_2=1}^T \cdots \sum_{z_D=1}^T \left( \prod_{t=1}^T\frac{\Gamma(\beta)}{\prod_{v=1}^V\Gamma(\beta n_v)} \frac{\prod_{v=1}^V\Gamma(N_{v|t}+\beta n_v)}{\Gamma(N_t+\beta)} \right) \left( \frac{\Gamma(\alpha)}{\prod_{t=1}^T\Gamma(\alpha m_t)} \frac{\prod_{t=1}^T\Gamma(D_t+\alpha m_t)}{\Gamma(D+\alpha)} \right)$

This is a sum of $T^D$ terms which cannot be factorized any more because of the dependencies among $z_1,\dots,z_D$ that corresponds to all $D_t$ , $N_t$ and $N_{v|t}$ .

$P(\D,\bm{z}) =\prod_{d=1}^D P(\D_d,z_d|\D_1,\dots,\D_{d-1},z_1,\dots,z_{d-1})$

Furthermore, for other factorizations of the posterior, it always ends up being computationally intractable.

Gibbs sampling

Gibbs sampling is a case of Markov chain Monte Carlo method for sampling from a distribution up to a normalization constant over more than one random variables. For $\bm{x}=x_1,\dots,x_n$ , to sample from $P(\bm{x})$ , Gibbs sampler instead samples from $P(x_i|\bm{x}_{\setminus i})=P(x_i|\bm{x}\setminus x_i) =P(x_i|x_1,\dots,x_{i-1},x_{i+1},\dots,x_n)$ iteratively.

initialize $x_1^{0}, \dots, x_n^{0}$ somehow.
on iteration
- $x_1^{(t+1)}\sim P(x_1|x_2^{(t)},\dots,x_n^{(t)})$
- $x_2^{(t+1)}\sim P(x_2|x_1^{(t+1)},x_3^{(t)},\dots,x_n^{(t)})$
- $x_3^{(t+1)}\sim P(x_3|x_1^{(t+1)},x_2^{(t+1)},x_3^{(t)}, \dots,x_n^{(t)})$
- $\vdots$
- $x_n^{(t+1)}\sim P(x_n|x_1^{(t+1)},\dots,x_{n-1}^{(t+1)})$

Gibbs sampler for Dirichlet–multinomial mixture model

Let $\D_d$ denotes the tokens associated with document $d$ .

$z_d^{(s)}\sim P(z_d|\D,\bm{z}_\d) =\frac{P(\D_d,z_d | \D_\d,\zz_\d)}{P(\D_d | \D_\d,\zz_\d)}$

where the denominator $P(\D_d | \D_\d,\zz_\d)$ is a normalization constant which can be ignored because we are able to sample from an unnormalized discrete distribution.

The numerator $P(\D_d,z_d | \D_\d,\zz_\d)$ is the evidence for document $d$ and its document–group assignment $z_d$ .

$P(\D_d,z_d | \D_\d,\zz_\d) &= \left( \prod_{t=1}^T \frac{\Gamma(N_t^\d+\beta)}{\prod_{v=1}^V\Gamma(N_{v|t}^\d+\beta n_v)} \frac{\prod_{v=1}^V\Gamma(N_{v|t}^d+N_{v|t}^\d+\beta n_v)} {\Gamma(N_t^d+N_t^\d+\beta)} \right) \left( \frac{\Gamma(D^\d+\alpha)}{\prod_{t=1}^T\Gamma(D_t^\d+\alpha m_t)} \frac{\prod_{t=1}^T\Gamma(D_t^d+D_t^\d+\alpha m_t)} {\Gamma(D^d+D^\d+\alpha)} \right)$

where

$N_t = N_t^d + N_t^\d$ ; $N_{v|t} = N_{v|t}^d + N_{v|t}^\d$
$N_{v|t}^d = 0$ if $t \ne z_d$ ; $N_t^d = 0$ if $t \ne z_d$
$D^d = 1$ ; $D^\d = D - 1$
$D_t^d = 0$ if $t \ne z_d$ , that is, $D_t^d = \I(t=z_d)$ and $D_t^\d = D_t - \I(t=z_d)$

$P(\D_d,z_d=t | \D_\d,\zz_\d) &= \left( \prod_{t=1}^T \frac{\Gamma(N_t^\d+\beta)}{\prod_{v=1}^V\Gamma(N_{v|t}^\d+\beta n_v)} \frac{\prod_{v=1}^V\Gamma(N_{v|t}^d+N_{v|t}^\d+\beta n_v)} {\Gamma(N_t^d+N_t^\d+\beta)} \right) \left( \frac{\Gamma(D^\d+\alpha)}{\prod_{t=1}^T\Gamma(D_t^\d+\alpha m_t)} \frac{\prod_{t=1}^T\Gamma(D_t^d+D_t^\d+\alpha m_t)} {\Gamma(D^d+D^\d+\alpha)} \right) \\ &= \left( \frac{\Gamma(N_t^\d+\beta)}{\prod_{v=1}^V\Gamma(N_{v|t}^\d+\beta n_v)} \frac{\prod_{v=1}^V\Gamma(N_{v|t}^d+N_{v|t}^\d++\beta n_v)} {\Gamma(N_t^d+N_t^\d+\beta)} \right) \left( \frac{\Gamma(D-1+\alpha)}{\Gamma(D_t^\d+\alpha m_t)} \frac{\Gamma(1+D_t^\d+\alpha m_t)}{\Gamma(D+\alpha)} \right) \\ &= \frac{\Gamma(N_t^\d+\beta)}{\Gamma(N_t^d+N_t^\d+\beta)} \left( \prod_{v=1}^V \frac{\Gamma(N_{v|t}^d+N_{v|t}^\d+\beta n_v)} {\Gamma(N_{v|t}^\d+\beta n_v)} \right) \frac{D_t^\d+\alpha m_t}{D-1+\alpha}$

Let $N_d$ denotes the number of tokens in document $d$ and $N_{v|d}$ denotes the number of the $v$ -th token in document $d$ , The probability $P(\D_d,z_d=t | \D_\d,\zz_\d)$ can be written as the follows.

$P(\D_d,z_d=t | \D_\d,\zz_\d) &= \frac{\Gamma(N_t^\d+\beta)}{\Gamma(N_d+N_t^\d+\beta)} \left( \prod_{v=1}^V \frac{\Gamma(N_{v|d}+N_{v|t}^\d+\beta n_v)} {\Gamma(N_{v|t}^\d+\beta n_v)} \right) \frac{D_t^\d+\alpha m_t}{D-1+\alpha}$

Here the data $\D=\D_d\cup\D_\d$ is observed, when $\zz_\d$ is fixed, we can evaluation the probability of $P(\D_d,z_d=t | \D_\d,\zz_\d)$ for all $t=1,2,\dots,T$ and draw a sample of $z_d$ from the resulting unnormalized distribution.

Bayesian Methods for Text