Bayesian Methods for Text

Dirichlet–multinomial mixture model: Gibbs sampling

«  Dirichlet–multinomial mixture model: known groups   ::   Contents   ::   Dirichlet–multinomial mixture model: exploration and prediction  »

Dirichlet–multinomial mixture model: Gibbs sampling

Similar to the previous Dirichlet–multinomial mixture model with known groups, this time the document–group assignment \bm{z} is no longer observed.

Random variables

  • \D=\{w_1,\dots,w_N\} where w_n\in\{1,2,\dots,V\}
  • \Psi=\{\gbm{\phi}_1,\dots,\gbm{\phi}_T,\gbm{\theta},\bm{z}\}
    • \gbm{\phi}_t=(\phi_{1|t},\phi_{2|t},\dots,\phi_{V|t})
    • \gbm{\theta}=(\theta_1,\theta_2,\dots,\theta_T)
    • \bm{z}=(z_1,z_2,\dots,z_D) where z_d\in\{1,2,\dots,T\}

where

  • N: number of tokens
  • D: number of documents
  • V: number of vocabularies
  • T: number of groups

Generative process

  • z_d\sim\gbm{\theta}
  • \gbm{\phi}_1,\dots,\gbm{\phi}_T
\sim\prod_{t=1}^T\Dir(\gbm{\phi}_t;\beta,\bm{n})
  • \gbm{\theta}\sim\Dir(\gbm{\theta};\alpha,\bm{m})

Challenges in computing posterior

P(\Psi|\D,\H)
&= P(\gbm{\phi}_1,\dots,\gbm{\phi}_T,\gbm{\theta},\bm{z}|\D,\H) \\
&= P(\gbm{\phi}_1,\dots,\gbm{\phi}_T,\gbm{\theta}|\bm{z},\D,\H)
   P(\bm{z}|\D,\H)

where

P(\gbm{\phi}_1,\dots,\gbm{\phi}_T,\gbm{\theta}|\bm{z},\D,\H)
=  \Dir\left(\gbm{\theta}; D+\alpha, \left(
   \frac{D_1+\alpha m_1}{D+\alpha},\dots,
   \frac{D_T+\alpha m_T}{D+\alpha}\right)\right)
   \prod_{t=1}^T
   \Dir\left(\gbm{\phi}_t; N_t+\beta, \left(
   \frac{N_{1|t}+\beta n_1}{N_t+\beta},\dots,
   \frac{N_{V|t}+\beta n_V}{N_t+\beta}\right)\right)

which is a joint of (T+1) Dirichlet distributions that we know how to compute.

As for P(\bm{z}|\D,\H), it can be factorized as the follows.

P(\bm{z}|\D,\H)=\frac{P(\D,\bm{z}|\H)}{P(\D|\H)}
=\frac{P(\D,\bm{z}|\H)}{\sum_{\bm{z}}P(\D,\bm{z}|\H)}

where the numerator P(\D,\bm{z}|\H) is the evidence for the data and \bm{z} that we know how to compute as well.

P(\D,\bm{z}|\H)
=  \left(
   \prod_{t=1}^T\frac{\Gamma(\beta)}{\prod_{v=1}^V\Gamma(\beta n_v)}
   \frac{\prod_{v=1}^V\Gamma(N_{v|t}+\beta n_v)}{\Gamma(N_t+\beta)}
   \right)
   \left(
   \frac{\Gamma(\alpha)}{\prod_{t=1}^T\Gamma(\alpha m_t)}
   \frac{\prod_{t=1}^T\Gamma(D_t+\alpha m_t)}{\Gamma(D+\alpha)}
   \right)

However, the computation of the denominator P(\D|\H)=\sum_{\bm{z}}P(\D,\bm{z}|\H) is intractable.

\sum_{\bm{z}}P(\D,\bm{z}|\H)
&= \sum_{z_1=1}^T \sum_{z_2=1}^T \cdots \sum_{z_D=1}^T
   \left(
   \prod_{t=1}^T\frac{\Gamma(\beta)}{\prod_{v=1}^V\Gamma(\beta n_v)}
   \frac{\prod_{v=1}^V\Gamma(N_{v|t}+\beta n_v)}{\Gamma(N_t+\beta)}
   \right)
   \left(
   \frac{\Gamma(\alpha)}{\prod_{t=1}^T\Gamma(\alpha m_t)}
   \frac{\prod_{t=1}^T\Gamma(D_t+\alpha m_t)}{\Gamma(D+\alpha)}
   \right)

This is a sum of T^D terms which cannot be factorized any more because of the dependencies among z_1,\dots,z_D that corresponds to all D_t, N_t and N_{v|t}.

P(\D,\bm{z})
=\prod_{d=1}^D P(\D_d,z_d|\D_1,\dots,\D_{d-1},z_1,\dots,z_{d-1})

Furthermore, for other factorizations of the posterior, it always ends up being computationally intractable.

Gibbs sampling

Gibbs sampling is a case of Markov chain Monte Carlo method for sampling from a distribution up to a normalization constant over more than one random variables. For \bm{x}=x_1,\dots,x_n, to sample from P(\bm{x}), Gibbs sampler instead samples from P(x_i|\bm{x}_{\setminus i})=P(x_i|\bm{x}\setminus x_i)
=P(x_i|x_1,\dots,x_{i-1},x_{i+1},\dots,x_n) iteratively.

  • initialize x_1^{0}, \dots, x_n^{0} somehow.
  • on iteration t+1
    • x_1^{(t+1)}\sim P(x_1|x_2^{(t)},\dots,x_n^{(t)})
    • x_2^{(t+1)}\sim P(x_2|x_1^{(t+1)},x_3^{(t)},\dots,x_n^{(t)})
    • x_3^{(t+1)}\sim P(x_3|x_1^{(t+1)},x_2^{(t+1)},x_3^{(t)},
\dots,x_n^{(t)})
    • \vdots
    • x_n^{(t+1)}\sim P(x_n|x_1^{(t+1)},\dots,x_{n-1}^{(t+1)})

Gibbs sampler for Dirichlet–multinomial mixture model

Let \D_d denotes the tokens associated with document d.

z_d^{(s)}\sim P(z_d|\D,\bm{z}_\d)
=\frac{P(\D_d,z_d | \D_\d,\zz_\d)}{P(\D_d | \D_\d,\zz_\d)}

where the denominator P(\D_d | \D_\d,\zz_\d) is a normalization constant which can be ignored because we are able to sample from an unnormalized discrete distribution.

The numerator P(\D_d,z_d | \D_\d,\zz_\d) is the evidence for document d and its document–group assignment z_d.

P(\D_d,z_d | \D_\d,\zz_\d)
&= \left(
   \prod_{t=1}^T
   \frac{\Gamma(N_t^\d+\beta)}{\prod_{v=1}^V\Gamma(N_{v|t}^\d+\beta n_v)}
   \frac{\prod_{v=1}^V\Gamma(N_{v|t}^d+N_{v|t}^\d+\beta n_v)}
   {\Gamma(N_t^d+N_t^\d+\beta)}
   \right)
   \left(
   \frac{\Gamma(D^\d+\alpha)}{\prod_{t=1}^T\Gamma(D_t^\d+\alpha m_t)}
   \frac{\prod_{t=1}^T\Gamma(D_t^d+D_t^\d+\alpha m_t)}
   {\Gamma(D^d+D^\d+\alpha)}
   \right)

where

  • N_t = N_t^d + N_t^\d; N_{v|t} = N_{v|t}^d + N_{v|t}^\d
  • N_{v|t}^d = 0 if t \ne z_d; N_t^d = 0 if t \ne z_d
  • D^d = 1; D^\d = D - 1
  • D_t^d = 0 if t \ne z_d, that is, D_t^d = \I(t=z_d) and D_t^\d = D_t - \I(t=z_d)

P(\D_d,z_d=t | \D_\d,\zz_\d)
&= \left(
   \prod_{t=1}^T
   \frac{\Gamma(N_t^\d+\beta)}{\prod_{v=1}^V\Gamma(N_{v|t}^\d+\beta n_v)}
   \frac{\prod_{v=1}^V\Gamma(N_{v|t}^d+N_{v|t}^\d+\beta n_v)}
   {\Gamma(N_t^d+N_t^\d+\beta)}
   \right)
   \left(
   \frac{\Gamma(D^\d+\alpha)}{\prod_{t=1}^T\Gamma(D_t^\d+\alpha m_t)}
   \frac{\prod_{t=1}^T\Gamma(D_t^d+D_t^\d+\alpha m_t)}
   {\Gamma(D^d+D^\d+\alpha)}
   \right) \\
&= \left(
   \frac{\Gamma(N_t^\d+\beta)}{\prod_{v=1}^V\Gamma(N_{v|t}^\d+\beta n_v)}
   \frac{\prod_{v=1}^V\Gamma(N_{v|t}^d+N_{v|t}^\d++\beta n_v)}
   {\Gamma(N_t^d+N_t^\d+\beta)}
   \right)
   \left(
   \frac{\Gamma(D-1+\alpha)}{\Gamma(D_t^\d+\alpha m_t)}
   \frac{\Gamma(1+D_t^\d+\alpha m_t)}{\Gamma(D+\alpha)}
   \right) \\
&= \frac{\Gamma(N_t^\d+\beta)}{\Gamma(N_t^d+N_t^\d+\beta)}
   \left(
   \prod_{v=1}^V
   \frac{\Gamma(N_{v|t}^d+N_{v|t}^\d+\beta n_v)}
   {\Gamma(N_{v|t}^\d+\beta n_v)}
   \right)
   \frac{D_t^\d+\alpha m_t}{D-1+\alpha}

Let N_d denotes the number of tokens in document d and N_{v|d} denotes the number of the v-th token in document d, The probability P(\D_d,z_d=t | \D_\d,\zz_\d) can be written as the follows.

P(\D_d,z_d=t | \D_\d,\zz_\d)
&= \frac{\Gamma(N_t^\d+\beta)}{\Gamma(N_d+N_t^\d+\beta)}
   \left(
   \prod_{v=1}^V
   \frac{\Gamma(N_{v|d}+N_{v|t}^\d+\beta n_v)}
   {\Gamma(N_{v|t}^\d+\beta n_v)}
   \right)
   \frac{D_t^\d+\alpha m_t}{D-1+\alpha}

Here the data \D=\D_d\cup\D_\d is observed, when \zz_\d is fixed, we can evaluation the probability of P(\D_d,z_d=t | \D_\d,\zz_\d) for all t=1,2,\dots,T and draw a sample of z_d from the resulting unnormalized distribution.

«  Dirichlet–multinomial mixture model: known groups   ::   Contents   ::   Dirichlet–multinomial mixture model: exploration and prediction  »