Bayesian Methods for Text

Latent Dirichlet allocation: Gibbs sampling

«  Dirichlet–multinomial mixture model: exploration and prediction   ::   Contents   ::   Hyperparameter inference: slice sampling  »

Latent Dirichlet allocation: Gibbs sampling

Mixtures vs admixtures

Mixture model

\Psi=\{\pphi_1, \dots, \pphi_T, \ttheta, \zz\} where \zz=(z_1,\dots,z_D), z_i\in\{1,\dots,T\}

z_d \sim \ttheta

Admixture model

\Psi=\{\pphi_1, \dots, \pphi_T, \ttheta_1, \dots, \ttheta_D, \zz\} where \ttheta_d=(\theta_{1|d},\dots,\theta_{T|d}) and \zz=(z_1,\dots,z_N)

z_n \sim \ttheta_{d_n}

Model

\D=\{\ww=(w_1,\dots,w_N)\}

\Psi=\{\pphi_1, \dots, \pphi_T, \ttheta_1, \dots, \ttheta_D, \zz
=(z_1,\dots,z_N)\}

\pphi_t \sim \Dir(\pphi_t; \beta \nn)

P(\pphi_1,\dots,\pphi_T | \H) = \prod_{t=1}^T \Dir(\pphi_t; \beta \nn)

P(\ttheta_1,\dots,\ttheta_D | \H) =
\prod_{d=1}^D \Dir(\ttheta_d; \alpha \mm)

  • N tokens
  • T model components
  • any token can be drawn from any model components

Generative process

for t=1,\dots,T
\pphi_t \sim \Dir(\pphi_t; \beta \nn)

for d=1,\dots,D
\ttheta_d \sim \Dir(\ttheta_d; \alpha \mm)

for n=1,\dots,N
z_n \sim \ttheta_{d_n}
w_n \sim \pphi_{z_n}

Note that D, N, N_d can be generated from Poisson distributions.

Model explanation

P(\Psi|\D)
&= P(\pphi_1,\dots,\pphi_T,\ttheta_1,\dots,\ttheta_D,\zz | \D) \\
&= P(\pphi_1,\dots,\pphi_T,\ttheta_1,\dots,\ttheta_D | \D,\zz) P(\zz | \D)

P(\pphi_1,\dots,\pphi_T,\ttheta_1,\dots,\ttheta_D | \D,\zz)
&= \frac{P(\D,\zz | \pphi_1,\dots,\pphi_T,\ttheta_1,\dots,\ttheta_D)
         P(\pphi_1,\dots,\pphi_T,\ttheta_1,\dots,\ttheta_D)}
        {P(\D, \zz)}

P(\D,\zz | \pphi_1,\dots,\pphi_T,\ttheta_1,\dots,\ttheta_D)
&= P(\D | \zz,\pphi_1,\dots,\pphi_T) P(\zz | \ttheta_1,\dots,\ttheta_D) \\
&= \prod_{t=1}^T \prod_{v=1}^V \phi_{v|t}^{N_{v|t}}
   \prod_{d=1}^D \prod_{t=1}^T \theta_{t|d}^{N_{t|d}}

prior

P(\pphi_1,\dots,\pphi_T,\ttheta_1,\dots,\ttheta_D)
= \prod_{t=1}^T \Dir(\pphi_t; \beta\nn)
  \prod_{d=1}^D \Dir(\ttheta_d; \alpha\mm)

evidence

P(\D,\zz)
&=\int d\pphi_1 \cdots \int d\pphi_T \int d\ttheta_1 \cdots \int d\ttheta_D
P(\D,\zz | \pphi_1,\dots,\pphi_T,\ttheta_1,\dots,\ttheta_D)
P(\pphi_1,\dots,\pphi_T,\ttheta_1,\dots,\ttheta_D) \\
&= \prod_{t=1}^T \left[
   \int d\pphi_t \prod_{v=1}^V \phi_{v|t}^{N_{v|t}} \Dir(\pphi_t; \beta\nn)
   \right]
   \prod_{d=1}^D \left[
   \int d\ttheta_d \prod_{t=1}^T \theta_{t|d}^{N_{t|d}}
   \Dir(\ttheta_d;\alpha\nn) \right] \\
&= \prod_{t=1}^T \frac{\Gamma(\beta)}{\prod_{v=1}^V\Gamma(\beta n_v)}
   \frac{\prod_{v=1}^V\Gamma(N_{v|t}+\beta n_v)}{\Gamma(N_t+\beta)}
   \prod_{d=1}^D \frac{\Gamma(\alpha)}{\prod_{t=1}^T\Gamma(\alpha m_t)}
   \frac{\prod_{t=1}^T\Gamma(N_{t|d}+\alpha m_t)}{\Gamma(N_d+\alpha)}

posterior

&P(\pphi_1,\dots,\pphi_T,\ttheta_1,\dots,\ttheta_D | \D, \zz) \\
&= P(\pphi_1,\dots,\pphi_T | \D, \zz)
   P(\ttheta_1,\dots,\ttheta_D | \D, \zz) \\
&= \left\{
   \prod_{t=1}^T \Dir(\pphi_t; N_t+\beta, \left(
   \frac{N_{1|t}+\beta_1}{N_t+\beta},
   \dots,\frac{N_{V|t}+\beta n_V}{N_t+\beta}\right))
   \right\}
   \left\{
   \prod_{d=1}^D \Dir(\ttheta_d; N_d+\alpha,
   \left(\frac{N_{1|d}+\alpha m_1}{N_d+\alpha}, \dots,
   \frac{N_{T|d}+\alpha m_T}{N_d+\alpha}\right))
   \right\} \\

Goal:

P(\pphi_1,\dots,\pphi_T,\ttheta_1,\dots,\ttheta_D, \zz | \D)
= P(\pphi_1,\dots,\pphi_T | \D, \zz)
  P(\ttheta_1,\dots,\ttheta_D | \D, \zz) P(\zz | \D)

P(\zz | \D)
= \frac{P(\D | \zz) P(\zz)}{P(\D)}

Gibbs sampling

Use Gibbs sampling to draw samples from P(\zz|\D):

P(z_n | \D, z_{\setminus n})
= \frac{P(w_n, z_n | \D_{\setminus n}, \zz_{\setminus n})}
  {\sum_{z_n} P(w_n, z_n | \D_{\setminus n}, \zz_{\setminus n})}
\propto P(w_n, z_n | \D_{\setminus n}, \zz_{\setminus n})

P(w_n=v, z_n=t | \D_{\setminus n}, \zz_{\setminus n})
= P(w_n=v | z_n=t, \D_{\setminus n}, \zz_{\setminus n})
  P(z_n=t | \D_{\setminus n}, \zz_{\setminus n})

P(w_n=v | z_n=t, \D_{\setminus n}, \zz_{\setminus n})
&= \int d\pphi_1 \cdots \int d\pphi_T
   \phi_{v|t}\prod_{t=1}^T\Dir(\pphi_t; N_t^{\setminus n}+\beta,
   \left(\frac{N_{1|t}^{\setminus n}+\beta n_1}{N_t^{\setminus n}+\beta},
   \dots,\frac{N_{V|t}^{\setminus n}+\beta n_V}{N_t^{\setminus n}+\beta}
   \right)) \\
&= \E[\phi_{v|t}] \\
&= \frac{N_{v|t}^{\setminus n}+\beta n_v}{N_t^{\setminus n}+\beta}

P(z_n=t | \D_{\setminus n}, \zz_{\setminus n})
&= \int d\ttheta_1 \cdots \int d\ttheta_D
   \theta_{t|d}\prod_{d=1}^D\Dir(\ttheta_d; N_d^{\setminus n},
   \left(\frac{N_{1|d}^{\setminus n}+\alpha m_1}{N_d^{\setminus n}+\alpha},
   \dots,\frac{N_{T|d}^{\setminus n}+\alpha m_T}{N_d^{\setminus n}+\alpha}
   \right)) \\
&= \frac{N_{t|d}^{\setminus n}+\alpha m_t}{N_d^{\setminus n}+\alpha} \\
&= \frac{N_{t|d}^{\setminus n}+\alpha m_t}{N_d-1+\alpha}

The denominator N_d-1+\alpha is a constant which can be dropped in the inner most loop.

z_n^{(s)}\sim P(z_n|\D,z_1^{(s)},\dots,z_{n-1}^{(s)},
                z_{n+1}^{(s)},\dots,z_N^{(s)})

Exploration

\E[\phi_{v|t}]
&= \int d\Psi \phi_{v|t} P(\Psi | \D) \\
&= \int d\pphi_t \phi_{v|t} P(\pphi_t | \D) \\
&= \sum_{\zz} \int d\pphi_t \phi_{v|t} P(\pphi_t | \zz, \D) P(\zz | \D) \\
&= \sum_{\zz} \frac{N_{v|t}+\beta n_v}{N_t+\beta} P(\zz | \D) \\
&\approx
   \frac{1}{S} \sum_{s=1}^S \frac{N_{v|t}^{(s)}+\beta n_v}{N_t^{(s)}+\beta}

likewise,

\E[\theta_{t|d}]
\approx \frac{1}{S}\sum_{s=1}^S
\frac{N_{t|d}^{(s)}+\alpha m_t}{N_d^{(s)}+\alpha}

However, due to the label switching problem, only the single sample with the highest probability should be used to estimate the expectation.

Prediction

P(\D' | \D)
&= \sum_\zz \sum_{\zz'} P(\D',\zz'|\D,\zz) P(\zz|\D) \\
&\approx \frac{1}{S} \sum_{s=1}^S \sum_{\zz'} P(\D',\zz'|\D,\zz^{(s)})

P(\D' | \D, \zz^{(s)})
&= \prod_{n=1}^{N'} P(w_n'|w_1',\dots,w_{n-1}',\D,\zz^{(s)}) \\
&= \prod_{n=1}^{N'} \sum_{z_1'}\cdots\sum_{z_n'}
   P(w_n',z_1',\dots,z_n' | w_1',\dots,w_{n-1}',\D,\zz^{(s)}) \\
&= \prod_{n=1}^{N'} \sum_{z_1'}\cdots\sum_{z_n'}
   P(w_n',z_n' | w_1',\dots,w_{n-1}',z_1',\dots,z_{n-1}',\D,\zz^{(s)})
   P(z_1',\dots,z_{n-1}' | w_1',\dots,w_{n-1}',\D,\zz^{(s)}) \\
&= \prod_{n=1}^{N'} \sum_{z_1'}\cdots\sum_{z_{n-1}'}
   \sum_{t=1}^T
   \frac{N_{v|t}^{'<n}+N_{v|t}^{(s)}+\beta n_v}{N_t^{'<n}+N_t^{(s)}+\beta}
   \frac{N_{t|d_n'}^{'<n}+N_{t|d_n'}^{(s)}+\alpha m_t}
   {N_{d_n'}^{'<n}+N_{d_n'}^{(s)}+\alpha}
   P(z_1',\dots,z_{n-1}' | w_1',\dots,w_{n-1}',\D,\zz^{(s)}) \\
&\approx
   \prod_{n=1}^{N'} \frac{1}{R} \sum_{r=1}^R \sum_{t=1}^T
   \frac{N_{v|t}^{'<n(r)}+N_{v|t}^{(s)}+\beta n_v}
   {N_t^{'<n(r)}+N_t^{(s)}+\beta}
   \frac{N_{t|d_n'}^{'<n(r)}+N_{t|d_n'}^{(s)}+\alpha m_t}
   {N_{d_n'}^{'<n(r)}+N_{d_n'}^{(s)}+\alpha} \\
&= \prod_{n=1}^{N'} \frac{1}{R} \sum_{r=1}^R \sum_{t=1}^T
   \frac{N_{v|t}^{'<n(r)}+N_{v|t}^{(s)}+\beta n_v}
   {N_t^{'<n(r)}+N_t^{(s)}+\beta}
   \frac{N_{t|d_n'}^{'<n(r)}+\alpha m_t}
   {N_{d_n'}^{'<n(r)}+\alpha}

where in the third to last line, v denotes the value of w_n' and d_n' denotes the document where w_n' is in. In addition, N_{t|d_n'}^{(s)}=0 and N_{d_n'}^{(s)}=0 because test documents do not exist in the training set.

The formula below is applied in the derivation:

P(w_n=v,z_n=t | \D^{\setminus n},\zz^{\setminus n})
=\frac{N_{v|t}^{\setminus n}+\beta n_v}{N_t^{\setminus n}+\beta}
 \frac{N_{t|d}^{\setminus n}+\alpha m_t}{N_d-1+\alpha}

«  Dirichlet–multinomial mixture model: exploration and prediction   ::   Contents   ::   Hyperparameter inference: slice sampling  »