Bayesian Methods for Text

Dirichlet–multinomial mixture model: exploration and prediction

«  Dirichlet–multinomial mixture model: Gibbs sampling   ::   Contents   ::   Latent Dirichlet allocation: Gibbs sampling  »

Dirichlet–multinomial mixture model: exploration and prediction

Recap

  • \D=\{w_1,\dots,w_N\}
  • \D_d: tokens in document d
  • \Psi=\{\pphi_1,\dots,\pphi_T,\gbm{\theta},\zz\}

The posterior of Dirichlet–multinomial mixture model is given by

P(\Psi|\D,\H) = P(\pphi_1,\dots,\pphi_T,\gbm{\theta}|\D,\zz,\H) P(\zz|\D,\H)

where P(\pphi_1,\dots,\pphi_T,\gbm{\theta}|\D,\zz,\H) is composed of T+1 Dirichlet distributions. The other term can be obtained by

P(\zz|\D,\H)
= \frac{P(\D|\zz) P(\zz)}{P(\D)}
= \frac{P(\D|\zz) P(\zz)}{\sum_\zz P(\D|\zz) P(\zz)}

in which \sum_\zz P(\D|\zz) P(\zz) can be computed using Gibbs sampling.

Exploration

P(\pphi_t|D,\H)
&= \int d\Psi_{\setminus\pphi}P(\Psi|\D,\H) \\
&= \sum_{\zz} P(\pphi_t,\zz|\D,\H) \\
&= \sum_{\zz} P(\pphi_t|\zz,\D,\H) P(\zz|\D,\H)

\E_{P(\pphi|\D,\H)}[\pphi_t]
&= \int d\pphi_t \pphi_t P(\pphi_t|\D,\H) \\
&= \int d\pphi_t \pphi_t \sum_\zz P(\pphi_t|\zz,\D,\H) P(\zz|\D,\H) \\
&= \sum_\zz \int d\pphi_t \pphi_t P(\pphi_t|\zz,\D,\H) P(\zz|\D,\H) \\

where

\int d\pphi_t \pphi_t P(\pphi_t|\zz,\D,\H)
&= \left(\frac{N_{1|t}+\beta n_1}{N_t+\beta},\dots,
         \frac{N_{V|t}+\beta n_V}{N_t+\beta}\right)

For a single component of \pphi_t

\E_{P(\pphi|\D,\H)}[\phi_{v|t}]
&= \sum_\zz \frac{N_{v|t}+\beta n_v}{N_t+\beta} P(\zz|\D,\H)

For the intractable sums, we can approximate that using sampling technique: draw samples \zz^{(1)},\dots,\zz^{(S)}\sim P(\zz|\D) by Gibbs sampling.

Can we do the following Monte Carlo approximation?

\E_{P(\pphi|\D,\H)}[\phi_{v|t}]
&\approx \frac{1}{S}\sum_{s=1}^S
   \frac{N_{v|t}^{(s)}+\beta n_v}{N_t^{(s)}+\beta}

No, because of label switching.

Label switching

Label switching happens for several reasons

  • T! permutations for an identical mode
  • multiple modes

Label switching happens also in a single Gibbs run; otherwise, the Markov chain is not mixing well.

Dealing with label switching

  • Matching up topics: very hard in practice. It is even harder when the distribution is multi-modal.
  • Averaging samples within a small region in the Markov chain: there is no guarantee that label switching does not happen on those samples.
  • Using Only one sample, in particular, the sample with highest probability.

In our case, we use a single sample to estimate the latent parameter:

\E_{P(\pphi|\D,\H)}[\phi_{v|t}]
&\approx \frac{N_{v|t}^{(s)}+\beta n_v}{N_t^{(s)}+\beta}
\qquad\text{where }z^{(s)} = \max_s P(\zz^{(s)}|\D)

Prediction

P(\D'|\D,\H)
&= \sum_{\zz'} P(\D',\zz'|\D,\H) \\
&= \sum_{\zz'} \int d\Psi P(\D',\zz'|\Psi,\H) P(\Psi|\D,\H) \\
&= \sum_{\zz'} \int d\Psi P(\D'|\zz',\Psi) P(\zz'|\Psi) P(\Psi|\D,\H) \\

Alternatively,

P(\D'|\D,\H)
&= \sum_{z'} \sum_\zz P(\D',\zz'|\D,\zz) P(\zz|\D)

Single new token

P(w_{N+1}=v|\D,\H)
&= \sum_{z_{d_{N+1}}} \sum_\zz P(w_{N+1}=v,z_{d_{N+1}}|\D,\zz) P(\zz|\D) \\
&= \sum_{t=1}^T \sum_\zz P(w_{N+1}=v,z_{d_{N+1}}=t|\D,\zz) P(\zz|\D) \\
&\approx \frac{1}{S} \sum_{t=1}^T \sum_{s=1}^S \left(
   \frac{N_{v|t}^{(s)}+\beta n_v}{N_t^{(s)}+\beta} \cdot
   \frac{D_t^{(s)}+\alpha m_t}{D+\alpha}\right) \\
&= \frac{1}{S} \sum_{s=1}^S \left[\sum_{t=1}^T \left(
   \frac{N_{v|t}^{(s)}+\beta n_v}{N_t^{(s)}+\beta} \cdot
   \frac{D_t^{(s)}+\alpha m_t}{D+\alpha}\right)\right]

where \zz^{(1)},\dots,\zz^{(S)}\sim P(\zz|\D).

The Monte Carlo method here is not susceptible to label switching. Each \sum_{t=1}^T \left(\frac
{N_{v|t}^{(s)}+\beta n_v}{N_t^{(s)}+\beta} \cdot
\frac{D_t^{(s)}+\alpha m_t}{D+\alpha}\right) is an approximation of a probability P(w_{N+1}=v|\D,\H) even if the component \frac{N_{v|t}^{(s)}+\beta n_v}{N_t^{(s)}+\beta} \cdot
\frac{D_t^{(s)}+\alpha m_t}{D+\alpha} for a particular t is susceptible to label switching.

Multiple new tokens

P(\D'|\D,\H)
&= \sum_{\zz} P(\D'|\D,\zz) P(\zz|\D) \\
%\sum_{\zz'} P(\D',\zz'|\D,\zz) P(\zz|\D) \\
&\approx
%\frac{1}{S} \sum_{s=1}^S \sum_{\zz'} P(\D',\zz'|\D,\zz^{(s)}) \\
\frac{1}{S} \sum_{s=1}^S P(\D'|\D,\zz^{(s)})

where \zz^{(1)},\dots,\zz^{(S)}\sim P(\zz|\D).

&P(\D'|\D,\zz^{(s)}) \\
&= \prod_{d=1}^{D'} P(\D_d'|\D_1',\dots,\D_{d-1}',\D,\zz^{(s)}) \\
&= \prod_{d=1}^{D'}\sum_{z_1'=1}^T\cdots\sum_{z_d'=1}^T
   P(\D_d',\zz'|\D_1',\dots,\D_{d-1}',\D,\zz^{(s)}) \\
&= \prod_{d=1}^{D'}\sum_{z_1'=1}^T\cdots\sum_{z_d'=1}^T
   P(\D_d',z_d'|\D_1',\dots,\D_{d-1}',z_1',\dots,z_{d-1}',\D,\zz^{(s)})
   P(z_1',\dots,z_{d-1}'|\D_1',\dots,\D_{d-1}',\D,\zz^{(s)}) \\
&= \prod_{d=1}^{D'}\sum_{z_1'=1}^T\cdots\sum_{z_d'=1}^T
   \frac
   {\Gamma(N_\zd^{'<d} + N_\zd^{(s)} +\beta)}
   {\prod_v\Gamma(N_{v|\zd}^{'<d} + N_{v|\zd}^{(s)} + \beta n_v)}
   \frac
   {\prod_v\Gamma(N'_{v|d} + N_{v|\zd}^{'<d} + N_{v|\zd}^{(s)} + \beta n_v)}
   {\Gamma(N'_d + N_{\zd}^{'<d} + N_{\zd}^{(s)} + \beta)}
   \frac
   {D_\zd^{'<d} + D_\zd^{(s)} + \alpha m_\zd}
   {d - 1 + D + \alpha} %\cdot \\
%& \qquad\qquad\qquad\qquad
   P(z_1',\dots,z_{d-1}'|\D_1',\dots,\D_{d-1}',\D,\zz^{(s)}) \\
&\approx \prod_{d=1}^{D'}\frac{1}{R}\sum_{r=1}^R\sum_{\zd=1}^T
   \frac
   {\Gamma(N_\zd^{'<d,(r)} + N_\zd^{(s)} +\beta)}
   {\prod_v\Gamma(N_{v|\zd}^{'<d,(r)} + N_{v|\zd}^{(s)} + \beta n_v)}
   \frac
   {\prod_v\Gamma(N'_{v|d} + N_{v|\zd}^{'<d,(r)} + N_{v|\zd}^{(s)} +
    \beta n_v)}
   {\Gamma(N'_d + N_{\zd}^{'<d,(r)} + N_{\zd}^{(s)} + \beta)}
   \frac
   {D_\zd^{'<d,(r)} + D_\zd^{(s)} + \alpha m_\zd}
   {d - 1 + D + \alpha} \\
&= \prod_{d=1}^{D'}\frac{1}{R}\sum_{r=1}^R\sum_{t=1}^T
   \frac
   {\Gamma(N_t^{'<d,(r)} + N_t^{(s)} +\beta)}
   {\prod_v\Gamma(N_{v|t}^{'<d,(r)} + N_{v|t}^{(s)} + \beta n_v)}
   \frac
   {\prod_v\Gamma(N'_{v|d} + N_{v|t}^{'<d,(r)} + N_{v|t}^{(s)} +
    \beta n_v)}
   {\Gamma(N'_d + N_{t}^{'<d,(r)} + N_{t}^{(s)} + \beta)}
   \frac
   {D_t^{'<d,(r)} + D_t^{(s)} + \alpha m_t}
   {d - 1 + D + \alpha}

where (z_1'^{(r)},\dots,z_{d-1}'^{(r)}) \sim
P(z_1',\dots,z_{d-1}'|\D_1',\dots,\D_{d-1}',\D,\zz^{(s)}) for r=1,\dots,R.

Overall, the predictive probability is given by

P(\D'|\D,\H)
\approx
\frac{1}{S} \sum_{s=1}^S
   \prod_{d=1}^{D'}\frac{1}{R}\sum_{r=1}^R\sum_{t=1}^T
   \frac
   {\Gamma(N_t^{'<d,(r)} + N_t^{(s)} +\beta)}
   {\prod_v\Gamma(N_{v|t}^{'<d,(r)} + N_{v|t}^{(s)} + \beta n_v)}
   \frac
   {\prod_v\Gamma(N'_{v|d} + N_{v|t}^{'<d,(r)} + N_{v|t}^{(s)} +
    \beta n_v)}
   {\Gamma(N'_d + N_{t}^{'<d,(r)} + N_{t}^{(s)} + \beta)}
   \frac
   {D_t^{'<d,(r)} + D_t^{(s)} + \alpha m_t}
   {d - 1 + D + \alpha}

  • N_t^{(s)}, N_{v|t}^{(s)}, N_{v|t}^{(s)}, N_{t}^{(s)}, D_t^{(s)} are constants in a Gibbs sampling run
  • Not susceptible to label switching
  • Computationally expensive: in practice only one sample (the one with the highest probability) is used, that is, S=1

«  Dirichlet–multinomial mixture model: Gibbs sampling   ::   Contents   ::   Latent Dirichlet allocation: Gibbs sampling  »