Dirichlet–multinomial mixture model: exploration and prediction

Recap

$\D=\{w_1,\dots,w_N\}$
$\D_d$ : tokens in document $d$
$\Psi=\{\pphi_1,\dots,\pphi_T,\gbm{\theta},\zz\}$

The posterior of Dirichlet–multinomial mixture model is given by

$P(\Psi|\D,\H) = P(\pphi_1,\dots,\pphi_T,\gbm{\theta}|\D,\zz,\H) P(\zz|\D,\H)$

where $P(\pphi_1,\dots,\pphi_T,\gbm{\theta}|\D,\zz,\H)$ is composed of $T+1$ Dirichlet distributions. The other term can be obtained by

$P(\zz|\D,\H) = \frac{P(\D|\zz) P(\zz)}{P(\D)} = \frac{P(\D|\zz) P(\zz)}{\sum_\zz P(\D|\zz) P(\zz)}$

in which $\sum_\zz P(\D|\zz) P(\zz)$ can be computed using Gibbs sampling.

Exploration

$P(\pphi_t|D,\H) &= \int d\Psi_{\setminus\pphi}P(\Psi|\D,\H) \\ &= \sum_{\zz} P(\pphi_t,\zz|\D,\H) \\ &= \sum_{\zz} P(\pphi_t|\zz,\D,\H) P(\zz|\D,\H)$

$\E_{P(\pphi|\D,\H)}[\pphi_t] &= \int d\pphi_t \pphi_t P(\pphi_t|\D,\H) \\ &= \int d\pphi_t \pphi_t \sum_\zz P(\pphi_t|\zz,\D,\H) P(\zz|\D,\H) \\ &= \sum_\zz \int d\pphi_t \pphi_t P(\pphi_t|\zz,\D,\H) P(\zz|\D,\H) \\$

where

$\int d\pphi_t \pphi_t P(\pphi_t|\zz,\D,\H) &= \left(\frac{N_{1|t}+\beta n_1}{N_t+\beta},\dots, \frac{N_{V|t}+\beta n_V}{N_t+\beta}\right)$

For a single component of $\pphi_t$

$\E_{P(\pphi|\D,\H)}[\phi_{v|t}] &= \sum_\zz \frac{N_{v|t}+\beta n_v}{N_t+\beta} P(\zz|\D,\H)$

For the intractable sums, we can approximate that using sampling technique: draw samples $\zz^{(1)},\dots,\zz^{(S)}\sim P(\zz|\D)$ by Gibbs sampling.

Can we do the following Monte Carlo approximation?

$\E_{P(\pphi|\D,\H)}[\phi_{v|t}] &\approx \frac{1}{S}\sum_{s=1}^S \frac{N_{v|t}^{(s)}+\beta n_v}{N_t^{(s)}+\beta}$

No, because of label switching.

Label switching

Label switching happens for several reasons

$T!$ permutations for an identical mode
multiple modes

Label switching happens also in a single Gibbs run; otherwise, the Markov chain is not mixing well.

Dealing with label switching

Matching up topics: very hard in practice. It is even harder when the distribution is multi-modal.
Averaging samples within a small region in the Markov chain: there is no guarantee that label switching does not happen on those samples.
Using Only one sample, in particular, the sample with highest probability.

In our case, we use a single sample to estimate the latent parameter:

$\E_{P(\pphi|\D,\H)}[\phi_{v|t}] &\approx \frac{N_{v|t}^{(s)}+\beta n_v}{N_t^{(s)}+\beta} \qquad\text{where }z^{(s)} = \max_s P(\zz^{(s)}|\D)$

Prediction

$P(\D'|\D,\H) &= \sum_{\zz'} P(\D',\zz'|\D,\H) \\ &= \sum_{\zz'} \int d\Psi P(\D',\zz'|\Psi,\H) P(\Psi|\D,\H) \\ &= \sum_{\zz'} \int d\Psi P(\D'|\zz',\Psi) P(\zz'|\Psi) P(\Psi|\D,\H) \\$

Alternatively,

$P(\D'|\D,\H) &= \sum_{z'} \sum_\zz P(\D',\zz'|\D,\zz) P(\zz|\D)$

Single new token

$P(w_{N+1}=v|\D,\H) &= \sum_{z_{d_{N+1}}} \sum_\zz P(w_{N+1}=v,z_{d_{N+1}}|\D,\zz) P(\zz|\D) \\ &= \sum_{t=1}^T \sum_\zz P(w_{N+1}=v,z_{d_{N+1}}=t|\D,\zz) P(\zz|\D) \\ &\approx \frac{1}{S} \sum_{t=1}^T \sum_{s=1}^S \left( \frac{N_{v|t}^{(s)}+\beta n_v}{N_t^{(s)}+\beta} \cdot \frac{D_t^{(s)}+\alpha m_t}{D+\alpha}\right) \\ &= \frac{1}{S} \sum_{s=1}^S \left[\sum_{t=1}^T \left( \frac{N_{v|t}^{(s)}+\beta n_v}{N_t^{(s)}+\beta} \cdot \frac{D_t^{(s)}+\alpha m_t}{D+\alpha}\right)\right]$

where $\zz^{(1)},\dots,\zz^{(S)}\sim P(\zz|\D)$ .

The Monte Carlo method here is not susceptible to label switching. Each $\sum_{t=1}^T \left(\frac {N_{v|t}^{(s)}+\beta n_v}{N_t^{(s)}+\beta} \cdot \frac{D_t^{(s)}+\alpha m_t}{D+\alpha}\right)$ is an approximation of a probability $P(w_{N+1}=v|\D,\H)$ even if the component $\frac{N_{v|t}^{(s)}+\beta n_v}{N_t^{(s)}+\beta} \cdot \frac{D_t^{(s)}+\alpha m_t}{D+\alpha}$ for a particular $t$ is susceptible to label switching.

Multiple new tokens

$P(\D'|\D,\H) &= \sum_{\zz} P(\D'|\D,\zz) P(\zz|\D) \\ %\sum_{\zz'} P(\D',\zz'|\D,\zz) P(\zz|\D) \\ &\approx %\frac{1}{S} \sum_{s=1}^S \sum_{\zz'} P(\D',\zz'|\D,\zz^{(s)}) \\ \frac{1}{S} \sum_{s=1}^S P(\D'|\D,\zz^{(s)})$

where $\zz^{(1)},\dots,\zz^{(S)}\sim P(\zz|\D)$ .

$&P(\D'|\D,\zz^{(s)}) \\ &= \prod_{d=1}^{D'} P(\D_d'|\D_1',\dots,\D_{d-1}',\D,\zz^{(s)}) \\ &= \prod_{d=1}^{D'}\sum_{z_1'=1}^T\cdots\sum_{z_d'=1}^T P(\D_d',\zz'|\D_1',\dots,\D_{d-1}',\D,\zz^{(s)}) \\ &= \prod_{d=1}^{D'}\sum_{z_1'=1}^T\cdots\sum_{z_d'=1}^T P(\D_d',z_d'|\D_1',\dots,\D_{d-1}',z_1',\dots,z_{d-1}',\D,\zz^{(s)}) P(z_1',\dots,z_{d-1}'|\D_1',\dots,\D_{d-1}',\D,\zz^{(s)}) \\ &= \prod_{d=1}^{D'}\sum_{z_1'=1}^T\cdots\sum_{z_d'=1}^T \frac {\Gamma(N_\zd^{'<d} + N_\zd^{(s)} +\beta)} {\prod_v\Gamma(N_{v|\zd}^{'<d} + N_{v|\zd}^{(s)} + \beta n_v)} \frac {\prod_v\Gamma(N'_{v|d} + N_{v|\zd}^{'<d} + N_{v|\zd}^{(s)} + \beta n_v)} {\Gamma(N'_d + N_{\zd}^{'<d} + N_{\zd}^{(s)} + \beta)} \frac {D_\zd^{'<d} + D_\zd^{(s)} + \alpha m_\zd} {d - 1 + D + \alpha} %\cdot \\ %& \qquad\qquad\qquad\qquad P(z_1',\dots,z_{d-1}'|\D_1',\dots,\D_{d-1}',\D,\zz^{(s)}) \\ &\approx \prod_{d=1}^{D'}\frac{1}{R}\sum_{r=1}^R\sum_{\zd=1}^T \frac {\Gamma(N_\zd^{'<d,(r)} + N_\zd^{(s)} +\beta)} {\prod_v\Gamma(N_{v|\zd}^{'<d,(r)} + N_{v|\zd}^{(s)} + \beta n_v)} \frac {\prod_v\Gamma(N'_{v|d} + N_{v|\zd}^{'<d,(r)} + N_{v|\zd}^{(s)} + \beta n_v)} {\Gamma(N'_d + N_{\zd}^{'<d,(r)} + N_{\zd}^{(s)} + \beta)} \frac {D_\zd^{'<d,(r)} + D_\zd^{(s)} + \alpha m_\zd} {d - 1 + D + \alpha} \\ &= \prod_{d=1}^{D'}\frac{1}{R}\sum_{r=1}^R\sum_{t=1}^T \frac {\Gamma(N_t^{'<d,(r)} + N_t^{(s)} +\beta)} {\prod_v\Gamma(N_{v|t}^{'<d,(r)} + N_{v|t}^{(s)} + \beta n_v)} \frac {\prod_v\Gamma(N'_{v|d} + N_{v|t}^{'<d,(r)} + N_{v|t}^{(s)} + \beta n_v)} {\Gamma(N'_d + N_{t}^{'<d,(r)} + N_{t}^{(s)} + \beta)} \frac {D_t^{'<d,(r)} + D_t^{(s)} + \alpha m_t} {d - 1 + D + \alpha}$

where $(z_1'^{(r)},\dots,z_{d-1}'^{(r)}) \sim P(z_1',\dots,z_{d-1}'|\D_1',\dots,\D_{d-1}',\D,\zz^{(s)})$ for $r=1,\dots,R$ .

Overall, the predictive probability is given by

$P(\D'|\D,\H) \approx \frac{1}{S} \sum_{s=1}^S \prod_{d=1}^{D'}\frac{1}{R}\sum_{r=1}^R\sum_{t=1}^T \frac {\Gamma(N_t^{'<d,(r)} + N_t^{(s)} +\beta)} {\prod_v\Gamma(N_{v|t}^{'<d,(r)} + N_{v|t}^{(s)} + \beta n_v)} \frac {\prod_v\Gamma(N'_{v|d} + N_{v|t}^{'<d,(r)} + N_{v|t}^{(s)} + \beta n_v)} {\Gamma(N'_d + N_{t}^{'<d,(r)} + N_{t}^{(s)} + \beta)} \frac {D_t^{'<d,(r)} + D_t^{(s)} + \alpha m_t} {d - 1 + D + \alpha}$

$N_t^{(s)}, N_{v|t}^{(s)}, N_{v|t}^{(s)}, N_{t}^{(s)}, D_t^{(s)}$ are constants in a Gibbs sampling run
Not susceptible to label switching
Computationally expensive: in practice only one sample (the one with the highest probability) is used, that is, $S=1$

Bayesian Methods for Text