Bayesian Methods for Text

Dirichlet–multinomial mixture model: known groups

«  Dirichlet–multinomial unigram language model   ::   Contents   ::   Dirichlet–multinomial mixture model: Gibbs sampling  »

Dirichlet–multinomial mixture model: known groups

Data

Assuming we are observing a set of documents,

\D=\{w_1, \dots, w_N, z_1,\dots,z_D\}

  • N: number of tokens
  • D: number of documents
  • V: vocabulary size
  • T: number of topics

Tokens: w_1, w_2, \dots, w_N, w_n\in\{1,2,\dots,V\}

Topic of each document: z_1, z_2, \dots, z_D, z_t\in\{1,2,\dots,T\}

Latent variables

Token distribution of t-th topic \gbm{\phi}_t = (\phi_{1|t}, \phi_{2|t}, \dots, \phi_{V|t})

Independence:

  • w_i is indepent of w_j for i\ne j given the topic of the document.
  • \phi_1,\cdots,\phi_T are i.i.d..

Topic distribution \gbm{\theta}=(\theta_1,\theta_2,\dots,\theta_T)

Overall we have \Psi=\{\gbm{\phi}_1, \gbm{\phi}_2, \dots, \gbm{\phi}_T, \gbm{\theta}\}.

Prior

P(\gbm{\phi}_1, \gbm{\phi}_2, \dots, \gbm{\phi}_T | \H)
= \prod_{t=1}^T P(\gbm{\phi}_t | \H)
= \prod_{t=1}^T \Dir(\gbm{\phi}_t; \beta, \bm{n})

P(\gbm{\theta} | \H) = \Dir(\gbm{\theta}; \alpha, \bm{m})

Overall prior

P(\gbm{\phi}_1, \gbm{\phi}_2, \dots, \gbm{\phi}_T, \gbm{\theta} | \H)
&= P(\gbm{\phi}_1, \gbm{\phi}_2, \dots, \gbm{\phi}_T | \H)
   P(\gbm{\theta} | \H) \\
&= \Dir(\gbm{\theta}; \alpha, \bm{m})
   \prod_{t=1}^T \Dir(\gbm{\phi}_t; \beta, \bm{n})

Notation

  • topic t is responsible for D_t documents
    • D = \sum_{t=1}^T D_t
  • total N_t tokens in those documents associated with topic t
  • N_{v|t} tokens of type v associated with topic t
    • N_t = \sum_{v=1}^V N_{v|t}

Likelihood

P(\D|\Psi,\H)
&= P(w_1, w_2, \dots, w_N, z_1, z_2, \dots, z_D |
    \gbm{\phi}_1, \gbm{\phi}_2, \dots, \gbm{\phi}_T, \gbm{\theta}) \\
&= P(w_1, w_2, \dots, w_N | \gbm{\phi}_1, \gbm{\phi}_2, \dots, \gbm{\phi}_T)
   P(z_1, z_2, \dots, z_D | \gbm{\theta})

Let t_n denotes the topic for token w_n for all n=1, 2, \dots, N where t_n\in\{1, 2, \dots, T\}.

P(w_1, w_2, \dots, w_N | \gbm{\phi}_1, \gbm{\phi}_2, \dots, \gbm{\phi}_T)
&= \prod_{n=1}^N P(w_n | \gbm{\phi}_1, \gbm{\phi}_2, \dots, \gbm{\phi}_T) \\
&= \prod_{n=1}^N \phi_{w_n | t_n} \\
&= \prod_{n=1}^N \prod_{v=1}^V \prod_{t=1}^T \phi_{v|t}^
   {\delta(w_n=v)\delta(t_n=t)} \\
&= \prod_{v=1}^V \prod_{t=1}^T \phi_{v|t}^
   {\sum_{n=1}^N\delta(w_n=v)\delta(t_n=t)} \\
&= \prod_{v=1}^V \prod_{t=1}^T \phi_{v|t}^{N_{v|t}}

P(z_1, z_2, \dots, z_D | \gbm{\theta})
&= \prod_{d=1}^D \theta_{z_d} \\
&= \prod_{d=1}^D \prod_{t=1}^T \theta_t^{\delta(z_d=t)} \\
&= \prod_{t=1}^T \theta_t^{\sum_{d=1}^D\delta(z_d=t)} \\
&= \prod_{t=1}^T \theta_t^{D_t}

Likelihood is therefore

P(\D|\Psi,\H)
&= P(w_1, w_2, \dots, w_N, z_1, z_2, \dots, z_D |
    \gbm{\phi}_1, \gbm{\phi}_2, \dots, \gbm{\phi}_T, \gbm{\theta}) \\
&= \left(\prod_{v=1}^V \prod_{t=1}^T \phi_{v|t}^{N_{v|t}}\right)
   \left(\prod_{t=1}^T \theta_t^{D_t}\right)

Evidence

P(\D|\H)
&= \int d\Psi P(\D|\Psi,\H)P(\Psi|\H) \\
&= \int d\gbm{\phi}_1 \int d\gbm{\phi}_2 \cdots \int d\gbm{\phi}_T
   \int d\gbm{\theta}
   \left(\prod_{v=1}^V \prod_{t=1}^T \phi_{v|t}^{N_{v|t}}\right)
   \left(\prod_{t=1}^T \theta_t^{D_t}\right)
   \Dir(\gbm{\theta}; \alpha, \bm{m})
   \left(\prod_{t=1}^T \Dir(\gbm{\phi}_t; \beta, \bm{n})\right) \\
&= \left[\int d\gbm{\phi}_1 \int d\gbm{\phi}_2 \cdots \int d\gbm{\phi}_T
   \prod_{t=1}^T \left(\Dir(\gbm{\phi}_t; \beta, \bm{n})
   \prod_{v=1}^V \phi_{v|t}^{N_{v|t}}\right)\right]
   \left[
   \int d\gbm{\theta}
   \left(\prod_{t=1}^T \theta_t^{D_t}\right)
   \Dir(\gbm{\theta}; \alpha, \bm{m})
   \right] \\
&= \left[\prod_{t=1}^T \int d\gbm{\phi}_t
   \left(\Dir(\gbm{\phi}_t; \beta, \bm{n})
   \prod_{v=1}^V \phi_{v|t}^{N_{v|t}}\right)\right]
   \left[
   \int d\gbm{\theta}
   \left(\prod_{t=1}^T \theta_t^{D_t}\right)
   \Dir(\gbm{\theta}; \alpha, \bm{m})
   \right] \\
&= \left(
   \prod_{t=1}^T\frac{\Gamma(\beta)}{\prod_{v=1}^V\Gamma(\beta n_v)}
   \frac{\prod_{v=1}^V\Gamma(N_{v|t}+\beta n_v)}{\Gamma(N_t+\beta)}
   \right)
   \left(
   \frac{\Gamma(\alpha)}{\prod_{t=1}^T\Gamma(\alpha m_t)}
   \frac{\prod_{t=1}^T\Gamma(D_t+\alpha m_t)}{\Gamma(D+\alpha)}
   \right)

Posterior

P(\Psi|\D,\H)
&= P(\gbm{\phi}_1, \gbm{\phi}_2, \dots, \gbm{\phi}_T, \gbm{\theta}|
w_1, w_2, \dots, w_N, z_1, z_2, \dots, z_D) \\
&= \Dir\left(\gbm{\theta}; D+\alpha, \left(
   \frac{D_1+\alpha m_1}{D+\alpha},
   \frac{D_2+\alpha m_2}{D+\alpha}, \dots,
   \frac{D_T+\alpha m_T}{D+\alpha}\right)\right)
   \prod_{t=1}^T
   \Dir\left(\gbm{\phi}_t; N_t+\beta, \left(
   \frac{N_{1|t}+\beta n_1}{N_t+\beta},
   \frac{N_{2|t}+\beta n_2}{N_t+\beta}, \dots,
   \frac{N_{V|t}+\beta n_V}{N_t+\beta}\right)\right)

Prediction

P(\D'|\D,\H)=\int d\Psi P(\D'|\Psi,\H)P(\Psi|\D,\H)

Consider the case consists of a single token in a new document \D'=\{w_{N+1}=v, z_{D+1}=t\}

P(\D'|\D,\H)
&= P(w_{N+1}=v, z_{D+1}=t | \D, \H) \\
&= \int d\Psi P(w_{N+1}=v, z_{D+1}=t | \Psi, \H) P(\Psi|\D, \H) \\
&= \int d\Psi P(w_{N+1}=v | z_{D+1}=t, \gbm{\phi}_t)
   P(z_{D+1}=t | \gbm{\theta}) P(\Psi|\D, \H) \\
&= \int d\Psi \phi_{v|t} \theta_t P(\Psi|\D, \H) \\
&= \frac{N_{v|t}+\beta n_v}{N_t+\beta}
   \frac{D_t+\alpha m_t}{D+\alpha}

For the case of a single token in an existing document \D'=\{w_{N+1}=v\} where w_{N+1} is of group z_D=t, the predictive probability is

P(\D'|\D,\H)
&= P(w_{N+1}=v | \D, \H) \\
&= \int d\Psi P(w_{N+1}=v | \Psi, \H) P(\Psi|\D, \H) \\
&= \int d\Psi P(w_{N+1}=v | z_D=t, \gbm{\phi}_t) P(\Psi|\D, \H) \\
&= \int d\Psi \phi_{v|t} P(\Psi|\D, \H) \\
&= \frac{N_{v|t}+\beta n_v}{N_t+\beta}

For new dataset consists of multiple documents \D'=\{w_{N+1}, w_{N+2}, \dots, w_{N+N'},
z_{D+1}, z_{D+2}, \dots, z_{D+D'}\}, the predictive probability is

P(\D'|\D,\H)
&= \int d\Psi P(w_{N+1}, \dots, w_{N+N'} |
   z_{D+1}, \dots, z_{D+D'},
   \gbm{\phi}_1, \dots, \gbm{\phi}_T)
   P(z_{D+1}, \dots, z_{D+D'} | \gbm{\theta}) P(\Psi|\D, \H) \\
&= \int d\Psi \left(\prod_{t=1}^T \prod_{v=1}^V \phi_{v|t}^{N_{v|t}'}\right)
   \left(\prod_{t=1}^T \theta_t^{D_t'}\right) P(\Psi|\D, \H) \\
&= \left(\prod_{t=1}^T
   \frac{\Gamma(N_t+\beta)}{\prod_{v=1}^V\Gamma(N_{v|t}+\beta n_v)}
   \frac{\prod_{v=1}^V\Gamma(N_{v|t}'+N_{v|t}+\beta n_v)}
   {\Gamma(N_t'+N_t+\beta)}\right)
   \left(\frac{\Gamma(D+\alpha)}{\prod_{t=1}^T\Gamma(D_t+\alpha m_t)}
   \frac{\prod_{t=1}^T\Gamma(D_t'+D_t+\alpha m_t)}{\Gamma(D'+D+\alpha)}
   \right)

«  Dirichlet–multinomial unigram language model   ::   Contents   ::   Dirichlet–multinomial mixture model: Gibbs sampling  »