Bayesian Methods for Text

Dirichlet–multinomial unigram language model

«  Beta–binomial unigram language model   ::   Contents   ::   Dirichlet–multinomial mixture model: known groups  »

Dirichlet–multinomial unigram language model

Tasks

  1. Specify model
  2. Explanation
  3. Exploration
  4. Prediction

Dirichlet–multinomial unigram language model

Dirichlet–multinomial unigram model generalizes beta-binomial unigram model to arbitrary finite number of token types.

Data

\D=\{w_1, w_2, \dots, w_N\}

  • N: number of tokens
  • V: number of unique word types
  • represent each token by the index to the actual vocabulary, that is, w_n\in\{1,2,\dots,V\}.

Each token w_n follows a categorical distribution parameterized by \gbm{\phi}=(\phi_1,\phi_2,\dots,\phi_V) that satisfies

  • 0\le \phi_v \le 1
  • \sum_{v=1}^V \phi_v = 1

The sum-to-one constraint implies the categorical distribution can be parameterized by only V-1 parameters, for instance, (\phi_1,\phi_2,\dots,\phi_{V-1}).

With previous notation, \Psi=\{\phi_1,\phi_2,\dots,\phi_V\}=\{\gbm{\phi}\}.

Prior

P(\phi_1,\phi_2,\dots,\phi_V|\H)
= P(\pphi|\H)
= \Dir(\pphi;\beta,\nn)

Dirichlet distribution

Dirichlet distribution is a distribution over discrete probability distributions

\Dir(\pphi;\beta,\nn)
= \frac{\Gamma(\beta)}{\prod_{v=1}^V\Gamma(\beta n_v)}
\prod_{v=1}^V \phi_v^{\beta n_v - 1}
\cdot
\delta\left(1-\sum_{v=1}^V \phi_v\right)

where \delta(x)=1 when x=0; otherwise, \delta(x)=0. This delta function explicitly specifies the support of the Dirichlet distribution.

Parameters

  • \nn: base measure; mean
  • \beta: concentration parameter

Expectation

\E_{\Dir(\pphi;\beta,\nn)}[\pphi]
= \left(
\E_{\Dir(\pphi;\beta,\nn)}[\phi_1],
\E_{\Dir(\pphi;\beta,\nn)}[\phi_2],
\dots,
\E_{\Dir(\pphi;\beta,\nn)}[\phi_V]
\right)

\E_{\Dir(\pphi;\beta,\nn)}[\phi_v]
&= \int d\pphi\,\phi_v \Dir(\pphi;\beta,\nn) \\
&= \int d\pphi\,\phi_v \frac{\Gamma(\beta)}{\prod_{v=1}^V\Gamma(\beta n_v)}
   \prod_{v=1}^V \phi_v^{\beta n_v - 1} \\
&= \int d\pphi\frac{\Gamma(\beta)}{\prod_{v=1}^V\Gamma(\beta n_v)}
   \phi_v^{\beta n_v}\prod_{i\ne v} \phi_i^{\beta n_i-1} \\
&= \frac{\Gamma(\beta)}{\prod_{v=1}^V\Gamma(\beta n_v)}
   \int d\pphi\,\phi_v^{\beta n_v}\prod_{i\ne v} \phi_i^{\beta n_i-1} \\
&= \frac{\Gamma(\beta)}{\prod_{v=1}^V\Gamma(\beta n_v)}
   \frac{\Gamma(\beta n_v + 1)\prod_{i\ne v}\Gamma(\beta n_i)}
   {\Gamma(\beta+1)} \\
&= n_v

Likelihood

P(\D|\Psi,\H)
&= P(w_1,w_2,\dots,w_N|\pphi,\H) \\
&= \prod_{n=1}^N P(w_n|\pphi,\H) \\
&= \prod_{n=1}^N \sum_{v=1}^V\delta(w_n=v)P(w_n=v|\pphi,\H) \\
&= \prod_{n=1}^N \prod_{v=1}^V P(w_n=v|\pphi,\H)^{\delta(w_n=v)} \\
&= \prod_{v=1}^V \phi_v^{\sum_{n=1}^N\delta(w_n=v)} \\
&= \prod_{v=1}^V \phi_v^{N_v}

Evidence

P(\D|\H)
&= \int d\Psi P(\D|\Psi,\H)P(\Psi|\H) \\
&= \int d\pphi \prod_{v=1}^V \phi_v^{N_v}
   \frac{\Gamma(\beta)}{\prod_{v=1}^V\Gamma(\beta n_v)}
   \prod_{v=1}^V \phi_v^{\beta n_v - 1} \\
&= \frac{\Gamma(\beta)}{\prod_{v=1}^V\Gamma(\beta n_v)}
   \int d\pphi \prod_{v=1}^V \phi_v^{N_v+\beta n_v-1} \\
&= \frac{\Gamma(\beta)}{\prod_{v=1}^V\Gamma(\beta n_v)}
   \frac{\prod_{v=1}^V\Gamma(N_v+\beta n_v)}{\Gamma(N+\beta)}

Remarks

  • \lim_{n\to 0^+}\Gamma(n) = \infty; \displaystyle\lim_{n\to 0^-}\Gamma(n) = -\infty
  • P(\D|\H) = 0 whenever there exists w_i=v in \D where n_v=0.
  • \text{Beta}(x; \beta, n_1=0) is a 1 point degenerate distribution with a Dirac delta function spike at the right end (x=1) with probability 1, and zero probability everywhere else.

Posterior

P(\Psi|\D,\H)
&=\frac{P(\D|\Psi,\H)P(\Psi|\H)}{P(\D|\H)} \\
&= \frac{\left(\prod_{v=1}^V \phi_v^{N_v}\right)
         \displaystyle\frac{\Gamma(\beta)}{\prod_{v=1}^V\Gamma(\beta n_v)}
         \prod_{v=1}^V \phi_v^{\beta n_v - 1}}
        {\displaystyle\frac{\Gamma(\beta)}{\prod_{v=1}^V\Gamma(\beta n_v)}
         \frac{\prod_{v=1}^V\Gamma(N_v+\beta n_v)}{\Gamma(N+\beta)}} \\
&= \frac{\Gamma(N+\beta)}{\prod_{v=1}^V \Gamma(N_v+\beta n_v)}
   \prod_{v=1}^V\phi_v^{N_v+\beta n_v-1} \\
&= \Dir\left(\pphi;N+\beta,\left(\frac{N_1+\beta n_1}{N+\beta},
        \frac{N_2+\beta n_2}{N+\beta},\dots,
        \frac{N_V+\beta n_V}{N+\beta}\right)\right)

Prediction

P(\D'|\D,\H)=\int d\Psi P(\D'|\Psi,\H)P(\Psi|\D,\H)

Consider the case consists of a single token \D'=\{w_{N+1}=v\}

P(w_{N+1}=v|\D,\H)
&= \int d\pphi P(w_{N+1}=v|\pphi,\H)P(\pphi|\D,\H) \\
&= \int d\pphi\, \phi_v
   \Dir\left(\pphi;N+\beta,\left(\frac{N_1+\beta n_1}{N+\beta},
   \frac{N_2+\beta n_2}{N+\beta},\dots,
   \frac{N_V+\beta n_V}{N+\beta}\right)\right) \\
&= \E_{\Dir\left(\pphi;N+\beta,\left(\frac{N_1+\beta n_1}{N+\beta},
   \frac{N_2+\beta n_2}{N+\beta},\dots,
   \frac{N_V+\beta n_V}{N+\beta}\right)\right)}[\phi_v] \\
&= \frac{N_v+\beta n_v}{N+\beta}

In general, if there are total N' tokens in \D' where N_v of which are of type v

P(\D'|\D,\H)
&= \int d\pphi \left(\prod_{v=1}^V \phi_v^{N_v}\right)
   \frac{\Gamma(N+\beta)}{\prod_{v=1}^V \Gamma(N_v+\beta n_v)}
   \prod_{v=1}^V\phi_v^{N_v+\beta n_v-1} \\
&= \frac{\Gamma(N+\beta)}{\prod_{v=1}^V \Gamma(N_v+\beta n_v)}
   \int d\pphi \prod_{v=1}^V\phi_v^{N_v'+N_v+\beta n_v-1} \\
&= \frac{\Gamma(N+\beta)}{\prod_{v=1}^V \Gamma(N_v+\beta n_v)}
   \frac{\prod_{v=1}^V\Gamma(N_v'+N_v+\beta n_v)}
   {\Gamma(N'+N+\beta)}

«  Beta–binomial unigram language model   ::   Contents   ::   Dirichlet–multinomial mixture model: known groups  »