Bayesian Methods for Text

Beta–binomial unigram language model

«  Probabilistic modeling   ::   Contents   ::   Dirichlet–multinomial unigram language model  »

Beta–binomial unigram language model

Recap

Task

  • specify model structure
  • specify probability distribution
  • specify independence assumption
  • specify other modeling assumption

Goal

  • explanation: form posterior of latent variable to explain data
  • prediction: form probability of unseen data

Notation

  • \Psi: all unknown random variables of the model
  • \D: observed data
  • \H: modeling assumption
  • \D': new data

Text model

Corpus of document \D consists of N word tokens which are characterized by \phi_1, the probability of type “no” (Bernoulli distributed).

NOTE: no document boundaries are considered

\D = \{w_1, \dots, w_N\}, w_i\in\{\text{no},\text{yes}\}

  • Independence assumption: w_i \perp w_j for all i \ne j
  • Exchangeable: bag-of-word model. eg: P(\text{yes}, \text{yes}, \text{no}, \text{yes}, \text{no})
= P(\text{yes}, \text{no}, \text{yes}, \text{no}, \text{yes})

\Psi = \{\phi_1\} where \phi_1 = P(w=\text{no}|\phi_1,\H).

Specify prior P(\phi|\H): specify the degree of belief of the value of \phi_1 on the entire real line.

P(\Psi|\H) = \text{Beta}(\phi_1;\beta,n_1)
          = \frac{\Gamma(\beta)}{\Gamma(\beta n_1)\Gamma(\beta(1-n_1)}
            \phi_1^{\beta n_1 - 1} (1-\phi_1)^{\beta(1-n_1) - 1}

Hyperparameters of Beta distribution

  • \beta > 0: concentration parameter – how concentrated the samples are to the mean
  • 0 \le n_1 \le 1: mean of the beta distribution, ie, \E[\phi_1]

Properties of Gamma function

  • \Gamma(1) = 1
  • \Gamma(x + 1) = x\Gamma(x)

Mean of Beta distribution

\E_{\text{Beta}(\phi_1;\beta,n_1)}[\phi_1]
&= \int d\phi_1 \phi_1 \text{Beta}(\phi_1;\beta,n_1) \\
&= \int d\phi_1 \phi_1
   \frac{\Gamma(\beta)}{\Gamma(\beta n_1)\Gamma(\beta(1-n_1)}
   \phi_1^{\beta n_1 - 1} (1-\phi_1)^{\beta(1-n_1) - 1} \\
&= \frac{\Gamma(\beta)}{\Gamma(\beta n_1)\Gamma(\beta(1-n_1)}
   \int d\phi_1 \phi_1^{\beta n_1} (1-\phi_1)^{\beta(1-n_1) - 1} \\
&= \frac{\Gamma(\beta)}{\Gamma(\beta n_1)\Gamma(\beta(1-n_1)}
   \int d\phi_1
   \text{Beta}\left(\phi_1;\beta+1,\frac{\beta n_1 + 1}{\beta + 1}\right)
   \frac{\Gamma(\beta n_1 + 1) \Gamma(\beta(1-n_1))}
   {\Gamma(\beta+1)} \\
&= \frac{\Gamma(\beta)}{\Gamma(\beta n_1)\Gamma(\beta(1-n_1))}
   \frac{\Gamma(\beta n_1 + 1) \Gamma(\beta(1-n_1))}
   {\Gamma(\beta+1)} \\
&= n_1

Beta–binomial model

Generating process

  • \phi_1\sim \text{Beta}(\phi_1;\beta, n_1)
  • w_n\sim \text{Bern}(\phi_1) for n from 1 to N

Observed data

  • \D = \{w_1, \dots, w_N\}, w_i\in\{\text{no},\text{yes}\}
  • number of “no”: N_1
  • number of “yes”: N-N_1

Likelihood

P(\D|\Psi,\H) &= P(w_1, \dots, w_N|\phi_1,\H) \\
&= \prod_{n=1}^N P(w_n|\phi_1,\H) \\
&= \prod_{n=1}^N \phi_1^{\delta(w_n = \text{no})}
(1-\phi_1)^{\delta(w_n = \text{yes})} \\
&= \phi_1^{\sum_{n=1}^N \delta(w_n = \text{no})}
(1-\phi_1)^{\sum_{n=1}^N \delta(w_n = \text{yes})} \\
&= \phi_1^{N_1} (1-\phi_1)^{N-N_1}

Prior

P(\Psi|\H) &= P(\phi_1|\H) \\
&= \text{Beta}(\phi_1; \beta, n_1) \\
&= \frac{\Gamma(\beta)}{\Gamma(\beta n_1)\Gamma(\beta(1-n_1))}
   \phi_1^{\beta n_1 - 1} (1-\phi_1)^{\beta(1-n_1) - 1}

Evidence

P(\D|\H) &= \int d\Psi P(\D|\Psi,\H) P(\Psi|\H) \\
&= \int d\phi_1 \phi_1^{N_1} (1-\phi_1)^{N-N_1}
\text{Beta}(\phi_1; \beta, n_1) \\
&= \int d\phi_1 \phi_1^{N_1} (1-\phi_1)^{N-N_1}
   \frac{\Gamma(\beta)}{\Gamma(\beta n_1)\Gamma(\beta(1-n_1))}
   \phi_1^{\beta n_1 - 1} (1-\phi_1)^{\beta(1-n_1) - 1} \\
&= \frac{\Gamma(\beta)}{\Gamma(\beta n_1)\Gamma(\beta(1-n_1))}
   \int d\phi_1 \phi_1^{N_1 + \beta n_1 - 1}
   (1-\phi_1)^{N - N_1 + \beta(1-n_1) - 1} \\
&= \frac{\Gamma(\beta)}{\Gamma(\beta n_1)\Gamma(\beta(1-n_1))}
   \frac{\Gamma(N_1+\beta n_1)\Gamma(N-N_1+\beta(1-n_1))}{\Gamma(N+\beta)}

Posterior

P(\Psi|\D,\H) &= \frac{P(\D|\Psi,\H) P(\Psi|\H)}{P(\D|\H)} \\
&= P(\D|\H)^{-1} \phi_1^{N_1} (1-\phi_1)^{N-N_1}
\frac{\Gamma(\beta)}{\Gamma(\beta n_1)\Gamma(\beta(1-n_1))}
\phi_1^{\beta n_1 - 1} (1-\phi_1)^{\beta(1-n_1) - 1} \\
&= \frac{\Gamma(N+\beta)}{\Gamma(N_1+\beta n_1)\Gamma(N-N_1+\beta(1-n_1))}
\phi_1^{N_1 + \beta n_1 - 1} (1-\phi_1)^{N - N_1 + \beta(1-n_1) - 1} \\
&= \text{Beta}\left(\phi_1;N+\beta, \frac{N_1+\beta n_1}{N+\beta}\right)

Remarks

  • In the Bayesian framework, prior performs a theoretically sound smoothing on the likelihood.
  • Conjugate prior: posterior and prior have the same form

Exploration

Summarize posterior by its mean

\E_{P(\Psi|\D,\H)}[\Psi]
= \E_{\text{Beta}\left(\phi_1;N+\beta, \frac{N_1+\beta n_1}{N+\beta}\right)}
[\phi_1]
= \frac{N_1+\beta n_1}{N+\beta}

Prediction

The predictive distribution of a single unseen example is

P(\D'=\{w_{N+1}=\text{no}\}|\D,\H)
&= \int d\Psi P(\D'=\{w_{N+1}=\text{no}\}|\Psi,\H)P(\Psi|\D,\H) \\
&= \int d\phi_1 P(w_{N+1} = \text{no}|\phi_1,\H) P(\phi_1|\D,\H) \\
&= \int d\phi_1 \phi_1
\text{Beta}\left(\phi_1;N+\beta, \frac{N_1+\beta n_1}{N+\beta}\right) \\
&= \E[P(\Psi|\D,\H)] \\
&= \frac{N_1+\beta n_1}{N+\beta}

In general, supposing there are N_1' number of “no” and N' - N_1' number of “yes” in \D', the posterior is as follows.

P(\D'|\D,\H)
&= \int d\Psi P(\D'|\Psi,\H)P(\Psi|\D,\H) \\
&= \int d\phi_1 \phi_1^{N_1'} (1-\phi_1)^{N' - N_1'}
\text{Beta}\left(\phi_1;N+\beta, \frac{N_1+\beta n_1}{N+\beta}\right) \\
&= \frac{\Gamma(N+\beta)}{\Gamma(N_1+\beta n_1)\Gamma(N-N_1+\beta(1-n_1))}
\int d\phi_1 \phi_1^{N_1' + N_1 + \beta n_1 - 1}
(1-\phi_1)^{N'-N_1'+N-N_1+\beta(1-n_1)-1} \\
&= \frac{\Gamma(N+\beta)}
{\Gamma(N_1+\beta n_1)\Gamma(N-N_1+\beta(1-n_1))}
\frac{\Gamma(N_1' + N_1 + \beta n_1)\Gamma(N'-N_1'+N-N_1+\beta(1-n_1))}
{\Gamma(N'+N+\beta)}

«  Probabilistic modeling   ::   Contents   ::   Dirichlet–multinomial unigram language model  »