Beta–binomial unigram language model

Recap

Task

specify model structure
specify probability distribution
specify independence assumption
specify other modeling assumption

Goal

explanation: form posterior of latent variable to explain data
prediction: form probability of unseen data

Notation

$\Psi$ : all unknown random variables of the model
$\D$ : observed data
$\H$ : modeling assumption
$\D'$ : new data

Text model

Corpus of document $\D$ consists of $N$ word tokens which are characterized by $\phi_1$ , the probability of type “no” (Bernoulli distributed).

NOTE: no document boundaries are considered

$\D = \{w_1, \dots, w_N\}$ , $w_i\in\{\text{no},\text{yes}\}$

Independence assumption: $w_i \perp w_j$ for all $i \ne j$
Exchangeable: bag-of-word model. eg: $P(\text{yes}, \text{yes}, \text{no}, \text{yes}, \text{no}) = P(\text{yes}, \text{no}, \text{yes}, \text{no}, \text{yes})$

$\Psi = \{\phi_1\}$ where $\phi_1 = P(w=\text{no}|\phi_1,\H)$ .

Specify prior $P(\phi|\H)$ : specify the degree of belief of the value of $\phi_1$ on the entire real line.

$P(\Psi|\H) = \text{Beta}(\phi_1;\beta,n_1) = \frac{\Gamma(\beta)}{\Gamma(\beta n_1)\Gamma(\beta(1-n_1)} \phi_1^{\beta n_1 - 1} (1-\phi_1)^{\beta(1-n_1) - 1}$

Hyperparameters of Beta distribution

$\beta > 0$ : concentration parameter – how concentrated the samples are to the mean
$0 \le n_1 \le 1$ : mean of the beta distribution, ie, $\E[\phi_1]$

Properties of Gamma function

$\Gamma(1) = 1$
$\Gamma(x + 1) = x\Gamma(x)$

Mean of Beta distribution

$\E_{\text{Beta}(\phi_1;\beta,n_1)}[\phi_1] &= \int d\phi_1 \phi_1 \text{Beta}(\phi_1;\beta,n_1) \\ &= \int d\phi_1 \phi_1 \frac{\Gamma(\beta)}{\Gamma(\beta n_1)\Gamma(\beta(1-n_1)} \phi_1^{\beta n_1 - 1} (1-\phi_1)^{\beta(1-n_1) - 1} \\ &= \frac{\Gamma(\beta)}{\Gamma(\beta n_1)\Gamma(\beta(1-n_1)} \int d\phi_1 \phi_1^{\beta n_1} (1-\phi_1)^{\beta(1-n_1) - 1} \\ &= \frac{\Gamma(\beta)}{\Gamma(\beta n_1)\Gamma(\beta(1-n_1)} \int d\phi_1 \text{Beta}\left(\phi_1;\beta+1,\frac{\beta n_1 + 1}{\beta + 1}\right) \frac{\Gamma(\beta n_1 + 1) \Gamma(\beta(1-n_1))} {\Gamma(\beta+1)} \\ &= \frac{\Gamma(\beta)}{\Gamma(\beta n_1)\Gamma(\beta(1-n_1))} \frac{\Gamma(\beta n_1 + 1) \Gamma(\beta(1-n_1))} {\Gamma(\beta+1)} \\ &= n_1$

Beta–binomial model

Generating process

$\phi_1\sim \text{Beta}(\phi_1;\beta, n_1)$
$w_n\sim \text{Bern}(\phi_1)$ for $n$ from 1 to N

Observed data

$\D = \{w_1, \dots, w_N\}$ , $w_i\in\{\text{no},\text{yes}\}$
number of “no”: $N_1$
number of “yes”: $N-N_1$

Likelihood

$P(\D|\Psi,\H) &= P(w_1, \dots, w_N|\phi_1,\H) \\ &= \prod_{n=1}^N P(w_n|\phi_1,\H) \\ &= \prod_{n=1}^N \phi_1^{\delta(w_n = \text{no})} (1-\phi_1)^{\delta(w_n = \text{yes})} \\ &= \phi_1^{\sum_{n=1}^N \delta(w_n = \text{no})} (1-\phi_1)^{\sum_{n=1}^N \delta(w_n = \text{yes})} \\ &= \phi_1^{N_1} (1-\phi_1)^{N-N_1}$

Prior

$P(\Psi|\H) &= P(\phi_1|\H) \\ &= \text{Beta}(\phi_1; \beta, n_1) \\ &= \frac{\Gamma(\beta)}{\Gamma(\beta n_1)\Gamma(\beta(1-n_1))} \phi_1^{\beta n_1 - 1} (1-\phi_1)^{\beta(1-n_1) - 1}$

Evidence

$P(\D|\H) &= \int d\Psi P(\D|\Psi,\H) P(\Psi|\H) \\ &= \int d\phi_1 \phi_1^{N_1} (1-\phi_1)^{N-N_1} \text{Beta}(\phi_1; \beta, n_1) \\ &= \int d\phi_1 \phi_1^{N_1} (1-\phi_1)^{N-N_1} \frac{\Gamma(\beta)}{\Gamma(\beta n_1)\Gamma(\beta(1-n_1))} \phi_1^{\beta n_1 - 1} (1-\phi_1)^{\beta(1-n_1) - 1} \\ &= \frac{\Gamma(\beta)}{\Gamma(\beta n_1)\Gamma(\beta(1-n_1))} \int d\phi_1 \phi_1^{N_1 + \beta n_1 - 1} (1-\phi_1)^{N - N_1 + \beta(1-n_1) - 1} \\ &= \frac{\Gamma(\beta)}{\Gamma(\beta n_1)\Gamma(\beta(1-n_1))} \frac{\Gamma(N_1+\beta n_1)\Gamma(N-N_1+\beta(1-n_1))}{\Gamma(N+\beta)}$

Posterior

$P(\Psi|\D,\H) &= \frac{P(\D|\Psi,\H) P(\Psi|\H)}{P(\D|\H)} \\ &= P(\D|\H)^{-1} \phi_1^{N_1} (1-\phi_1)^{N-N_1} \frac{\Gamma(\beta)}{\Gamma(\beta n_1)\Gamma(\beta(1-n_1))} \phi_1^{\beta n_1 - 1} (1-\phi_1)^{\beta(1-n_1) - 1} \\ &= \frac{\Gamma(N+\beta)}{\Gamma(N_1+\beta n_1)\Gamma(N-N_1+\beta(1-n_1))} \phi_1^{N_1 + \beta n_1 - 1} (1-\phi_1)^{N - N_1 + \beta(1-n_1) - 1} \\ &= \text{Beta}\left(\phi_1;N+\beta, \frac{N_1+\beta n_1}{N+\beta}\right)$

Remarks

In the Bayesian framework, prior performs a theoretically sound smoothing on the likelihood.
Conjugate prior: posterior and prior have the same form

Exploration

Summarize posterior by its mean

$\E_{P(\Psi|\D,\H)}[\Psi] = \E_{\text{Beta}\left(\phi_1;N+\beta, \frac{N_1+\beta n_1}{N+\beta}\right)} [\phi_1] = \frac{N_1+\beta n_1}{N+\beta}$

Prediction

The predictive distribution of a single unseen example is

$P(\D'=\{w_{N+1}=\text{no}\}|\D,\H) &= \int d\Psi P(\D'=\{w_{N+1}=\text{no}\}|\Psi,\H)P(\Psi|\D,\H) \\ &= \int d\phi_1 P(w_{N+1} = \text{no}|\phi_1,\H) P(\phi_1|\D,\H) \\ &= \int d\phi_1 \phi_1 \text{Beta}\left(\phi_1;N+\beta, \frac{N_1+\beta n_1}{N+\beta}\right) \\ &= \E[P(\Psi|\D,\H)] \\ &= \frac{N_1+\beta n_1}{N+\beta}$

In general, supposing there are $N_1'$ number of “no” and $N' - N_1'$ number of “yes” in $\D'$ , the posterior is as follows.

$P(\D'|\D,\H) &= \int d\Psi P(\D'|\Psi,\H)P(\Psi|\D,\H) \\ &= \int d\phi_1 \phi_1^{N_1'} (1-\phi_1)^{N' - N_1'} \text{Beta}\left(\phi_1;N+\beta, \frac{N_1+\beta n_1}{N+\beta}\right) \\ &= \frac{\Gamma(N+\beta)}{\Gamma(N_1+\beta n_1)\Gamma(N-N_1+\beta(1-n_1))} \int d\phi_1 \phi_1^{N_1' + N_1 + \beta n_1 - 1} (1-\phi_1)^{N'-N_1'+N-N_1+\beta(1-n_1)-1} \\ &= \frac{\Gamma(N+\beta)} {\Gamma(N_1+\beta n_1)\Gamma(N-N_1+\beta(1-n_1))} \frac{\Gamma(N_1' + N_1 + \beta n_1)\Gamma(N'-N_1'+N-N_1+\beta(1-n_1))} {\Gamma(N'+N+\beta)}$

Bayesian Methods for Text