Hyperparameter optimization

Hyperparameters of LDA are , , , and .
- $\mm$ : $T$ -dimensional
- $\nn$ : $V$ -dimensional
Previously assume symmetric Dirichlet prior is used.
- $\mm=\{1/T, \dots, 1/T\}$
- $\nn=\{1/V, \dots, 1/V\}$
Relax the symmetric assumption and consider asymmetric Dirichlet prior.
- how to identify the optimal hyperparameters according to certain criterion?

We will be working on $\alpha\mm$ and $\beta\nn$ as a whole because they always occur together.

$\Psi = \{\pphi_1, \dots, \pphi_T, \ttheta_1, \dots, \ttheta_D, \zz, \alpha\mm, \beta\nn\}$

Assumption: $\alpha m_t$ are drawn iid from certain prior for all $t$ , and so are $\beta n_v$ .

Gamma distribution

$P(x|s,c) = \frac{1}{\Gamma(c)s}\left(\frac{x}{s}\right)^{c-1} \exp\left(-\frac{x}{s}\right)$

in the limit when $sc=1$ when $c\to\infty$ , the Gamma distribution will become uniform over $x > 0$ .

Posterior

$P(\Psi|\D,\H) &= P(\pphi_1, \dots, \pphi_T, \ttheta_1, \dots, \ttheta_D | \zz, \alpha\mm, \beta\nn, \D, \H) P(\zz | \alpha\mm, \beta\nn, \D, \H) P(\alpha\mm, \beta\nn | \D, \H) \\ &= P(\pphi_1, \dots, \pphi_T, \ttheta_1, \dots, \ttheta_D | \zz, \alpha\mm, \beta\nn, \D, \H) P(\zz, \alpha\mm, \beta\nn | \D, \H)$

$P(z_n=t | \zz_{\setminus n}, \D, \alpha, \beta, \H) \propto \frac{N_{v|t}^{\setminus n} + \beta n_v}{N_t^{\setminus n} + \beta} \left(N_{t|d}^{\setminus n} + \alpha m_t\right)$

Hyperparameters optimization

Instead of sampling values from posterior $P(\alpha m_1, \dots, \alpha m_T, \beta n_1, \dots, \beta n_v | \zz, \D, \H)$ , maximize the posterior probability itself.

It follows an EM-like framework:

repeat

sample $\zz$

optimize $\alpha\mm$ , $\beta\nn$ using previous $\zz$

Notation:

$\U = \{\alpha m_1, \dots, \alpha m_T, \beta n_1, \dots, \beta n_V\}$

Objective:

maximize $P(\U | \zz, \D) \propto P(\zz, \D | \U)$

$\U^* &= \argmax_{\U} P(\zz, \D | \U) \\ &= \argmax_{\U} P(\D | \zz, \U)P(\zz | \U)$

where $\beta\nn$ is only involved in $P(\D | \zz, \U)$ , and $\alpha\mm$ is only involved in $P(\zz | \U)$ . So we can alternatively optimize $\alpha\mm$ and $\beta\nn$ until convergence.

Note that $P(\zz | \alpha\mm)$ is concave in $\alpha\mm$ , which means it will converge to the global optimal value.

Fixed-point iteration

Minka’s fixed-point iteration [1] is a fast algorithm to optimize the hyperparameters of LDA.

$P(\zz | \alpha\mm) = \prod_d \frac{\Gamma(\alpha)}{\prod_t \Gamma(\alpha m_t)} \frac{\prod_t \Gamma(N_{t|d} + \alpha m_t)}{\Gamma(N_d + \alpha)}$

$\log P(\zz | \alpha\mm) &= \sum_d \left[\log\Gamma(\alpha) - \sum_t\log\Gamma(\alpha m_t) + \sum_t\log\Gamma(N_{t|d} + \alpha m_t) - \log\Gamma(N_d + \alpha) \right] \\ &= \sum_d \left[\log\Gamma(\alpha) - \log\Gamma(N_d + \alpha) + \sum_t\left(\log\Gamma(N_{t|d} + \alpha m_t) - \log\Gamma(\alpha m_t) \right)\right]$

Bound 1

For any $x\in\R^{+}$ and $a\in\Z^{+}$

$\log\Gamma(x) - \log\Gamma(x+a) \ge \log\Gamma(\hat{x}) - \log\Gamma(\hat{x} + a) + \Big(\Psi(\hat{x} + a) - \Psi(\hat{x})\Big)(\hat{x} - x)$

Bound 2

For any $x\in\R^{+}$ and $a\in\Z^{+}$

$\log\Gamma(x+a) - \log\Gamma(x) \ge \log\Gamma(\hat{x}+a) - \log\Gamma(\hat{x}) + \hat{x}\Big(\Psi(\hat{x}+a)-\Psi(\hat{x})\Big)(\log x - \log\hat{x})$

where $\Psi(x) = \displaystyle\frac{d}{dx}\log\Gamma(x)$ .

Fixed-point iteration for hyperparameters of LDA

Supposing $\alpha\mm^{*}$ is the optimal parameters, it follows that

$\log P(\zz | \alpha\mm^{*}) &\ge B(\alpha\mm^{*}) \\ &= \sum_d\Bigg[\log\Gamma(\alpha)-\log\Gamma(N_d+\alpha)+ \Big(\Psi(N_d+\alpha)-\Psi(\alpha)\Big)(\alpha-\alpha^{*}) \\ &\qquad + \sum_t \left[\log\Gamma(N_{t|d}+\alpha m_t) + \log\Gamma(\alpha m_t) + \alpha m_t\Big(\Psi(N_{t|d}+\alpha m_t) - \Psi(\alpha m_t)\Big) \Big(\log(\alpha m_t^{*}) - \log(\alpha m_t)\Big)\right]\Bigg] \\ &= \sum_d \left[ \Big(\Psi(N_d+\alpha)-\Psi(\alpha)\Big)(-\alpha^{*}) + \sum_t \alpha m_t \Big(\Psi(N_{t|d}+\alpha m_t) - \Psi(\alpha m_t) \Big) \log \alpha m_t^{*} \right] + C$

$\frac{\partial B(\alpha\mm^{*})}{\partial \alpha m_t^{*}} = \sum_d \left[ \frac{\alpha m_t\left(\Psi(N_{t|d}+\alpha m_t) - \Psi(\alpha m_t)\right)} {\alpha m_t^{*}} - \Big(\Psi(N_d + \alpha) - \Psi(\alpha)\Big) \right]$

$\alpha m_t^{*} = \alpha m_t \frac{\sum_d \Psi(N_{t|d}+\alpha m_t) - \Psi(\alpha m_t)} {\sum_d \Psi(N_d+\alpha) - \Psi(\alpha)}$

How to approximate $\Psi(x)$ using the recurrence relationship of the gamma function?

Remarks

Run a few iterations only: use optimization as a proxy for sampling
Use asymmetric model: stop words will stand out
Workaround: fix $\mm$ and $\nn$ to uniform distribution and still use fixed-point iteration to optimize $\alpha$ and $\beta$ .

References

[1]	http://research.microsoft.com/en-us/um/people/minka/papers/dirichlet/minka-dirichlet.pdf

Bayesian Methods for Text