Bayesian Methods for Text

Hyperparameter optimization

«  Hyperparameter inference: slice sampling   ::   Contents

Hyperparameter optimization

We will be working on \alpha\mm and \beta\nn as a whole because they always occur together.

\Psi = \{\pphi_1, \dots, \pphi_T, \ttheta_1, \dots, \ttheta_D,
         \zz, \alpha\mm, \beta\nn\}

Assumption: \alpha m_t are drawn iid from certain prior for all t, and so are \beta n_v.

Gamma distribution

P(x|s,c) = \frac{1}{\Gamma(c)s}\left(\frac{x}{s}\right)^{c-1}
           \exp\left(-\frac{x}{s}\right)

in the limit when sc=1 when c\to\infty, the Gamma distribution will become uniform over x > 0.

Posterior

P(\Psi|\D,\H)
&= P(\pphi_1, \dots, \pphi_T, \ttheta_1, \dots, \ttheta_D |
     \zz, \alpha\mm, \beta\nn, \D, \H)
   P(\zz | \alpha\mm, \beta\nn, \D, \H)
   P(\alpha\mm, \beta\nn | \D, \H) \\
&= P(\pphi_1, \dots, \pphi_T, \ttheta_1, \dots, \ttheta_D |
     \zz, \alpha\mm, \beta\nn, \D, \H)
   P(\zz, \alpha\mm, \beta\nn | \D, \H)

P(z_n=t | \zz_{\setminus n}, \D, \alpha, \beta, \H)
\propto \frac{N_{v|t}^{\setminus n} + \beta n_v}{N_t^{\setminus n} + \beta}
\left(N_{t|d}^{\setminus n} + \alpha m_t\right)

Hyperparameters optimization

Instead of sampling values from posterior P(\alpha m_1, \dots, \alpha m_T, \beta n_1, \dots, \beta n_v |
\zz, \D, \H), maximize the posterior probability itself.

It follows an EM-like framework:

repeat
sample \zz
optimize \alpha\mm, \beta\nn using previous \zz

Notation:

\U = \{\alpha m_1, \dots, \alpha m_T, \beta n_1, \dots, \beta n_V\}

Objective:

maximize P(\U | \zz, \D) \propto P(\zz, \D | \U)

\U^*
&= \argmax_{\U} P(\zz, \D | \U) \\
&= \argmax_{\U} P(\D | \zz, \U)P(\zz | \U)

where \beta\nn is only involved in P(\D | \zz, \U), and \alpha\mm is only involved in P(\zz | \U). So we can alternatively optimize \alpha\mm and \beta\nn until convergence.

Note that P(\zz | \alpha\mm) is concave in \alpha\mm, which means it will converge to the global optimal value.

Fixed-point iteration

Minka’s fixed-point iteration [1] is a fast algorithm to optimize the hyperparameters of LDA.

P(\zz | \alpha\mm)
= \prod_d \frac{\Gamma(\alpha)}{\prod_t \Gamma(\alpha m_t)}
          \frac{\prod_t \Gamma(N_{t|d} + \alpha m_t)}{\Gamma(N_d + \alpha)}

\log P(\zz | \alpha\mm)
&= \sum_d \left[\log\Gamma(\alpha) - \sum_t\log\Gamma(\alpha m_t)
   + \sum_t\log\Gamma(N_{t|d} + \alpha m_t) - \log\Gamma(N_d + \alpha)
   \right] \\
&= \sum_d \left[\log\Gamma(\alpha) - \log\Gamma(N_d + \alpha)
   + \sum_t\left(\log\Gamma(N_{t|d} + \alpha m_t) - \log\Gamma(\alpha m_t)
   \right)\right]

Bound 1

For any x\in\R^{+} and a\in\Z^{+}

\log\Gamma(x) - \log\Gamma(x+a)
\ge \log\Gamma(\hat{x}) - \log\Gamma(\hat{x} + a)
+ \Big(\Psi(\hat{x} + a) - \Psi(\hat{x})\Big)(\hat{x} - x)

Bound 2

For any x\in\R^{+} and a\in\Z^{+}

\log\Gamma(x+a) - \log\Gamma(x)
\ge \log\Gamma(\hat{x}+a) - \log\Gamma(\hat{x})
+ \hat{x}\Big(\Psi(\hat{x}+a)-\Psi(\hat{x})\Big)(\log x - \log\hat{x})

where \Psi(x) = \displaystyle\frac{d}{dx}\log\Gamma(x).

Fixed-point iteration for hyperparameters of LDA

Supposing \alpha\mm^{*} is the optimal parameters, it follows that

\log P(\zz | \alpha\mm^{*})
&\ge B(\alpha\mm^{*}) \\
&= \sum_d\Bigg[\log\Gamma(\alpha)-\log\Gamma(N_d+\alpha)+
   \Big(\Psi(N_d+\alpha)-\Psi(\alpha)\Big)(\alpha-\alpha^{*}) \\
&\qquad
   + \sum_t \left[\log\Gamma(N_{t|d}+\alpha m_t) + \log\Gamma(\alpha m_t) +
   \alpha m_t\Big(\Psi(N_{t|d}+\alpha m_t) - \Psi(\alpha m_t)\Big)
   \Big(\log(\alpha m_t^{*}) - \log(\alpha m_t)\Big)\right]\Bigg] \\
&= \sum_d \left[ \Big(\Psi(N_d+\alpha)-\Psi(\alpha)\Big)(-\alpha^{*})
   + \sum_t \alpha m_t \Big(\Psi(N_{t|d}+\alpha m_t) - \Psi(\alpha m_t)
   \Big) \log \alpha m_t^{*} \right] + C

\frac{\partial B(\alpha\mm^{*})}{\partial \alpha m_t^{*}}
= \sum_d \left[
   \frac{\alpha m_t\left(\Psi(N_{t|d}+\alpha m_t) - \Psi(\alpha m_t)\right)}
   {\alpha m_t^{*}} - \Big(\Psi(N_d + \alpha) - \Psi(\alpha)\Big)
   \right]

\alpha m_t^{*}
= \alpha m_t \frac{\sum_d \Psi(N_{t|d}+\alpha m_t) - \Psi(\alpha m_t)}
   {\sum_d \Psi(N_d+\alpha) - \Psi(\alpha)}

How to approximate \Psi(x) using the recurrence relationship of the gamma function?

Remarks

  • Run a few iterations only: use optimization as a proxy for sampling
  • Use asymmetric model: stop words will stand out
  • Workaround: fix \mm and \nn to uniform distribution and still use fixed-point iteration to optimize \alpha and \beta.

«  Hyperparameter inference: slice sampling   ::   Contents