Bayesian Methods for Text

Hyperparameter inference: slice sampling

«  Latent Dirichlet allocation: Gibbs sampling   ::   Contents   ::   Hyperparameter optimization  »

Hyperparameter inference: slice sampling

Hyperparameters of LDA

The hyperparameters of LDA are \alpha, \mm, \beta, and \nn.

In this lecture, we assume \mm is a T-dimensional uniform distribution, and \nn is a V-dimensional uniform distribution.

Also, \alpha > 0 and \beta > 0.

\Psi=\{\pphi_1, \dots, \pphi_T, \ttheta_1, \dots, \ttheta_D,
\zz, \alpha, \beta\}

In order to compute the posterior distribution, we have to first choose priors P(\alpha | \H) and P(\beta | \H) for hyperparameters \alpha, \beta.

Gamma distribution

P(x | s, c) = \frac{1}{\Gamma(c)s}\left(\frac{x}{s}\right)^{c-1}
\exp\left(-\frac{x}{s}\right)

  • Domain: (0, \infty)
  • Parameters:
    • s: scale parameter
    • c: shape parameter
  • Mean: sc
  • Variance: s^2 c
  • When sc = 1 as c\to\infty (a very broad gamma distribution) the distribution can be used as an uninformative prior, which is uniformly distributed over the domain.

Hyperparameter inference

P(\Psi | \D, \H)
= P(\pphi_1, \dots, \pphi_T, \ttheta_1, \dots, \ttheta_D |
   \zz, \alpha, \beta, \D, \H)
   P(\zz, \alpha, \beta | \D)

How to draw samples of \zz, \alpha, and \beta as a whole?

P(\zz, \alpha, \beta | \D)
= \frac{P(\D, \zz, \alpha, \beta)}{P(\D)}

where the denominator P(\D) is a normalization constant. The numerator can be factorized as

P(\D, \zz, \alpha, \beta)
=  P(\D | \zz, \beta) P(\zz | \alpha)
   P(\alpha)P(\beta)

We have learned how to perform Gibbs sampling on \zz:

P(z_n=t | \zz_{\setminus n}, \D, \alpha, \beta)
\propto
\frac{N_{v|t}^{\setminus n} + \beta n_v}{N_t^{\setminus n} + \beta}
(N_{t|d}^{\setminus n} + \alpha m_t)

where the shorthand notation v=w_n and d=d_n is used.

Blocked Gibbs sampling

Repeat the following steps

  1. Sample z_1, \dots, z_N by Gibbs sampling, usually for several rounds.
  2. Sample \alpha, \beta from P(\alpha, \beta | \zz, \D)

To sample \alpha we have to be able to compute the following distribution

P(\alpha | \D, \zz, \beta)
= \frac{P(\D, \zz, \alpha, \beta)}{P(\D, \zz, \beta)}

Notice that this is a continuous distribution, unlike that of \zz.

  1. Can we compute the denominator?
  2. Do we need to compute the denominator?

Slice sampling

Slice sampling [1] is applicable when we want to draw a sample x from P(x) but we can only compute the unnormalized distribution P^*(x).

The idea of slice sampling is to sample uniformly under the curve.

Slice sampling

start at some point x
evaluate P^*(x)
draw u'\sim U(0, P^*(x))
create a slice (l, r) that contains xstepping out
while true
draw x'\sim U(l, r)
evaluate P^*(x')
if P^*(x') > u'
break
else
modify interval (l, r)shrinkage

Stepping out

draw a\sim U(0, 1)
set l = x - aw
set r = l + w
while P^*(l) > u'
set l = l - w
while P^*(r) > u'
set r = r + w
  • Evaluation of P^*(x) is expensive so usually the last two loops are skipped, and make w big enough based on prior knowledge in the beginning.
  • In practice, only a limited number of stepping outs are allowed or otherwise it might keep expanding in some rare cases.

Shrinkage

if x' < x
l = x'
else
r = x'

Hyperparameter inference

P^*(\alpha | \D, \zz, \beta)
&= P(\D | \zz, \beta) P(\zz | \alpha) P(\alpha) P(\beta) \\
&= P(\zz | \alpha) P(\alpha) \\
&= P(\zz | \alpha) \\
&= \prod_{d=1}^D \frac{\Gamma(\alpha)}{\prod_{t=1}^T\Gamma(\alpha m_t)}
   \frac{\prod_{t=1}^T\Gamma(N_{t|d}+\alpha m_t)}{\Gamma(N_d+\alpha)}

  • The equation sign here is applied to P^*, so it holds up to a constant factor.
  • The second line holds because \alpha is not involved in P(\D | \zz, \beta) and P(\beta).
  • The third line holds due to the assumption that P(\alpha) and P(\beta) are uninformative so treated as constants.

Random variable transformation

Consider the transformation function

Z = g(X)

When g is strictly monotone

f_Z(z)
= f_X(g^{-1}(z))\left|\frac{d}{dz}g^{-1}(z)\right|
= f_X(x)\left|\frac{dx}{dz}\right|
= f_X(x) \frac{1}{|J|}

Change of variables

There is no easy way to draw samples in (0, \infty), so we instead consider a monotone mapping x=\log\alpha > 0 and draw samples from the equivalent distribution in terms of x given by

P^*(x | \D, \zz, \beta) = P^*(\alpha | \D, \zz, \beta)
\frac{d\alpha}{dx}

where the Jacobian is \displaystyle\frac{d\alpha}{dx}
= \frac{1}{\left(\displaystyle\frac{dx}{d\alpha}\right)}
= \frac{1}{\left(\displaystyle\frac{d\log\alpha}{d\alpha}\right)}
= \alpha

Therefore P^*(\alpha | \D, \zz, \beta) can be written as

P^*(x | \D, \zz, \beta)
= P^*(\alpha | \D, \zz, \beta) \alpha
= \prod_{d=1}^D \frac{\Gamma(e^x)}{\prod_{t=1}^T\Gamma(e^x m_t)}
   \frac{\prod_{t=1}^T\Gamma(N_{t|d}+e^x m_t)}{\Gamma(N_d+e^x)}
   e^x

Similarly,

P^*(\beta | \D, \zz, \alpha)
&= P(\D | \zz, \beta) P(\beta) \\
&= P(\D | \zz, \beta) \\
&= \prod_{t=1}^T\frac{\Gamma(\beta)}{\prod_{v=1}^V\Gamma(\beta n_v)}
   \frac{\prod_{v=1}^V\Gamma(N_{v|t}+\beta n_v)}{\Gamma(N_d+\beta)}

By change of variable x=\log\beta

P^*(x | \D, \zz, \alpha)
= P^*(\beta | \D, \zz, \alpha) \beta
= \prod_{t=1}^T\frac{\Gamma(e^x)}{\prod_{v=1}^V\Gamma(e^x n_v)}
   \frac{\prod_{v=1}^V\Gamma(N_{v|t}+e^x n_v)}{\Gamma(N_d+e^x)}
   e^x

Multivariate slice sampling

Instead of sampling each variable alternatively conditioned on one another, multivariate slice sampling is available to sample multiple variables in one go.

Multivariate slice sampling

evaluate P^*(\xx)
draw u' \sim U(0, P^*(\xx))
for each dimension k=1,\dots,n
draw a \sim U(0, 1)
set l_k = x_k - a w_k
set r_k = l_k + w_k
while true
for each dimension k=1,\dots,n
draw x_k' \sim U(l_k, r_k)
evaluate P^*(\xx')
if P^*(\xx') > u'
break
else
for each dimension k=1,\dots,n
modify interval (l_k, r_k)

Back to hyperparameter inference

P^*(\alpha, \beta | \D, \zz)
&= P(\D | \zz, \beta) P(\zz | \alpha) P(\alpha) P(\beta) \\
&= P(D, \zz | \alpha, \beta)

which is the evidence with known topics, see Evidence.

To draw from P(\alpha, \beta | \D, \zz) using multivariate slice sampling, let \xx = (\log\alpha, \log\beta) = (x_1, x_2) the Jacobian is given by

J(x_1, x_2)
=
\begin{vmatrix}
\displaystyle\frac{\partial\alpha}{\partial x_1} &
\displaystyle\frac{\partial\beta}{\partial x_1} \\[10pt]
\displaystyle\frac{\partial\alpha}{\partial x_2} &
\displaystyle\frac{\partial\beta}{\partial x_2}
\end{vmatrix}
=
\begin{vmatrix}
\alpha & 0 \\
0 & \beta
\end{vmatrix}
= \alpha\beta = e^{x_1} e^{x_2}

So

P^*(x_1, x_2 | \D, \zz)
&= P^*(\alpha, \beta | \D, \zz) \alpha \beta \\
&= \prod_{d=1}^D \frac{\Gamma(e^{x_1})}{\prod_{t=1}^T\Gamma(e^{x_1} m_t)}
   \frac{\prod_{t=1}^T\Gamma(N_{t|d}+e^{x_1} m_t)}{\Gamma(N_d+e^{x_1})}
   \prod_{t=1}^T\frac{\Gamma(e^{x_2})}{\prod_{v=1}^V\Gamma(e^{x_2} n_v)}
   \frac{\prod_{v=1}^V\Gamma(N_{v|t}+e^{x_2} n_v)}{\Gamma(N_d+e^{x_2})}
   e^{x_1}
   e^{x_2}

«  Latent Dirichlet allocation: Gibbs sampling   ::   Contents   ::   Hyperparameter optimization  »