Hyperparameter inference: slice sampling

Hyperparameters of LDA

The hyperparameters of LDA are $\alpha$ , $\mm$ , $\beta$ , and $\nn$ .

In this lecture, we assume $\mm$ is a $T$ -dimensional uniform distribution, and $\nn$ is a $V$ -dimensional uniform distribution.

Also, $\alpha > 0$ and $\beta > 0$ .

$\Psi=\{\pphi_1, \dots, \pphi_T, \ttheta_1, \dots, \ttheta_D, \zz, \alpha, \beta\}$

In order to compute the posterior distribution, we have to first choose priors $P(\alpha | \H)$ and $P(\beta | \H)$ for hyperparameters $\alpha, \beta$ .

Gamma distribution

$P(x | s, c) = \frac{1}{\Gamma(c)s}\left(\frac{x}{s}\right)^{c-1} \exp\left(-\frac{x}{s}\right)$

Domain: $(0, \infty)$
Parameters:
- $s$ : scale parameter
- $c$ : shape parameter
Mean: $sc$
Variance: $s^2 c$
When $sc = 1$ as $c\to\infty$ (a very broad gamma distribution) the distribution can be used as an uninformative prior, which is uniformly distributed over the domain.

Hyperparameter inference

$P(\Psi | \D, \H) = P(\pphi_1, \dots, \pphi_T, \ttheta_1, \dots, \ttheta_D | \zz, \alpha, \beta, \D, \H) P(\zz, \alpha, \beta | \D)$

How to draw samples of $\zz$ , $\alpha$ , and $\beta$ as a whole?

$P(\zz, \alpha, \beta | \D) = \frac{P(\D, \zz, \alpha, \beta)}{P(\D)}$

where the denominator $P(\D)$ is a normalization constant. The numerator can be factorized as

$P(\D, \zz, \alpha, \beta) = P(\D | \zz, \beta) P(\zz | \alpha) P(\alpha)P(\beta)$

We have learned how to perform Gibbs sampling on $\zz$ :

$P(z_n=t | \zz_{\setminus n}, \D, \alpha, \beta) \propto \frac{N_{v|t}^{\setminus n} + \beta n_v}{N_t^{\setminus n} + \beta} (N_{t|d}^{\setminus n} + \alpha m_t)$

where the shorthand notation $v=w_n$ and $d=d_n$ is used.

Blocked Gibbs sampling

Repeat the following steps

Sample $z_1, \dots, z_N$ by Gibbs sampling, usually for several rounds.
Sample $\alpha, \beta$ from $P(\alpha, \beta | \zz, \D)$

To sample $\alpha$ we have to be able to compute the following distribution

$P(\alpha | \D, \zz, \beta) = \frac{P(\D, \zz, \alpha, \beta)}{P(\D, \zz, \beta)}$

Notice that this is a continuous distribution, unlike that of $\zz$ .

Can we compute the denominator?
Do we need to compute the denominator?

Slice sampling

Slice sampling [1] is applicable when we want to draw a sample $x$ from $P(x)$ but we can only compute the unnormalized distribution $P^*(x)$ .

The idea of slice sampling is to sample uniformly under the curve.

Slice sampling

start at some point 
evaluate 
draw 
create a slice  that contains  —  stepping out
while true
draw 
evaluate 
if 
break
else
modify interval  — shrinkage

Stepping out

draw $a\sim U(0, 1)$

set $l = x - aw$

set $r = l + w$

while $P^*(l) > u'$

set $l = l - w$

while $P^*(r) > u'$

set $r = r + w$

Evaluation of $P^*(x)$ is expensive so usually the last two loops are skipped, and make $w$ big enough based on prior knowledge in the beginning.
In practice, only a limited number of stepping outs are allowed or otherwise it might keep expanding in some rare cases.

Shrinkage

if $x' < x$

$l = x'$

else

$r = x'$

Hyperparameter inference

$P^*(\alpha | \D, \zz, \beta) &= P(\D | \zz, \beta) P(\zz | \alpha) P(\alpha) P(\beta) \\ &= P(\zz | \alpha) P(\alpha) \\ &= P(\zz | \alpha) \\ &= \prod_{d=1}^D \frac{\Gamma(\alpha)}{\prod_{t=1}^T\Gamma(\alpha m_t)} \frac{\prod_{t=1}^T\Gamma(N_{t|d}+\alpha m_t)}{\Gamma(N_d+\alpha)}$

The equation sign here is applied to $P^*$ , so it holds up to a constant factor.
The second line holds because $\alpha$ is not involved in $P(\D | \zz, \beta)$ and $P(\beta)$ .
The third line holds due to the assumption that $P(\alpha)$ and $P(\beta)$ are uninformative so treated as constants.

Random variable transformation

Consider the transformation function

$Z = g(X)$

When $g$ is strictly monotone

$f_Z(z) = f_X(g^{-1}(z))\left|\frac{d}{dz}g^{-1}(z)\right| = f_X(x)\left|\frac{dx}{dz}\right| = f_X(x) \frac{1}{|J|}$

Change of variables

There is no easy way to draw samples in $(0, \infty)$ , so we instead consider a monotone mapping $x=\log\alpha > 0$ and draw samples from the equivalent distribution in terms of $x$ given by

$P^*(x | \D, \zz, \beta) = P^*(\alpha | \D, \zz, \beta) \frac{d\alpha}{dx}$

where the Jacobian is $\displaystyle\frac{d\alpha}{dx} = \frac{1}{\left(\displaystyle\frac{dx}{d\alpha}\right)} = \frac{1}{\left(\displaystyle\frac{d\log\alpha}{d\alpha}\right)} = \alpha$

Therefore $P^*(\alpha | \D, \zz, \beta)$ can be written as

$P^*(x | \D, \zz, \beta) = P^*(\alpha | \D, \zz, \beta) \alpha = \prod_{d=1}^D \frac{\Gamma(e^x)}{\prod_{t=1}^T\Gamma(e^x m_t)} \frac{\prod_{t=1}^T\Gamma(N_{t|d}+e^x m_t)}{\Gamma(N_d+e^x)} e^x$

Similarly,

$P^*(\beta | \D, \zz, \alpha) &= P(\D | \zz, \beta) P(\beta) \\ &= P(\D | \zz, \beta) \\ &= \prod_{t=1}^T\frac{\Gamma(\beta)}{\prod_{v=1}^V\Gamma(\beta n_v)} \frac{\prod_{v=1}^V\Gamma(N_{v|t}+\beta n_v)}{\Gamma(N_d+\beta)}$

By change of variable $x=\log\beta$

$P^*(x | \D, \zz, \alpha) = P^*(\beta | \D, \zz, \alpha) \beta = \prod_{t=1}^T\frac{\Gamma(e^x)}{\prod_{v=1}^V\Gamma(e^x n_v)} \frac{\prod_{v=1}^V\Gamma(N_{v|t}+e^x n_v)}{\Gamma(N_d+e^x)} e^x$

Multivariate slice sampling

Instead of sampling each variable alternatively conditioned on one another, multivariate slice sampling is available to sample multiple variables in one go.

Multivariate slice sampling

evaluate $P^*(\xx)$

draw $u' \sim U(0, P^*(\xx))$

for each dimension $k=1,\dots,n$

draw 
set 
set 

while true

for each dimension $k=1,\dots,n$

draw $x_k' \sim U(l_k, r_k)$

evaluate $P^*(\xx')$

if $P^*(\xx') > u'$

break

else

for each dimension $k=1,\dots,n$

modify interval $(l_k, r_k)$

Back to hyperparameter inference

$P^*(\alpha, \beta | \D, \zz) &= P(\D | \zz, \beta) P(\zz | \alpha) P(\alpha) P(\beta) \\ &= P(D, \zz | \alpha, \beta)$

which is the evidence with known topics, see Evidence.

To draw from $P(\alpha, \beta | \D, \zz)$ using multivariate slice sampling, let $\xx = (\log\alpha, \log\beta) = (x_1, x_2)$ the Jacobian is given by

$J(x_1, x_2) = \begin{vmatrix} \displaystyle\frac{\partial\alpha}{\partial x_1} & \displaystyle\frac{\partial\beta}{\partial x_1} \\[10pt] \displaystyle\frac{\partial\alpha}{\partial x_2} & \displaystyle\frac{\partial\beta}{\partial x_2} \end{vmatrix} = \begin{vmatrix} \alpha & 0 \\ 0 & \beta \end{vmatrix} = \alpha\beta = e^{x_1} e^{x_2}$

$P^*(x_1, x_2 | \D, \zz) &= P^*(\alpha, \beta | \D, \zz) \alpha \beta \\ &= \prod_{d=1}^D \frac{\Gamma(e^{x_1})}{\prod_{t=1}^T\Gamma(e^{x_1} m_t)} \frac{\prod_{t=1}^T\Gamma(N_{t|d}+e^{x_1} m_t)}{\Gamma(N_d+e^{x_1})} \prod_{t=1}^T\frac{\Gamma(e^{x_2})}{\prod_{v=1}^V\Gamma(e^{x_2} n_v)} \frac{\prod_{v=1}^V\Gamma(N_{v|t}+e^{x_2} n_v)}{\Gamma(N_d+e^{x_2})} e^{x_1} e^{x_2}$

References

[1]	http://www.cs.toronto.edu/~radford/ftp/slc-samp.pdf

Bayesian Methods for Text

Hyperparameter inference: slice sampling

Hyperparameter inference: slice sampling

Hyperparameters of LDA

Gamma distribution

Hyperparameter inference

Blocked Gibbs sampling

Slice sampling

Hyperparameter inference

Random variable transformation

Change of variables

Multivariate slice sampling

References