Dirichlet–multinomial unigram language model
Tasks
- Specify model
- Explanation
- Exploration
- Prediction
Dirichlet–multinomial unigram language model
Dirichlet–multinomial unigram model generalizes beta-binomial unigram model to arbitrary finite number of token types.
Data
: number of tokens
: number of unique word types
- represent each token by the index to the actual vocabulary, that is,
.
Each token follows a categorical distribution
parameterized by
that satisfies
The sum-to-one constraint implies the categorical distribution can be
parameterized by only parameters, for instance,
.
With previous notation,
.
Prior
Dirichlet distribution
Dirichlet distribution is a distribution over discrete probability distributions
where when
; otherwise,
.
This delta function explicitly specifies the support of the
Dirichlet distribution.
Parameters
: base measure; mean
: concentration parameter
Expectation
Likelihood
Evidence
Remarks
;
whenever there exists
in
where
.
is a 1 point degenerate distribution with a Dirac delta function spike at the right end (
) with probability 1, and zero probability everywhere else.
Posterior
Prediction
Consider the case consists of a single token
In general, if there are total tokens in
where
of which are of type