Dirichlet–multinomial unigram language model
Tasks
- Specify model
- Explanation
- Exploration
- Prediction
Dirichlet–multinomial unigram language model
Dirichlet–multinomial unigram model generalizes beta-binomial unigram model to arbitrary finite number of token types.
Data
- : number of tokens
- : number of unique word types
- represent each token by the index to the actual vocabulary, that is, .
Each token follows a categorical distribution parameterized by that satisfies
The sum-to-one constraint implies the categorical distribution can be parameterized by only parameters, for instance, .
With previous notation, .
Prior
Dirichlet distribution
Dirichlet distribution is a distribution over discrete probability distributions
where when ; otherwise, . This delta function explicitly specifies the support of the Dirichlet distribution.
Parameters
- : base measure; mean
- : concentration parameter
Expectation
Likelihood
Evidence
Remarks
- ;
- whenever there exists in where .
- is a 1 point degenerate distribution with a Dirac delta function spike at the right end () with probability 1, and zero probability everywhere else.
Posterior
Prediction
Consider the case consists of a single token
In general, if there are total tokens in where of which are of type