Dirichlet–multinomial mixture model: known groups
Data
Assuming we are observing a set of documents,
- : number of tokens
- : number of documents
- : vocabulary size
- : number of topics
Tokens: ,
Topic of each document: ,
Latent variables
Token distribution of -th topic
Independence:
- is indepent of for given the topic of the document.
- are i.i.d..
Topic distribution
Overall we have .
Prior
Overall prior
Notation
- topic is responsible for documents
- total tokens in those documents associated with topic
- tokens of type associated with topic
Likelihood
Let denotes the topic for token for all where .
Likelihood is therefore
Evidence
Posterior
Prediction
Consider the case consists of a single token in a new document
For the case of a single token in an existing document where is of group , the predictive probability is
For new dataset consists of multiple documents , the predictive probability is