=============================================================================
Hidden Markov Model and Bayesian Model Merging Toolkit Documentation

Kristie Seymore
Initial version: 7/9/98
Last updated: 9/8/99

Please report problems with this code to kseymore@jprc.com

=============================================================================

This toolkit consists of three programs: 
 * evaluate
 * bmm
 * train_hmm

1) evaluate:
------------

Usage: evaluate -hmm <model_file>
                -vit | -forward | -max_class
              [ -obs <observation_file> ]
              [ -out <output_file> ]
              [ -print_probs ]
              [ -calc_pp ]
              [ -trans_weight <weight> ]
              [ -print_model ]
              [ -print_state_id ]
              [ -read_obs_id ]
              [ -only_punc_trans ]
              [ -read_labels ]
              [ -server_mode <port_num>]
              [ -vit_details ]
              [ -forward_details ]

Given a Hidden Markov model and a set of observation sequences, evaluate will 
return the most likely state sequence through the model for a particular 
observation sequence (via the Viterbi algorithm), the probability of the Viterbi
path, or the total probability of the observation (via the Forward algorithm).

Explanation of the command line arguments:
	-hmm 		File containg the description of a Hidden Markov model,
				formatted per the instructions in 
				part (a) below.
	-vit		Returns the most likely state sequence for each
				observation sequence.
	-forward	Returns the total probability of each observation sequence
				according to the given the model.
	-max_class	Returns the most likely individual state for each
				individual observation, independent of context.
	-obs		File containing the observation sequences (see 
				part (b)). If not specified, stdin is used.
	-out		File that output should be printed to. If not
				specified, stdout will be used.
	-print_probs	Prints the transition logprob and emission logprob
                                for each observation.
	-calc_pp	Calculates the perplexity of the viterbi paths.
	-trans_weight	Sets the transition weight, which controls the relative weighting
				between the transitions and the emissions in choosing
				the most likely paths through the model.
	-print_model	Prints a summary of the HMM to stderr after it has been
				read in.
	-print_state_id	Prints the state id after the observation word
				and label for viterbi paths and max_class outputs.
	-read_obs_id		Signals that each observation sequence will be 
				preceded by an integer identification number. 
				This number will be printed with each line 
				of output.
        -only_punc_trans  Transitions to a new state are only allowed on 
				words that end in [.,:!]. This punctuation
				is stripped when calculating emission
				probabilities. Periods are stripped when
				occurring at the end of a word that has more
				than one lettter.
	-read_labels	Signals that a label follows each observation word in the
				input observation file. The label is read, and is
				used to constrain the viterbi search (forced alignment).
	-server_mode	Runs program as a server through a socket with port 
				number <port_num>
	-vit_details	Prints to stderr lots of small details about the stages
				of the Viterbi search 
	-forward_details Prints to stderr lots of small details about the stages
				of the Forward calculation


a) Hidden Markov Model file format:
The Hidden Markov model must be supplied to the program as a file, formatted
in the following way:

---------------------------
<num_distributions>
<distribution_label> <distribution_file>
etc.
<num_states>
<state_id> <state_label> <minimum_duration> <distribution_label> 
etc.
<from_state_id> <to_state_id> <prob>
etc.
---------------------------

where

<num_distributions>, <num_states>, <state_id>, <from_state_id>, 
	<to_state_id> are intergers
<distribution_label> is a word
<distribution_file> is a file path
<state_label> is a word
<prob> is a real number

<minimum_duration> is an integer - it specifies the minimum number of
observations the state must emit before a transition out of that state 
is allowed. A default value of 1 should be used if there is no duration 
preference. 

<prob> is the probability of a transition. All transitions
out of a state should have a total probability of 1.

By convention, the start state in the HMM should have the label 'start', and
the end state should have the label 'end'. These states should be included in
the HMM file. Their distribution labels should be "null". The null distribution
does not need to specified in the distribution list.

b) Observation file format:
Observation sequences are expected to be in the form of conditioned,
space separated words. One observation sequence should be specified
per line. An integer id number may precede each observation sequence
if the -obs_id flag is set on the command line. State labels may
follow each word if the -read_labels flag is used for forced alignment.

c) Probability distribution format: 
State emission distributions provided by the <distribution_file> paths in the
model definition file must be in arpa unigram language model format.
This means they must look like the following:

--------------------------------------------
\data\
ngram 1=<some_integer>

\1-grams:
<log_prob> <word>
<log_prob> <word>
<log_prob> <word>
<log_prob> <word>

\end\
-------------------------------------------

where

<some_integer> is the size of the vocabulary
<log_prob> is the probability of <word>, given in log10 form.

d) Output:
Results from the Viterbi search or the forward calculation will be output 
in the following form:

Viterbi path search:
sequence_id: O_1 state_1 state_id_1 ltprob_1 leprob_1 O_2 state_2 state_id_2 ltprob_2 leprob_2 ...
etc.

where 
sequence_id = integer id assigned to string. The sequence id
	only appears when -obs_id is set.
O_i = observation token i, as read in from input
state_i = state label of state that emitted symbol O_i in Viterbi path	
		(may not be unique)
state_id_i = state id of state that emitted symbol O_i in Viterbi path
		(must be unique)
ltprob_i = log (base e) of the transition probability between states
		i and i-1
leprob_i = log (base e) of the emission probability of the word 
		from state i

Forward probability:
sequence_id: logprob (total logprob, base e)



2) bmm: 
-------

Bayesian Model Merging

Andreas Stolcke, "Bayesian Learning of Probabilistic Language Models",
Ph.D. thesis, University of California, Berkeley, CA, 1994.
http://www-speech.sri.com/people/stolcke/publications.html

Usage: bmm -data <data_file>
           -map | -mean | -ml | -structure
           -vocab <vocab_file>
           -out <output_dir>
           [ -closed_vocab ]
           [ -incremental <orig_num_strings_to_add> <num_strings_to_add>
           [ -lookahead <num_steps>
           [ -priors <dist_file>  (format of <word> <count>)]
           [ -print_model <start_iteration> <step> ]
           [ -evaluate_model <observation_file> <start_iteration> <step> ]
           [ -print_dist <directory> ]
           [ -smooth_dists <mode> (absolute discounting = 1, linear interpolation = 2) ]
           [ -read_count ]
           [ -read_label ]
           [ -read_id ]
           [ -initial_adj_collapse ]
           [ -initial_V_collapse ]
           [ -exit_after_initial_model ]
           [ -exit_after_initial_collapse ]
           [ -same_label ]
           [ -same_label_at_first ]
           [ -neighbors_only ]
           [ -no_prior | -structure_prior ]
           [ -narrow_emission_prior  (broad is default) ]
           [ -pw <prior_weight (1.0)> | -eff_sample_size <count> ]
           [ -fixed_dirichlet_prior_weight <weight> ]
           [ -trans_uniform_alpha <count (1.0)> ]
           [ -obs_uniform_alpha <count (1.0)> ]

Given a set of training observations, find the most probable 
Hidden Markov Model given the observations.

Explanation of the command line arguments:
	-data		File containing observation sequences. See format notes below.
	-map	 	Model parameters are set to their maximum a 
				posteriori estimates. Posterior of model structure and
				parameters is maximized, evaluated at the map parameter settings.
	-mean		Model parameters are set to their mean a posteriori estimates.
				Posterior of model structure and parameters is maximized,
                                evaluated at the mean parameter settings.
	-ml		Model parameters are set to their maximum likelihood estimates.
				Posterior of model structure and parameters is maximized,
                                evaluated at the ml parameter settings.
	-structure	Sets the objective function to be the maximization of
                                the posterior of the model structure. When necessary
				(during evaluation of model output), model parameters are
				set to their ml estimates.
        -vocab          Vocabulary file. Any words in the observation
                                file that do not appear in the vocabulary file will
                                be mapped to the unknown token. The unknown token is
				added to the vocabulary, unless -closed_vocab is specified.
	-out		Directory where output models will be printed.
				The final model will always be printed.
	-closed_vocab	The unknown word token is not added to the vocabulary. If any
				out-of-vocabulary words appear in the observation file, 
				the program will exit.
	-incremental 	Processes observation sequences incrementally. Merging will begin
				using the number of observation sequences specified in
				<orig_num_strings_to_add>. Once no more valid merges can
				be found, <num_strings_to_add> observation strings will
				be added. This process repeats until no more valid merges exist.
	-lookahead	Specifies the number of invalid merges to be carried out once no
				valid merges exist, in an effort to escape local minima.
	-priors		File where prior count distributions are specified. The
				file should contain lines of the form 
				"<dist_label> <dist_path>". Each distribution file
				specified by <dist_path> should contain data of
				the form "<word> <count>". These prior distirbutions
				will be used in addition to a uniform prior distribution.
				The alpha values of the uniform distribution are specified
				with -trans_uniform_alpha and -obs_uniform_alpha.
	-print_model	When set, the model is printed starting at iteration number 
				<start_iteration>. The model is then printed every
				time <step> merges occur.
	-evaluate_model The current model is evaluated using viterbi decoding on the 
				observations specified in <observation_file>. This
				evaluation occurs beginning at iteration number <start_iteration>
				and repeats every <step> iterations. The viterbi 
				decoding output is written to the -out directory.
	-print_dist	The emission distributions for each state are printed
				to the specified directory whenever the model is printed.
				The emission distributions are smoothed is smoothing
				is specified with the -smooth_dists flag. Otherwise,
				unsmoothed parameter estimates are calculated according
				to the model parameter settings selected (map, mean, ml).
	-smooth_dists 	Emission distributions are smoothed before being printed
				or evaluated. If "1" is specified, each distribution
				is smoothed using absolute discounting. For mode "2",
				the emission counts are linearly interpolated with
				the prior distribution and a uniform distribution. Mixture
				weights are set using leave-one-out expectation-maximization
				of the emission counts.
	-read_count	Signals that each observation sequence will be preceded 
				by an integer count value. This number will be
				the initial Viterbi count assigned to that 
				observation sequence in the initial model.
	-read_label	Signals that each word in the observation sequence file 
				will be followed by a word label. See format 
				notes below for more information.
	-read_id	Signals that each observation sequence will be preceded
				by an integer id value. This number will not be
				used for any other purpose.
	-initial_adj_collapse When set, any two states in the initial model that 
				have the same label and share a unique transition 
				(the source state has no other output
				transitions and the destination state has no other 
				input transitions) will be merged.
	-initial_V_collapse When set, any two states in the initial model that
				have the same label and that are children of the same
				state will be merged. This merging will be performed
				in the forward and backward directions, after 
				initial_adj_collapse, if specified.
	-exit_after_initial_model After the initial model is created from the
				observations, the model will be written and the 
				program will exit.
	-exit_after_initial_collapse After the initial model has undergone
				adjacent of V merging as specified, the model
				will be written and the program will exit.
	-same_label	When set, only states that have the same label are 
				considered for a merge.
	-same_label_at_first	Only states that have the same label are considered
				for a merge at first. Once no valid merges exist,
				this constraint is lifted.
	-neighbors_only When set, only states that are neighbors (share a 
				transition) are considered for a merge.
	-no_prior		Specifies that the prior component should not 
				be included in the calculations to determine the
				value of a merge.
	-structure_prior	Specifies that only the structural prior component should
                                be included in the calculations to determine the
                                value of a merge. (The parameter prior component is
				excluded.)
	-narrow_emission_prior	Specifies that prior counts should only be considered
				over words that actually have an emission count in each state.
				This is opposed to a broad emission prior, which considers
				all prior counts, whether or not those words have occurred
				in the current state. A broad emission prior is default.
	-pw 		Sets the prior weight, which controls the balance 
				between generalization and fit to the data:
				log P(M|X) = log P(X|M) + pw * log P(M).
				For pw > 1, the algorithm will stop merging 
				later; for pw < 1, earlier. Default value 
				is 1.0.
	-eff_sample_size The number of observation sequences in X multiplied 
				by 1/pw. Another interpretation for prior
				weight, the effective sample size can be increased
				in order to increase the impact of the likelihood 
				term in model maximization. An effective sample
				size greater than the number of observation
				sequences corresponds to a prior weight < 1.
	-fixed_dirichlet_prior_weight Sets the total dirichlet prior weight to the
				specified value. This value applies to both emissions
				and transition prior distributions.
	-trans_uniform_alpha	Sets the Dirichlet prior weight, alpha, to <count> for
				each transition parameter in the model.
        -obs_uniform_alpha	Sets the Dirichlet prior weight, alpha, to <count> for
                                each emission parameter in the model.


Format notes:

a) The observation file <data_file> must be given in the form
<count> <word> [<label>] <word> [<label>] <word> [<label>] etc.
<count> <word> [<label>] <word> [<label>] <word> [<label>] etc.
...

where <count> is an integer giving how many times that string occurred in the
training data, and is only required if the -read_count flag is set.
<word> is a conditioned word. We assume that no identical
strings exist in the list - all complete word sequences on one line are unique.
<label> is an optional word, assumed only when -read_label is set, where
the 2nd word of each word pair is used as the label
for the state corresponding to the 1st word token of the pair. 
Each state initially can emit one output symbol, which is the word
assigned to that state. Periods are stripped off of the
ends of all words that have more than one letter, commas 
and colons are stripped off the ends of all words.

b) Output file format: Models written to the output file will be in 
the Hidden Markov model format described above in the "evaluate" documentation.
If -print_dist is specified, arpa-formatted emission distirbutions
will be written to the specified directory.

The arpa-formatted emission distributions are written in an
abbreviated format. The vocabulary file is specified in
the header of the file, as well as the zeroton probability
value. Then all words that have a probability different than
the zeroton value are listed in the file. Any words not listed
in the arpa file that occur in the vocabulary file should be
assigned the zeroton prob value listed in the file header.
This format avoids long files for distributions with
large vocabularies and sparse emission counts, where the
majority of words are assigned the zeroton probability.


3) train_hmm:
------------

Usage: train_hmm -hmm <model_file>
                 -vocab <vocab_file>
               [ -closed_vocab ]
               [ -obs <observation_file> ]
               [ -out <output_file> ]
               [ -print_new_emissions <directory> ]
               [ -read_label ]
               [ -read_id ]
               [ -only_punc_trans ]
               [ -uniform ]
               [ -random <seed> ]
               [ -trans_only ]
               [ -bw_details ]

Given an HMM with initial parameter estimates and observation data, iterate 
forward-backward training until local maximum-likelihood parameter estimates 
are attained.

Explanation of the command line arguments:
        -hmm            File containg the description of a Hidden Markov model,
                                formatted per the instructions in Section 1, part (a).
	-vocab 		The vocabulary file. Any observation word that
				is not in the vocabulary is changed to the <UNK> symbol.
				The <UNK> symbol is added to the vocabulary, unless
				-closed_vocab is specified.
        -closed_vocab   The unknown word token is not added to the vocabulary. If any
                                out-of-vocabulary words appear in the observation file,
                                the program will exit.
        -obs            File containing the observation sequences used for training (see Section 1,
                                part (b)). If not specified, stdin is used.
        -out            File that trained model should be printed to. If not
                                specified, stdout will be used.
	-print_new_emissions	Directory where newly estimated emission distributions should
				be printed to (in arpa format).
	-read_label	Indicates that each word in the observation sequences (-obs) will
				be followed by a label.
	-read_id	Indicates that each observation sequence will be preceded by an
                                integer id.
        -only_punc_trans  Transitions to a new state are only allowed on
                                words that end in [.,:!]. This punctuation
                                is stripped when calculating emission
                                probabilities. Periods are stripped when
                                occurring at the end of a word that has more
                                than one lettter.
	-uniform	Initializes transition parameters to uniform probabilities. Non-zero
				transitions are limited to those indicated in the initial model file.
	-random		Initializes transition parameters to (normalized) random probabilities. Non-zero
                                transitions are limited to those indicated in the initial model file.
	-trans_only	Trains transition parameters only. Emission distribution values 
				are held fixed.
	-bw_details	Prints to stderr lots of small details about the stages
                                of Baum-Welch training.


