Topics from all semester are eligible to appear on the final exam. We will tend to focus on things covered in problem sets, exercises, Piazza discussions, or lecture.
You may bring a single “cheat sheet” to the final exam, if you like: a single sheet of paper with notes, front and back.
Among the topics after the midterm (or were emphasized after the midterm), here are ones which may appear on the final exam. If it was covered after the midterm but doesn’t appear here, it won’t be on the exam.
Perceptron learning algorithm
Classification perceptron
Structured perceptron
Averaged perceptron
Sequence tagging
Discriminative sequence models (in order of increasing richness)
ILR (Indep log reg)
MEMM (maxent markov model)
Structured perceptron, with a first-order chain model
Features
BIO representation of spans
Decoding algorithms, and how they interact with model richness
Greedy, Viterbi, beam search
Parsing
CFGs, PCFGs
Given a CFG and sentence, is there a legal parse? What is one of the parses?
Given a PCFG and a sentence parse, what is its probability?
Relationship between HMMs and PCFGs
CKY algorithm
What does a span cell mean?
What is its complexity?
Be able to run the “is there a legal parse?” version of CKY (like the exercise)
Why are backpointers
Head words and dependencies
What is the head word of a phrase, and why is it useful to know?
Head rules: don’t have to memorize them, but understand what typical ones look like, and how to apply them to a tree. See below.
Lexical semantics (WordNet-style)
Word senses versus word forms
Synonymy: equivalent meaning.
Hypernymy: more general meaning.
Set-theoretic interpretation. Entailment direction under word substitution interpretation.
Lexical semantics from unlabeled data
Word-word distributional similarity: Cosine similarity and dot products on context vectors
Why does distributional similarity even work at all?
Hierarchical word clusters: Brown clustering
What’s useful about hierarchical clusters versus flat clusters?
Syntactic versus topical word clusters: using statistics on a small context window gets syntactic properties of words, versus a large window which gets at topical ones. This is why topic models, which unlike HMMs model an entire document together but ignore left-and-right sequence information, learn topical, non-syntactic word clusterings.
Topic model basics
Models a document as a mixture of K topics, and each topic has a unigram LM
Ignores sequence information
Trained in an unsupervised manner with EM
Precision and recall
How does a decision threshold affect P/R?
How do bias terms affect P/R?
For different types of problems, what P/R tradeoffs make sense?
Coreference: everything in lecture and PS4
Head rules example
Here are some head rules that go along with a binary CFG grammar, where the head child phrase is starred. (The Collins head rules are basically similar, but for more complicated PTB-style grammars.)
S -> NP VP* (verb heads the sentence)
VP -> V* NP (verb is the head over a direct object)
NP -> NP* PP (prep phrase modifies the noun)
PP -> P* NP
NP -> D N* (determiner modifies the noun)
Given a specification of head rules and a parse tree, extract head words for each phrase, and word-word dependencies. Example:
(S (NP (N Bob))
(VP (V loved)
(NP (NP (D the) (N show))
(PP (P on) (NP (N Broadway))))))
With the above rules, here’s the headedness marked up on the tree with a “*“, in places where there are siblings,