Topics for final exam

Topics from all semester are eligible to appear on the final exam. We will tend to focus on things covered in problem sets, exercises, Piazza discussions, or lecture.

You may bring a single “cheat sheet” to the final exam, if you like: a single sheet of paper with notes, front and back.

Among the topics after the midterm (or were emphasized after the midterm), here are ones which may appear on the final exam. If it was covered after the midterm but doesn’t appear here, it won’t be on the exam.

Perceptron learning algorithm
- Classification perceptron
- Structured perceptron
- Averaged perceptron
Sequence tagging
- Discriminative sequence models (in order of increasing richness)
  - ILR (Indep log reg)
  - MEMM (maxent markov model)
  - Structured perceptron, with a first-order chain model
- Features
- BIO representation of spans
- Decoding algorithms, and how they interact with model richness
  - Greedy, Viterbi, beam search
Parsing
- CFGs, PCFGs
  - Given a CFG and sentence, is there a legal parse? What is one of the parses?
  - Given a PCFG and a sentence parse, what is its probability?
  - Relationship between HMMs and PCFGs
- CKY algorithm
  - What does a span cell mean?
  - What is its complexity?
  - Be able to run the “is there a legal parse?” version of CKY (like the exercise)
  - Why are backpointers
- Head words and dependencies
  - What is the head word of a phrase, and why is it useful to know?
  - Head rules: don’t have to memorize them, but understand what typical ones look like, and how to apply them to a tree. See below.
Lexical semantics (WordNet-style)
- Word senses versus word forms
- Synonymy: equivalent meaning.
- Hypernymy: more general meaning.
- Set-theoretic interpretation. Entailment direction under word substitution interpretation.
Lexical semantics from unlabeled data
- Word-word distributional similarity: Cosine similarity and dot products on context vectors
  - Why does distributional similarity even work at all?
- Hierarchical word clusters: Brown clustering
  - What’s useful about hierarchical clusters versus flat clusters?
- Syntactic versus topical word clusters: using statistics on a small context window gets syntactic properties of words, versus a large window which gets at topical ones. This is why topic models, which unlike HMMs model an entire document together but ignore left-and-right sequence information, learn topical, non-syntactic word clusterings.
Topic model basics
- Models a document as a mixture of K topics, and each topic has a unigram LM
- Ignores sequence information
- Trained in an unsupervised manner with EM
Precision and recall
- How does a decision threshold affect P/R?
- How do bias terms affect P/R?
- For different types of problems, what P/R tradeoffs make sense?
Coreference: everything in lecture and PS4

Head rules example

Here are some head rules that go along with a binary CFG grammar, where the head child phrase is starred. (The Collins head rules are basically similar, but for more complicated PTB-style grammars.)

S -> NP VP* (verb heads the sentence)
VP -> V* NP (verb is the head over a direct object)
NP -> NP* PP (prep phrase modifies the noun)
PP -> P* NP
NP -> D N* (determiner modifies the noun)

Given a specification of head rules and a parse tree, extract head words for each phrase, and word-word dependencies. Example:

(S  (NP (N Bob)) 
    (VP (V loved) 
        (NP (NP (D the) (N show)) 
            (PP (P on) (NP (N Broadway))))))

With the above rules, here’s the headedness marked up on the tree with a “*“, in places where there are siblings,

(S  (NP (N Bob)) 
    (VP (V* loved) 
        (NP (NP (D the) (N* show)) 
            (PP (P* on) (NP (N Broadway))))))

Here are head words for some constituents, where the head word is starred,

Bob loved* the show on Broadway
the show* on Broadway

The unlabeled dependency graph implied by the above head rules, is the following edgelist, in GOVERNOR -> DEPENDENT format:

loved -> Bob
loved -> show
show -> the
show -> on
on -> Broadway