Distributional semantics

Please submit a zip file with both

a PDF with the answers to these questions, and
the code you write

This is based on parts of David’s 11/25 lecture. Feel free to read internet resources also on cosine similarity: for example, this chapter from Manning et al’s IR textbook, or wikipedia, or this blogpost by some guy, etc.

This is due on 12/5.

Starter code: ps5.zip

Cosine similarity

Recall that, where \(i\) indexes over the context types, cosine similarity is defined as follows. \(x\) and \(y\) are both vectors of context counts, where \(x_i\) is the count of context \(i\).

\[cossim(x,y) = \frac{ \sum_i x_i y_i }{ \sqrt{\sum_i x_i^2} \sqrt{\sum_i y_i^2} }\]

The nice thing about cosine similarity is that it is normalized: no matter what the input vectors are, the output is between 0 and 1. One way to think of this is that cosine similarity is just, um, the cosine function, which has this property. Another way to think of it is, to work through the situations of maximum and minimum similarity between two context vectors, starting from the definition above.

Note: a good way to understand the cosine similarity function is that the numerator cares about whether the \(x\) and \(y\) vectors are correlated. If \(x\) and \(y\) tend to have high values for the same contexts, the numerator tends to be big. The denominator can be thought of as a normalization factor: if all the values of \(x\) are really large, for example, dividing by the square root of their sum-of-squares prevents the whole thing from getting arbitrarily large. In fact, dividing by both these things (aka their norms) means the whole thing can’t go higher than 1.

Question 1: If the input vectors \(x\) and \(y\) are identical, what value is the cosine similarity? Your answer should be a single number. Please show how you derived this, starting from the definition above (this will be very short). This is the maximum possible value of cosine similarity.

Question 2: Say that there are no contexts in common: for every \(i\), whenever \(x_i>0\), it’s the case that \(y_i=0\), and when \(y_i>0\), then \(x_i=0\). In other words, say the words are “A” and “B”; this means, every context you see at least once for A never appears for B. In this case, what value is the cosine similarity? Explain why this follows from the definition above. This is the minimum possible value of cosine similarity.

Question 3: We have provided a file containing context count vectors for three words: “dog”, “cat”, and “cloud”. These are immediate left and right contexts from a small part of the unlabeled tweets corpus we’ve provided you. This is in the file cloud_cat_dog.from_first_100k_tweets.txt. You can open the file in a text editor since it’s quite small.

Please compute the cosine similarities between each pair of words (using python code) and include the similarities in your writeup. For each pair of words, print out the word pair and their cosine similarity.

Please implement this in the script cossim.py. We’ve provided very simple code that loads the context count vectors from the data file.

Note on how we made the context vectors

You don’t need to do this as part of your problem set, but if you like, take a look at allcounts.py, which we ran on the unlabeled tweets to create the context file. We ran it only on the first 100,000 tweets, and these three words appeared only 202, 158, and 24 times. If you run the script on more tweets, you get bigger (and better) context vectors. You could try using cosine similarities for features in your NER system: for example, compute cosine similarities to “bieber” or another celebrity name; or take the max among the cossims against multiple names; or use contexts but not cosine similarity somehow, like just take high frequency or highly discriminative contexts and use their presence as features … whatever you want.