Distributional semantics

Please submit a zip file with both

a PDF with the answers to these questions, and
the code you write

Please do not submit the data files.

This focuses on experimenting with cosine similarity. You may find this chapter from Manning et al’s IR textbook helpful, or wikipedia, or this.

This is due on Sunday, 12/13 (at any time or that night if necessary).

Starter code and context data: hw5.zip

Cosine similarity

Recall that, where \(i\) indexes over the context types, cosine similarity is defined as follows. \(x\) and \(y\) are both vectors of context counts, where \(x_i\) is the count of context \(i\).

\[cossim(x,y) = \frac{ \sum_i x_i y_i }{ \sqrt{\sum_i x_i^2} \sqrt{\sum_i y_i^2} }\]

The nice thing about cosine similarity is that it is normalized: no matter what the input vectors are, the output is between 0 and 1. One way to think of this is that cosine similarity is just, um, the cosine function, which has this property. Another way to think of it is, to work through the situations of maximum and minimum similarity between two context vectors, starting from the definition above.

Note: a good way to understand the cosine similarity function is that the numerator cares about whether the \(x\) and \(y\) vectors are correlated. If \(x\) and \(y\) tend to have high values for the same contexts, the numerator tends to be big. The denominator can be thought of as a normalization factor: if all the values of \(x\) are really large, for example, dividing by the square root of their sum-of-squares prevents the whole thing from getting arbitrarily large. In fact, dividing by both these things (aka their norms) means the whole thing can’t go higher than 1.

Question 1: (5 points) If the input vectors \(x\) and \(y\) are identical, what value is the cosine similarity? Your answer should be a single number. Please show how you derived this, starting from the definition above. (This will be a very short derivation.) This is the maximum possible value of cosine similarity.

Question 2: (5 points) Say that there are no contexts in common: for every \(i\), whenever \(x_i>0\), it’s the case that \(y_i=0\), and when \(y_i>0\), then \(x_i=0\). In other words, say the words are “A” and “B”; this means, every context you see at least once for A never appears for B. In this case, what value is the cosine similarity? Explain why this follows from the definition above. This is the minimum possible value of cosine similarity (assuming \(x\) and \(y\) are non-negative).

Question 3: (10 points) See the file twcounts.cloud_cat_dog, which contains context count vectors for three words: “dog”, “cat”, and “cloud”. These are immediate left and right contexts from a twitter corpus. You can open the file in a text editor since it’s quite small.

Please complete cossim() and cloud_cat_dog() in distsim.py to compute and display the cosine similarities between each pair of these words. Include the similarities in your writeup. For each pair of words, print out the word pair and their cosine similarity. Comment on whether the relative simlarities make sense.

We’ve provided very simple code that loads the context count vectors from the data file.

Question 4: (10 points) Implement show_nearest(). Given a dictionary of word-context vectors, and a particular query word w, it finds the 20 words most-similar to w. For each, display the other word, and its similarity to the query word w.

To make sure it’s working, feel free to use the small twcounts.cloud_cat_dog database.

Question 5: (20 points) Explore similarities in twcounts.4k, which contains context counts for about 4000 words in a sample of 1 million english-language tweets from may 2014. (this is a very small corpus relative to what people usually use for this.) The twitter data was lowercased, and URLs and usernames were removed. The context counts are for the 2000 most common words in twitter, as well as the most common 2000 words in the new york times. (But all context counts are from twitter.) The context counts only contain contexts that appeared for more than one word. The file vocab contains the list of all terms in this data, along with their total frequency.

Choose at least six words. For each, show the output of show_nearest() and comment on whether the output makes sense. Comment on whether this approach to distributional similarity makes more or less sense for certain terms.

Four of your words should be:

A name (for example: person, organization, or location)
A common noun
An adjective
A term specific to online communication (perhaps an acronym, emoticon, or hashtag)

You may also want to try exploring further words that are returned from a most-similar list from one of these. You can think of this as traversing the similarity graph among words.

Implementation note: You should use the interactive interpreter for this, because it might take some time to load the context counts file. On my laptop it takes 500 MB of memory to load it into memory from the load_contexts() function. If you don’t have enough memory available, your computer will get very slow because the OS will start swapping. If you have to use a machine without that much memory available, you can instead implement in a streaming approach by using the stream_contexts() generator function to access the data; this lets you iterate through the data from disk, one vector at a time, without putting everything into memory. You can see its use in the loading function. (You could also alternatively use a key-value or other type of database, but that’s too much work for this assignment.)

Extra note: You don’t need this, but for reference, our preprocessing scripts we used to create the context data are in the preproc/ directory.

Extra credit: (up to 10 points) Analyze word similarities with WordNet, and compare and contrast against the distributional similarity results.

For a fair comparison, limit yourself to words in the twcounts.4k vocabulary. First, calculate how many of the words are present in WordNet, at least based on what method you’re doing lookups with. (There is an issue that WordNet requires a part-of-speech for lookup, but we don’t have those in our data; you’ll have to devise a solution). Second, for the words you analyzed with distributional similarity above, now do the same with WordNet-based similarity as implemented in NLTK, as described here, or search for “nltk wordnet similarity”. For a fair comparison, do the nearest-similarity ranking among the words in the twcounts.4k vocabulary. You may use path_similarity, or any of the other similarity methods (e.g. res_similarity). Describe what you are doing. Compare and contrast the words you get. Does WordNet give similar or very different results? Why?