Homework 5: Distributional semantics¶

This is due on 11/27 (11:55pm), submitted electronically.

How to do this problem set¶

Most of these questions require writing Python code and computing results, and the rest of them have textual answers. Write all the textual answers in this document, show the output of your experiment in this document, and implement the functions in the distsim.py. Once you are finished, you will upload this .ipynb file and distsim.py to Moodle.

When creating your final version of the problem set to hand in, please do a fresh restart and execute every cell in order. Then you'll be sure it's actually right. Make sure to press "Save"!

Your Name:

List collaborators, and how you collaborated, here: (see our grading and policies page for details on our collaboration policy).

name 1

Cosine Similarity¶

Recall that, where $i$ indexes over the context types, cosine similarity is defined as follows. $x$ and $y$ are both vectors of context counts (each for a different word), where $x_i$ is the count of context $i$.

$$cossim(x,y) = \frac{ \sum_i x_i y_i }{ \sqrt{\sum_i x_i^2} \sqrt{\sum_i y_i^2} }$$

The nice thing about cosine similarity is that it is normalized: no matter what the input vectors are, the output is between 0 and 1. One way to think of this is that cosine similarity is just, um, the cosine function, which has this property (for non-negative $x$ and $y$). Another way to think of it is, to work through the situations of maximum and minimum similarity between two context vectors, starting from the definition above.

Note: a good way to understand the cosine similarity function is that the numerator cares about whether the $x$ and $y$ vectors are correlated. If $x$ and $y$ tend to have high values for the same contexts, the numerator tends to be big. The denominator can be thought of as a normalization factor: if all the values of $x$ are really large, for example, dividing by the square root of their sum-of-squares prevents the whole thing from getting arbitrarily large. In fact, dividing by both these things (aka their norms) means the whole thing can’t go higher than 1.

Question 1 (10 points)¶

See the file nytcounts.university_cat_dog, which contains context count vectors for three words: “dog”, “cat”, and “university”. These are immediate left and right contexts from a New York Times corpus. You can open the file in a text editor since it’s quite small.

Please complete cossim_sparse(v1,v2) in distsim.py to compute and display the cosine similarities between each pair of these words. Briefly comment on whether the relative simlarities make sense.

Note that we’ve provided very simple code that tests the context count vectors from the data file.

import distsim; reload(distsim)

word_to_ccdict = distsim.load_contexts("nytcounts.university_cat_dog")
print "Cosine similarity between cat and dog" ,distsim.cossim_sparse(word_to_ccdict['cat'],word_to_ccdict['dog'])
print "Cosine similarity between cat and university" ,distsim.cossim_sparse(word_to_ccdict['cat'],word_to_ccdict['university'])
print "Cosine similarity between university and dog" ,distsim.cossim_sparse(word_to_ccdict['university'],word_to_ccdict['dog'])

Write your response here:

Question 2 (15 points)¶

Implement show_nearest(). Given a dictionary of word-context vectors, the context vector of a particular query word w, the words you want to exclude in the responses (It should be the query word w in this question), and the similarity metric you want to use (It should be the cossim_sparse function you just implemented), show_nearest() finds the 20 words most-similar to w. For each, display the other word, and its similarity to the query word w.

To make sure it’s working, feel free to use the small nytcounts.university_cat_dog database as follows.

import distsim
word_to_ccdict = distsim.load_contexts("nytcounts.university_cat_dog")
distsim.show_nearest(word_to_ccdict, word_to_ccdict['dog'], set(['dog']), distsim.cossim_sparse)

Question 3 (20 points)¶

Explore similarities in nytcounts.4k, which contains context counts for about 4000 words in a sample of New York Times. The news data was lowercased and URLs were removed. The context counts are for the 2000 most common words in twitter, as well as the most common 2000 words in the New York Times. (But all context counts are from New York Times.) The context counts only contain contexts that appeared for more than one word. The file vocab contains the list of all terms in this data, along with their total frequency. Choose six words. For each, show the output of show_nearest() and comment on whether the output makes sense. Comment on whether this approach to distributional similarity makes more or less sense for certain terms. Four of your words should be:

a name (for example: person, organization, or location)
a common noun
an adjective
a verb

You may also want to try exploring further words that are returned from a most-similar list from one of these. You can think of this as traversing the similarity graph among words.

Implementation note: On my laptop it takes several hundred MB of memory to load it into memory from the load_contexts() function. If you don’t have enough memory available, your computer will get very slow because the OS will start swapping. If you have to use a machine without that much memory available, you can instead implement in a streaming approach by using the stream_contexts() generator function to access the data; this lets you iterate through the data from disk, one vector at a time, without putting everything into memory. You can see its use in the loading function. (You could also alternatively use a key-value or other type of database, but that’s too much work for this assignment.)

Extra note: You don’t need this, but for reference, our preprocessing scripts we used to create the context data are in the preproc/ directory.

import distsim; reload(distsim)
word_to_ccdict = distsim.load_contexts("nytcounts.4k")
###Provide your answer below; perhaps in another cell so you don't have to reload the data each time

###Answer examples
distsim.show_nearest(word_to_ccdict, word_to_ccdict['jack'],set(['jack']),distsim.cossim_sparse)

Write your response here:

Question 4 (10 points)¶

In the next several questions, you'll examine similarities in trained word embeddings, instead of raw context counts.

See the file nyt_word2vec.university_cat_dog, which contains word embedding vectors pretrained by word2vec [1] for three words: “dog”, “cat”, and “university”. You can open the file in a text editor since it’s quite small.

Please complete cossim_dense(v1,v2) in distsim.py to compute and display the cosine similarities between each pair of these words.

Implementation note: Notice that the inputs of cossim_dense(v1,v2) are numpy arrays. If you do not very familiar with the basic operation in numpy, you can find some examples in the basic operation section here: https://docs.scipy.org/doc/numpy-dev/user/quickstart.html

If you know how to use Matlab but haven't tried numpy before, the following link should be helpful: https://docs.scipy.org/doc/numpy-dev/user/numpy-for-matlab-users.html

[1] Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." NIPS 2013.

import distsim; reload(distsim)
word_to_vec_dict = distsim.load_word2vec("nyt_word2vec.university_cat_dog")
print "Cosine similarity between cat and dog" ,distsim.cossim_dense(word_to_vec_dict['cat'],word_to_vec_dict['dog'])
print "Cosine similarity between cat and university" ,distsim.cossim_dense(word_to_vec_dict['cat'],word_to_vec_dict['university'])
print "Cosine similarity between university and dog" ,distsim.cossim_dense(word_to_vec_dict['university'],word_to_vec_dict['dog'])

word_to_vec_dict = distsim.load_word2vec("nyt_word2vec.university_cat_dog")
distsim.show_nearest(word_to_vec_dict, word_to_vec_dict['dog'], set(['dog']),distsim.cossim_dense)

Question 5 (25 points)¶

Repeat the process you did in the question 3, but now use dense vector from word2vec. Comment on whether the outputs makes sense. Compare the outputs of using show_nearest() on word2vec and the outputs on sparse context vector (so we suggest you to use the same words in question 3). Which method works better on the query words you choose. Please brief explain why one method works better than other in each case.

Notice that we use default parameters of word2vec in gensim to get word2vec word embeddings.

import distsim
word_to_vec_dict = distsim.load_word2vec("nyt_word2vec.4k")
###Provide your answer bellow

###Answer examples
distsim.show_nearest(word_to_vec_dict, word_to_vec_dict['jack'],set(['jack']),distsim.cossim_dense)

Write your response here:

Question 7 (15 points)¶

After you have word embedding, one of interesting things you can do is to perform analogical reasoning tasks. In the following example, we provide the code which can find the closet words to the vector $v_{king}-v_{man}+v_{woman}$ to fill the blank on the question:

king : man = __ : woman

Notice that the word2vec is trained in an unsupervised manner; it is impressive that it can apparently do an interesting type of reasoning. (For a contrary opinion, see Linzen 2016.)

Please come up with another analogical reasoning task (another triple of words), and output the answer using the the same method. Comment on whether the output makes sense. If the output makes sense, explain why we can capture such relation between words using an unsupervised algorithm. Where does the information come from? On the other hand, if the output does not make sense, propose an explanation why the algorithm fails on this case.

import distsim
king = word_to_vec_dict['king']
man = word_to_vec_dict['man']
woman = word_to_vec_dict['woman']
distsim.show_nearest(word_to_vec_dict,
                     king-man+woman,
                     set(['king','man','woman']),
                     distsim.cossim_dense)
###Provide your answer bellow

Write your response here:

Extra credit (up to 5 points)¶

Analyze word similarities with WordNet, and compare and contrast against the distributional similarity results. For a fair comparison, limit yourself to words in the nytcounts.4k vocabulary. First, calculate how many of the words are present in WordNet, at least based on what method you’re doing lookups with. (There is an issue that WordNet requires a part-of-speech for lookup, but we don’t have those in our data; you’ll have to devise a solution).

Second, for the words you analyzed with distributional similarity above, now do the same with WordNet-based similarity as implemented in NLTK, as described here, or search for “nltk wordnet similarity”. For a fair comparison, do the nearest-similarity ranking among the words in the nytcounts.4k vocabulary. You may use path_similarity, or any of the other similarity methods (e.g. res_similarity for Resnik similarity, which is one of the better ones). Describe what you are doing. Compare and contrast the words you get. Does WordNet give similar or very different results? Why?</p>

Extra credit (up to 5 points)¶

Investigate a few of the alternative methods described in Linzen 2016 on the man/woman/king/queen and your new example. What does this tell you about the legitimacy of analogical reasoning tasks? How do you assess Linzen's arguments?