Homework 5: Lexicons and Distributional Semantics

This is due on Friday, 11/10 (11pm)

How to do this problem set

Most of these questions require writing Python code and computing results, and the rest of them have textual answers. Write all the textual answers in this document, show the output of your experiment in this document, and implement the functions in the python files.

Submit a PDF of thie .ipynb to Gradescope, and the .ipynb and all python files to Moodle.

The assignment has two parts:

  • In the first part, you will experiment with Turney's method to find word polarities in a twitter dataset, given some positive and negative seed words.
  • In the second part, you will experiment with distributional and vector semantics.

Your Name: <>

List collaborators, and how you collaborated, here: (see our grading and policies page for details on our collaboration policy).

  • name 1

Part 1: Lexicon semantics

Recall that PMI of a pair of words, is defined as:

$$PMI(x, y) = log\frac{ P(x, y) }{ P(x)P(y)}$$

The Turney mehod defines a word's polarity as:

$$Polarity(word) = PMI(word, positive\_word)−PMI(word, negative\_word)$$

where the joint probability $P(w, v)$ or, more specifically, $P(w\ NEAR\ v)$ is the probability of both being "near" each other. We'll work with tweets, so it means: if you choose a tweet at random, what's the chance it contains both w and v?

(If you look at the Turney method as explained in the SLP3 chapter, the "hits" function is a count of web pages that contain at least one instance of the two words occurring near each other.)

The positive_word and negative_word terms are initially constructed by hand. For example: we might start with single positive word ('excellent') and a single negative word ('bad'). We can also have list of positive words ('excellent', 'perfect', 'love', ....) and list of negative words ('bad', 'hate', 'filthy',....)

If we're using a seed list of multiple terms, just combine them into a single symbol, e.g. all the positive seed words get rewritten to POS_WORD (and similarly for NEG_WORD). This $P(w, POS\_WORD)$ effectively means the co-ocurrence of $w$ with any of the terms in the list.

For this assignment, we will use twitter dataset which has 349378 tweets. These tweets are in the file named tweets.txt. These are the tweets of one day and filtered such that each tweet contains at least one of the seed words we've selected.

Question 1 (15 points)

The file tweets.txt contains around 349,378 tweets with one tweet per line. It is a random sample of public tweets, which we tokenized with twokenize.py's tokenizeRawTweetText()). The text you see has a space between each token so you can just use .split() if you want. We also filtered tweets to ones that included at least one term from one of these seed lists:

  • Positive seed list: ["good", "nice", "love", "excellent", "fortunate", "correct", "superior"]
  • Negative seed list: ["bad", "nasty", "poor", "hate", "unfortunate", "wrong", "inferior"]

Each tweet contains at least one positive or negative seed word. Take a look at the file (e.g. less' andgrep'). Implement the Turney's method to calculate polarity scores of all words.

Some things to keep in mind:

  • Ignore the seed words (i.e. don't calculate the polarity of the seed words).
  • You may want to ignore the polarity of words beignning with @ or #.

We recommend that you write code in a python file, but it's up to you.

QUESTION: You'll have to do something to handle zeros in the PMI equation. Please explain your and justify your decision about this.

textual answer here

Question 2 (5 points)

Print the top 50 most-positive words (i.e. inferred positive words) and the 50 most-negative words.

Many of the words won't make sense. Comment on at least two that do make sense, and at least two that don't. Why do you think these are happening with this dataset and method?

In [2]:
# Write code to print words here

Textual answer here.

Question 3 (5 points)

Now filter out all the words which have total count < 500, and then print top 50 polarity words and bottom 50 polarity words.

Choose some of the words from both the sets of 50 words you got above which accoording to you make sense. Again please note, you will find many words which don't make sense. Do you think these results are better than the results you got in Question-1? Explain why.

In [4]:
# Write code to print words here

Textual answer here.

Question 4 (5 points)

Even after filtering out words with count < 500, many top-most and bottom-most polarity don't make sense. Identify what kind of words these are and what can be done to filter them out. You can read some tweets in the file to see what's happening.

Textual answer here.

Part-2: Distributional Semantics

Cosine Similarity

Recall that, where $i$ indexes over the context types, cosine similarity is defined as follows. $x$ and $y$ are both vectors of context counts (each for a different word), where $x_i$ is the count of context $i$.

$$cossim(x,y) = \frac{ \sum_i x_i y_i }{ \sqrt{\sum_i x_i^2} \sqrt{\sum_i y_i^2} }$$

The nice thing about cosine similarity is that it is normalized: no matter what the input vectors are, the output is between 0 and 1. One way to think of this is that cosine similarity is just, um, the cosine function, which has this property (for non-negative $x$ and $y$). Another way to think of it is, to work through the situations of maximum and minimum similarity between two context vectors, starting from the definition above.

Note: a good way to understand the cosine similarity function is that the numerator cares about whether the $x$ and $y$ vectors are correlated. If $x$ and $y$ tend to have high values for the same contexts, the numerator tends to be big. The denominator can be thought of as a normalization factor: if all the values of $x$ are really large, for example, dividing by the square root of their sum-of-squares prevents the whole thing from getting arbitrarily large. In fact, dividing by both these things (aka their norms) means the whole thing can’t go higher than 1.

In this problem we'll work with vectors of raw context counts. (As you know, this is not necessarily the best representation.)

Question 5 (5 points)

See the file nytcounts.university_cat_dog, which contains context count vectors for three words: “dog”, “cat”, and “university”. These are immediate left and right contexts from a New York Times corpus. You can open the file in a text editor since it’s quite small.

Write a function which takes context count dictionaries of two words and calculates cosine similarity between these two words. The function should return a number beween 0 and 1. Briefly comment on whether the relative simlarities make sense.

In [14]:
import distsim; reload(distsim)

word_to_ccdict = distsim.load_contexts("nytcounts.university_cat_dog")

# write code here to show output (i.e. cosine similarity between these words.)
# We encourage you to write other functions in distsim.py itself. 
file nytcounts.university_cat_dog has contexts for 3 words

Write your response here:

Question 6 (20 points)

Explore similarities in nytcounts.4k, which contains context counts for about 4000 words in a sample of New York Times articles. The news data was lowercased and URLs were removed. The context counts are for the 2000 most common words in twitter, as well as the most common 2000 words in the New York Times. (But all context counts are from New York Times.) The context counts only contain contexts that appeared for more than one word. The file has three tab-separate fields: the word, its count, and a JSON-encoded dictionary of its context counts. You'll see it's just counts of the immediate left/right neighbors.

Choose six words. For each, show the output of 20 nearest words (use cosine similarity as distance metric). Comment on whether the output makes sense. Comment on whether this approach to distributional similarity makes more or less sense for certain terms. Four of your words should be:

  • a name (for example: person, organization, or location)
  • a common noun
  • an adjective
  • a verb

You may also want to try exploring further words that are returned from a most-similar list from one of these. You can think of this as traversing the similarity graph among words.

Implementation note: On my laptop it takes several hundred MB of memory to load it into memory from the load_contexts() function. If you don’t have enough memory available, your computer will get very slow because the OS will start swapping. If you have to use a machine without that much memory available, you can instead implement in a streaming approach by using the stream_contexts() generator function to access the data; this lets you iterate through the data from disk, one vector at a time, without putting everything into memory. You can see its use in the loading function. (You could also alternatively use a key-value or other type of database, but that’s too much work for this assignment.)

In [ ]:
import distsim; reload(distsim)
word_to_ccdict = distsim.load_contexts("nytcounts.4k")
###Provide your answer below; perhaps in another cell so you don't have to reload the data each time

Question 7 (10 points)

In the next several questions, you'll examine similarities in trained word embeddings, instead of raw context counts.

See the file nyt_word2vec.university_cat_dog, which contains word embedding vectors pretrained by word2vec [1] for three words: “dog”, “cat”, and “university”, from the same corpus. You can open the file in a text editor since it’s quite small.

Write a function which takes word embedding vectors of two words and calculates cossine similarity between these 2 words. The function should return a number beween -1 and 1. Briefly comment on whether the relative simlarities make sense.

Implementation note: Notice that the inputs of this function are numpy arrays (v1 and v2). If you are not very familiar with the basic operation in numpy, you can find some examples in the basic operation section here: https://docs.scipy.org/doc/numpy-dev/user/quickstart.html

If you know how to use Matlab but haven't tried numpy before, the following link should be helpful: https://docs.scipy.org/doc/numpy-dev/user/numpy-for-matlab-users.html

[1] Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." NIPS 2013.

In [16]:
import distsim; reload(distsim)

word_to_vec_dict = distsim.load_word2vec("nyt_word2vec.4k")

# write code here to show output (i.e. cosine similarity between these words.)
# We encourage you to write other functions in distsim.py itself. 

Write your response here:

Question 8 (20 points)

Repeat the process you did in the question 6, but now use dense vector from word2vec. Comment on whether the outputs makes sense. Compare the outputs of using nearest words on word2vec and the outputs on sparse context vector (so we suggest you to use the same words in question 6). Which method works better on the query words you choose. Please brief explain why one method works better than other in each case.

Not: we used the default parameters of word2vec in gensim to get word2vec word embeddings.

In [1]:
import distsim
word_to_vec_dict = distsim.load_word2vec("nyt_word2vec.4k")
###Provide your answer below

Question 9 (15 points)

An interesting thing to try with word embeddings is analogical reasoning tasks. In the following example, it's intended to solve the analogy question "king is to man as what is to woman?", or in SAT-style notation,

king : man :: __ : woman

Some research has proposed to use additive operations on word embeddings to solve the analogy: take the vector $(v_{king}-v_{man}+v_{woman})$ and find the most-similar word to it. One way to explain this idea: if you take "king", get rid of its attributes/contexts it shares with "man", and add in the attributes/contexts of "woman", hopefully you'll get to a point in the space that has king-like attributes but the "man" ones replaced with "woman" ones.

Show the output for 20 closest words you get by trying to solve that analogy with this method. Did it work?

Please come up with another analogical reasoning task (another triple of words), and output the answer using the same method. Comment on whether the output makes sense. If the output makes sense, explain why we can capture such relation between words using an unsupervised algorithm. Where does the information come from? On the other hand, if the output does not make sense, propose an explanation why the algorithm fails on this case.

Note that the word2vec is trained in an unsupervised manner just with distributional statistics; it is interesting that it can apparently do any reasoning at all. For a critical view, see Linzen 2016.

In [10]:
# Write code to show output here.

Textual answer here.