Word embedding demo

UMass CS 490A, 2021-10-28

Load the data. Downloaded a year or two ago from https://nlp.stanford.edu/projects/glove/

2021-11-02: it appears the links on that webpage don't work, which was also an issues during 10/28 lecture. In OH today we looked at http://web.archive.org (highly recommended!) and ended up finding this link, which appears to currently work: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip

There are some other copies on the web too.

For actual use, not just pedagogical, I'd recommend using one of the larger embeddings versions listed on that webpage.

Lookup a word in the matrix

If you look carefully, the values in each vector are kind of correlated: first value is moderately high in both, second value is slightly negative, third value is more negative, etc.

So let's use a dot product (aka inner product) to do this comparison quantitatively. Looking at vector values in parallel, when both values are highly positive together, or highly negative together, the inner product will tend to be high.

See paper notes: cosine similarity as a normalized dot product

Sanity check: max similarity for cosine against itself

Sanity check: some opposite signs => very dissimilar.

Let's find the top-10 most similar words to "ketchup". (1) calculate similarity for each word in the vocab. (2) sort in descending order than take the first 10 of the list.

Question: what about the most DISSIMILAR? Doesn't make sense. This is typical in these spaces - similarities can say something interesting, but dissimilarities don't get you much.