Word embedding demo

UMass CS 485, 2023-11-02

Load the data. Downloaded a while ago from https://nlp.stanford.edu/projects/glove/

Sometimes the links on that page don't work. Via http://web.archive.org (highly recommended!) we found this, which appears to currently work: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip

There are some other copies on the web too.

For actual use, not just pedagogical, I'd recommend using one of the larger embeddings versions listed on that webpage.

Lookup a word in the matrix

If you look carefully, the values in each vector are kind of correlated: first value is moderately high in both, second value is slightly negative, third value is more negative, etc.

So let's use a dot product (aka inner product) to do this comparison quantitatively. Looking at vector values in parallel, when both values are highly positive together, or highly negative together, the inner product will tend to be high.

See paper notes: cosine similarity as a normalized dot product

Sanity check: max similarity for cosine against itself

Sanity check: some opposite signs => very dissimilar.

Let's find the top-10 most similar words to "life". (1) calculate similarity for each word in the vocab. (2) sort in descending order than take the first 10 of the list.

Useful trick: get a sorted list of indices with numpy.argsort(). (or in vanilla python, inds=range(N); inds.sort(key=lambda i: values[i]) Sorting/ranking by scores, and getting the top-10, is something you do all the time when analyzing NLP systems.