Try   HackMD

Count-Based Word Vectors

If you want to dive deeper into the theory behind NLP, a great resource is Stanford's course CS224n: Natural Language Processing with Deep Learning. They have most of their material online. The 2019 version also has links to videos.

Here, we concentrate on the practical side of NLP.

Let us first look at

nltk.download('reuters')

At NLTK, you can find so-called corpora which are often used for NLP tasks. We use the Reuters corpus (feel free to download it and look at it).

Being able to create one's own corpus is a valuable skill. Here is a short writeup on webscraping that will aid in developing this skill.

Co-Occurrence

Code:

  • Read a corpus (text).
  • Keep only distinct words (remove duplicates) and sort.

Exercise: Print some example words. How many words are there?

Code:

  • compute_co_occurrence_matrix returns a co-occurrence matrix and a dictionary
  • If M is the co-occurrence matrix and D the dictionary, then M[D["word1"],D["word2"]]) counts how often a the two words co-occur in a window of the given size (
    4
    by default).

Exercise: Find words that co-occur with each other. For example:

  • What is the largest co_occurrence you can find?
  • Which words co-occur with "energy"?
  • What is the largest co_occurrence between two nouns you can find?
  • Is co_occurrence symmetric?
  • Do words co_occur with themselves?
  • What happens if a word is not in the dictionary?

Dimensionality Reduction

We will explain SVD/LSA in class. Roughly, given an

n×n-matrix
M
, we want to decompose it as
MUSV,

where

S is a
k×k
-matrix and
kn
. In our example
n
is in the thousands and
k
can be as small as
2
.

Code: reduce_to_k_dim turns an

n×n-matrix
M
into a
n×k
-matrix
US
.

To compute the cosine similarity of two vectors a,b use

np.dot(a,b)/np.linalg.norm(a)/np.linalg.norm(b)

Exercise: This continues the exercise about the co-occurrence matrix. Compute the similarity of various pairs of words.

Co-Occurrence Plot Analysis

Exercise: Make a list of words that forms two (or three) distinct clusters in the 2-dimensional plot. Hint: Download the corpus and have a look at the texts from which the words are taken.