If you want to dive deeper into the theory behind NLP, a great resource is Stanford's course CS224n: Natural Language Processing with Deep Learning. They have most of their material online. The 2019 version also has links to videos.
Here, we concentrate on the practical side of NLP.
Let us first look at
nltk.download('reuters')
At NLTK, you can find so-called corpora which are often used for NLP tasks. We use the Reuters corpus (feel free to download it and look at it).
Being able to create one's own corpus is a valuable skill. Here is a short writeup on webscraping that will aid in developing this skill.
Code:
Exercise: Print some example words. How many words are there?
Code:
compute_co_occurrence_matrix
returns a co-occurrence matrix and a dictionaryM
is the co-occurrence matrix and D
the dictionary, then M[D["word1"],D["word2"]])
counts how often a the two words co-occur in a window of the given size (Exercise: Find words that co-occur with each other. For example:
We will explain SVD/LSA in class. Roughly, given an
where
Code: reduce_to_k_dim
turns an
To compute the cosine similarity of two vectors a,b
use
np.dot(a,b)/np.linalg.norm(a)/np.linalg.norm(b)
Exercise: This continues the exercise about the co-occurrence matrix. Compute the similarity of various pairs of words.
Exercise: Make a list of words that forms two (or three) distinct clusters in the 2-dimensional plot. Hint: Download the corpus and have a look at the texts from which the words are taken.