---
tags: NLP, 298, copygright haub and kurz,
---
# Count-Based Word Vectors
If you want to dive deeper into the theory behind NLP, a great resource is Stanford's course [CS224n: Natural Language Processing with Deep Learning](https://web.stanford.edu/class/cs224n/). They have most of their material online. The [2019](https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/) version also has links to videos.
Here, we concentrate on the practical side of NLP.
Let us first look at
```python
nltk.download('reuters')
```
At NLTK, you can find so-called [corpora](https://www.nltk.org/nltk_data/) which are often used for NLP tasks. We use the Reuters corpus (feel free to download it and look at it).
Being able to create one's own corpus is a valuable skill. Here is a short [writeup on webscraping](https://hackmd.io/ZY7oAUiuRVqONEoiYmhEmw?view) that will aid in developing this skill.
## Co-Occurrence
**Code:**
- Read a corpus (text).
- Keep only distinct words (remove duplicates) and sort.
**Exercise:** Print some example words. How many words are there?
**Code:**
- `compute_co_occurrence_matrix` returns a co-occurrence matrix and a dictionary
- If `M` is the co-occurrence matrix and `D` the dictionary, then `M[D["word1"],D["word2"]])` counts how often a the two words co-occur in a window of the given size ($4$ by default).
**Exercise:** Find words that co-occur with each other. For example:
- What is the largest co_occurrence you can find?
- Which words co-occur with "energy"?
- What is the largest co_occurrence between two nouns you can find?
- Is co_occurrence symmetric?
- Do words co_occur with themselves?
- What happens if a word is not in the dictionary?
## Dimensionality Reduction
We will explain SVD/LSA in class. Roughly, given an $n\times n$-matrix $M$, we want to decompose it as $$M\approx USV,$$
where $S$ is a $k\times k$-matrix and $k\ll n$. In our example $n$ is in the thousands and $k$ can be as small as $2$.
**Code:** `reduce_to_k_dim` turns an $n\times n$-matrix $M$ into a $n\times k$-matrix $US$.
<!--**Exercise (optional):** ...-->
To compute the [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) of two vectors `a,b` use
```python
np.dot(a,b)/np.linalg.norm(a)/np.linalg.norm(b)
```
**Exercise:** This continues the exercise about the co-occurrence matrix. Compute the similarity of various pairs of words.
## Co-Occurrence Plot Analysis
**Exercise:** Make a list of words that forms two (or three) distinct clusters in the 2-dimensional plot. Hint: [Download the corpus](http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html) and have a look at the texts from which the words are taken.