--- tags: NLP, 298, copygright haub and kurz, --- # Count-Based Word Vectors If you want to dive deeper into the theory behind NLP, a great resource is Stanford's course [CS224n: Natural Language Processing with Deep Learning](https://web.stanford.edu/class/cs224n/). They have most of their material online. The [2019](https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/) version also has links to videos. Here, we concentrate on the practical side of NLP. Let us first look at ```python nltk.download('reuters') ``` At NLTK, you can find so-called [corpora](https://www.nltk.org/nltk_data/) which are often used for NLP tasks. We use the Reuters corpus (feel free to download it and look at it). Being able to create one's own corpus is a valuable skill. Here is a short [writeup on webscraping](https://hackmd.io/ZY7oAUiuRVqONEoiYmhEmw?view) that will aid in developing this skill. ## Co-Occurrence **Code:** - Read a corpus (text). - Keep only distinct words (remove duplicates) and sort. **Exercise:** Print some example words. How many words are there? **Code:** - `compute_co_occurrence_matrix` returns a co-occurrence matrix and a dictionary - If `M` is the co-occurrence matrix and `D` the dictionary, then `M[D["word1"],D["word2"]])` counts how often a the two words co-occur in a window of the given size ($4$ by default). **Exercise:** Find words that co-occur with each other. For example: - What is the largest co_occurrence you can find? - Which words co-occur with "energy"? - What is the largest co_occurrence between two nouns you can find? - Is co_occurrence symmetric? - Do words co_occur with themselves? - What happens if a word is not in the dictionary? ## Dimensionality Reduction We will explain SVD/LSA in class. Roughly, given an $n\times n$-matrix $M$, we want to decompose it as $$M\approx USV,$$ where $S$ is a $k\times k$-matrix and $k\ll n$. In our example $n$ is in the thousands and $k$ can be as small as $2$. **Code:** `reduce_to_k_dim` turns an $n\times n$-matrix $M$ into a $n\times k$-matrix $US$. <!--**Exercise (optional):** ...--> To compute the [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity) of two vectors `a,b` use ```python np.dot(a,b)/np.linalg.norm(a)/np.linalg.norm(b) ``` **Exercise:** This continues the exercise about the co-occurrence matrix. Compute the similarity of various pairs of words. ## Co-Occurrence Plot Analysis **Exercise:** Make a list of words that forms two (or three) distinct clusters in the 2-dimensional plot. Hint: [Download the corpus](http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html) and have a look at the texts from which the words are taken.