# Week4 Assignment: Find Collocations with Skip-grams
Date: 2020/3/24
==Setup:==
All required files are on [Google Drive](https://drive.google.com/drive/folders/13Vup8nGtJWa6Ln6jd9oNXhNfuSTDQWW9?fbclid=IwAR12PWFZY4Lm-StOOoywrtIzLs0ETmZNH19GDepVdDM-U9yo5NtOUsKPsHs). There are 6 files:
1. Collocation.ipynb (template)
2. lang8.pos.m2.txt (dataset)
3. problem_word_list.txt
4. rules.tsv <font color='red'>(Optional, for grade B)</font>
5. Collocation-POS.ipynb <font color='red'>(Optional, template for grade B)</font>
6. linggle_api_new.py <font color='red'>(Optional, for grade C)</font>
You may use IDE such as Jupyter Notebook to run the files.
Both .ipynb and .py would be acceptable.
==Description:==
In this assignment, you will be provided with a dataset **lang8.pos.m2.txt** which contains grammatically incorrect sentences and its corrections. Each data will look like:
```
S His//DET Kanji//PROPN 's//PART ability//NOUN is//AUX much//ADV better//ADJ than//SCONJ me//PRON .//PUNCT
A 2 3|||U:NOUN:POSS||||||REQUIRED|||-NONE-|||0
A 8 9|||R:PRON|||mine|||REQUIRED|||-NONE-|||0
```
From this data, we will generate valid collocations by **Smadja's algorithm**.
**What is skip-gram?**
> skip-grams are a generalization of n-grams in which the components (typically words) need not be consecutive in the text under consideration, but may leave gaps that are skipped over.
Reference: [Wikipedia: n-gram](https://en.wikipedia.org/wiki/N-gram#Skip-gram)
**About Smadja's algorithm:**
Reference: [基于Smadja演算法的搭配詞自動提取](https://blog.csdn.net/jinping_shi/article/details/61203395)
[Smadja, Frank. “Retrieving Collocations from Text: Xtract.” Computational Linguistics 19 (1993): 143-177.](https://www.semanticscholar.org/paper/Retrieving-Collocations-from-Text%3A-Xtract-Smadja/a3ff7801bcf72fea30117c88d397403a570c5c68)
**What is POS?**
> Part of speech tags. Commonly listed English parts of speech are noun, verb, adjective, adverb, pronoun, preposition, conjunction, interjection, numeral, article, or determiner.
Reference: [Wikipedia: Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech)
---
## :memo: This assignment is to do the following:
### Step 1: Complete list_to_skipgram()
Input a set of data:
```
['S I//PRON feel//VERB relax//VERB to//PART take//VERB a//DET bedrock//NOUN bath//NOUN ^//PUNCT ^//PUNCT',
'A 1 2|||R:VERB:TENSE|||felt|||REQUIRED|||-NONE-|||0',
'A 2 3|||R:ADJ:FORM|||relaxed|||REQUIRED|||-NONE-|||0']
```
Through **m2_to_wikiedit()**, the data would be transfered into edited form:
```
['I', '[-feel-]{+felt+}', 'relax', 'to', 'take', 'a', 'bedrock', 'bath', '^', '^']
```
The edited word in above example is **[-feel-]{+felt+}**.
If there is any problem word within the range of maxdist, say +-5, in this assignment, we would consider the edit is highly related to the problem word.
Output the result, including problem_word, edit_tag, and position.
```
[('relax', '[-feel-]{+felt+}', 1),
('to', '[-feel-]{+felt+}', 2),
('feel', '[-relax-]{+relaxed+}', -1),
('to', '[-relax-]{+relaxed+}', 1)]
```
:::danger
:bulb:Hint:
Edit tag may be [- -]{+ +}, [- -], or {+ +}.
Do not separate edit tag if it is a [- -]{+ +}.
* [- -] indicates an deletion
* {+ +} indicates an insertion
* [- -]{+ +} indicates an replacement
:::
### Step 2: Complete skgm dictionary
**corpus_list** is the segmentation of lang8 dataset.
The structure of skgm dictionary would look like:
```
skgm['arrive']['{+at+}'][-1] = 24
```
* First layer **'arrive'** indicating the problem word
* Second layer **'{+at+}'** indicating the edit tag
* Third layer **'-1'** indicating the position where the edit
* Output **'24'** indicating the frequency of above three conditions
Complete skgm dictionary by using list_to_skipgram().
### Step 3: Complete skipbigram_static dictionary
Calculate frequency, average_position, spread, and strength for each problem word and its edit tag. Format as following:
```
skipbigram_static['arrive']['{+at+}']
```
```
{
'freq': 25,
'avg_p': 2.5,
'spread': 46.45,
'strength': 13.413212074892048,
}
```
* **Frequency:** the frequency of each problem word and edit tag combination, regardless of their position.
* **Average_position:** the average position of each problem word and edit tag combination. Frequency/10.
* **Spread:** 如下圖公式
* **Strength:** (frequency of the problem word and edit tag combination - average of all frequencies of THE problem word) / standard deviation of all frequencies of THE problem word.

### Step 4: Complete valid_collocation()
Find valid collocations with **Smadja's algorithm**.
Output format as following:
```
valid_collocation('explain')
```
Format: problemword edit_tag (distance, count)
```
explain [-about-] (1, 19)
explain {+the+} (1, 11)
explain {+to+} (1, 12)
explain {+it+} (1, 18)
explain {+will+} (-3, 5)
explain {+'ll+} (-1, 5)
explain {+this+} (1, 5)
```
**Smadja's algorithm** as following:

* K0 = 1
* K1 = 1
* U0 = 5
* Ui = spread
For each condition:
* C1:篩掉在資料中出現頻率不高的搭配詞
* C2:spread小表示距離頻率分布均勻,因此可能不是合理的搭配,反之,如果spread大,就表示可能是合理的搭配
* C3:篩選出合理搭配詞的距離
---
### Optional: Grade B - Using knowned rules to improve correctness
Format of **rules.tsv** looks like:

In this assignment, we will consider the combination of Problem_Word_POS, Edit_TYPE, Distance, and Count as rules.
To add rules to original method, count the result of **list_to_skipgram()** into **skgm** dictionary <font color='blue'>if only the result agrees with the rules in **rules.tsv**</font>.
Every keyword in **skgm** and **skipbigram_static** dictionary must be concatenated with its pos tag, while edit tag should also be concatenated with its corresponding tag.
```
* Grade B Example Input
skgm['arrive//VERB']['{+at//M:PREP+}'][]
skipbigram_static['arrive//VERB']['{+at//M:PREP+}']
```
The result of grade B must be different from grade A.
Example output as following:
```
* Grade A Result
In:
valid_collocation('explain')
Out:
explain [-about-] (1, 19)
explain {+the+} (1, 11)
explain {+to+} (1, 12)
explain {+it+} (1, 18)
explain {+will+} (-3, 5)
explain {+'ll+} (-1, 5)
explain {+this+} (1, 5)
```
```
* Grade B Result
In:
valid_collocation('explain//VERB')
Out:
explain//VERB [-about//U:PREP-] (1, 19)
explain//VERB {+to//M:PREP+} (1, 12)
explain//VERB {+it//M:PRON+} (1, 17)
```
---
### Optional: Grade C - Call linggle_api to calculate candidates' frequency
1. Tokenize a n-gram with **spacy**.
1. Find the valid_collocation of each token.
```
import spacy
nlp = spacy.load('en_core_web_lg')
ngram = 'busy in something'
collocation = []
for token in nlp(ngram):
print(token.text + '//' + token.pos_)
collocation.append(valid_collocation(token.text + '//' + token.pos_))
```
3. Generate all possible edited ngram by result of valid_collocation.
```
candidates = [['busy', 'with', 'something'],
['busy', 'my', 'in', 'something'],
['busy', 'something'],
['busy', 'in', 'a', 'something'],
['busy', 'in', 'the', 'something'],
['busy', 'now', 'in', 'something'],
['busy', 'in', 'my', 'something'],
['busy', 'in', 'an', 'something'],
['busy', 'in', 'our', 'something'],
['busy', 'in', 'this', 'something'],
['busy', 'in', 'to', 'something'],
['busy', 'in', 'their', 'something'],
['busy', 'in', 'something']]
```
4. Count the frequency of each candidate by calling linggle_api. Output format as following:
```
result count
0 busy with something 6420
1 busy something 67
2 busy in something 52
```
As you can see from above, most candidates are not counted by linggle_api, which means the candidates are not frequently used in corpus and may not be valid ngrams.
---
## 💯 Submission and Evaluation:
* **Grade A (80 points):** Find valid collocations with Smadja's algorithm.
Be prepared to explain your code, and show the result during demo.
* **Grade B (full score):** Put pos_tag into consideration by using **rules.tsv**.
Be prepared to show both results of grade A and grade B.
* **Gade C (bonus):** shown frequency results calculated by **linggle_api**.
Be prepared to show both results of grade A and grade B, and results count by linggle_api.
- [ ] Demo your code to TA before leaving the classroom
- [ ] Submit your code to iLMS
If you do not finish this assignment in class, you can still hand in the assignment within a week at <font color='blue'>Delta 712</font>. <font color='blue'>The time of your submission will not affect on your score.</font> TAs available at:
* Tuesday 10:00 ~ 12:00
* Wednesday 15:30 ~ 17:30
* Thursday 15:30 ~ 17:30