# Week4 Assignment: Find Collocations with Skip-grams Date: 2020/3/24 ==Setup:== All required files are on [Google Drive](https://drive.google.com/drive/folders/13Vup8nGtJWa6Ln6jd9oNXhNfuSTDQWW9?fbclid=IwAR12PWFZY4Lm-StOOoywrtIzLs0ETmZNH19GDepVdDM-U9yo5NtOUsKPsHs). There are 6 files: 1. Collocation.ipynb (template) 2. lang8.pos.m2.txt (dataset) 3. problem_word_list.txt 4. rules.tsv <font color='red'>(Optional, for grade B)</font> 5. Collocation-POS.ipynb <font color='red'>(Optional, template for grade B)</font> 6. linggle_api_new.py <font color='red'>(Optional, for grade C)</font> You may use IDE such as Jupyter Notebook to run the files. Both .ipynb and .py would be acceptable. ==Description:== In this assignment, you will be provided with a dataset **lang8.pos.m2.txt** which contains grammatically incorrect sentences and its corrections. Each data will look like: ``` S His//DET Kanji//PROPN 's//PART ability//NOUN is//AUX much//ADV better//ADJ than//SCONJ me//PRON .//PUNCT A 2 3|||U:NOUN:POSS||||||REQUIRED|||-NONE-|||0 A 8 9|||R:PRON|||mine|||REQUIRED|||-NONE-|||0 ``` From this data, we will generate valid collocations by **Smadja's algorithm**. **What is skip-gram?** > skip-grams are a generalization of n-grams in which the components (typically words) need not be consecutive in the text under consideration, but may leave gaps that are skipped over. Reference: [Wikipedia: n-gram](https://en.wikipedia.org/wiki/N-gram#Skip-gram) **About Smadja's algorithm:** Reference: [基于Smadja演算法的搭配詞自動提取](https://blog.csdn.net/jinping_shi/article/details/61203395) [Smadja, Frank. “Retrieving Collocations from Text: Xtract.” Computational Linguistics 19 (1993): 143-177.](https://www.semanticscholar.org/paper/Retrieving-Collocations-from-Text%3A-Xtract-Smadja/a3ff7801bcf72fea30117c88d397403a570c5c68) **What is POS?** > Part of speech tags. Commonly listed English parts of speech are noun, verb, adjective, adverb, pronoun, preposition, conjunction, interjection, numeral, article, or determiner. Reference: [Wikipedia: Part of Speech](https://en.wikipedia.org/wiki/Part_of_speech) --- ## :memo: This assignment is to do the following: ### Step 1: Complete list_to_skipgram() Input a set of data: ``` ['S I//PRON feel//VERB relax//VERB to//PART take//VERB a//DET bedrock//NOUN bath//NOUN ^//PUNCT ^//PUNCT', 'A 1 2|||R:VERB:TENSE|||felt|||REQUIRED|||-NONE-|||0', 'A 2 3|||R:ADJ:FORM|||relaxed|||REQUIRED|||-NONE-|||0'] ``` Through **m2_to_wikiedit()**, the data would be transfered into edited form: ``` ['I', '[-feel-]{+felt+}', 'relax', 'to', 'take', 'a', 'bedrock', 'bath', '^', '^'] ``` The edited word in above example is **[-feel-]{+felt+}**. If there is any problem word within the range of maxdist, say +-5, in this assignment, we would consider the edit is highly related to the problem word. Output the result, including problem_word, edit_tag, and position. ``` [('relax', '[-feel-]{+felt+}', 1), ('to', '[-feel-]{+felt+}', 2), ('feel', '[-relax-]{+relaxed+}', -1), ('to', '[-relax-]{+relaxed+}', 1)] ``` :::danger :bulb:Hint: Edit tag may be [- -]{+ +}, [- -], or {+ +}. Do not separate edit tag if it is a [- -]{+ +}. * [- -] indicates an deletion * {+ +} indicates an insertion * [- -]{+ +} indicates an replacement ::: ### Step 2: Complete skgm dictionary **corpus_list** is the segmentation of lang8 dataset. The structure of skgm dictionary would look like: ``` skgm['arrive']['{+at+}'][-1] = 24 ``` * First layer **'arrive'** indicating the problem word * Second layer **'{+at+}'** indicating the edit tag * Third layer **'-1'** indicating the position where the edit * Output **'24'** indicating the frequency of above three conditions Complete skgm dictionary by using list_to_skipgram(). ### Step 3: Complete skipbigram_static dictionary Calculate frequency, average_position, spread, and strength for each problem word and its edit tag. Format as following: ``` skipbigram_static['arrive']['{+at+}'] ``` ``` { 'freq': 25, 'avg_p': 2.5, 'spread': 46.45, 'strength': 13.413212074892048, } ``` * **Frequency:** the frequency of each problem word and edit tag combination, regardless of their position. * **Average_position:** the average position of each problem word and edit tag combination. Frequency/10. * **Spread:** 如下圖公式 * **Strength:** (frequency of the problem word and edit tag combination - average of all frequencies of THE problem word) / standard deviation of all frequencies of THE problem word. ![](https://i.imgur.com/QKBrD2k.png) ### Step 4: Complete valid_collocation() Find valid collocations with **Smadja's algorithm**. Output format as following: ``` valid_collocation('explain') ``` Format: problemword edit_tag (distance, count) ``` explain [-about-] (1, 19) explain {+the+} (1, 11) explain {+to+} (1, 12) explain {+it+} (1, 18) explain {+will+} (-3, 5) explain {+'ll+} (-1, 5) explain {+this+} (1, 5) ``` **Smadja's algorithm** as following: ![](https://i.imgur.com/3OgetxJ.png) * K0 = 1 * K1 = 1 * U0 = 5 * Ui = spread For each condition: * C1:篩掉在資料中出現頻率不高的搭配詞 * C2:spread小表示距離頻率分布均勻,因此可能不是合理的搭配,反之,如果spread大,就表示可能是合理的搭配 * C3:篩選出合理搭配詞的距離 --- ### Optional: Grade B - Using knowned rules to improve correctness Format of **rules.tsv** looks like: ![](https://i.imgur.com/Eq6dYKT.png) In this assignment, we will consider the combination of Problem_Word_POS, Edit_TYPE, Distance, and Count as rules. To add rules to original method, count the result of **list_to_skipgram()** into **skgm** dictionary <font color='blue'>if only the result agrees with the rules in **rules.tsv**</font>. Every keyword in **skgm** and **skipbigram_static** dictionary must be concatenated with its pos tag, while edit tag should also be concatenated with its corresponding tag. ``` * Grade B Example Input skgm['arrive//VERB']['{+at//M:PREP+}'][] skipbigram_static['arrive//VERB']['{+at//M:PREP+}'] ``` The result of grade B must be different from grade A. Example output as following: ``` * Grade A Result In: valid_collocation('explain') Out: explain [-about-] (1, 19) explain {+the+} (1, 11) explain {+to+} (1, 12) explain {+it+} (1, 18) explain {+will+} (-3, 5) explain {+'ll+} (-1, 5) explain {+this+} (1, 5) ``` ``` * Grade B Result In: valid_collocation('explain//VERB') Out: explain//VERB [-about//U:PREP-] (1, 19) explain//VERB {+to//M:PREP+} (1, 12) explain//VERB {+it//M:PRON+} (1, 17) ``` --- ### Optional: Grade C - Call linggle_api to calculate candidates' frequency 1. Tokenize a n-gram with **spacy**. 1. Find the valid_collocation of each token. ``` import spacy nlp = spacy.load('en_core_web_lg') ngram = 'busy in something' collocation = [] for token in nlp(ngram): print(token.text + '//' + token.pos_) collocation.append(valid_collocation(token.text + '//' + token.pos_)) ``` 3. Generate all possible edited ngram by result of valid_collocation. ``` candidates = [['busy', 'with', 'something'], ['busy', 'my', 'in', 'something'], ['busy', 'something'], ['busy', 'in', 'a', 'something'], ['busy', 'in', 'the', 'something'], ['busy', 'now', 'in', 'something'], ['busy', 'in', 'my', 'something'], ['busy', 'in', 'an', 'something'], ['busy', 'in', 'our', 'something'], ['busy', 'in', 'this', 'something'], ['busy', 'in', 'to', 'something'], ['busy', 'in', 'their', 'something'], ['busy', 'in', 'something']] ``` 4. Count the frequency of each candidate by calling linggle_api. Output format as following: ``` result count 0 busy with something 6420 1 busy something 67 2 busy in something 52 ``` As you can see from above, most candidates are not counted by linggle_api, which means the candidates are not frequently used in corpus and may not be valid ngrams. --- ## 💯 Submission and Evaluation: * **Grade A (80 points):** Find valid collocations with Smadja's algorithm. Be prepared to explain your code, and show the result during demo. * **Grade B (full score):** Put pos_tag into consideration by using **rules.tsv**. Be prepared to show both results of grade A and grade B. * **Gade C (bonus):** shown frequency results calculated by **linggle_api**. Be prepared to show both results of grade A and grade B, and results count by linggle_api. - [ ] Demo your code to TA before leaving the classroom - [ ] Submit your code to iLMS If you do not finish this assignment in class, you can still hand in the assignment within a week at <font color='blue'>Delta 712</font>. <font color='blue'>The time of your submission will not affect on your score.</font> TAs available at: * Tuesday 10:00 ~ 12:00 * Wednesday 15:30 ~ 17:30 * Thursday 15:30 ~ 17:30