Week10 Assignment: Correction of miscollocation in a Sentence

# Week10 Assignment: Correction of miscollocation in a Sentence Date: 2020/5/5 Lecture on Google Meet: ==Setup:== All required files are on [Google Drive](https://drive.google.com/drive/folders/1kzx9YBW79ZNPiYQUKBkGHFnvwZH25Imq?fbclid=IwAR0euWfyxJsLDL8u1SpRpr9HdZEAb2XnvzAXI9f2jQqdHTU-pp7kyMgcLK8). There are 4 files: 1. miscollocations-template.ipynb (template) 2. bnc.vn.txt (data) 3. linggle_api.py 4. bnc.bin You may use IDE such as Jupyter Notebook to run the files. Both .ipynb and .py would be acceptable. ==Description:== In this assignment, we are going to write a program for detecting and correcting miscollocations in a sentence. **Original sentence:** `'He have a little mistake' ` **Corrected sentence:** `'He make a little mistake', 'He admit a little mistake'` Each line of **bnc.vn.txt** consists of: ``` A_verb V:obj:N An_object frequency_of_'V N' ``` ``` affect V:obj:N people 113 affect V:obj:N way 69 ... ``` :::danger Mind the sample values of **mutual information** and **word similarity** in the template has been adjusted (Step 2 and Step 3), and WILL be different from yours. ::: --- ## :memo: This assignment is to do the following: ## Step 1: Build index for verbs and nouns 完成兩個 dictionaries，**vocount** 和 **ovcount**。 * v = verb * o = object 資料來自 **bnc.vn.txt**，已經讀檔為 **vo**。兩個 dictionaries 的格式為： * `vocount[verb][object] = count` * `ovcount[object][verb] = count` ## Step 2: Mutual Information 計算 mutual information，公式為：log2( p(v,o) / ( p(v) * p(o) ) ) * p(v,o) = c(v,o) / N * p(v) = c(v) / N * p(o) = c(o) / N 利用 **vocount** 和 **ovcount** 取得 count。 ## Step 3: Use Nested For Loop to Compute Word Similarity 輸入兩個字w1, w2和他們的詞性，輸出兩字的word similarity。 * pos = postag(詞性) 用 **wn.synsets()** 找出一個字在wordnet中的所有synset。用法如下： ``` Input: wn.synsets('dog', 'n') ``` ``` Output: [Synset('dog.n.01'), Synset('frump.n.01'), Synset('dog.n.03'), Synset('cad.n.01'), Synset('frank.n.02'), Synset('pawl.n.01'), Synset('andiron.n.01')] ``` 找出w1所有的synset和w2所有的synset，再從兩個集合中，找出.path_similarity()最高者，此數即為word_similarity :::info .path_similarity()用法： similarity = synset1.path_similarity(synset2) ::: ## Step 4: Set the cluster for verbs and nouns **n_cluster()** 的用法： * 輸入兩個字（一組collocation，verb+object）和一個similarity_function() * 輸出多個該verb可以搭配的其他object，和對應的mutual information、word similarity ``` Input: n_cluster('reach', 'purpose', wn_sim) ``` ``` Output: [('end', 4.0842, 0.5), ('goal', 2.5279, 0.5), ('decision', 3.2595, 0.33333), ('target', 4.0371, 0.33333), ('destination', 6.7663, 0.33333)] ``` :::info 此處使用Step3所完成的 **wn_sim()** 作為 similarity_function() ::: ＞＞完成v_cluster的部分： * 輸入一組 collocation，和一個 similarity_function() * 輸出多組和 input 相似的 collocation，和對應的 mutual information、word similarity ``` Input: v_cluster('reach', 'purpose', wn_sim) ``` ``` Output: [('attain', 'goal', 5.3309, 1.0), ('achieve', 'goal', 5.3121, 1.0), ('accomplish', 'purpose', 4.6576, 1.0), ('achieve', 'target', 4.6033, 1.0), ('accomplish', 'goal', 4.4155, 1.0), ('attain', 'purpose', 4.1807, 1.0), ('achieve', 'end', 4.0749, 1.0), ('achieve', 'purpose', 3.9586, 1.0), ('attain', 'end', 3.9185, 1.0), ('accomplish', 'end', 3.3954, 1.0), ('succeed', 'purpose', 2.7491, 0.5), ('succeed', 'goal', 2.507, 0.5), ('attain', 'target', 1.9953, 1.0)] ``` 1. 利用 n_cluster()，找出該 verb 可以搭配的其他 objects。 1. 找出這些 objects 可以搭配的其他 verbs。 1. 計算所有verb+object組合的mutual information和word similarity。 1. 輸出mutual information和word similarity大於一定值的組合。 :::warning 此定值可自行調整，輸出合理即可 ::: ## Step 5: Language Model **LM()** 的參數格式（範例）： * `vn_pair = [ [('reach', 1), ('dream', 3)] ]` * `sentence = "He reach his dream".split()` vn_pair 中可以有多組在 sentence 裡的 collocation，要標記上該字在 sentence 裡面的 index。 * 輸入要更改的collocation和其所在的sentence * 輸出更改後的collocation和新的sentence ``` Output: [('He achieve his dream', -14.745, 51212), ('He achieve his perfection', -16.949, 17226), ('He attain his perfection', -17.321, 12322)] ``` ＞＞作法： 1. 利用 **linggle.search()** 來驗證 collocation 是否正確。若 **linggle.search()** 的結果>10000，則判斷為正確。若<10000，則視為錯誤，要尋找替代的正確collocation。 1. 利用 **v_cluster()** 尋找替代的collocation。 1. 再次使用 **linggle.search()** 來確定替代的collocation與否正確，若<10000則排除。 1. 輸出正確的替代collocation 1. 輸出修改過後的sentence ## Step 6: Part-of-speech tags and dependencies by spaCy 利用 spaCy 的 dependency parsing 自動抓取句子中的 verb+object 組合。若該字的.dep_為**dobj**，尋找該字的**root**並將其視作一組collocation。需判斷 root 和 dobj 的詞性是否正確。 [spaCy documentation link](https://spacy.io/api/annotation) spaCy的postag對照表格：(Part-of-speech tagging >> English) spaCy的dependency：（Syntactic Dependency Parsing） --- ## 💯 Submission and Evaluation: * (80 points) to step 5 * (100 points) Complete step 6 - [ ] Fill in the [demo time table](https://docs.google.com/spreadsheets/d/1Bw6qcqKV45LPLHe0Lc9DmFzQDikTCSqGbWEGuUcMNg8/edit#gid=1252683389), and wait for your turn - [ ] Submit your code to iLMS If you do not finish this assignment in class, you can still demo online within a week. The time of your submission will not affect on your score. If you have questions, put it on the iLMS discussion board. Online demos available at: * Tuesday 10:00 ~ 12:00 * Wednesday 15:30 ~ 17:30 * Thursday 15:30 ~ 17:30