# Week10 Assignment: Correction of miscollocation in a Sentence
Date: 2020/5/5
Lecture on Google Meet:
==Setup:==
All required files are on [Google Drive](https://drive.google.com/drive/folders/1kzx9YBW79ZNPiYQUKBkGHFnvwZH25Imq?fbclid=IwAR0euWfyxJsLDL8u1SpRpr9HdZEAb2XnvzAXI9f2jQqdHTU-pp7kyMgcLK8). There are 4 files:
1. miscollocations-template.ipynb (template)
2. bnc.vn.txt (data)
3. linggle_api.py
4. bnc.bin
You may use IDE such as Jupyter Notebook to run the files.
Both .ipynb and .py would be acceptable.
==Description:==
In this assignment, we are going to write a program for detecting and correcting miscollocations in a sentence.
**Original sentence:** `'He have a little mistake'
`
**Corrected sentence:** `'He make a little mistake', 'He admit a little mistake'`
Each line of **bnc.vn.txt** consists of:
```
A_verb V:obj:N An_object frequency_of_'V N'
```
```
affect V:obj:N people 113
affect V:obj:N way 69
...
```
:::danger
Mind the sample values of **mutual information** and **word similarity** in the template has been adjusted (Step 2 and Step 3), and WILL be different from yours.
:::
---
## :memo: This assignment is to do the following:
## Step 1: Build index for verbs and nouns
完成兩個 dictionaries,**vocount** 和 **ovcount**。
* v = verb
* o = object
資料來自 **bnc.vn.txt**,已經讀檔為 **vo**。
兩個 dictionaries 的格式為:
* `vocount[verb][object] = count`
* `ovcount[object][verb] = count`
## Step 2: Mutual Information
計算 mutual information,公式為:log2( p(v,o) / ( p(v) * p(o) ) )
* p(v,o) = c(v,o) / N
* p(v) = c(v) / N
* p(o) = c(o) / N
利用 **vocount** 和 **ovcount** 取得 count。
## Step 3: Use Nested For Loop to Compute Word Similarity
輸入兩個字w1, w2和他們的詞性,輸出兩字的word similarity。
* pos = postag(詞性)
用 **wn.synsets()** 找出一個字在wordnet中的所有synset。用法如下:
```
Input:
wn.synsets('dog', 'n')
```
```
Output:
[Synset('dog.n.01'),
Synset('frump.n.01'),
Synset('dog.n.03'),
Synset('cad.n.01'),
Synset('frank.n.02'),
Synset('pawl.n.01'),
Synset('andiron.n.01')]
```
找出w1所有的synset和w2所有的synset,
再從兩個集合中,找出.path_similarity()最高者,此數即為word_similarity
:::info
.path_similarity()用法:
similarity = synset1.path_similarity(synset2)
:::
## Step 4: Set the cluster for verbs and nouns
**n_cluster()** 的用法:
* 輸入兩個字(一組collocation,verb+object)和一個similarity_function()
* 輸出多個該verb可以搭配的其他object,和對應的mutual information、word similarity
```
Input:
n_cluster('reach', 'purpose', wn_sim)
```
```
Output:
[('end', 4.0842, 0.5),
('goal', 2.5279, 0.5),
('decision', 3.2595, 0.33333),
('target', 4.0371, 0.33333),
('destination', 6.7663, 0.33333)]
```
:::info
此處使用Step3所完成的 **wn_sim()** 作為 similarity_function()
:::
<font color="blue" size=4>>>完成v_cluster的部分:</font>
* 輸入一組 collocation,和一個 similarity_function()
* 輸出多組和 input 相似的 collocation,和對應的 mutual information、word similarity
```
Input:
v_cluster('reach', 'purpose', wn_sim)
```
```
Output:
[('attain', 'goal', 5.3309, 1.0),
('achieve', 'goal', 5.3121, 1.0),
('accomplish', 'purpose', 4.6576, 1.0),
('achieve', 'target', 4.6033, 1.0),
('accomplish', 'goal', 4.4155, 1.0),
('attain', 'purpose', 4.1807, 1.0),
('achieve', 'end', 4.0749, 1.0),
('achieve', 'purpose', 3.9586, 1.0),
('attain', 'end', 3.9185, 1.0),
('accomplish', 'end', 3.3954, 1.0),
('succeed', 'purpose', 2.7491, 0.5),
('succeed', 'goal', 2.507, 0.5),
('attain', 'target', 1.9953, 1.0)]
```
1. 利用 n_cluster(),找出該 verb 可以搭配的其他 objects。
1. 找出這些 objects 可以搭配的其他 verbs。
1. 計算所有verb+object組合的mutual information和word similarity。
1. 輸出mutual information和word similarity<font color="blue">大於一定值</font>的組合。
:::warning
此定值可自行調整,輸出合理即可
:::
## Step 5: Language Model
**LM()** 的參數格式(範例):
* `vn_pair = [ [('reach', 1), ('dream', 3)] ]`
* `sentence = "He reach his dream".split()`
vn_pair 中可以有多組在 sentence 裡的 collocation,要標記上該字在 sentence 裡面的 index。
* 輸入要更改的collocation和其所在的sentence
* 輸出更改後的collocation和新的sentence
```
Output:
[('He achieve his dream', -14.745, 51212),
('He achieve his perfection', -16.949, 17226),
('He attain his perfection', -17.321, 12322)]
```
<font color='blue' size=4>>>作法:</font>
1. 利用 **linggle.search()** 來驗證 collocation 是否正確。若 **linggle.search()** 的結果>10000,則判斷為正確。若<10000,則視為錯誤,要尋找替代的正確collocation。
1. 利用 **v_cluster()** 尋找替代的collocation。
1. 再次使用 **linggle.search()** 來確定替代的collocation與否正確,若<10000則排除。
1. 輸出正確的替代collocation
1. 輸出修改過後的sentence
## Step 6: Part-of-speech tags and dependencies by spaCy
利用 spaCy 的 dependency parsing 自動抓取句子中的 verb+object 組合。
若該字的.dep_為**dobj**,尋找該字的**root**並將其視作一組collocation。
需判斷 root 和 dobj 的詞性是否正確。
[spaCy documentation link](https://spacy.io/api/annotation)
spaCy的postag對照表格:(Part-of-speech tagging >> English)
spaCy的dependency:(Syntactic Dependency Parsing)
---
## 💯 Submission and Evaluation:
* (80 points) to step 5
* (100 points) Complete step 6
- [ ] Fill in the [demo time table](https://docs.google.com/spreadsheets/d/1Bw6qcqKV45LPLHe0Lc9DmFzQDikTCSqGbWEGuUcMNg8/edit#gid=1252683389), and wait for your turn
- [ ] Submit your code to iLMS
If you do not finish this assignment in class, you can still demo online within a week. <font color='blue'>The time of your submission will not affect on your score.</font> If you have questions, put it on the iLMS discussion board. Online demos available at:
* Tuesday 10:00 ~ 12:00
* Wednesday 15:30 ~ 17:30
* Thursday 15:30 ~ 17:30