Week7 Assignment: Using Church-Gale algorithm to build an ERRANT for Mandarin

# Week7 Assignment: Using Church-Gale algorithm to build an ERRANT for Mandarin Date: 2020/4/14 Lecture on Google Meet: https://meet.google.com/jdj-atsb-vpz ==Setup:== All required files are on iLMS. There is 1 file: 1. udn-template.ipynb (template) Also, visit https://github.com/ckiplab/ckiptagger to install and download model files. Mind the requirements for ckip are: * python>=3.6 * tensorflow>=1.13.1,<2 / tensorflow-gpu>=1.13.1,<2 (one of them) * gdown (optional, for downloading model files from google drive) You may use IDE such as Jupyter Notebook to run the files. Both .ipynb and .py would be acceptable. ==Description:== **ERRANT**, grammatical ERRor ANnotation Toolkit, is a tool kit that can automatically annotate parallel English sentences with error type information[1]. For example: **Original sentence:** `This are gramamtical sentence .` **Corrected sentence:** `This is a grammatical sentence .` **Output:** ``` S This are gramamtical sentence . A 1 2|||R:VERB:SVA|||is|||REQUIRED|||-NONE-|||0 A 2 2|||M:DET|||a|||REQUIRED|||-NONE-|||0 A 2 3|||R:SPELL|||grammatical|||REQUIRED|||-NONE-|||0 A -1 -1|||noop|||-NONE-|||REQUIRED|||-NONE-|||1 ``` --- In this assignment, you will be required to build an ERRANT for Madarin using Church-Gale algorithm. **Original sentence:** `'這也間接突顯出鴻海' ` **Corrected sentence:** `'也直接透露出鴻海'` **Output:** `[(0, 'D', '這'), (2, 'R', '間接', '直接'), (3, 'R', '突顯出', '透露出')]` :::info Mind that the Gale-Church Algorithm is not originally used to solve error-correction problems. In this assignment, we will use its concept and modify it hugely. Gale-Church Algorithm is a classical algorithm that is built in nltk, and has many tutorials. Check it out yourself if interested. ::: **Gale-Church Algorithm** > A method for aligning corresponding sentences in a parallel corpus. It works on the principle that equivalent sentences should roughly correspond in length—that is, longer sentences in one language should correspond to longer sentences in the other language. Reference: [Wikipedia: Gale–Church alignment algorithm](https://en.wikipedia.org/wiki/Gale%E2%80%93Church_alignment_algorithm) **Quick explanation on Gale-Church Algorithm** Reference: [Sentence alignment - Gale & Church | MT talks #7](https://www.youtube.com/watch?v=_4lnyoC3mtQ&fbclid=IwAR2s3tKaQlzcl1vdN8d2lDJI32-IqYLdnCRPv1NcheozQiT6OdMzoXUFk1o) **What is ckip package?** 國產的自動化中文斷詞工具，由中研院CKIP lab研發 --- ## :memo: This assignment is to do the following: ### Step 1: 加入詞性參考（詞性來自ckip）完成seg_pos()。使用ckip的ws model和pos model進行斷詞、判斷詞性 ws = word segmantation pos = part-of-speech 這兩個model請依照以上description的指示，到github下載。兩個model的用法： * ws()：輸入含有一個句子的list, 輸出含有該句斷詞的list。輸入`['這也間接突顯出鴻海']` 輸出`['這','也','間接','突顯出','鴻海']` * pos()：輸入含有許多斷詞的list，輸出斷詞的詞性。輸入`['這','也','間接','突顯出','鴻海']` 輸出`['Nep','D','D','VJ','Nb']` seg_pos()需要你做的是：輸入一串string，`input = '這也間接突顯出鴻海'` 輸出一串list，將句子斷詞並標上詞性，`output = [('這', 'Nep'), ('也', 'D'), ('間接', 'D'), ('突顯出', 'VJ'), ('鴻海', 'Nb')]` :::info **Tips:** 善用zip() ::: ### Step 2: 計算 edit cost 返回cost_matrix, op_matrix 完成calculate_edit_cost()。＞＞首先，建造出空的cost_matrix和op_matrix，假設輸入的句子為： ``` orig = [('這', 'Nep'), ('也', 'D'), ('間接', 'D'), ('突顯出', 'VJ'), ('鴻海', 'Nb')] cor = [('也', 'D'), ('直接', 'VH'), ('透露出', 'VK'), ('鴻海', 'Nb')] ``` orig為修正前的句子，cor為修正後的句子。 cost_matrix和op_matrix的大小相同，都是(len(orig)+1) * (len(cor)+1)： | | 也D | 直接VH | 透露出VK | 鴻海Nb | | -------- | -------- | -------- | -------- | -------- | | 這Nep | | | | | | 也D | | | | | | 間接D | | | | | | 突顯出VJ | | | | | | 鴻海Nb | | | | | :::danger :exclamation:注意column為orig，row為cor。 ::: ＞＞再來，我們先填入cost_matrix和op_matrix的第一行和第一列（初始值）。 cost_matrix的初始值為斷詞+詞性與空字串之間的差異字數量，在此我們使用nltk.edit_distance()來計算，用法如下： `nltk.edit_distance('你好嗎', '你好') = 1`，因為兩個字串差了一個字'嗎' 故cost_matrix的初始值為： | 0 | ~~也D~~ 2 | ~~直接VH~~ 4 | ~~透露出VK~~ 5 | ~~鴻海Nb~~ 4 | | --------- | ---- | ------- | --------- | ------- | | ~~這Nep~~ 4 | | | | | | ~~也D~~ 2 | | | | | | ~~間接D~~ 3 | | | | | | ~~突顯出VJ~~ 5 | | | | | | ~~鴻海Nb~~ 4 | | | | | op_matrix的首列為 I（Insert），首行為D（Delete）: | 0 | ~~也D~~ I [0,j] | ~~直接VH~~ I | ~~透露出VK~~ I | ~~鴻海Nb~~ I | | ------------------- |:---------------:|:------------:| -------------- | ------------ | | ~~這Nep~~ D [i,0] | [i, j] | | | | | ~~也D~~ D [i+1,0] | [i+1, j] | | | | | ~~間接D~~ D | | | | | | ~~突顯出VJ~~ D | | | | | | ~~鴻海Nb~~ D | | | | | ＞＞接下來，你要填入matrix中剩下的值： **op_matrix**需要填入符號：**M**（Match）、**I**、**D**、**R**（Replace），或是混合型**I/D**（Insert & Delete）。以上述空的op_matrix為例， * [i,j]所代表的為：從[i,0]修正為[0,j]所發生的編輯種類。此例編輯種類應為Insert和Delete，故[i,j]格填入"I/D"。 * 再以[i+1,j]格為例，[i+1,0]和[0,j]並無修正，故[i+1,j]格填入"M" 這時會遇到一個問題：*I/D和R如何區分？* 我們會計算sub_cost（substitude_cost，及為replace_cost），若sub_cost < (insert_cost + delete_cost)，則以Replace標記。sub_cost的算法將在下面介紹。 **cost_matrix**需要填入數字，每個數字代表的是修正前跟修正後的差異性。根據在op_matrix中的標記，計算出相應的cost： * 若[i,j]在op_matrix中標記為I/D，我們需要計算出insert_cost和delete_cost，相加並存入cost_matrix[i,j]。 * 若標記為D，存入delete_cost。 * 若標記為I，存入insert_cost。 delete_cost算法：刪去的字串+詞性與空字串的相異字數。 insert_cost算法：加入的字串+詞性與空字串的相異字數。 :::info 上述兩者皆可利用nltk.edit_distance()來完成。字串與詞性皆須納入考慮，例如：將('鴻海', 'Nb')視為'鴻海Nb' ::: sub_cost的算法較為特別，我們會在step3時實作一個get_sub_cost()，將此函數返回的值，再加上被修正字符的位置與修正字符位置之間的距離，及為sub_cost。 `sub_cost = get_sub_cost(orig, cor) + 距離` ### Step 3: 計算Replace的cost 完成get_sub_cost(o, c)，並return一個cost（數字）；o代表修正前（original），c代表修正後（corrected）。 此部分為自由發揮，盡量找出I/D與Replace不同的特徵並賦予權重。 例如：這-->也：這會是I/D，因為"這"和"也"並沒有關聯性。突顯出-->透露出：這很有可能是Replace，因為這兩個詞高度相關。 :::warning :bulb:**Suggestion** **檢查詞性**、詞的長度、詞的差異等作為判斷依據。詞性簡寫所代表的意思：https://github.com/ckiplab/ckiptagger/wiki/POS-Tags 不斷執行calculate_edit_cost(orig, cor)檢查結果，以調整到最佳權重。 ::: ### Step 4: format cost_matrix and op_matrix 完成format_align()。利用cost_matrix和op_matrix的資訊，整理出該句的所有編輯。 Input: ``` orig = ['這', '也', '間接', '突顯出', '鴻海'] cor = ['也', '直接', '透露出', '鴻海'] ``` Output: `[(0, 'D', '這'), (2, 'R', '間接', '直接'), (3, 'R', '突顯出', '透露出')]` Output format: [(編輯位置, 編輯種類, Delete/Insert/Replace的詞), (), (), ...] * 若編輯種類為Replace，要印出被取代和取代的詞 * 若為Match，不用印任何東西 （4/14 22:34補）這個步驟是要從matrix中，判斷correction sentence對於original sentence做了什麼修正。故我們檢查op_matrix中的第二行，代表的是orig的"這"，發現一整行都是I/D，就表示"這"是被Delete的，故輸出(0, 'D', '這')。再舉個例子，第五行同時出現了1個I/D和3個R，取cost最小的那個，若cost相同則優先取R，及為(3, 'R', '突顯出', '透露出')。 --- ## 💯 Submission and Evaluation: * (80 points) Completing the two matrices Be prepared to explain how you design the matrices, and how you calculate sub_cost. * (100 points) Completing step 4 - [ ] Fill in the [demo time table](https://docs.google.com/spreadsheets/d/1Bw6qcqKV45LPLHe0Lc9DmFzQDikTCSqGbWEGuUcMNg8/edit#gid=1926325713), and wait for your turn - [ ] Submit your code to iLMS If you do not finish this assignment in class, you can still demo online within a week. The time of your submission will not affect on your score. If you have questions, put it on the iLMS discussion board. Online demos available at: * Tuesday 10:00 ~ 12:00 * Wednesday 15:30 ~ 17:30 * Thursday 15:30 ~ 17:30 --- ## :book: Reference > [1] Christopher Bryant, Mariano Felice, and Ted Briscoe. 2017. Automatic annotation and evaluation of error types for grammatical error correction. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vancouver, Canada. > [1] Mariano Felice, Christopher Bryant, and Ted Briscoe. 2016. Automatic extraction of learner errors in ESL sentences using linguistically enhanced alignments. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers. Osaka, Japan.