Week3 Assignment: Chinese word segmentation

Week3 Assignment: Chinese word segmentation === Date: 2020/03/17 ## ==Setup:== All required files are on iLMS. There are two files: hw2.py or hw2.ipynb (your code) merge.tsv (data) ## ==Description== In this assignment, you will implement a Chinese word segmentation model. You need to use the model to tokenize sentence to words. For example: ``` Input 中華民國於一九九五年三月一日開始實施全民健保 Output ['中華民國', '於', '一九九五年', '三月', '一日', '開始', '實施', '全民健保'] ``` there are lots way to split sentence to words take 水淹金山寺 for example it may be 水淹金山寺水淹金山寺水淹金山寺水淹金山寺 ... our mission is to find the best one from thoes so the first step is to split sentence to all possible path one of the efficent way is to build a DAG sencond we are try to find the best from the path from those path one of way is use DP :::info hint: you can refer lecture 1 for more information. ::: ## ==Requirement== ### 1. convert sentence to DAG(Directed Acyclic Graph) Convert tokens to DAG(Directed Acyclic Graph) :::info hint: you can use merge.tsv to get all possible tokens. the data format is `token \t count` we provide following code to handle datas ```python= # convert data to dict in the following way # take '中華民國' for example: # tokens['中'] = count # tokens['中華'] = count # tokens['中華民'] = 0, if '中華民' not in data # tokens['中華民國'] = count tokens = {} N = 0 for i in datafile("merge.tsv", sep="\t"): name, count = i[0], i[1] N += count tokens[name] = count for j in range(len(name)): t = name[: j + 1] if t not in tokens: tokens[t] = 0 ``` ::: ``` take '火鍋是四川的特色' for example: '０１２３４５６７' [ {0: 1351, 1: 327}, # 0(index of "火"):1351(count of "火") 1(index of "鍋"):327(count of 火鍋) {1: 633}, # 1(index of "鍋"):633(count of "鍋") {2: 346856}, # 2(index of "是"):346856(count of "是") {3: 3640, 4: 348}, # 3(index of "四"):3640(count of "四") 4(index of "川"):348(count of "四川") {4: 138}, # 4(index of "川"):138(count of "川") {5: 1185058}, # 5(index of "的"):1185058(count of "的") {6: 1245, 7: 4453}, # 6(index of "特"):1245(count of "特") 7(index of "色"):4453(count of "特色") {7: 835} # 7(index of "色"):835(count of "色") ] or you can use any data structure you like. ``` Code: ```python def build_tree(sentence): pass ``` ### 2. convert sentence to tokens you must use dynamic programming to solve this problem. ```python # use closure to simulate private variables # if you don't know what is closure, it's ok, just handle dp function. # example: # >>> print(parse('孫中山先生推翻滿清建立中華民國')) # [-26.955841777998064, ['孫中山', '先生', '推翻', '滿清', '建立', '中華民國']] def parse(sentence): sentence = sentence tree = build_tree(sentence) # use dynamic programming to parse sentence to tokens # !!!note!!! you should use cache function to cache function results. # # !!!hint!!! python build-in lru_cache can help you cache function result. def dp(index=0): pass # !!!note!!! you must return [probability, [tokens]] return dp(0) ``` :::danger :bulb: you must return [probability, [tokens]] For example [-10.970361739190729, ['今天', '可能', '下雨']] ::: :::info Hint: 1. To avoid underflow, please use log. e.g. log(p1 * p2 * p3) = log(p1) + log(p2) + log(p3) 2. python build-in lru_cache can help you cache function result. ::: --- ## ==Testing data== ``` [-10.970361739190729, ['今天', '可能', '下雨']] [-31.937781660374768, ['中華民國', '於', '一九九五年', '三月', '一日', '開始', '實施', '全民健保']] [-26.955841777998064, ['孫中山', '先生', '推翻', '滿清', '建立', '中華民國']] [-27.05945291889755, ['教授', '應該', '要', '為', '辛苦', '的', '助教', '加薪']] [-16.95450965070056, ['火鍋', '是', '四川', '的', '特色']] [-25.07845561290892, ['波士頓', '茶葉', '事件', '促使', '美國', '革命']] [-23.817296657004494, ['羅馬', '帝國', '皇帝', '遭到', '殺害']] ``` --- ## ==Submission and Evaluation== 1. submit your code to jedi.nlplab.cc:3000, 2. you need to demo your code to TA and explain how you design every function.