Week3 Assignment: Chinese word segmentation === Date: 2020/03/17 ## ==Setup:== All required files are on iLMS. There are two files: hw2.py or hw2.ipynb (your code) merge.tsv (data) ## ==Description== In this assignment, you will implement a Chinese word segmentation model. You need to use the model to tokenize sentence to words. For example: ``` Input 中華民國於一九九五年三月一日開始實施全民健保 Output ['中華民國', '於', '一九九五年', '三月', '一日', '開始', '實施', '全民健保'] ``` there are lots way to split sentence to words take 水淹金山寺 for example it may be 水 淹 金山寺 水淹 金山寺 水淹 金山 寺 水 淹 金 山 寺 ... our mission is to find the best one from thoes so the first step is to split sentence to all possible path one of the efficent way is to build a DAG sencond we are try to find the best from the path from those path one of way is use DP :::info hint: you can refer lecture 1 for more information. ::: ## ==Requirement== ### 1. convert sentence to DAG(Directed Acyclic Graph) Convert tokens to DAG(Directed Acyclic Graph) :::info hint: you can use merge.tsv to get all possible tokens. the data format is `token \t count` we provide following code to handle datas ```python= # convert data to dict in the following way # take '中華民國' for example: # tokens['中'] = count # tokens['中華'] = count # tokens['中華民'] = 0, if '中華民' not in data # tokens['中華民國'] = count tokens = {} N = 0 for i in datafile("merge.tsv", sep="\t"): name, count = i[0], i[1] N += count tokens[name] = count for j in range(len(name)): t = name[: j + 1] if t not in tokens: tokens[t] = 0 ``` ::: ``` take '火鍋是四川的特色' for example: '01234567' [ {0: 1351, 1: 327}, # 0(index of "火"):1351(count of "火") 1(index of "鍋"):327(count of 火鍋) {1: 633}, # 1(index of "鍋"):633(count of "鍋") {2: 346856}, # 2(index of "是"):346856(count of "是") {3: 3640, 4: 348}, # 3(index of "四"):3640(count of "四") 4(index of "川"):348(count of "四川") {4: 138}, # 4(index of "川"):138(count of "川") {5: 1185058}, # 5(index of "的"):1185058(count of "的") {6: 1245, 7: 4453}, # 6(index of "特"):1245(count of "特") 7(index of "色"):4453(count of "特色") {7: 835} # 7(index of "色"):835(count of "色") ] or you can use any data structure you like. ``` Code: ```python def build_tree(sentence): pass ``` ### 2. convert sentence to tokens you must use dynamic programming to solve this problem. ```python # use closure to simulate private variables # if you don't know what is closure, it's ok, just handle dp function. # example: # >>> print(parse('孫中山先生推翻滿清建立中華民國')) # [-26.955841777998064, ['孫中山', '先生', '推翻', '滿清', '建立', '中華民國']] def parse(sentence): sentence = sentence tree = build_tree(sentence) # use dynamic programming to parse sentence to tokens # !!!note!!! you should use cache function to cache function results. # # !!!hint!!! python build-in lru_cache can help you cache function result. def dp(index=0): pass # !!!note!!! you must return [probability, [tokens]] return dp(0) ``` :::danger :bulb: you must return [probability, [tokens]] For example [-10.970361739190729, ['今天', '可能', '下雨']] ::: :::info Hint: 1. To avoid underflow, please use log. e.g. log(p1 * p2 * p3) = log(p1) + log(p2) + log(p3) 2. python build-in lru_cache can help you cache function result. ::: --- ## ==Testing data== ``` [-10.970361739190729, ['今天', '可能', '下雨']] [-31.937781660374768, ['中華民國', '於', '一九九五年', '三月', '一日', '開始', '實施', '全民健保']] [-26.955841777998064, ['孫中山', '先生', '推翻', '滿清', '建立', '中華民國']] [-27.05945291889755, ['教授', '應該', '要', '為', '辛苦', '的', '助教', '加薪']] [-16.95450965070056, ['火鍋', '是', '四川', '的', '特色']] [-25.07845561290892, ['波士頓', '茶葉', '事件', '促使', '美國', '革命']] [-23.817296657004494, ['羅馬', '帝國', '皇帝', '遭到', '殺害']] ``` --- ## ==Submission and Evaluation== 1. submit your code to jedi.nlplab.cc:3000, 2. you need to demo your code to TA and explain how you design every function.