Week3 Assignment: Chinese word segmentation
===
Date: 2020/03/17
## ==Setup:==
All required files are on iLMS. There are two files:
hw2.py or hw2.ipynb (your code)
merge.tsv (data)
## ==Description==
In this assignment, you will implement a Chinese word segmentation model.
You need to use the model to tokenize sentence to words. For example:
```
Input
中華民國於一九九五年三月一日開始實施全民健保
Output
['中華民國', '於', '一九九五年', '三月', '一日', '開始', '實施', '全民健保']
```
there are lots way to split sentence to words
take 水淹金山寺 for example
it may be
水 淹 金山寺
水淹 金山寺
水淹 金山 寺
水 淹 金 山 寺
...
our mission is to find the best one from thoes
so the first step is to split sentence to all possible path
one of the efficent way is to build a DAG
sencond
we are try to find the best from the path from those path
one of way is use DP
:::info
hint:
you can refer lecture 1 for more information.
:::
## ==Requirement==
### 1. convert sentence to DAG(Directed Acyclic Graph)
Convert tokens to DAG(Directed Acyclic Graph)
:::info
hint:
you can use merge.tsv to get all possible tokens.
the data format is
`token \t count`
we provide following code to handle datas
```python=
# convert data to dict in the following way
# take '中華民國' for example:
# tokens['中'] = count
# tokens['中華'] = count
# tokens['中華民'] = 0, if '中華民' not in data
# tokens['中華民國'] = count
tokens = {}
N = 0
for i in datafile("merge.tsv", sep="\t"):
name, count = i[0], i[1]
N += count
tokens[name] = count
for j in range(len(name)):
t = name[: j + 1]
if t not in tokens:
tokens[t] = 0
```
:::
```
take '火鍋是四川的特色' for example:
'01234567'
[
{0: 1351, 1: 327}, # 0(index of "火"):1351(count of "火") 1(index of "鍋"):327(count of 火鍋)
{1: 633}, # 1(index of "鍋"):633(count of "鍋")
{2: 346856}, # 2(index of "是"):346856(count of "是")
{3: 3640, 4: 348}, # 3(index of "四"):3640(count of "四") 4(index of "川"):348(count of "四川")
{4: 138}, # 4(index of "川"):138(count of "川")
{5: 1185058}, # 5(index of "的"):1185058(count of "的")
{6: 1245, 7: 4453}, # 6(index of "特"):1245(count of "特") 7(index of "色"):4453(count of "特色")
{7: 835} # 7(index of "色"):835(count of "色")
]
or you can use any data structure you like.
```
Code:
```python
def build_tree(sentence):
pass
```
### 2. convert sentence to tokens
you must use dynamic programming to solve this problem.
```python
# use closure to simulate private variables
# if you don't know what is closure, it's ok, just handle dp function.
# example:
# >>> print(parse('孫中山先生推翻滿清建立中華民國'))
# [-26.955841777998064, ['孫中山', '先生', '推翻', '滿清', '建立', '中華民國']]
def parse(sentence):
sentence = sentence
tree = build_tree(sentence)
# use dynamic programming to parse sentence to tokens
# !!!note!!! you should use cache function to cache function results.
#
# !!!hint!!! python build-in lru_cache can help you cache function result.
def dp(index=0):
pass
# !!!note!!! you must return [probability, [tokens]]
return dp(0)
```
:::danger
:bulb: you must return [probability, [tokens]]
For example
[-10.970361739190729, ['今天', '可能', '下雨']]
:::
:::info
Hint:
1. To avoid underflow, please use log.
e.g. log(p1 * p2 * p3) = log(p1) + log(p2) + log(p3)
2. python build-in lru_cache can help you cache function result.
:::
---
## ==Testing data==
```
[-10.970361739190729, ['今天', '可能', '下雨']]
[-31.937781660374768, ['中華民國', '於', '一九九五年', '三月', '一日', '開始', '實施', '全民健保']]
[-26.955841777998064, ['孫中山', '先生', '推翻', '滿清', '建立', '中華民國']]
[-27.05945291889755, ['教授', '應該', '要', '為', '辛苦', '的', '助教', '加薪']]
[-16.95450965070056, ['火鍋', '是', '四川', '的', '特色']]
[-25.07845561290892, ['波士頓', '茶葉', '事件', '促使', '美國', '革命']]
[-23.817296657004494, ['羅馬', '帝國', '皇帝', '遭到', '殺害']]
```
---
## ==Submission and Evaluation==
1. submit your code to jedi.nlplab.cc:3000,
2. you need to demo your code to TA and explain how you design every function.