---
title: 中文分詞技術
tags: self-learning, NLP
---
# **_中文分詞技術_**
[TOC]
:::warning
**_Reference:_**
* [自然語言處理 — 使用 N-gram 實現輸入文字預測](https://medium.com/%E6%89%8B%E5%AF%AB%E7%AD%86%E8%A8%98/%E8%87%AA%E7%84%B6%E8%AA%9E%E8%A8%80%E8%99%95%E7%90%86-%E4%BD%BF%E7%94%A8-n-gram-%E5%AF%A6%E7%8F%BE%E8%BC%B8%E5%85%A5%E6%96%87%E5%AD%97%E9%A0%90%E6%B8%AC-10ac622aab7a)
* [如何通過機率去訓練語言模型來快速預測下一個單詞?](https://kknews.cc/zh-tw/education/ejqzjbr.html)
* [語言模型及其應用](https://kknews.cc/zh-tw/tech/l3qya2b.html)
* [Language Modeling Tutorial](https://www.slideshare.net/ckmarkohchang/language-modeling-51286122)
* [Hidden Markov Model](https://ckmarkoh.github.io/blog/2014/04/03/natural-language-processing-hidden-markov-models/)
* 補充
* [使用關聯法則為主之語言模型於擷取長距離中文文字關聯性](https://www.aclweb.org/anthology/O01-1002.pdf)
* [使用collections中的namedtuple來操作簡單的物件結構](https://qiubite31.github.io/2017/03/02/%E4%BD%BF%E7%94%A8collections%E4%B8%AD%E7%9A%84namedtuple%E4%BE%86%E6%93%8D%E4%BD%9C%E7%B0%A1%E5%96%AE%E7%9A%84%E7%89%A9%E4%BB%B6%E7%B5%90%E6%A7%8B/)
* [貝葉斯推斷的運作原理](https://brohrer.mcknote.com/zh-Hant/statistics/how_bayesian_inference_works.html)
:::
---
## **前言**
* 「[深度學習](https://codertw.com/%E7%A8%8B%E5%BC%8F%E8%AA%9E%E8%A8%80/585610/)」是機器學習中的具有深層結構的神經網路++演算法++。
* 下列的NLP任務,在深度學習的使用下取得顯著進步:
* [命名實體識別 (Named Entity Recognition, NER)](https://www.chainnews.com/zh-hant/articles/690356109154.htm)
* [詞性標記 (Part-of-speech Tagging, POS Tagging)](https://kknews.cc/zh-tw/education/rlv43kx.html)
* [情感分析 (Sentiment Analysis)](https://twgreatdaily.com/1o7BXW4BMH2_cNUgv02-.html)
> [Deep Learning for Sentiment Analysis : A Survey](https://arxiv.org/abs/1801.07883)
* 機器翻譯
---
==語言模型這 4 個字看似很博大精深,但是它僅僅指的就是「一個句子的機率」。==
> [語言模型 (Language Model) 與 N-gram 原理](https://medium.com/%E6%89%8B%E5%AF%AB%E7%AD%86%E8%A8%98/%E8%87%AA%E7%84%B6%E8%AA%9E%E8%A8%80%E8%99%95%E7%90%86-%E4%BD%BF%E7%94%A8-n-gram-%E5%AF%A6%E7%8F%BE%E8%BC%B8%E5%85%A5%E6%96%87%E5%AD%97%E9%A0%90%E6%B8%AC-10ac622aab7a)
---
## **統計分詞:實作**
### 0. 載入語料(文本),並擷取.json中['detailcontent']屬性之文字內容
```python=
from collections import Counter, namedtuple
import json
import re
DATASET_DIR = './WebNews.json'
with open(DATASET_DIR, encoding = 'utf8') as f:
dataset = json.load(f)
seg_list = list(map(lambda d: d['detailcontent'], dataset))
```
### 1. 把seg_list中漢字範圍的部分,新增至NEW_seg_list
```python=
rule = re.compile('[^\u4e00-\u9fa5]') #Unicode編碼的漢字範圍
NEW_seg_list = []
##(方法一)
NEW_seg_list = [rule.sub('', seg) for seg in seg_list]
##(方法二)
string = ''
for i in seg_list:
for j in i:
string += rule.sub('', j)
NEW_seg_list.append(string)
string = ''
print(NEW_seg_list[2])
```
### 2. 貝氏定理:統計字詞出現的機率。並且依機率高到低排序。
```python=
def ngram(documents, N=2):
ngram_prediction = dict()
total_grams = list()
words = list()
Word = namedtuple('Word', ['word', 'prob']) #'Word'物件,其屬性包含'word', 'prob'
for doc in documents: #doc單篇語料
split_words = ['<s>'] + list(doc) + ['</s>']
# 計算分子
[total_grams.append(tuple(split_words[i:i+N])) for i in range(len(split_words)-N+1)]
# 計算分母
[words.append(tuple(split_words[i:i+N-1])) for i in range(len(split_words)-N+2)]
total_word_counter = Counter(total_grams)
word_counter = Counter(words)
for key in total_word_counter:
word = ''.join(key[:N-1])
if word not in ngram_prediction:
ngram_prediction.update({word: set()})
next_word_prob = total_word_counter[key]/word_counter[key[:N-1]]
w = Word(key[-1], '{:.3g}'.format(next_word_prob))
ngram_prediction[word].add(w)
return ngram_prediction
```
```python=
tri_prediction = ngram(NEW_seg_list, N=3)
for word, ng in tri_prediction.items():
tri_prediction[word] = sorted(ng, key=lambda x: x.prob, reverse=True)
```
### 3. 印出最有可能接續在'韓國'後面的字(取前5名)
```python=
text = '韓國'
next_words = list(tri_prediction[text])[:5]
for next_word in next_words:
print('next word: {}, probability: {}'.format(next_word.word, next_word.prob))
```
---
## 語言模型的統計原理
* 在前面的'韓國'出現過的情況下,後面的'隊'也一起出現的機率為何?
```python=
string = ''
for i in NEW_seg_list:
string += i
pattern1 = re.compile('韓國隊')
r1 = pattern1.findall(string)
print('\'韓國隊\':共有 {} 個。\n'.format(r1.count('韓國隊')), r1, '\n')
pattern2 = re.compile('韓國')
r2 = pattern2.findall(string)
print('\'韓國\':共有 {} 個。\n'.format(r2.count('韓國')), r2)
print('\n{} / {} = {}'.format(r1.count('韓國隊'), r2.count('韓國'), round(r1.count('韓國隊')/ r2.count('韓國'), 3)), '\n---\n')
```
## Output
```
'韓國隊':共有 5 個。
['韓國隊', '韓國隊', '韓國隊', '韓國隊', '韓國隊']
'韓國':共有 28 個。
['韓國', '韓國', '韓國', '韓國', '韓國', '韓國', '韓國', '韓國', '韓國', '韓國', '韓國', '韓國', '韓國', '韓國', '韓國', '韓國', '韓國', '韓國', '韓國', '韓國', '韓國', '韓國', '韓國', '韓國', '韓國', '韓國', '韓國', '韓國']
5 / 28 = 0.179
---
next word: 隊, probability: 0.179
next word: 及, probability: 0.179
next word: 明, probability: 0.0714
next word: 日, probability: 0.0714
next word: 音, probability: 0.0357
```