--- title: 中文分詞技術 tags: self-learning, NLP --- # **_中文分詞技術_** [TOC] :::warning **_Reference:_** * [自然語言處理 — 使用 N-gram 實現輸入文字預測](https://medium.com/%E6%89%8B%E5%AF%AB%E7%AD%86%E8%A8%98/%E8%87%AA%E7%84%B6%E8%AA%9E%E8%A8%80%E8%99%95%E7%90%86-%E4%BD%BF%E7%94%A8-n-gram-%E5%AF%A6%E7%8F%BE%E8%BC%B8%E5%85%A5%E6%96%87%E5%AD%97%E9%A0%90%E6%B8%AC-10ac622aab7a) * [如何通過機率去訓練語言模型來快速預測下一個單詞?](https://kknews.cc/zh-tw/education/ejqzjbr.html) * [語言模型及其應用](https://kknews.cc/zh-tw/tech/l3qya2b.html) * [Language Modeling Tutorial](https://www.slideshare.net/ckmarkohchang/language-modeling-51286122) * [Hidden Markov Model](https://ckmarkoh.github.io/blog/2014/04/03/natural-language-processing-hidden-markov-models/) * 補充 * [使用關聯法則為主之語言模型於擷取長距離中文文字關聯性](https://www.aclweb.org/anthology/O01-1002.pdf) * [使用collections中的namedtuple來操作簡單的物件結構](https://qiubite31.github.io/2017/03/02/%E4%BD%BF%E7%94%A8collections%E4%B8%AD%E7%9A%84namedtuple%E4%BE%86%E6%93%8D%E4%BD%9C%E7%B0%A1%E5%96%AE%E7%9A%84%E7%89%A9%E4%BB%B6%E7%B5%90%E6%A7%8B/) * [貝葉斯推斷的運作原理](https://brohrer.mcknote.com/zh-Hant/statistics/how_bayesian_inference_works.html) ::: --- ## **前言** * 「[深度學習](https://codertw.com/%E7%A8%8B%E5%BC%8F%E8%AA%9E%E8%A8%80/585610/)」是機器學習中的具有深層結構的神經網路++演算法++。 * 下列的NLP任務,在深度學習的使用下取得顯著進步: * [命名實體識別 (Named Entity Recognition, NER)](https://www.chainnews.com/zh-hant/articles/690356109154.htm) * [詞性標記 (Part-of-speech Tagging, POS Tagging)](https://kknews.cc/zh-tw/education/rlv43kx.html) * [情感分析 (Sentiment Analysis)](https://twgreatdaily.com/1o7BXW4BMH2_cNUgv02-.html) > [Deep Learning for Sentiment Analysis : A Survey](https://arxiv.org/abs/1801.07883) * 機器翻譯 --- ==語言模型這 4 個字看似很博大精深,但是它僅僅指的就是「一個句子的機率」。== > [語言模型 (Language Model) 與 N-gram 原理](https://medium.com/%E6%89%8B%E5%AF%AB%E7%AD%86%E8%A8%98/%E8%87%AA%E7%84%B6%E8%AA%9E%E8%A8%80%E8%99%95%E7%90%86-%E4%BD%BF%E7%94%A8-n-gram-%E5%AF%A6%E7%8F%BE%E8%BC%B8%E5%85%A5%E6%96%87%E5%AD%97%E9%A0%90%E6%B8%AC-10ac622aab7a) --- ## **統計分詞:實作** ### 0. 載入語料(文本),並擷取.json中['detailcontent']屬性之文字內容 ```python= from collections import Counter, namedtuple import json import re DATASET_DIR = './WebNews.json' with open(DATASET_DIR, encoding = 'utf8') as f: dataset = json.load(f) seg_list = list(map(lambda d: d['detailcontent'], dataset)) ``` ### 1. 把seg_list中漢字範圍的部分,新增至NEW_seg_list ```python= rule = re.compile('[^\u4e00-\u9fa5]') #Unicode編碼的漢字範圍 NEW_seg_list = [] ##(方法一) NEW_seg_list = [rule.sub('', seg) for seg in seg_list] ##(方法二) string = '' for i in seg_list: for j in i: string += rule.sub('', j) NEW_seg_list.append(string) string = '' print(NEW_seg_list[2]) ``` ### 2. 貝氏定理:統計字詞出現的機率。並且依機率高到低排序。 ```python= def ngram(documents, N=2): ngram_prediction = dict() total_grams = list() words = list() Word = namedtuple('Word', ['word', 'prob']) #'Word'物件,其屬性包含'word', 'prob' for doc in documents: #doc單篇語料 split_words = ['<s>'] + list(doc) + ['</s>'] # 計算分子 [total_grams.append(tuple(split_words[i:i+N])) for i in range(len(split_words)-N+1)] # 計算分母 [words.append(tuple(split_words[i:i+N-1])) for i in range(len(split_words)-N+2)] total_word_counter = Counter(total_grams) word_counter = Counter(words) for key in total_word_counter: word = ''.join(key[:N-1]) if word not in ngram_prediction: ngram_prediction.update({word: set()}) next_word_prob = total_word_counter[key]/word_counter[key[:N-1]] w = Word(key[-1], '{:.3g}'.format(next_word_prob)) ngram_prediction[word].add(w) return ngram_prediction ``` ```python= tri_prediction = ngram(NEW_seg_list, N=3) for word, ng in tri_prediction.items(): tri_prediction[word] = sorted(ng, key=lambda x: x.prob, reverse=True) ``` ### 3. 印出最有可能接續在'韓國'後面的字(取前5名) ```python= text = '韓國' next_words = list(tri_prediction[text])[:5] for next_word in next_words: print('next word: {}, probability: {}'.format(next_word.word, next_word.prob)) ``` --- ## 語言模型的統計原理 * 在前面的'韓國'出現過的情況下,後面的'隊'也一起出現的機率為何? ```python= string = '' for i in NEW_seg_list: string += i pattern1 = re.compile('韓國隊') r1 = pattern1.findall(string) print('\'韓國隊\':共有 {} 個。\n'.format(r1.count('韓國隊')), r1, '\n') pattern2 = re.compile('韓國') r2 = pattern2.findall(string) print('\'韓國\':共有 {} 個。\n'.format(r2.count('韓國')), r2) print('\n{} / {} = {}'.format(r1.count('韓國隊'), r2.count('韓國'), round(r1.count('韓國隊')/ r2.count('韓國'), 3)), '\n---\n') ``` ## Output ``` '韓國隊':共有 5 個。 ['韓國隊', '韓國隊', '韓國隊', '韓國隊', '韓國隊'] '韓國':共有 28 個。 ['韓國', '韓國', '韓國', '韓國', '韓國', '韓國', '韓國', '韓國', '韓國', '韓國', '韓國', '韓國', '韓國', '韓國', '韓國', '韓國', '韓國', '韓國', '韓國', '韓國', '韓國', '韓國', '韓國', '韓國', '韓國', '韓國', '韓國', '韓國'] 5 / 28 = 0.179 --- next word: 隊, probability: 0.179 next word: 及, probability: 0.179 next word: 明, probability: 0.0714 next word: 日, probability: 0.0714 next word: 音, probability: 0.0357 ```