以情緒分析應用於辨識外賣平台之正負向評論

# 以情緒分析應用於辨識外賣平台之正負向評論繼上次的Youtube垃圾訊息的分析後，對情緒分析有稍微暸解，因此本次要實做的是辨識外賣平台的正負向評論，有別於上次分析英文訊息的實作，本次使用之Dataset是由ChineseNlpCorpus收集整理的簡體中文資料，並透過結巴(jieba)中文斷詞與MLP多層感知器(Multilayer perceptron)來分析，最後會嘗試以LSTM長短期記憶(Long Short-Term Memory)來分析，嘗試將準確率提高。資料推薦：ChineseNlpCorpus整理了許多中文的資料集，包含情感/觀點/評論傾向性分析、推薦系統和FAQ問答系統等，雖然都是簡體字，但當作中文的NLP練習也是不錯的選擇。 ![](https://i.imgur.com/4fr9KZz.png) --- 本次實作使用ChineseNlpCorpus dataset：[waimai_10k](https://github.com/SophonPlus/ChineseNlpCorpus/blob/master/datasets/waimai_10k/intro.ipynb) ## 一、資料介紹 waimai_10k資料集總共有兩個欄位，分別為： 1. label：評論分類，1表示為正面評論，0表示為負面評論。 2. review：評論內容。 ![](https://i.imgur.com/E0VDqrR.png) ### 步驟一：下載資料 ```python= import os import urllib.request url="https://raw.githubusercontent.com/SophonPlus/ChineseNlpCorpus/master/datasets/waimai_10k/waimai_10k.csv" #設定儲存的檔案路徑及名稱 filepath="waimai_10k.csv" # 判斷檔案是否存在，若不存在才下載 if not os.path.isfile(filepath): # 下載檔案 result=urllib.request.urlretrieve(url,filepath) print('downloaded:',result) ``` ### 步驟二：查看資料載完資料後，便可以查看資料筆數： ```python= import pandas as pd pd_all = pd.read_csv('waimai_10k.csv') print('評論數目（全部）：%d' % pd_all.shape[0]) print('評論數目（正面）：%d' % pd_all[pd_all.label==1].shape[0]) print('評論數目（負面）：%d' % pd_all[pd_all.label==0].shape[0]) ``` ![](https://i.imgur.com/iija6Fj.png) ## 二、資料預處理接著，我們開始將資料從csv檔中取出，並進行資料的預處理。 ### 步驟一：讀取csv檔案建立`read_files()`方法，取得資料： ```python= import csv import numpy as np def read_files(): path = 'waimai_10k.csv' label = [] all_texts = [] all_label = [] #取得review資料 with open(path, newline='') as csvfile_train: reader = csv.DictReader(csvfile_train) content = [row['review'] for row in reader] all_texts+=content #取得label資料 with open(path, newline='') as csvfile_label: reader = csv.DictReader(csvfile_label) tag = [row['label'] for row in reader] label+=tag #將label list的值轉為int格式 all_label = list(map(int, label)) return all_texts,all_label ``` 呼叫`read_files()`方法，取得訓練資料train與label標籤： ```python= train,label=read_files() print(train[3999]) print(label[3999]) print(train[4000]) print(label[4000]) ``` 查看資料結果，可以看到在4000筆以前的資料都是正面的評論，而4000筆後的資料為負面的評論，如下圖： ![](https://i.imgur.com/AHALMDJ.png) ### 步驟二：打亂資料順序從前面查看資料時，可以知道正面的評論為4000筆，負面的評論為7987筆，兩個資料量懸殊，因此我們要平均資料量，將負面的評論只取4000筆： ```python= train = train[:8000] label = label[:8000] ``` 由於資料都是依照正負評論順序排列，為了讓資料自然一點，我們要將資料的順序打亂，如下： ```python= import random x_shuffle=train y_shuffle=label z_shuffle = list(zip(x_shuffle, y_shuffle)) random.shuffle(z_shuffle) x_train, y_label = zip(*z_shuffle) ``` 接著列印出前10筆資料，查看打亂前及打亂後的排序結果： ```python= print(label[:10]) print(y_label[:10]) ``` ![](https://i.imgur.com/ag2GYco.png) ### 步驟三：label序列化為了要符合訓練模型的格式，我們需要將label資料序列化，如下： ```python= from keras.utils import np_utils y_label = np_utils.to_categorical(y_label, 2) ``` 查看序列化的結果，[0 1]代表1正面評論，[1 0]代表0負面評論： ![](https://i.imgur.com/NdBORmq.png) ### 步驟四：將資料分割為訓練資料與測試資料由於原始資料沒有提供測試的資料，因此我們必須自己將資料切分，8成的資料(6400)為訓練資料，2成的資料(1600)為測試資料，如下： ```python= NUM_TRAIN = int(8000 * 0.8) train, test = x_train[:NUM_TRAIN], x_train[NUM_TRAIN:] labels_train, labels_test = y_label[:NUM_TRAIN], y_label[NUM_TRAIN:] ``` ### 步驟五：取得及設定停用詞上網搜尋[停用詞](https://blog.csdn.net/shijiebei2009/article/details/39696571)，取得對於訓練資料無意義的用詞與符號，存成txt檔，接著讀取文字檔內容，並將各個停用詞以斷行符號`\n`分割，取得`stopWords`list： ```python= stopWords=[] with open('stopWord.txt', 'r', encoding='utf8') as f: stopWords = f.read().split('\n') stopWords.append('\n') ``` ![](https://i.imgur.com/eMf95mV.png) ### 步驟六：使用結巴(jieba)中文分詞由於中文不像英文一個一個單字都是分開的，因此要使用一些工具，來協助斷詞，我選擇使用結巴(jieba)，它是一個開源的中文斷詞套件，可以將所有的評論分詞，例如： `我今天很快樂` 這句話經過結巴斷詞後便會被分成 `['我','今天','很','快樂']` 接著再使用前面設定的`stopWords`，將一些無意義的斷詞移除，如下： ```python= import jieba sentence=[] sentence_test=[] #透過jieba分詞工具，分別處理train和test資料 for content in train: _sentence=list(jieba.cut(content, cut_all=True)) sentence.append(_sentence) for content in test: _sentence=list(jieba.cut(content, cut_all=True)) sentence_test.append(_sentence) remainderWords2 = [] remainderWords_test = [] #將斷詞分別從train和test資料中移除 for content in sentence: remainderWords2.append(list(filter(lambda a: a not in stopWords, content))) for content in sentence_test: remainderWords_test.append(list(filter(lambda a: a not in stopWords, content))) ``` 查看結果： ![](https://i.imgur.com/qgaLOYQ.png) ### 步驟七：建立token字典使用Tokenizer建立大小為3000的字典，接著透過`fit_on_texts()`方法將訓練的留言資料中，依照文字出現次數排序，而前3000個常出現的單字將會列入token字典中。 ```python= from keras.preprocessing.text import Tokenizer token = Tokenizer(num_words=3000) token.fit_on_texts(remainderWords2) ``` 建立完成字典後，透過`word_index`屬性將其內容列印，便可以查看到3000最常出現的單字，其順序是依照單字出現次數的多寡排序： ![](https://i.imgur.com/qK4ciXS.png) ### 步驟八：建立數字list 接著，透過token的`texts_to_sequences()`方法將訓練及測試資料轉換為數字list。 ```python= x_train_seq = token.texts_to_sequences(remainderWords2) x_test_seq = token.texts_to_sequences(remainderWords_test) ``` 例如第一句評論 `['超级', '美味', '神速']` 對應到字典便會轉換為 `[44, 288, 338]` 表示`超級`這個詞彙對應到字典的第44個排序。 ![](https://i.imgur.com/De9O8FD.png) 此外，由於keras只接受長度一樣的list輸入，因此必須使用sequence的`pad_sequences()`方法，將序列後的訓練及測試資料長度限制在50，表示當list長度超過50時，會自動切斷多出來的內容，反之list長度小於50時便會自動補0，直到長度為50。 ```python= x_train = sequence.pad_sequences(x_train_seq, maxlen=50) x_test = sequence.pad_sequences(x_test_seq, maxlen=50) ``` 例如第100筆資料有5個字詞，那剩下的便會補上45個0。 ![](https://i.imgur.com/dBPnPfg.png) ## 三、建立模型開始建立MLP模型前需要先引入相關的模組，如下： ```python= from keras.models import Sequential from keras.layers.core import Dense, Dropout, Activation,Flatten from keras.layers.embeddings import Embedding ``` 接著開始建立模型，其中參數的設定就是一直try and error，找出可以得得最高精確率的設定。 1. 加入Embedding層，並設定output_dim輸出維度為128，而input_dim輸入維度則是與前面設定的字典大小相同為3000，input_length也與前面設定序列長度相同50。 1. 轉換為Flatten平坦層，表示會有3000*128個神經元。 1. 加入隱藏層，並設定神經元為256個，其中激活函數設定為relu，表示資料會捨去負數，並介於0到無限大區間。 1. 加入輸出層，並設定輸出為2個神經元，並定義激活函數為sigmoid表示資料為0或1。 ```python= model = Sequential() model.add(Embedding(output_dim=128, input_dim=3000, input_length=50)) model.add(Dropout(0.2)) model.add(Flatten()) model.add(Dense(units=256, activation='relu' )) model.add(Dropout(0.2)) model.add(Dense(units=2, activation='sigmoid' )) model.summary() ``` 列出模型摘要： ![](https://i.imgur.com/Wa4IyLJ.png) ## 四、開始訓練模型建立完MLP模型後，便可以透過`model.compile()`設定訓練模型的方式，最後以`model.fit()`開始訓練。 ```python= model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) train_history =model.fit(x_train, labels_train, batch_size=100, epochs=10, verbose=2, validation_split=0.2) ``` 執行後便會開始訓練模型，並一一列出每次週期的訓練結果，如下： ![](https://i.imgur.com/YMl0rTf.png) ## 五、情緒分析預測結果將test測試的資料加入模型評估結果，並取得模型正確率。 ```python= scores = model.evaluate(my_test, test_label, verbose=1) scores[1] ``` 預測結果： ![](https://i.imgur.com/pQ8Icta.png) 透過predict_classes()方法取得test資料的預測結果，並且轉為一維陣列，接著建立一個方法查看預測結果是否正確。 ```python= predict=model.predict_classes(x_test) def display_test_Sentiment(i): print(test[i]) print('原始結果:',labels_test[i]) print('預測結果:',predict[i]) ``` 呼叫display_test_Sentiment()並傳入要查看的資料編號。 ```python= display_test_Sentiment(0) ``` ![](https://i.imgur.com/xRHay3t.png) ```python= import matplotlib.pyplot as plt def show_train_history(train_acc,test_acc): plt.plot(train_history.history[train_acc]) plt.plot(train_history.history[test_acc]) plt.title('Train History') plt.ylabel('Accuracy') plt.xlabel('Epoch') plt.legend(['train', 'test'], loc='upper left') plt.show() ``` 查看訓練與測試資料的準確度訓練結果圖，可以看到Overfitting很嚴重，表示模型過度擬合訓練資料了！而且測試資料的準確度沒增加反而還有下降的趨勢，如下圖： ![](https://i.imgur.com/kDqwpp3.png) 因此，接下來便要開始嘗試提高準確度並且改善Overfitting的問題。 ## 六、提升準確率 ### 步驟一：去除重複字除了一般的停用字，在各評論當中也會有出現一些對於判斷正負面評論較無意義的詞彙，因此接下來我們就要找出這些詞，並手動加入停用字當中囉！首先，一樣要經過jieba分詞，並先去除一般的停用字，而我們將資料分為所有的訓練評論資料、正面與負面評論三種，在後面會比較好比對哪些是較無意義的詞彙，如下： ```python= from collections import Counter segments=[] segments_postive=[] segments_negative=[] #全部訓練資料分詞 for content in train: _sentence=list(jieba.cut(content, cut_all=True)) segments+=_sentence #正面評論分詞 for content in train_postive: _sentence=list(jieba.cut(content, cut_all=True)) segments_postive+=_sentence #負面評論分詞 for content in train_negative: _sentence=list(jieba.cut(content, cut_all=True)) segments_negative+=_sentence #去除訓練、正面與負面評論的停用詞 remainderWords = list(filter(lambda a: a not in stopWords, segments)) remainderWords_postive = list(filter(lambda a: a not in stopWords, segments_postive)) remainderWords_negative = list(filter(lambda a: a not in stopWords, segments_negative)) ``` 排序並計算三種資料的詞彙出現次數： ```python= sorted(Counter(remainderWords).items(), key=lambda x:x[1], reverse=True) sorted(Counter(remainderWords_postive).items(), key=lambda x:x[1], reverse=True) sorted(Counter(remainderWords_negative).items(), key=lambda x:x[1], reverse=True) ``` 查看結果，我將像`送`、`餐`、`吃`較無意義的詞拿掉，且像`餅`、`卷`正負面評論比例差不多的字也拿掉，像是`小時`或`不錯`這種比例懸殊詞的便留下。 ![](https://i.imgur.com/1JQtoOo.png) 也可以使用較酷炫的文字雲方式查看，如下： ```python= from wordcloud import WordCloud from matplotlib import pyplot as plt #記得要加上字型檔，否則會出現錯誤 wordcloud = WordCloud(font_path="Microsoft JhengHei.ttf") wordcloud.generate_from_frequencies(frequencies=Counter(remainderWords)) plt.figure(figsize=(15,15)) plt.imshow(wordcloud, interpolation="bilinear") plt.axis("off") plt.show() ``` ![](https://i.imgur.com/vwNeMhj.png) ### 步驟二：改為LSTM分析修改前面的模型，加入LSTM層，如下： ```python= model = Sequential() #將字典長度改為1000，評論資料長度改為50 model.add(Embedding(output_dim=128, input_dim=1000, input_length=50)) model.add(LSTM(output_dim=64, activation='sigmoid', inner_activation='hard_sigmoid')) model.add(Dropout(0.5)) model.add(Dense(units=256, activation='relu' )) model.add(Dropout(0.5))#Dropout值改為0.5 model.add(Dense(units=2, activation='sigmoid' )) model.summary() ``` 再訓練一次模型： ```python= model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) train_history =model.fit(x_train, labels_train, batch_size=100, epochs=10, verbose=2, validation_split=0.2) ``` ![](https://i.imgur.com/4LYa5kn.png) 查看預測結果，可以看到預測的準確率提高到0.83囉！ ```python= scores = model.evaluate(x_test, labels_test, verbose=1) scores[1] ``` ![](https://i.imgur.com/WxIVisP.png) 且Overfitting的問題也改善了許多： ![](https://i.imgur.com/vWoC3uI.png) 為了減少Overfitting，修改了許多的參數，才有現在的成果，但有些沒有截圖到！！準確度也從0.81、0.82提升到0.83了，雖然沒有很多不過下面四張圖應該還是可以看到整個改善的過程吧。 ![](https://i.imgur.com/yY2TklG.png) 這次實作就到這裡囉～謝謝觀看！ --- 改進 1. 每個詞彙加上權重：tf idf 2. 字典3000->1000的依據