R tm + wordcloud
text mining
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
如果無法轉成DocumentTermMatrix
–-> Rcpp didn't have previous_save 之類的error
把Rcpp整個檔案刪掉
再 install.packages("Rcpp")
wordcloud 裡面包含5個txt檔
檔案連結點我
下載後看放哪裡,再改dir的位置哈哈。
分段講解
- 文本處理
- 建立語意資料庫
- Result + 畫圖展示
文本處理
Corpus 可以用三種方式來讀取資料
- DirSource()
- VectorSource()
- DataframeSource()
這次是用DirSource的方式,三者差別
輸出Corpus
建立語料庫
可以直接呼叫docs or 用 inspect() 來查看 Corpus
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
建立語意資料庫
- 去除數字
- 將大寫都轉換成小寫
- 移除空白部分
- 移除標點符號
移除標點符號時,像是 don't 的 單引號也會被移除,剩下 dont
- 去除停頓詞(可以客製化)
- 增加停頓詞(可以客製化)
- 語料庫去除stopwords
- 提取詞幹
- 把文本印出來檢查看看
docs[[n]] 是指語料庫中的第幾個文件,從1開始(此次文件有5個,所以是1~5)
尚未處理的writeline
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
處理過的writeline
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- 將語料庫轉成文件矩陣,以利後續分析
轉換後把matrix叫出來查看,注意看terms(有幾個字)
用freq來計算每個字出現的頻率
最後用length來check一下
748 = terms 個數 (以此例來說)
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
Result + 畫圖展示
- 將出現頻率進行排序
- 展示最常出現的字詞 & 最不常出現的字詞
- 找出至少出現幾次的字詞 (排序按照abcd,不是按照出現次數)
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- 找出跟某個字詞最常一起出現的字詞
Image Not Showing
Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- 建立dataframe準備畫圖
- 用geom_bar來畫圖
可以在freq那邊選擇要展示的文字之出現頻率至少大於多少才可以,這邊是用 >= 7
還有要注意的點是 stat = 'identity' 要寫 –->因為 default = 'count'
如果是count的話,他會計算在dataframe中每個字詞出現的次數 –-> 也就是1
所以會變成高度全部都是1的bar圖 –-> not good

- 導入wordloud來畫文字雲
min.freq = n = 至少出現n次的字詞才會顯示在wordcloud上面
set.seed(42)只是為了能重現同一種雲
更多的brewer

討論
有發現一些字詞的尾部被切掉了,例如village –-> villag
可能後續還要查詢一下這方面的資訊
不過作為email分析的前置練習
我覺得還行
完整CODE
reference
More tutorial / note
- my coding-blog
- my movie-blog