---
title: 'R 語言學習心得 Text Mining + WordCloud'
disqus: hackmd
---
R tm + wordcloud
text mining
===
![downloads](https://img.shields.io/badge/download-R-brightgreen)
![grade](https://img.shields.io/badge/Grade-新手-brightgreen)
![chat](https://img.shields.io/discord/:serverId.svg)
----
:::warning
如果無法轉成DocumentTermMatrix
---> Rcpp didn't have previous_save 之類的error
把Rcpp整個檔案刪掉
再 install.packages("Rcpp")
:::
wordcloud 裡面包含5個txt檔
檔案連結[點我](https://drive.google.com/drive/folders/1h29B2rvMFmywGO9Q2BQLOd2KEBwV0WVg?usp=sharing)
下載後看放哪裡,再改dir的位置~~哈哈~~。
---
**分段講解**
1. 文本處理
2. 建立語意資料庫
3. Result + 畫圖展示
----
==文本處理==
```r=
library(tm)
dir <- "C:\\Users\\user\\Desktop\\wordcloud"
docs <- Corpus(DirSource(dir))
docs
```
Corpus 可以用三種方式來讀取資料
1. DirSource()
2. VectorSource()
3. DataframeSource()
這次是用DirSource的方式,三者[差別](http://yphuang.github.io/blog/2016/03/04/text-mining-tm-package/)
==輸出Corpus==
```r=
writeCorpus(Your_Corpus_Name,path = dir)
```
==建立語料庫==
```r=
library(tm)
dir <- "C:\\Users\\user\\Desktop\\wordcloud"
docs <- Corpus(DirSource(dir))
docs
inspect(docs)
```
:::success
可以直接呼叫docs or 用 inspect() 來查看 Corpus
:::
![](https://i.imgur.com/POEHVjv.png)
----
==建立語意資料庫==
1. 去除數字
2. 將大寫都轉換成小寫
3. 移除空白部分
4. 移除標點符號
:::warning
移除標點符號時,像是 don't 的 單引號也會被移除,剩下 dont
:::
```r=
docs <- tm_map(docs,content_transformer(tolower))
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, stripWhitespace)
```
----
5. 去除停頓詞(可以客製化)
6. 增加停頓詞(可以客製化)
7. 語料庫去除stopwords
```r=
# add two extra stop words: 'news' and 'via'
myStopwords <- c(stopwords("english"), "news", "via")
# remove 'r' and 'big' from stopwords
myStopwords <- setdiff(myStopwords, c("r", "big"))
docs <- tm_map(docs, removeWords, myStopwords)
```
7. 提取詞幹
```r=
# stem words
docs <- tm_map(docs, stemDocument)
```
8. 把文本印出來檢查看看
```r=
writeLines(as.character(docs[[1]]))
```
**docs[[n]] 是指語料庫中的第幾個文件,從1開始(此次文件有5個,所以是1~5)**
==尚未處理的writeline==
![](https://i.imgur.com/lR2X5yn.png)
==處理過的writeline==
![](https://i.imgur.com/C5u5iCo.png)
9. 將語料庫轉成文件矩陣,以利後續分析
```r=
dtm <- DocumentTermMatrix(docs)
dtm
freq <- colSums(as.matrix(dtm))
length(freq) == 748
```
:::info
轉換後把matrix叫出來查看,注意看terms(有幾個字)
用freq來計算每個字出現的頻率
最後用length來check一下
748 = terms 個數 (以此例來說)
:::
![](https://i.imgur.com/zq6cr7j.png)
----
==Result + 畫圖展示==
1. 將出現頻率進行排序
2. 展示最常出現的字詞 & 最不常出現的字詞
3. 找出至少出現幾次的字詞 (排序按照abcd,不是按照出現次數)
```r=
ord <- order(freq,decreasing = TRUE)
#inspect most frequent occurring terms
freq[head(ord)]
#inspect least frequent occurring terms
freq[tail(ord)]
# lowfreq = 至少出現幾次
findFreqTerms(dtm,lowfreq=7)
```
![](https://i.imgur.com/e4dH7HF.png)
----
4. 找出跟某個字詞最常一起出現的字詞
```r=
# 設定一個詞A,然後找出所有跟隨這個詞出現的詞
# 後面數字是出現的機率(0~1)
# ex. 設1的話就是找只要A出現,就一定會出現的詞
findAssocs(dtm,'die',0.6)
```
![](https://i.imgur.com/gRqDdI2.png)
----
5. 建立dataframe準備畫圖
6. 用geom_bar來畫圖
```r=
wf=data.frame(term=names(freq),occurrences=freq)
library(ggplot2)
p <- ggplot(subset(wf, freq >= 7), aes(term, occurrences))
p <- p + geom_bar(stat='identity')
p <- p + theme(axis.text.x=element_text(angle=20, hjust=1),plot.title = element_text(hjust = 0.5),legend.position ='best')
p <- p + ggtitle('Term-occurance histogram (freq>=7)')
p
```
:::danger
可以在freq那邊選擇要展示的文字之出現頻率至少大於多少才可以,這邊是用 >= 7
還有要注意的點是 stat = 'identity' 要寫 --->因為 default = 'count'
如果是count的話,他會計算在dataframe中每個字詞出現的次數 ---> 也就是1
所以會變成高度全部都是1的bar圖 ---> not good
:::
![](https://i.imgur.com/1R6lbyY.png)
----
7. 導入wordloud來畫文字雲
```r=
install.packages("wordcloud")
library(wordcloud)
set.seed(42)
#limit words by specifying min frequency
# pal(n,plattes) ---> n = number of color in platte,min = 3 // plattes see below link
wordcloud(names(freq),freq, min.freq=5,colors=brewer.pal(6,'Paired'))
# more brewer: https://www.datanovia.com/en/blog/the-a-z-of-rcolorbrewer-palette/
```
:::info
min.freq = n = 至少出現n次的字詞才會顯示在wordcloud上面
set.seed(42)只是為了能重現同一種雲
:::
[更多的brewer](https://www.datanovia.com/en/blog/the-a-z-of-rcolorbrewer-palette/)
![](https://i.imgur.com/IpGsOvx.png)
----
==討論==
:::success
有發現一些字詞的尾部被切掉了,例如village ---> villag
可能後續還要查詢一下這方面的資訊
不過作為email分析的前置練習
我覺得還行
:::
----
==完整CODE==
```r=
library(tm)
dir <- "C:\\Users\\user\\Desktop\\wordcloud"
docs <- Corpus(DirSource(dir))
docs
getTransformations()
writeLines(as.character(docs[[1]]))
docs <- tm_map(docs,content_transformer(tolower))
docs <- tm_map(docs, removeNumbers)
docs <- tm_map(docs, removePunctuation)
docs <- tm_map(docs, stripWhitespace)
# add two extra stop words: 'news' and 'via'
myStopwords <- c(stopwords("english"), "news", "via")
# remove 'r' and 'big' from stopwords
# myStopwords <- setdiff(myStopwords, c("r", "big"))
docs <- tm_map(docs, removeWords, myStopwords)
# stem words
docs <- tm_map(docs, stemDocument)
writeLines(as.character(docs[[1]]))
dtm <- DocumentTermMatrix(docs)
dtm
freq <- colSums(as.matrix(dtm))
length(freq) == 748
ord <- order(freq,decreasing = TRUE)
#inspect most frequent occurring terms
freq[head(ord)]
#inspect least frequent occurring terms
freq[tail(ord)]
# lowfreq = 至少出現幾次
findFreqTerms(dtm,lowfreq=7)
# 設定一個詞A,然後找出所有跟隨這個詞出現的詞
# 後面數字是出現的機率(0~1)
# ex. 設1的話就是找只要A出現,就一定會出現的詞
findAssocs(dtm,'die',0.6)
wf=data.frame(term=names(freq),occurrences=freq)
library(ggplot2)
p <- ggplot(subset(wf, freq >= 7), aes(term, occurrences))
p <- p + geom_bar(stat='identity')
p <- p + theme(axis.text.x=element_text(angle=20, hjust=1),plot.title = element_text(hjust = 0.5),legend.position ='best')
p <- p + ggtitle('Term-occurance histogram (freq>=7)')
p
# wordcloud
library(wordcloud)
#setting the same seed each time ensures consistent look across clouds
set.seed(42)
#limit words by specifying min frequency
# pal(n,plattes) ---> n = number of color in platte,min = 3 // plattes see below link
wordcloud(names(freq),freq, min.freq=5,colors=brewer.pal(6,'Paired'))
# more brewer: https://www.datanovia.com/en/blog/the-a-z-of-rcolorbrewer-palette/
```
----
==reference==
* [第一篇](https://eight2late.wordpress.com/2015/05/27/a-gentle-introduction-to-text-mining-using-r/)
* [第二篇](https://rstudio-pubs-static.s3.amazonaws.com/66739_c4422a1761bd4ee0b0bb8821d7780e12.html)
* [第三篇](https://rpubs.com/olga_bradford/468821)
* [第四篇](https://cran.r-project.org/web/packages/tidytext/vignettes/tidying_casting.html)
----
## More tutorial / note
1. [my coding-blog](fatcatcat-lab.blogspot.com)
2. [my movie-blog](fatcatcat-movie.blogspot.com)
###### tags: `R` `beginner` `cat` `tutorial`