R 語言學習心得 Text Mining + WordCloud

--- title: 'R 語言學習心得 Text Mining + WordCloud' disqus: hackmd --- R tm + wordcloud text mining === ![downloads](https://img.shields.io/badge/download-R-brightgreen) ![grade](https://img.shields.io/badge/Grade-新手-brightgreen) ![chat](https://img.shields.io/discord/:serverId.svg) ---- :::warning 如果無法轉成DocumentTermMatrix ---> Rcpp didn't have previous_save 之類的error 把Rcpp整個檔案刪掉再 install.packages("Rcpp") ::: wordcloud 裡面包含5個txt檔檔案連結[點我](https://drive.google.com/drive/folders/1h29B2rvMFmywGO9Q2BQLOd2KEBwV0WVg?usp=sharing) 下載後看放哪裡，再改dir的位置~~哈哈~~。 --- **分段講解** 1. 文本處理 2. 建立語意資料庫 3. Result + 畫圖展示 ---- ==文本處理== ```r= library(tm) dir <- "C:\\Users\\user\\Desktop\\wordcloud" docs <- Corpus(DirSource(dir)) docs ``` Corpus 可以用三種方式來讀取資料 1. DirSource() 2. VectorSource() 3. DataframeSource() 這次是用DirSource的方式，三者[差別](http://yphuang.github.io/blog/2016/03/04/text-mining-tm-package/) ==輸出Corpus== ```r= writeCorpus(Your_Corpus_Name,path = dir) ``` ==建立語料庫== ```r= library(tm) dir <- "C:\\Users\\user\\Desktop\\wordcloud" docs <- Corpus(DirSource(dir)) docs inspect(docs) ``` :::success 可以直接呼叫docs or 用 inspect() 來查看 Corpus ::: ![](https://i.imgur.com/POEHVjv.png) ---- ==建立語意資料庫== 1. 去除數字 2. 將大寫都轉換成小寫 3. 移除空白部分 4. 移除標點符號 :::warning 移除標點符號時，像是 don't 的單引號也會被移除，剩下 dont ::: ```r= docs <- tm_map(docs,content_transformer(tolower)) docs <- tm_map(docs, removeNumbers) docs <- tm_map(docs, removePunctuation) docs <- tm_map(docs, stripWhitespace) ``` ---- 5. 去除停頓詞(可以客製化) 6. 增加停頓詞(可以客製化) 7. 語料庫去除stopwords ```r= # add two extra stop words: 'news' and 'via' myStopwords <- c(stopwords("english"), "news", "via") # remove 'r' and 'big' from stopwords myStopwords <- setdiff(myStopwords, c("r", "big")) docs <- tm_map(docs, removeWords, myStopwords) ``` 7. 提取詞幹 ```r= # stem words docs <- tm_map(docs, stemDocument) ``` 8. 把文本印出來檢查看看 ```r= writeLines(as.character(docs[[1]])) ``` **docs[[n]] 是指語料庫中的第幾個文件，從1開始(此次文件有5個，所以是1~5)** ==尚未處理的writeline== ![](https://i.imgur.com/lR2X5yn.png) ==處理過的writeline== ![](https://i.imgur.com/C5u5iCo.png) 9. 將語料庫轉成文件矩陣，以利後續分析 ```r= dtm <- DocumentTermMatrix(docs) dtm freq <- colSums(as.matrix(dtm)) length(freq) == 748 ``` :::info 轉換後把matrix叫出來查看，注意看terms(有幾個字) 用freq來計算每個字出現的頻率最後用length來check一下 748 = terms 個數 (以此例來說) ::: ![](https://i.imgur.com/zq6cr7j.png) ---- ==Result + 畫圖展示== 1. 將出現頻率進行排序 2. 展示最常出現的字詞 & 最不常出現的字詞 3. 找出至少出現幾次的字詞 (排序按照abcd,不是按照出現次數) ```r= ord <- order(freq,decreasing = TRUE) #inspect most frequent occurring terms freq[head(ord)] #inspect least frequent occurring terms freq[tail(ord)] # lowfreq = 至少出現幾次 findFreqTerms(dtm,lowfreq=7) ``` ![](https://i.imgur.com/e4dH7HF.png) ---- 4. 找出跟某個字詞最常一起出現的字詞 ```r= # 設定一個詞A，然後找出所有跟隨這個詞出現的詞 # 後面數字是出現的機率(0~1) # ex. 設1的話就是找只要A出現，就一定會出現的詞 findAssocs(dtm,'die',0.6) ``` ![](https://i.imgur.com/gRqDdI2.png) ---- 5. 建立dataframe準備畫圖 6. 用geom_bar來畫圖 ```r= wf=data.frame(term=names(freq),occurrences=freq) library(ggplot2) p <- ggplot(subset(wf, freq >= 7), aes(term, occurrences)) p <- p + geom_bar(stat='identity') p <- p + theme(axis.text.x=element_text(angle=20, hjust=1),plot.title = element_text(hjust = 0.5),legend.position ='best') p <- p + ggtitle('Term-occurance histogram (freq>=7)') p ``` :::danger 可以在freq那邊選擇要展示的文字之出現頻率至少大於多少才可以，這邊是用 >= 7 還有要注意的點是 stat = 'identity' 要寫 --->因為 default = 'count' 如果是count的話，他會計算在dataframe中每個字詞出現的次數 ---> 也就是1 所以會變成高度全部都是1的bar圖 ---> not good ::: ![](https://i.imgur.com/1R6lbyY.png) ---- 7. 導入wordloud來畫文字雲 ```r= install.packages("wordcloud") library(wordcloud) set.seed(42) #limit words by specifying min frequency # pal(n,plattes) ---> n = number of color in platte,min = 3 // plattes see below link wordcloud(names(freq),freq, min.freq=5,colors=brewer.pal(6,'Paired')) # more brewer: https://www.datanovia.com/en/blog/the-a-z-of-rcolorbrewer-palette/ ``` :::info min.freq = n = 至少出現n次的字詞才會顯示在wordcloud上面 set.seed(42)只是為了能重現同一種雲 ::: [更多的brewer](https://www.datanovia.com/en/blog/the-a-z-of-rcolorbrewer-palette/) ![](https://i.imgur.com/IpGsOvx.png) ---- ==討論== :::success 有發現一些字詞的尾部被切掉了，例如village ---> villag 可能後續還要查詢一下這方面的資訊不過作為email分析的前置練習我覺得還行 ::: ---- ==完整CODE== ```r= library(tm) dir <- "C:\\Users\\user\\Desktop\\wordcloud" docs <- Corpus(DirSource(dir)) docs getTransformations() writeLines(as.character(docs[[1]])) docs <- tm_map(docs,content_transformer(tolower)) docs <- tm_map(docs, removeNumbers) docs <- tm_map(docs, removePunctuation) docs <- tm_map(docs, stripWhitespace) # add two extra stop words: 'news' and 'via' myStopwords <- c(stopwords("english"), "news", "via") # remove 'r' and 'big' from stopwords # myStopwords <- setdiff(myStopwords, c("r", "big")) docs <- tm_map(docs, removeWords, myStopwords) # stem words docs <- tm_map(docs, stemDocument) writeLines(as.character(docs[[1]])) dtm <- DocumentTermMatrix(docs) dtm freq <- colSums(as.matrix(dtm)) length(freq) == 748 ord <- order(freq,decreasing = TRUE) #inspect most frequent occurring terms freq[head(ord)] #inspect least frequent occurring terms freq[tail(ord)] # lowfreq = 至少出現幾次 findFreqTerms(dtm,lowfreq=7) # 設定一個詞A，然後找出所有跟隨這個詞出現的詞 # 後面數字是出現的機率(0~1) # ex. 設1的話就是找只要A出現，就一定會出現的詞 findAssocs(dtm,'die',0.6) wf=data.frame(term=names(freq),occurrences=freq) library(ggplot2) p <- ggplot(subset(wf, freq >= 7), aes(term, occurrences)) p <- p + geom_bar(stat='identity') p <- p + theme(axis.text.x=element_text(angle=20, hjust=1),plot.title = element_text(hjust = 0.5),legend.position ='best') p <- p + ggtitle('Term-occurance histogram (freq>=7)') p # wordcloud library(wordcloud) #setting the same seed each time ensures consistent look across clouds set.seed(42) #limit words by specifying min frequency # pal(n,plattes) ---> n = number of color in platte,min = 3 // plattes see below link wordcloud(names(freq),freq, min.freq=5,colors=brewer.pal(6,'Paired')) # more brewer: https://www.datanovia.com/en/blog/the-a-z-of-rcolorbrewer-palette/ ``` ---- ==reference== * [第一篇](https://eight2late.wordpress.com/2015/05/27/a-gentle-introduction-to-text-mining-using-r/) * [第二篇](https://rstudio-pubs-static.s3.amazonaws.com/66739_c4422a1761bd4ee0b0bb8821d7780e12.html) * [第三篇](https://rpubs.com/olga_bradford/468821) * [第四篇](https://cran.r-project.org/web/packages/tidytext/vignettes/tidying_casting.html) ---- ## More tutorial / note 1. [my coding-blog](fatcatcat-lab.blogspot.com) 2. [my movie-blog](fatcatcat-movie.blogspot.com) ###### tags: `R` `beginner` `cat` `tutorial`

Read more

研替面試之路

從成大測量到台大測量組

SQL語法學習心得-4

研究生日記-5