Try   HackMD

R tm + wordcloud
text mining

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →


如果無法轉成DocumentTermMatrix
-> Rcpp didn't have previous_save 之類的error
把Rcpp整個檔案刪掉
再 install.packages("Rcpp")

wordcloud 裡面包含5個txt檔
檔案連結點我
下載後看放哪裡,再改dir的位置哈哈


分段講解

  1. 文本處理
  2. 建立語意資料庫
  3. Result + 畫圖展示

文本處理

library(tm) dir <- "C:\\Users\\user\\Desktop\\wordcloud" docs <- Corpus(DirSource(dir)) docs

Corpus 可以用三種方式來讀取資料

  1. DirSource()
  2. VectorSource()
  3. DataframeSource()

這次是用DirSource的方式,三者差別

輸出Corpus

writeCorpus(Your_Corpus_Name,path = dir)

建立語料庫

library(tm) dir <- "C:\\Users\\user\\Desktop\\wordcloud" docs <- Corpus(DirSource(dir)) docs inspect(docs)

可以直接呼叫docs or 用 inspect() 來查看 Corpus

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →


建立語意資料庫

  1. 去除數字
  2. 將大寫都轉換成小寫
  3. 移除空白部分
  4. 移除標點符號

移除標點符號時,像是 don't 的 單引號也會被移除,剩下 dont

docs <- tm_map(docs,content_transformer(tolower)) docs <- tm_map(docs, removeNumbers) docs <- tm_map(docs, removePunctuation) docs <- tm_map(docs, stripWhitespace)

  1. 去除停頓詞(可以客製化)
  2. 增加停頓詞(可以客製化)
  3. 語料庫去除stopwords
# add two extra stop words: 'news' and 'via' myStopwords <- c(stopwords("english"), "news", "via") # remove 'r' and 'big' from stopwords myStopwords <- setdiff(myStopwords, c("r", "big")) docs <- tm_map(docs, removeWords, myStopwords)
  1. 提取詞幹
# stem words docs <- tm_map(docs, stemDocument)
  1. 把文本印出來檢查看看
writeLines(as.character(docs[[1]]))

docs[[n]] 是指語料庫中的第幾個文件,從1開始(此次文件有5個,所以是1~5)

尚未處理的writeline

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

處理過的writeline

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  1. 將語料庫轉成文件矩陣,以利後續分析
dtm <- DocumentTermMatrix(docs) dtm freq <- colSums(as.matrix(dtm)) length(freq) == 748

轉換後把matrix叫出來查看,注意看terms(有幾個字)
用freq來計算每個字出現的頻率
最後用length來check一下
748 = terms 個數 (以此例來說)

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →


Result + 畫圖展示

  1. 將出現頻率進行排序
  2. 展示最常出現的字詞 & 最不常出現的字詞
  3. 找出至少出現幾次的字詞 (排序按照abcd,不是按照出現次數)
ord <- order(freq,decreasing = TRUE) #inspect most frequent occurring terms freq[head(ord)] #inspect least frequent occurring terms freq[tail(ord)] # lowfreq = 至少出現幾次 findFreqTerms(dtm,lowfreq=7)

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →


  1. 找出跟某個字詞最常一起出現的字詞
# 設定一個詞A,然後找出所有跟隨這個詞出現的詞 # 後面數字是出現的機率(0~1) # ex. 設1的話就是找只要A出現,就一定會出現的詞 findAssocs(dtm,'die',0.6)

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →


  1. 建立dataframe準備畫圖
  2. 用geom_bar來畫圖
wf=data.frame(term=names(freq),occurrences=freq) library(ggplot2) p <- ggplot(subset(wf, freq >= 7), aes(term, occurrences)) p <- p + geom_bar(stat='identity') p <- p + theme(axis.text.x=element_text(angle=20, hjust=1),plot.title = element_text(hjust = 0.5),legend.position ='best') p <- p + ggtitle('Term-occurance histogram (freq>=7)') p

可以在freq那邊選擇要展示的文字之出現頻率至少大於多少才可以,這邊是用 >= 7
還有要注意的點是 stat = 'identity' 要寫 ->因為 default = 'count'
如果是count的話,他會計算在dataframe中每個字詞出現的次數 -> 也就是1
所以會變成高度全部都是1的bar圖 -> not good


  1. 導入wordloud來畫文字雲
install.packages("wordcloud") library(wordcloud) set.seed(42) #limit words by specifying min frequency # pal(n,plattes) ---> n = number of color in platte,min = 3 // plattes see below link wordcloud(names(freq),freq, min.freq=5,colors=brewer.pal(6,'Paired')) # more brewer: https://www.datanovia.com/en/blog/the-a-z-of-rcolorbrewer-palette/

min.freq = n = 至少出現n次的字詞才會顯示在wordcloud上面
set.seed(42)只是為了能重現同一種雲

更多的brewer


討論

有發現一些字詞的尾部被切掉了,例如village -> villag
可能後續還要查詢一下這方面的資訊
不過作為email分析的前置練習
我覺得還行


完整CODE

library(tm) dir <- "C:\\Users\\user\\Desktop\\wordcloud" docs <- Corpus(DirSource(dir)) docs getTransformations() writeLines(as.character(docs[[1]])) docs <- tm_map(docs,content_transformer(tolower)) docs <- tm_map(docs, removeNumbers) docs <- tm_map(docs, removePunctuation) docs <- tm_map(docs, stripWhitespace) # add two extra stop words: 'news' and 'via' myStopwords <- c(stopwords("english"), "news", "via") # remove 'r' and 'big' from stopwords # myStopwords <- setdiff(myStopwords, c("r", "big")) docs <- tm_map(docs, removeWords, myStopwords) # stem words docs <- tm_map(docs, stemDocument) writeLines(as.character(docs[[1]])) dtm <- DocumentTermMatrix(docs) dtm freq <- colSums(as.matrix(dtm)) length(freq) == 748 ord <- order(freq,decreasing = TRUE) #inspect most frequent occurring terms freq[head(ord)] #inspect least frequent occurring terms freq[tail(ord)] # lowfreq = 至少出現幾次 findFreqTerms(dtm,lowfreq=7) # 設定一個詞A,然後找出所有跟隨這個詞出現的詞 # 後面數字是出現的機率(0~1) # ex. 設1的話就是找只要A出現,就一定會出現的詞 findAssocs(dtm,'die',0.6) wf=data.frame(term=names(freq),occurrences=freq) library(ggplot2) p <- ggplot(subset(wf, freq >= 7), aes(term, occurrences)) p <- p + geom_bar(stat='identity') p <- p + theme(axis.text.x=element_text(angle=20, hjust=1),plot.title = element_text(hjust = 0.5),legend.position ='best') p <- p + ggtitle('Term-occurance histogram (freq>=7)') p # wordcloud library(wordcloud) #setting the same seed each time ensures consistent look across clouds set.seed(42) #limit words by specifying min frequency # pal(n,plattes) ---> n = number of color in platte,min = 3 // plattes see below link wordcloud(names(freq),freq, min.freq=5,colors=brewer.pal(6,'Paired')) # more brewer: https://www.datanovia.com/en/blog/the-a-z-of-rcolorbrewer-palette/

reference


More tutorial / note

  1. my coding-blog
  2. my movie-blog
tags: R beginner cat tutorial