---
GA: UA-159972578-2
---
###### tags: `R` `Data Processing` `資料前處理` `Regex` `正則表達式` `文字分析` `Text Mining`
# Text Mining (Preprocessing)
Reference: [Ebook](https://www.tidytextmining.com/tidytext.html)
Dataset: [古騰堡書庫](https://www.gutenberg.org)
## Regex in R
| 尋找 | 符號 | 舉例 |
| -------- | -------- | -------- |
| 開頭為 | ^ | ^第[0:9]章 |
| 結尾為 | $ | 區$ |
| 包含 | [] | 區$ |
| 數字從1~10以上| [1-9]\|1[9] |1~19|
[找更多範例](https://atedev.wordpress.com/2007/11/23/%E6%AD%A3%E8%A6%8F%E8%A1%A8%E7%A4%BA%E5%BC%8F-regular-expression/)
## String Manipulation (package: stringr)
### str_detect(string, pattern)
+ 找出匹配的字元
+ 回傳布林值
```{r}
library(stringr)
library(janeaustenr) # austen_books
original_books <- austen_books() %>%
group_by(book) %>%
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", ignore_case = TRUE)))) %>%
ungroup()
```
```
## # A tibble: 73,422 x 4
## text book linenumber chapter
## <chr> <fct> <int> <int>
## 1 "SENSE AND SENSIBILITY" Sense & Sensibility 1 0
## 2 "" Sense & Sensibility 2 0
## 3 "by Jane Austen" Sense & Sensibility 3 0
## 4 "" Sense & Sensibility 4 0
## 5 "(1811)" Sense & Sensibility 5 0
## 6 "" Sense & Sensibility 6 0
## 7 "" Sense & Sensibility 7 0
## 8 "" Sense & Sensibility 8 0
## 9 "" Sense & Sensibility 9 0
## 10 "CHAPTER 1" Sense & Sensibility 10 1
## # … with 73,412 more rows
```
### str_count(string, pattern)
+ 計算字串內pattern數
```{r}
str_count('aaa444sssddd', "a") # 3
```
### str_split(string, pattern, n)
+ 分割字串
```{r}
val <- "abc,123,234,iuuu"
s1 <- str_split(val, ",") # 遇到,就切割
s2 <- str_split(val, ",", 2) # 切兩段
```
```
## "abc" "123" "234" "iuuu"
## "abc" "123,234,iuuu"
```
```{r}
strsplit(df$sentence,"[。!;?!?;]") # 以全形或半形 驚歎號、問號、分號 以及 全形句號 爲依據進行斷句
```
### sub_extract(string, pattern)
+ 提取匹配
```{r}
val <- c("abca4", 123, "cba2")
str_extract(val, "\\d") # 返還匹配的數字
str_extract(val, "[a-z]+") # 返回匹配的字符
str_extract_all(val, "\\d") # 返還匹配的數字
```
> [1] "4" "1" "2"
> [1] "abca" NA "cba"
>
> [[1]]
> [1] "4"
>
> [[2]]
> [1] "1" "2" "3"
>
> [[3]]
> [1] "2"
### substr(x, start, stop)
+ 從字串擷取字元
```{r}
substr(df, 1, 4)
```
### grep(pattern, x)
+ 找哪裡有關鍵字
+ 回傳index值
```{r}
grep("鄉$", df$區域別) # 找最後一個字為鄉
```
```
## [1] 163 164 165 166 167 168 169 170 175 176 177 178 179 180 181 182 183
## [18] 191 192 193 194 195 196 197 198 199 200 201 210 211 212 213 214 215
```
### gsub(pattern, replacement, x)
+ 替代字元
```{r}
gsub("區$", "鄉", df$區域別) # 找出區結尾的並用「鄉」代替「區」
```
### scan
```{r}
scan(file = "./dict/lexicon.txt", what=character(),
sep='\n', encoding='utf-8', fileEncoding='utf-8')
```
## Unstructure data to Meta data (Tidy Data)
### 1. gather(col_name, value_name, key)
把key合成單一變數以col_name儲存並在value_name填入value
#### Example 1
```{r}
library(tidyr)
preg2 <- preg %>%
gather(treatment, n, treatmenta:treatmentb) %>% # 將treatmenta和treatmentb收斂在treatment並總計n次
mutate(treatment = gsub("treatment", "", treatment)) %>% # 找出關鍵字treatment並以空值替代(只保留a和b)
arrange(name, treatment)
```
<center>
![](https://i.imgur.com/LmOeaxa.png =60%x)
![](https://i.imgur.com/ytPEADK.png =50%x)
</center>
#### Example 2
```{r}
tb <- as_tibble(read.csv("tb.csv", stringsAsFactors = FALSE))
tb2 <- tb %>%
gather(demo, n, -iso2, -year, na.rm = TRUE) # iso2,year維持不動, 將剩餘的收斂為demo並計算值到n
```
<center>
![](https://i.imgur.com/HZ3EIJx.png =95%x)
![](https://i.imgur.com/RhFePMN.png =35%x)
</center>
### 2. spread(key, value)
根據key展開做成複數個變數並填入value
#### Example 1
```{r}
library(gutenbergr)
bronte <- gutenberg_download(c(1260, 768, 969, 9182, 767)) # 選擇書號
frequency <- bind_rows(mutate(tidy_bronte, author = "Brontë Sisters"), # 在bronte資料集新增全部值為Brontë Sisters的變數
mutate(tidy_hgwells, author = "H.G. Wells"),
mutate(tidy_books, author = "Jane Austen")) %>%
mutate(word = str_extract(word, "[a-z']+")) %>% # 去掉非a~z的字符
count(author, word) %>% # 計算三個作者對某詞使用次數
group_by(author) %>%
mutate(proportion = n / sum(n)) %>% # 將次數變為比例
select(-n) %>% # 把次數拿掉
spread(author, proportion) %>% # 將三個author展成三個變數Brontë Sisters, H.G. Wells, Jane Austen並拿proportion作為值
gather(author, proportion, `Brontë Sisters`:`H.G. Wells`) # 保留Jane Austen並將Brontë Sisters和H.G. Wells收斂為author
```
<center>
![](https://i.imgur.com/dw65qWK.png =80%x)
Raw Data
![](https://i.imgur.com/eT2WECG.png =50%x)
Proportion
![](https://i.imgur.com/xRaW469.png =80%x)
Spread Data
![](https://i.imgur.com/6DPuqZI.png =80%x)
Gather Data
</center>
#### Example 2: Visualization
```{r}
library(scales)
# expect a warning about rows with missing values being removed
ggplot(frequency, aes(x = proportion, y = `Jane Austen`, color = abs(`Jane Austen` - proportion))) +
geom_abline(color = "gray40", lty = 2) +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
scale_color_gradient(limits = c(0, 0.001), low = "darkslategray4", high = "gray75") +
facet_wrap(~author, ncol = 2) +
theme(legend.position="none") +
labs(y = "Jane Austen", x = NULL)
```
![](https://i.imgur.com/toPWfb0.png)
### 3. separate(data, col, into, sep)
+ sep
+ If character: as a regular expression.
+ If numeric: as positions to split at.
+ 正數從左開始數
+ 負數從右開始數
#### Example 1
```{r}
tb3 <- tb2 %>%
separate(col=demo, into=c("sex", "age"), sep=1) # 以第一個字切割, 將demo展開成兩個變數sex和age
```
<center>
![](https://i.imgur.com/pNzfCPO.png =42%x)
</center>
### 4. unite(col_name, word, word, sep)
tidyr’s unite() function is the inverse of separate(), and lets us recombine the columns into one.
#### Example 1
```
# A tibble: 44,784 x 3
book word1 word2
<fct> <chr> <chr>
1 Sense & Sensibility jane austen
2 Sense & Sensibility austen 1811
3 Sense & Sensibility 1811 chapter
4 Sense & Sensibility chapter 1
5 Sense & Sensibility norland park
6 Sense & Sensibility surrounding acquaintance
7 Sense & Sensibility late owner
8 Sense & Sensibility advanced age
9 Sense & Sensibility constant companion
10 Sense & Sensibility happened ten
# … with 44,774 more rows
```
```{r}
bigrams_united <- bigrams_filtered %>%
unite(bigram, word1, word2, sep = " ")
```
```
## # A tibble: 44,784 x 2
## book bigram
## <fct> <chr>
## 1 Sense & Sensibility jane austen
## 2 Sense & Sensibility austen 1811
## 3 Sense & Sensibility 1811 chapter
## 4 Sense & Sensibility chapter 1
## 5 Sense & Sensibility norland park
## 6 Sense & Sensibility surrounding acquaintance
## 7 Sense & Sensibility late owner
## 8 Sense & Sensibility advanced age
## 9 Sense & Sensibility constant companion
## 10 Sense & Sensibility happened ten
## # … with 44,774 more rows
```
## 斷句/斷詞&停用字
### tibble
+ tibble data frame
+ 不在意儲存格的型態,可以為list
+ 多一列定義欄位的class type
+ 斷句
```{r}
text <- c("Because I could not stop for Death -",
"He kindly stopped for me -",
"The Carriage held but just Ourselves -",
"and Immortality")
text_df <- tibble(line = 1:4, text = text)
```
```
## # A tibble: 4 x 2
## line text
## <int> <chr>
## 1 1 Because I could not stop for Death -
## 2 2 He kindly stopped for me -
## 3 3 The Carriage held but just Ourselves -
## 4 4 and Immortality
```
### unnest_tokens
+ 斷詞
+ 去除標點
+ 轉為小寫
```{r}
library(text)
text_df %>%
unnest_tokens(word, text, to_lower=True) # 將text做斷詞並儲存結果到word
```
```
## # A tibble: 20 x 2
## line word
## <int> <chr>
## 1 1 because
## 2 1 i
## 3 1 could
## 4 1 not
## 5 1 stop
## 6 1 for
## 7 1 death
## 8 2 he
## 9 2 kindly
## 10 2 stopped
## # … with 10 more rows
```
### anti_join
+ to remove stop words
```{r}
data(stop_words)
tidy_books <- tidy_books %>%
anti_join(stop_words)
```
### tidy
+ turn non-tidy data into tidy form(one-token-per-document-per-row)
+ similar to the `melt()` function from the reshape2 package for non-sparse matrices.
```
library(tidytext)
ap_td <- tidy(AssociatedPress)
ap_td
```
## 還原字為原形
### lemmatize_words
```
tidy_books$lemma = lemmatize_words(tidy_books$word)
```