R
Data Processing
資料前處理
Regex
正則表達式
文字分析
Text Mining
Reference: Ebook
Dataset: 古騰堡書庫
尋找 | 符號 | 舉例 |
---|---|---|
開頭為 | ^ | ^第[0:9]章 |
結尾為 | $ | 區$ |
包含 | [] | 區$ |
數字從1~10以上 | [1-9]|1[9] | 1~19 |
library(stringr)
library(janeaustenr) # austen_books
original_books <- austen_books() %>%
group_by(book) %>%
mutate(linenumber = row_number(),
chapter = cumsum(str_detect(text, regex("^chapter [\\divxlc]", ignore_case = TRUE)))) %>%
ungroup()
## # A tibble: 73,422 x 4
## text book linenumber chapter
## <chr> <fct> <int> <int>
## 1 "SENSE AND SENSIBILITY" Sense & Sensibility 1 0
## 2 "" Sense & Sensibility 2 0
## 3 "by Jane Austen" Sense & Sensibility 3 0
## 4 "" Sense & Sensibility 4 0
## 5 "(1811)" Sense & Sensibility 5 0
## 6 "" Sense & Sensibility 6 0
## 7 "" Sense & Sensibility 7 0
## 8 "" Sense & Sensibility 8 0
## 9 "" Sense & Sensibility 9 0
## 10 "CHAPTER 1" Sense & Sensibility 10 1
## # … with 73,412 more rows
str_count('aaa444sssddd', "a") # 3
val <- "abc,123,234,iuuu"
s1 <- str_split(val, ",") # 遇到,就切割
s2 <- str_split(val, ",", 2) # 切兩段
## "abc" "123" "234" "iuuu"
## "abc" "123,234,iuuu"
strsplit(df$sentence,"[。!;?!?;]") # 以全形或半形 驚歎號、問號、分號 以及 全形句號 爲依據進行斷句
val <- c("abca4", 123, "cba2")
str_extract(val, "\\d") # 返還匹配的數字
str_extract(val, "[a-z]+") # 返回匹配的字符
str_extract_all(val, "\\d") # 返還匹配的數字
[1] "4" "1" "2"
[1] "abca" NA "cba"[[1]]
[1] "4"[[2]]
[1] "1" "2" "3"[[3]]
[1] "2"
substr(df, 1, 4)
grep("鄉$", df$區域別) # 找最後一個字為鄉
## [1] 163 164 165 166 167 168 169 170 175 176 177 178 179 180 181 182 183
## [18] 191 192 193 194 195 196 197 198 199 200 201 210 211 212 213 214 215
gsub("區$", "鄉", df$區域別) # 找出區結尾的並用「鄉」代替「區」
scan(file = "./dict/lexicon.txt", what=character(),
sep='\n', encoding='utf-8', fileEncoding='utf-8')
把key合成單一變數以col_name儲存並在value_name填入value
library(tidyr)
preg2 <- preg %>%
gather(treatment, n, treatmenta:treatmentb) %>% # 將treatmenta和treatmentb收斂在treatment並總計n次
mutate(treatment = gsub("treatment", "", treatment)) %>% # 找出關鍵字treatment並以空值替代(只保留a和b)
arrange(name, treatment)
Learn More →
Learn More →
tb <- as_tibble(read.csv("tb.csv", stringsAsFactors = FALSE))
tb2 <- tb %>%
gather(demo, n, -iso2, -year, na.rm = TRUE) # iso2,year維持不動, 將剩餘的收斂為demo並計算值到n
Learn More →
Learn More →
根據key展開做成複數個變數並填入value
library(gutenbergr)
bronte <- gutenberg_download(c(1260, 768, 969, 9182, 767)) # 選擇書號
frequency <- bind_rows(mutate(tidy_bronte, author = "Brontë Sisters"), # 在bronte資料集新增全部值為Brontë Sisters的變數
mutate(tidy_hgwells, author = "H.G. Wells"),
mutate(tidy_books, author = "Jane Austen")) %>%
mutate(word = str_extract(word, "[a-z']+")) %>% # 去掉非a~z的字符
count(author, word) %>% # 計算三個作者對某詞使用次數
group_by(author) %>%
mutate(proportion = n / sum(n)) %>% # 將次數變為比例
select(-n) %>% # 把次數拿掉
spread(author, proportion) %>% # 將三個author展成三個變數Brontë Sisters, H.G. Wells, Jane Austen並拿proportion作為值
gather(author, proportion, `Brontë Sisters`:`H.G. Wells`) # 保留Jane Austen並將Brontë Sisters和H.G. Wells收斂為author
Learn More →
Raw Data
Learn More →
Proportion
Spread Data
Gather Data
library(scales)
# expect a warning about rows with missing values being removed
ggplot(frequency, aes(x = proportion, y = `Jane Austen`, color = abs(`Jane Austen` - proportion))) +
geom_abline(color = "gray40", lty = 2) +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_x_log10(labels = percent_format()) +
scale_y_log10(labels = percent_format()) +
scale_color_gradient(limits = c(0, 0.001), low = "darkslategray4", high = "gray75") +
facet_wrap(~author, ncol = 2) +
theme(legend.position="none") +
labs(y = "Jane Austen", x = NULL)
tb3 <- tb2 %>%
separate(col=demo, into=c("sex", "age"), sep=1) # 以第一個字切割, 將demo展開成兩個變數sex和age
tidyr’s unite() function is the inverse of separate(), and lets us recombine the columns into one.
# A tibble: 44,784 x 3
book word1 word2
<fct> <chr> <chr>
1 Sense & Sensibility jane austen
2 Sense & Sensibility austen 1811
3 Sense & Sensibility 1811 chapter
4 Sense & Sensibility chapter 1
5 Sense & Sensibility norland park
6 Sense & Sensibility surrounding acquaintance
7 Sense & Sensibility late owner
8 Sense & Sensibility advanced age
9 Sense & Sensibility constant companion
10 Sense & Sensibility happened ten
# … with 44,774 more rows
bigrams_united <- bigrams_filtered %>%
unite(bigram, word1, word2, sep = " ")
## # A tibble: 44,784 x 2
## book bigram
## <fct> <chr>
## 1 Sense & Sensibility jane austen
## 2 Sense & Sensibility austen 1811
## 3 Sense & Sensibility 1811 chapter
## 4 Sense & Sensibility chapter 1
## 5 Sense & Sensibility norland park
## 6 Sense & Sensibility surrounding acquaintance
## 7 Sense & Sensibility late owner
## 8 Sense & Sensibility advanced age
## 9 Sense & Sensibility constant companion
## 10 Sense & Sensibility happened ten
## # … with 44,774 more rows
text <- c("Because I could not stop for Death -",
"He kindly stopped for me -",
"The Carriage held but just Ourselves -",
"and Immortality")
text_df <- tibble(line = 1:4, text = text)
## # A tibble: 4 x 2
## line text
## <int> <chr>
## 1 1 Because I could not stop for Death -
## 2 2 He kindly stopped for me -
## 3 3 The Carriage held but just Ourselves -
## 4 4 and Immortality
library(text)
text_df %>%
unnest_tokens(word, text, to_lower=True) # 將text做斷詞並儲存結果到word
## # A tibble: 20 x 2
## line word
## <int> <chr>
## 1 1 because
## 2 1 i
## 3 1 could
## 4 1 not
## 5 1 stop
## 6 1 for
## 7 1 death
## 8 2 he
## 9 2 kindly
## 10 2 stopped
## # … with 10 more rows
data(stop_words)
tidy_books <- tidy_books %>%
anti_join(stop_words)
melt()
function from the reshape2 package for non-sparse matrices.library(tidytext)
ap_td <- tidy(AssociatedPress)
ap_td
tidy_books$lemma = lemmatize_words(tidy_books$word)
Other Reference: Rpubs R Markdown Theme date: '`r Sys.Date()`' output: rmdformats::material # readthedown date: "`r Sys.Date()`" output: prettydoc::html_pretty:
Aug 3, 2022Reference: http://ccckmit.wikidot.com/st:test1 d = density function 連續型 p = cumulative distribution function 離散 q = quantile function r = random number generation # 累積的常態機率分布 pnorm(175, mean = 170, sd = 5) - pnorm(170, 170, 5) # 計算170到175中間的那塊常態分布面積(mean:170, x:175, sd:5)
Nov 1, 2021Reference: Michael Hahsler(2011). "recommenderlab: A Framework for Developing and Testing Recommendation Algorithms." 推薦系統基本觀念—Collaborative Filtering(協同式過濾) 1. User-based Collaborative Filtering(UBCF) Memory-based CF 採用「全部」或巨量的使用者資料去進行推薦→佔記憶體 推薦系統通常是在線上進行運算,因此上述特性會不利於演算速度 原理
Aug 24, 2020Reference: Michael Hahsler(2011). "recommenderlab: A Framework for Developing and Testing Recommendation Algorithms." 一個用於探索時序資料的套件 將隨著時間而變的狀態用視覺化方式呈現 使用場合: 行銷漏斗轉換分析 社經事件追蹤研究 購物流程/行為分析 顧客價值分析(新顧客、忠誠顧客、沈睡顧客的轉變)
Jun 3, 2020or
By clicking below, you agree to our terms of service.
New to HackMD? Sign up