R - HackMD

R === 本筆記主要為<a href="https://www.ewant.org/admin/tool/mooccourse/allcourses.php?search=R%E8%AA%9E%E8%A8%80%E6%96%B0%E6%89%8B%E6%9D%91">磨課師 - R語言新手村</a>內容的統整 R Studio下載 -> https://cran.csie.ntu.edu.tw/ [<基礎篇>](#Basic) &emsp;&emsp;[Chapter 01 - 變數及運算](#變數及運算) &emsp;&emsp;[Chapter 02 - 資料結構(向量)](#資料結構(向量)) &emsp;&emsp;[Chapter 03 - 資料結構(矩陣)](#資料結構(矩陣)) &emsp;&emsp;[Chapter 04 - 資料結構(資料框)](#資料結構(資料框)) &emsp;&emsp;[Chapter 05 - 資料結構(列表)](#資料結構(列表)) &emsp;&emsp;[Chapter 06 - 條件選擇](#條件選擇) &emsp;&emsp;[Chapter 07 - 迴圈控制](#迴圈控制) &emsp;&emsp;[Chapter 08 - 函式](#函式) &emsp;&emsp;[Chapter 09 - 內建繪圖](#內建繪圖) [<進階篇>](#進階應用) &emsp;&emsp;[Chapter 01 - 網路爬蟲](#網路爬蟲) &emsp;&emsp;[Chapter 02 - 繪圖套件](#繪圖套件) &emsp;&emsp;[Chapter 03 - 機器學習](#機器學習) &emsp;&emsp;[Chapter 04 - 資料分析](#資料分析) # Basic ## *變數及運算* 變數宣告 : 變數名稱 <- 變數值 R語言變數命名規則 變數第一個字母必須是英文字母或是句點符號「.」以句點符號「.」當變數的第一個字母時，其後所接的第一個字元不能是數字 註解 : # 輸出 : print() | 型態 | 意義 | | -------- | -------- | | Numeric | 數值 | | Character | 文字 | | Logical | 邏輯值 | | 運算子 | 意義 | | -------- | -------- | | + | 加 | | - | 減 | | * | 乘 | | / | 除 | | ^ | 次方 | | %% | 餘數 | ```r x <- 24 y <- "string" # 單引雙引皆可，需成對 z <- TRUE # 或T 反之FALSE 或F ``` | mode(x) | str(x) | length(x) | | -------- | -------- | -------- | | 查看變數型態 | 變數詳情 | 資料個數 | ```r > a <- "Hello" #文字型態 > mode(a) [1] "character" > str(a) chr "Hello" > length(a) [1] 1 ``` ## *資料結構(向量)* 一維資料向量名稱 <- c( 元素1, 元素2, 元素3, … ) # 可不同型別 | &emsp;c | ( "MON" | "TUE" | "WED" | "THU" | "FRI" ) | | -------- | -------- | -------- | --- | --- | --- | | 索引 | &emsp;&emsp;1 | &emsp;2 | &emsp;3 | &emsp;4 | &emsp;5 | ```r num <- c(1, 2, 3, 4, 5, 6) # num <- 1:6 也有同樣效果 print(length(num)) # 6 print(min(num)) # 最小值是1 還有max(), sum(), mean()... print(sort(num, decreasing=TRUE)) # 6 5 4 3 2 1 TRUE降序 decreasing不寫預設升序 print(num + 5) # 6 7 8 9 10 11 每筆資料+5 可做運算 # order() 用法與sort()一樣有decreasing參數選擇小到大或大到小 smallToBig <- c( 100, 1000, 10 ) print(order(smallToBig)) # 3 1 2 顯示排列順序 # 1. 先把資料做編號100是1, 1000是2, 10是3 # 2. 將排序後的結果按照原編號放上，就得到10 100 1000 -> 3 1 2啦 print(smallToBig[order(smallToBig)]) # 10 100 1000 # 取值或改植 mix <- c(TRUE, 14.5, "A", 98, "24") print(mix[3]) # "A" mix[4] <- 100 # 第4元素變為100 print(mix[2:4]) # 14.5 "A" 100 第2~4元素 print(mix[c(1, 3, 5)]) # TRUE "A" "24" # 每筆資料命名 # Method 1 color <- c(紅='red', 綠='green', 藍'blue') # Method 2 color <- c('red', 'green', 'blue') names(color) <- c('紅', '綠', '藍') print(color) # 紅綠藍 # "red" "green" "blue" ``` ## *資料結構(矩陣)* 二維資料, 每個元素資料型態需相同矩陣名稱 <- matrix( data = 向量, nrow = 幾列, ncol = 幾欄, byrow = FALSE) 其中ncol可以不給，只提供nrow，R語言自動幫你切割計算 byrow若為TRUE，在填值的時候由第一列開始把值填滿，才到下一列，反之就是欄先填 ```r mat <- matrix(1:9, nrow=3, byrow=TRUE) ``` | | [,1] | [,2] | [,3] | | -------- | -------- | -------- | --- | | [1,] | 1 | 2 | 3 | | [2,] | 4 | 5 | 6 | | [3,] | 7 | 8 | 9 | ```r print(mat[2,]) # 4 5 6 print(mat[,3]) # 3 6 9 print(mat[1,1]) # 1 mat[1,] <- c(3, 3, 3) # 第一列 mat[2:3, 2] <- c(5, 5) # 2、3列中第2欄 mat[3,3] <- 7 # 第三列第三欄 print(mat) # [,1] [,2] [,3] # [1,] 3 3 3 # [2,] 4 5 6 # [3,] 7 5 7 # 列、行命名 rownames(mat) <- c('第一列','第二列','第三列') colnames(mat) <- c('第一欄','第二欄','第三欄') print(mat) # 第一欄第二欄第三欄 # 第一列 3 3 3 # 第二列 4 5 6 # 第三列 7 5 7 # 每列各自相加與每行各自相加 print(rowSums(mat)) # 第一列第二列第三列 # 9 15 19 print(colSums(mat)) # 第一欄第二欄第三欄 # 14 13 16 ``` *** ```r # 增加矩陣的行數，可以用矩陣或向量 cbind ( 矩陣1 , 矩陣2 , 向量1 , … ) ``` ```r # cbind Example mat2 <- matrix(c(1, 2, 1, 2), nrow=2, byrow=TRUE) print(mat2) # [,1] [,2] # [1,] 1 2 # [2,] 1 2 mat2 <- cbind(mat2, c(3, 3)) print(mat2) # [,1] [,2] [,3] # [1,] 1 2 3 # [2,] 1 2 3 ``` ```r # 增加矩陣的列數，可以用矩陣或向量 rbind ( 矩陣1 , 矩陣2 , 向量1 , … ) ``` ```r # rbind Example mat2 <- rbind(mat2, c(1, 2, 3)) print(mat2) # [,1] [,2] [,3] # [1,] 1 2 3 # [2,] 1 2 3 # [3,] 1 2 3 ``` ## *資料結構(資料框)* 與矩陣同為二維結構最常與csv做搭配 ![](https://i.imgur.com/oUv6Xj4.png) 先來介紹R語言內建資料集(資料框) https://2formosa.blogspot.com/2018/04/preloaded-datasets-in-r.html 在console模式直接打上你想要顯示的資料集名稱即可 ```r > rock area peri shape perm 1 4990 2791.900 0.0903296 6.3 2 7002 3892.600 0.1486220 6.3 3 7558 3930.660 0.1833120 6.3 4 7352 3869.320 0.1170630 6.3 5 7943 3948.540 0.1224170 17.1 6 7979 4010.150 0.1670450 17.1 7 9333 4345.750 0.1896510 17.1 8 8209 4344.750 0.1641270 17.1 9 8393 3682.040 0.2036540 119.0 10 6425 3098.650 0.1623940 119.0 11 9364 4480.050 0.1509440 119.0 12 8624 3986.240 0.1481410 119.0 13 10651 4036.540 0.2285950 82.4 14 8868 3518.040 0.2316230 82.4 15 9417 3999.370 0.1725670 82.4 16 8874 3629.070 0.1534810 82.4 17 10962 4608.660 0.2043140 58.6 18 10743 4787.620 0.2627270 58.6 19 11878 4864.220 0.2000710 58.6 20 9867 4479.410 0.1448100 58.6 21 7838 3428.740 0.1138520 142.0 22 11876 4353.140 0.2910290 142.0 23 12212 4697.650 0.2400770 142.0 24 8233 3518.440 0.1618650 142.0 25 6360 1977.390 0.2808870 740.0 26 4193 1379.350 0.1794550 740.0 27 7416 1916.240 0.1918020 740.0 28 5246 1585.420 0.1330830 740.0 29 6509 1851.210 0.2252140 890.0 30 4895 1239.660 0.3412730 890.0 31 6775 1728.140 0.3116460 890.0 32 7894 1461.060 0.2760160 890.0 33 5980 1426.760 0.1976530 950.0 34 5318 990.388 0.3266350 950.0 35 7392 1350.760 0.1541920 950.0 36 7894 1461.060 0.2760160 950.0 37 3469 1376.700 0.1769690 100.0 38 1468 476.322 0.4387120 100.0 39 3524 1189.460 0.1635860 100.0 40 5267 1644.960 0.2538320 100.0 41 5048 941.543 0.3286410 1300.0 42 1016 308.642 0.2300810 1300.0 43 5605 1145.690 0.4641250 1300.0 44 8793 2280.490 0.4204770 1300.0 45 3475 1174.110 0.2007440 580.0 46 1651 597.808 0.2626510 580.0 47 5514 1455.880 0.1824530 580.0 48 9718 1485.580 0.2004470 580.0 # 頭六筆 > head(rock) area peri shape perm 1 4990 2791.90 0.0903296 6.3 2 7002 3892.60 0.1486220 6.3 3 7558 3930.66 0.1833120 6.3 4 7352 3869.32 0.1170630 6.3 5 7943 3948.54 0.1224170 17.1 6 7979 4010.15 0.1670450 17.1 # 尾六筆 > tail(rock) area peri shape perm 43 5605 1145.690 0.464125 1300 44 8793 2280.490 0.420477 1300 45 3475 1174.110 0.200744 580 46 1651 597.808 0.262651 580 47 5514 1455.880 0.182453 580 48 9718 1485.580 0.200447 580 ``` 語法 資料框名稱<- data.frame( 向量1, 向量2, 向量3, …) ```r id <- c(1, 2, 3) age <- c(25, 30, 35) sex <- c("male", "female", "male") pay <- c(1000, 1500, 2000) info <- data.frame(id, age, sex, pay) print(info) # id age sex pay # 1 1 25 male 1000 # 2 2 30 female 1500 # 3 3 35 male 2000 # 取值改值方式與矩陣相同[列, 欄]，多了些寫法 print(info[3, 2]) # 35 print(info[2:3, 1:2]) # id age # 2 2 30 # 3 3 35 print(info[2,]) # id age sex pay # 2 2 30 female 1500 print(info[1:2, "sex"]) # [1] male female # Levels: female male print(info$pay) # [1] 1000 1500 2000 ``` *** ```r # subset() 子集用法 subset ( 資料框名稱 , subset = 邏輯判斷式 ) ``` ```r # 一樣拿前面info作範例 print(subset(info, subset = age <= 30)) # id age sex pay # 1 1 25 male 1000 # 2 2 30 female 1500 ``` 使用edit(資料框) 可產生圖形化介面，方便更改資料 ```r edit(info) ``` ![](https://i.imgur.com/MlHRzZv.png) ### CSV讀取假設C:/Users/Hank/Desktop/下有a.csv檔絕對路徑做法 ```r 資料框名稱 <- read.csv("C:/Users/Hank/Desktop/a.csv", header=T, sep=',') ``` 或者設置工作路徑 ```r setwd('C:/Users/Hank/Desktop/') dataframe <- read.csv("a.csv", header=T, sep=',') # getwd()可查詢目前工作路徑 ``` ## *資料結構(列表)* 列表可接受各式各樣的資料型態，可包上述向量、矩陣、資料框 ListName <- list( comp1, comp2, comp3, …, compN ) ```r # 向量一星期觀看人數 <- c(800, 550, 750, 900, 1950, 3000, 1750) names(一星期觀看人數) <- c("星期一", "星期二", "星期三", "星期四", "星期五", "星期六", "星期日") # 矩陣電影票房收入 <- matrix(c(2500, 2650), nrow=1, byrow=TRUE) colnames(電影票房收入) <- c("上週票房收入(萬)", "本週票房收入(萬)") rownames(電影票房收入) <- "末日異戰" # 資料框電影評分 <- c(4.5, 5.0, 4.2) 評語來源 <- c("電影真好看", "酸番茄", "爆米花影院") 評語 <- c("Amazing", "Wonderful", "Exciting") 電影評論資訊 <- data.frame(電影評分, 評語來源, 評語) # 文字 movieName <- "末日異戰(Invasion)" # 列表整合末日異戰相關資訊 <- list(movieName, 一星期觀看人數, 電影票房收入, 電影評論資訊) names(末日異戰相關資訊) <- c('電影名稱', '觀看人數', '票房收入', '電影評論') print(末日異戰相關資訊) # $電影名稱 # [1] "末日異戰(Invasion)" # $觀看人數 # 星期一星期二星期三星期四星期五星期六星期日 # 800 550 750 900 1950 3000 1750 # $票房收入 # 上週票房收入(萬) 本週票房收入(萬) # 末日異戰 2500 2650 # $電影評論 # 電影評分評語來源評語 # 1 4.5 電影真好看 Amazing # 2 5.0 酸番茄 Wonderful # 3 4.2 爆米花影院 Exciting ``` ```r # 取得其中項目 # 利用索引 print(末日異戰相關資訊[[1]]) # $電影名稱 # [1] "末日異戰(Invasion)" # 利用項目名稱 print(末日異戰相關資訊[["票房收入"]]) # 上週票房收入(萬) 本週票房收入(萬) # 末日異戰 2500 2650 print(末日異戰相關資訊$電影評論) # 電影評分評語來源評語 # 1 4.5 電影真好看 Amazing # 2 5.0 酸番茄 Wonderful # 3 4.2 爆米花影院 Exciting ``` *** ```r # 增添新項目原列表 <- c(原列表, 項目名稱=項目資料) ``` ```r releasedDate <- "2020年4月17日" 末日異戰相關資訊 <- c(末日異戰相關資訊, 上映日期=releasedDate) print(末日異戰相關資訊) # $電影名稱 # [1] "末日異戰(Invasion)" # $觀看人數 # 星期一星期二星期三星期四星期五星期六星期日 # 800 550 750 900 1950 3000 1750 # $票房收入 # 上週票房收入(萬) 本週票房收入(萬) # 末日異戰 2500 2650 # $電影評論 # 電影評分評語來源評語 # 1 4.5 電影真好看 Amazing # 2 5.0 酸番茄 Wonderful # 3 4.2 爆米花影院 Exciting # $上映日期 # [1] "2020年4月17日" ``` ## *條件選擇* if ```r if (condition) { # do something } ``` if-else ```r if (condition) { # do something } else #if以外的狀況 { # do something } ``` ifelse() ```r ifelse(condition, 成真要做的事, 成假要做的事) ``` switch() ```r switch ( ' 指定執行第幾行或那個名稱的程式碼 ', ' 第一行 : 執行 A ', ' 第二行 : 執行 B ', ' 第三行 : 執行 C ', ... ) # examples switch( 2 , 1+2 , 3*3 , #執行這裡 4^4 , ) switch ( "caseA", caseA=1000, #執行這裡 caseB=2*100, caseC="Hello R" ) ``` ## *迴圈控制* for ```r for (變數 in 範圍) { # do somethings } # 範圍可以是向量或這種1:100 # example sum <- 0 for (i in 1:100) { sum <- sum + i } print(sum) # 5050 ``` while ```r while (condition) { # do somethings } ``` ![](https://i.imgur.com/rnRwWb7.png) repeat ```r # 單行單一述句無窮執行可用小括弧()或大括弧{} x <- 0 repeat(print(x)) # 無窮印0 # 多重述句，只能使用{} x <- 0 repeat { 　print ( x ) 　x <- x + 1 　if ( x == 5 ) { 　　print ("x=5 離開迴圈") 　　break 　} } # [1] 0 # [1] 1 # [1] 2 # [1] 3 # [1] 4 # [1] "x=5 離開迴圈" ``` break與next ``` break 會直接跳離迴圈，不執行迴圈內下個循環 next 跳離迴圈某個循環，並執行下個循環 ``` ## *函式* ```r functionName <- function(參數列) { # do something return {值} # 若沒有要回傳東西，則這行去除 } # example 長方形面積或周長 <- function ( 長 , 寬 , 面積或周長=TRUE) {# 預設參數　　　面積 <- 0 #初始化面積以及周長　　　周長 <- 0 　　　if ( 面積或周長 == TRUE ) { 　　　　　　面積 <- 長 * 寬　　　　　　return ( 面積 ) 　　　} 　　　else { 　　　　　　周長 <- 2 * ( 長+寬 ) 　　　　　　return ( 周長 ) 　　　} } > 長方形面積或周長 ( 10 , 5 ) # 50 > 長方形面積或周長 ( 10 , 5 , 面積或周長 =FALSE ) # 30 ``` ### 抽樣sample() ```r sample( x , size , replace = FALSE ) # x 為範圍 # size 指定隨機個數 # replace 如果是TRUE為取後放回 # example sample(1:6, 3, replace=TRUE) # 骰子抽3次，每次都從6個裡挑 ``` ### rep() ```r rep( x , times ) # x重複times次 # example rep(NA, 3) # NA NA NA ``` ## *內建繪圖* ![](https://i.imgur.com/iRg9eFg.png) ### 線圖 ```r plot(x, y, ...各式參數) # 若只給一個範圍，原則上會將x, y點都設成一樣(左下右上斜直線) ``` ```r plotNumber <- 1:10 plot(plotNumber, type="l", col="blue", lty=2, main="Linear", xlab="X", ylab="Y") #線段樣式"p" #顏色 #1實2虛 #標題 #x軸名稱 #y軸名稱 ``` ![](https://i.imgur.com/dd0ShZn.png) ### 點圖 ```r dotchart(x, ...) ``` ```r dotNumber <- 1:10 dotchart(dotNumber, label=dotNumber, pch=2, color="red", main="Dot", xlab="X", ylab="Y") #每個三角形所在數值 #符號 ``` ![](https://i.imgur.com/ZnMU0zq.png)![](https://i.imgur.com/XXzvhsg.png) ### 長條圖 ```r barplot(height, ...) # height can be vector or matrix(most use) ``` ```r barNumber <- matrix ( 1:10 , nrow = 2 , ncol = 5 ) label <- c("L1", "L2", "L3", "L4", "L5") barplot(barNumber, col=c("1", "2"), beside=TRUE, names.arg=label, main="Barplot") #1黑2紅 #並排顯示 #長條標註 ``` beside = TRUE & beside = FALSE ![](https://i.imgur.com/pRiw1Dx.png)![](https://i.imgur.com/edWpjCo.png) ```r barNumber <- 1:10 label <- c("L1", "L2", "L3", "L4", "L5") barplot(barNumber, col=c("1", "2"), beside=TRUE, names.arg=label, main="Barplot") # 只有長條趨勢 ``` ![](https://i.imgur.com/AJli73g.png) ### 圓餅圖 ```r pie(x, ...) ``` ```r pieNumber <- 1:5 label <- c("P1", "P2", "P3", "P4", "P5") pie(pieNumber, label=label, col=rainbow(length(pieNumber)), main="圓餅圖") #設置彩虹顏色 ``` ![](https://i.imgur.com/q3LLX7u.png) ### 盒型圖 ```r boxplot(x, ...) # 可以多個x，圖形將會往右顯示下去 ``` ```r x <- 1:4 y <- 5:8 boxplot ( x , y , col = "red" , border = "blue" ) ``` ![](https://i.imgur.com/BpVnQBd.png) # 進階應用 ## *網路爬蟲* 直接做法 ```r url = 'https://www.ndhu.edu.tw/' content <- readLines(url) head(content) ``` 結果↓ ``` [1] "<!DOCTYPE html>" [2] "<html lang=\"zh-tw\">" [3] "<head>" [4] "<meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\">" [5] "<meta http-equiv=\"X-UA-Compatible\" content=\"IE=edge,chrome=1\" />" [6] "<meta name=\"viewport\" content=\"initial-scale=1.0, user-scalable=1, minimum-scale=1.0, maximum-scale=3.0\">" ``` 套件安裝與使用 - rvest ```r > install.packages('rvest') # 倘若使用出現錯誤像是there is no package called 'stringi'，則去安裝 > install.packages('stringi') ``` 若對於網頁架構不熟的同學，可去安裝google外掛軟件selectorGadget 啟用並點選要抓取的資料，即在下方顯示對應的css 就可以貼到html_nodes的第2參數啦~ ![](https://i.imgur.com/nNM3Len.png) ```r library(rvest) # 引入套件 url <- "https://www.imdb.com/title/tt4154756/" avengers <- read_html(url) #讀取網頁原始碼 score <- html_nodes(avengers, "strong span") #找到css目標所在元件 ranking <- html_text(score) # 該元件的文字內容 ``` ```r > score {xml_nodeset (1)} [1] 8.4 > as.numeric(ranking) [1] 8.4 > ranking [1] "8.4" ``` 資料篩選 抓下來的網路資料，時常夾雜了不需要的部分，這時就必須對文字上進行過濾大約講一下怎麼用gsub()函式，我們假定已經從網路上抓了3個訊息下來，如下 ```r gsub(欲取代字元, 欲改的文字, replacement=取代字元) ``` ```r title <- "Avengers: Infinity War (2018) " time <- "\n 2h 29min" inTheaterDate <- "25 April 2018 (Taiwan)\n" ``` ```r title <- gsub("\\s+", title, replacement="") time <- gsub("\n\\s+", time, replacement="") inTheaterDate <- gsub("\n", inTheaterDate, replacement="") ``` 結果↓ ```r > title [1] "Avengers:InfinityWar(2018)" > time [1] "2h 29min" > inTheaterDate [1] "25 April 2018 (Taiwan)" ``` 其中 ``` \\s表示一個空格 \n是換行字元 \\s+表示至少一個空格，可以到無限個 ``` ## *繪圖套件* ggplot2 ```r > install.packages("ggplot2") ``` ```r ggplot(df, aes(x=?, y=?)) + other_function() + ... # 利用+號將各圖層疊合起來 ``` Example 使用內建資料集cars ```r > head(cars) speed dist 1 4 2 2 4 10 3 7 4 4 7 22 5 8 16 6 9 10 ``` ```r library(ggplot2) # 引入套件 ggcars <- ggplot(cars, aes(x=speed, y=dist)) # 僅有座標底圖 ggcars ``` ![](https://i.imgur.com/wY0kt1h.png) ```r library(ggplot2) ggcars <- ggplot(cars, aes(x=speed, y=dist)) + geom_point() # 添加散佈圖上去底圖 ggcars ``` ![](https://i.imgur.com/rQ47PEU.png) geom_ | 函數 | 圖形 | | -------- | -------- | | geom_histogram() | 直方圖 | | geom_boxplot() | 盒形圖 | | geom_line() | 線圖 | | geom_bar() | 長條圖 | ## *機器學習* 監督式學習 - 決策樹 套件安裝 ```r install.packages("rpart") install.packages("rattle") install.packages("RColorBrewer") install.packages("rpart.plot") ``` **下載** 決策樹訓練資料評估新設店面.csv https://1drv.ms/u/s!AowAgK8Ip6CthV4MNGiaLwpSmo5w?e=hQcp17 ```r library(rpart) # 引入套件 library(rpart.plot) library(rattle) library(RColorBrewer) 評估資料 <- read.csv("決策樹訓練資料評估新設店面.csv", header=T, sep=",") 評估資料 # 被預測的變數 ~ 做預測的變數1 + ... + 做預測的變數n 評估資料樹 <- rpart(決定 ~ 城市規模 + 平均收入 + 教育程度 + 當地投資者, data=評估資料, method="class", control = rpart.control(minsplit=5)) # control可調整決策樹 minsplit建立一新節點所需資料量 plot(評估資料樹) text(評估資料樹) fancyRpartPlot(評估資料樹) ``` 結果區↓ ```r > 評估資料城市規模平均收入教育程度當地投資者決定 1 大高高是是 2 中中中否否 3 小低低是否 4 大高高否是 5 小中高是否 6 中高中是是 7 中中中是否 8 大中中否否 9 中高低是否 10 小高高否是 11 小中高否否 12 中高中否否 ``` plot + text ![](https://i.imgur.com/1FYnHDW.png) fancyRpartPlot ![](https://i.imgur.com/hWrldxw.png) 非監督式學習 - K-means **下載** 客戶區隔.csv https://1drv.ms/u/s!AowAgK8Ip6CthV3YMUrbsJ6ADCJE?e=qX7Ces ```r 客戶資料 <- read.csv("客戶區隔.csv", header=T, sep=",") 客戶資料 # centers分幾群 nstart執行幾次(隨機挑中心點) 客戶分群 <- kmeans(客戶資料, centers=3, nstart = 10) 客戶分群 # X軸為客戶, Y軸為購買總價 cluster欄位所需顏色 plot(formula=購買總價 ~ 客戶, data=客戶資料, col=客戶分群$cluster) # 標出各群中心點 pch為*符號 cex符號大小 points(客戶分群$center[, c("客戶", "購買總價")], col=1:3, pch=8, cex=2) ``` 結果區↓ ```r > 客戶資料客戶客戶收入.千交易數購買總價 1 1 95 5 450 2 2 88 10 800 3 3 76 15 900 4 4 30 2 50 5 5 60 18 900 6 6 45 9 250 7 7 56 14 500 8 8 22 8 150 9 9 38 7 200 10 10 65 9 500 11 11 72 1 30 12 12 48 6 180 > 客戶分群 K-means clustering with 3 clusters of sizes 3, 6, 3 Cluster means: 客戶客戶收入.千交易數購買總價 1 6.000000 72.00000 9.333333 483.3333 2 8.333333 42.50000 5.500000 143.3333 3 3.333333 74.66667 14.333333 866.6667 Clustering vector: [1] 1 3 3 2 3 2 1 2 2 1 2 2 Within cluster sum of squares by cluster: [1] 2583.333 39135.667 7098.667 (between_SS / total_SS = 95.6 %) Available components: [1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss" "size" [8] "iter" "ifault" ``` ![](https://i.imgur.com/J7NSCtP.png) ## *資料分析* ### 文字探勘與文字雲分為3步驟 1. 建立語料庫(Corpus) 2. 產生詞彙文件矩陣(Term Document Matrix) 3. 挖掘TDM找出模式安裝套件 ```r install.packages("tm") install.packages("wordcloud") ``` **下載** JobsSpeech.txt https://1drv.ms/t/s!AowAgK8Ip6CthWMap8mdjdO6WTkb?e=Zf4TyV ```r library(tm) # 引入套件 library(wordcloud) speech <- readLines("JobsSpeech.txt") # 載入Jobs演講 corpus <- VCorpus(VectorSource(speech)) # 先轉文字向量，再建立動態語料庫 corpusSW <- tm_map(corpus, stripWhitespace) # 刪空白 corpusTL <- tm_map(corpusSW, content_transformer(tolower)) #全轉小寫 corpusRN <- tm_map(corpusTL, removeNumbers) # 濾數字 corpusRP <- tm_map(corpusRN, removePunctuation) # 濾標點符號 corpusRW <- tm_map(corpusRP, removeWords, stopwords("english")) # 停用詞，第3參數定義想濾掉的詞彙 tdm <- TermDocumentMatrix(corpusRW) # 建立TDM tdmMatrix <- as.matrix(tdm) # 轉矩陣 tdmSort <- sort(rowSums(tdmMatrix), decreasing = T) # 加總次數並由多到低排序 tdmdf <- data.frame(freq = tdmSort) head(tdmdf) # 頭六筆 tdmdfwd <- data.frame(tdmdf, words = names(tdmSort)) wordcloud(tdmdfwd$words, tdmdfwd$freq, col = brewer.pal(8, "Set2"), min.freq = "10", random.order = F) # brewer.pal(8, "Set2") 從顏色集合Set2挑出8種顏色 # min.freq = "10" 繪出至少出現10次詞彙 # random.order = F 中心至外圍次數多-->次數低 ``` 結果區↓ ```r freq life 15 college 13 one 9 years 9 apple 8 just 8 ``` ![](https://i.imgur.com/IPQ9phW.png) 內建英文停止詞集 ```r > stopwords("english") [1] "i" "me" "my" "myself" "we" "our" "ours" [8] "ourselves" "you" "your" "yours" "yourself" "yourselves" "he" [15] "him" "his" "himself" "she" "her" "hers" "herself" [22] "it" "its" "itself" "they" "them" "their" "theirs" [29] "themselves" "what" "which" "who" "whom" "this" "that" [36] "these" "those" "am" "is" "are" "was" "were" [43] "be" "been" "being" "have" "has" "had" "having" [50] "do" "does" "did" "doing" "would" "should" "could" [57] "ought" "i'm" "you're" "he's" "she's" "it's" "we're" [64] "they're" "i've" "you've" "we've" "they've" "i'd" "you'd" [71] "he'd" "she'd" "we'd" "they'd" "i'll" "you'll" "he'll" [78] "she'll" "we'll" "they'll" "isn't" "aren't" "wasn't" "weren't" [85] "hasn't" "haven't" "hadn't" "doesn't" "don't" "didn't" "won't" [92] "wouldn't" "shan't" "shouldn't" "can't" "cannot" "couldn't" "mustn't" [99] "let's" "that's" "who's" "what's" "here's" "there's" "when's" [106] "where's" "why's" "how's" "a" "an" "the" "and" [113] "but" "if" "or" "because" "as" "until" "while" [120] "of" "at" "by" "for" "with" "about" "against" [127] "between" "into" "through" "during" "before" "after" "above" [134] "below" "to" "from" "up" "down" "in" "out" [141] "on" "off" "over" "under" "again" "further" "then" [148] "once" "here" "there" "when" "where" "why" "how" [155] "all" "any" "both" "each" "few" "more" "most" [162] "other" "some" "such" "no" "nor" "not" "only" [169] "own" "same" "so" "than" "too" "very" ``` ### pipeline運算子%>% **套件安裝** ```r install.packages("magrittr") ``` **使用時機** 當一個資料要做一連串處理，下級的輸入為上級的輸出，就可以使用舉上述的[賈伯斯演講](#文字探勘與文字雲)的部分程式碼改寫原本的 ```r corpus <- VCorpus(VectorSource(speech)) # 先轉文字向量，再建立動態語料庫 corpusSW <- tm_map(corpus, stripWhitespace) # 刪空白 corpusTL <- tm_map(corpusSW, content_transformer(tolower)) #全轉小寫 corpusRN <- tm_map(corpusTL, removeNumbers) # 濾數字 corpusRP <- tm_map(corpusRN, removePunctuation) # 濾標點符號 corpusRW <- tm_map(corpusRP, removeWords, stopwords("english")) # 停用詞，第3參數定義想濾掉的詞彙 ``` 使用%>%後 ```r library(magrittr) # 匯入%>% corpus <- VCorpus(VectorSource(speech)) %>% tm_map(stripWhitespace) %>% tm_map(content_transformer(tolower)) %>% tm_map(removeNumbers) %>% tm_map(removePunctuation) %>% tm_map(removeWords, stopwords("english")) ``` 看出之間不同了嗎? 原先的每行都會給定一個變數其內容，而下一行它就被拿來當作輸入因此可以用%>%簡化程式碼，也減少變數的使用在函數方面就不需要放資料輸入參數了 <center><h1 style="color: #8b4513;">Fin ~</h1></center>