R Basic - HackMD

# R Basic ###### tags: `普渡計畫` ## 基礎的R操作 ### R的變數型態 #### Logical ```r= TRUE, FALSE ``` #### Numeric ```r= 6, 18.5, 825 ``` #### Integer ```r= 8L, 45L ``` #### Complex Real Value + Complex Value ```r= 7 + 5i ``` #### Character ```r= 'g', "Smith" ``` #### Raw Any data is stored as raw bytes ```r= "Hello" is stored as 68 65 6c 6c 6f ``` ### 變數的原則 1. 變數可以由字母(letters)、數字(numbers)、下橫線(underscore)和句點(period) 2. 如果是由period為開頭，則不能再接數字 3. 不可以使用保留字 ### 運算子 #### 種類 1. Arithmetic Operators 2. Rational Operators 3. Logical Operators 4. Assignment Operators #### Arithmetic Operators 假設A = 100、B = 20 ```r= A^B ## 即是100的20次方 ``` ```r= A%%B ## 即是求餘數(0) ``` ```r= A%/%B ## 即是求整除的商數 ``` #### Rational Operators ><那些我就不記錄了 #### Logical Operators A %in% B：A 是否在 B 中。 &&、＆:交集 & 適用於向量式的邏輯判斷，&& 適用於單一值的邏輯判斷。 ||、|：聯集，| 適用狀況與 & 相同，|| 適用狀況與 && 相同。 #### Assignment Operators 就是<- ## 資料結構(Data Structure) ### 資料處理的三大步驟 1. `辨識`資料結構 2. `定義`資料內的值 3. `操作`資料 ### 資料的類型 | Data Structure | Type | Dimensionality | |:--------------:|:-------------:|:--------------:| | Atomic Vectors | Homogenous | 1 | | List | Heterogeneous | 1 | | Matrix | Homogenous | 2 | | Array | Homogenous | n | | Factor | Homogenous | 1 | | Data Frame | Heterogeneous | 2 | ### Atomic Vectors 原子向量 1. 一維的資料結構 2. 向量內的元素必須屬於同一類型 3. 可以包含: * Numeric Data Type * Integer Data Type * Character Data Type * Logical Data Type 例如: ```r= a <- c(1, 2, 5, 3, 6, -2, 4) b <- c("a", "b", "c") c <- c(TRUE, TRUE, FALSE) ``` 向量中的連續數字可以使用':'來產生 ```r= X <- 2:5; X [1] 2 3 4 5 ``` 向量中的元素可已用數值(index)來定位，通常可以使用中括號(square brackets)和數字來指定特定的元素 ```r= x <- c('1','2','3') x[2] > "2" ``` #### 向量的基礎用法 ```r= c(81,125)/60 > c(81,125)/60 [1] 1.350000 2.083333 ``` 也可以使用賦值的方式來運算 ```r= ratings <- c(81, 125) ratings / 60 > ratings <- c(81, 125) > ratings / 60 [1] 1.350000 2.083333 ``` 連續數字可以用'c(a:b)'的方式來撰寫 ```r= a <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) b <- c(1:10) c <- c(10:1) > a [1] 1 2 3 4 5 6 7 8 9 10 > b [1] 1 2 3 4 5 6 7 8 9 10 > c [1] 10 9 8 7 6 5 4 3 2 1 ``` #### Vector的種類 ```r= # Numeric Vector a <- c(1985, 1999, 2010, 2002) > a [1] 1985 1999 2010 2002 # Character Vector b <- c("Toy Story", "Akira", "The Artist", "City of God") > b [1] "Toy Story" "Akira" "The Artist" "City of God" # Logical Vector 1 1995>1997 > 1995>1997 [1] FALSE # Logical Vector 2 movie_ratings <- c(7.3, 8.5, 8.3, 7.5, 6.9, 5.2, 8.2, 8.0, 7.9, 9.3) movie_ratings > 7.5 > movie_ratings > 7.5 [1] FALSE TRUE TRUE FALSE FALSE FALSE TRUE TRUE TRUE TRUE ``` #### 如何替vector內的元素命名使用name()可以替元素命名 ```r= drink <- c(1, 2, 3) names(drink) <- c("set_A", "set_B", "set_C") drink["set_B"] > drink <- c(1, 2, 3) > names(drink) <- c("set_A", "set_B", "set_C") > drink["set_B"] set_B 2 ``` #### 如何計算vector的長度使用length()可以計算vector的長度 ```r= drink <- c(1, 2, 3) length(drink) > length(drink) [1] 3 ``` #### 如何排序vector? 使用sort可以排序vector的元素 ```r= year <- c(1985, 1991, 1782, 1963) names(year) <- c("A", "B", "C", "D") year_sorted <- sort(year) year_sorted > year <- c(1985, 1991, 1782, 1963) > names(year) <- c("A", "B", "C", "D") > year_sorted <- sort(year) > year_sorted C D A B 1782 1963 1985 1991 ``` #### 最大和最小使用max()和min()可辨別出最大或最小值 ```r= year <- c(1985, 1991, 1782, 1963) names(year) <- c("A", "B", "C", "D") year_sorted <- sort(year) year_sorted min_year <- min(year) max_year <- max(year) min_year max_year > min_year [1] 1782 > max_year [1] 1991 ``` #### 如何進行計算? 可以使用sum(), mean()與summary()快速獲取資訊 ```r= # Calculation cash <- c(1999,2000,53345234,3458730458,348503948509286983) sum(cash) mean(cash) summary(cash) > cash <- c(1999,2000,53345234,3458730458,348503948509286983) > sum(cash) [1] 3.48504e+17 > mean(cash) [1] 6.970079e+16 > summary(cash) Min. 1st Qu. Median Mean 3rd Qu. Max. 1.999e+03 2.000e+03 5.335e+07 6.970e+16 3.459e+09 3.485e+17 ``` #### Vector Index 這...跟python不一樣是從1開始算而非0開始 ```r= # Vector Index cost_2014 <- c(8.6, 8.5, 8.1) > cost_2014 <- c(8.6, 8.5, 8.1) cost_2014[2] > cost_2014[2] [1] 8.5 cost_2014[c(2,3)] > cost_2014[c(2,3)] [1] 8.5 8.1 cost_2014[1:3] > cost_2014[1:3] [1] 8.6 8.5 8.1 cost_2014[-1] # 移除index為1的元素 > cost_2014[-1] [1] 8.5 8.1 cost_2014[267] # 超出序號的元素會顯示NA > cost_2014[267] [1] NA cost_2014[cost_2014>8.3] # 也可以下條件來篩選 > cost_2014[cost_2014>8.3] [1] 8.6 8.5 ``` #### Vector Arithmatic 可以把使用向量進行運算 ```r= a <- c(20, 30, 40, 50) b <- c(12, 12, 12, 3445) a*b > a*b [1] 240 360 480 172250 ``` ### List 列表比起Array和Matrix只能包含特定類型的值 list可以包含各種不同類型的值 ```r= ## List的範例 movie <- list("Toy Story", 1995, c("Animation","Adventure","Comedy")) movie > movie [[1]] [1] "Toy Story" [[2]] [1] 1995 [[3]] [1] "Animation" "Adventure" "Comedy" ## 獲取部分list內容的作法 movie[2:3] > movie[2:3] [[1]] [1] 1995 [[2]] [1] "Animation" "Adventure" "Comedy" ``` #### list內元素的命名可以把list內的元素賦值給特定的名稱並且使用錢字號$或是中括號[]來抓取特定的元素 e.g. movie$genre, movie["genre"] ```r= movie <- list(name = "Toy Story", year = 1995, genre = c("Animation","Adventure","Comedy")) movie > movie $name [1] "Toy Story" $year [1] 1995 $genre [1] "Animation" "Adventure" "Comedy" movie$genre > movie$genre [1] "Animation" "Adventure" "Comedy" movie["genre"] > movie["genre"] $genre [1] "Animation" "Adventure" "Comedy" ``` #### list內元素的新增步驟1: 將要新增的值賦值給list的新名稱步驟2: 查看list內的元素 ```r= > movie["age"]<-18 > movie $name [1] "Toy Story" $year [1] 1995 $genre [1] "Animation" "Adventure" "Comedy" $age [1] 18 ``` #### 移除list內的元素名稱直接把NULL(大小寫有差喔)賦值給list內的元素名稱即可消除該名稱 ```r= movie["age"]<-NULL movie ``` ### Matrix 矩陣內的數值一定要屬於同一種類型，不論是數字、文字或是布林值使用matrix()建立矩陣 ```r= A <- matrix(1:9, nrow = 3, ncol = 3) A [,1] [,2] [,3] [1,] 1 4 7 [2,] 2 5 8 [3,] 3 6 9 B <- matrix(1:9, nrow = 3, ncol = 3, byrow = TRUE) B [,1] [,2] [,3] [1,] 1 2 3 [2,] 4 5 6 [3,] 7 8 9 ``` ```r= movie_vector <- c("Akira", "Toy Story", "Room", "The Wave", "Whiplash", "Star Wars", "The Ring", "The Artist", "Jumanji") movie_matrix <- matrix(movie_vector, nrow = 3, ncol = 3) movie_matrix ## 一開始矩陣會以列的方向來排列 > movie_matrix [,1] [,2] [,3] [1,] "Akira" "The Wave" "The Ring" [2,] "Toy Story" "Whiplash" "The Artist" [3,] "Room" "Star Wars" "Jumanji" ## 可透過調整參數byrow = TRUE，讓矩陣內的元素以橫向列為單位來排列 movie_matrix <- matrix(movie_vector, nrow = 3, ncol = 3, byrow = TRUE) movie_matrix > movie_matrix [,1] [,2] [,3] [1,] "Akira" "Toy Story" "Room" [2,] "The Wave" "Whiplash" "Star Wars" [3,] "The Ring" "The Artist" "Jumanji" ## 可透過定位的方式來抓取矩陣的元素 movie_matrix[2:3, 1:2] > movie_matrix[2:3, 1:2] [,1] [,2] [1,] "The Wave" "Whiplash" [2,] "The Ring" "The Artist" ## 或是透過c(x,y)的方式來定位 [,1] [,2] [,3] [1,] 1 2 3 [2,] 4 5 6 [3,] 7 8 9 B[c(1,2),c(2,3)] [,1] [,2] [1,] 2 3 [2,] 5 6 ``` ### Array Array跟Matrix很像，但可以有更多的維度(n>=2)來存取數據使用array()和dim來確定array的排列 c(列,欄,Array個數) ```r= a<- array(1:24, dim=c(3,4,2)) a [,1] [,2] [,3] [,4] [1,] 1 4 7 10 [2,] 2 5 8 11 [3,] 3 6 9 12 , , 2 [,1] [,2] [,3] [,4] [1,] 13 16 19 22 [2,] 14 17 20 23 [3,] 15 18 21 24 ``` ```r= ## 使用Array() ## 把原有的vector轉成array ## dim = c(4,3) ## 意指將其定義成4*3的矩陣 movie_vector <- c("Akira", "Toy Story", "Room", "The Wave", "Whiplash", "Star Wars", "The Ring", "The Artist", "Jumanji") movie_array <- array(movie_vector, dim=c(4,3)) movie_array > movie_array [,1] [,2] [,3] [1,] "Akira" "Whiplash" "Jumanji" [2,] "Toy Story" "Star Wars" "Akira" [3,] "Room" "The Ring" "Toy Story" [4,] "The Wave" "The Artist" "Room" ## 抓取特定的元素(單一值) movie_array[1,2] > movie_array[1,2] [1] "Whiplash" ## 抓取特定的列(row) movie_array[1,] > movie_array[1,] [1] "Akira" "Whiplash" "Jumanji" ## 抓取特定的欄(Column) movie_array[,2] > movie_array[,2] [1] "Whiplash" "Star Wars" "The Ring" [4] "The Artist" ## 或是透過定位的方式獲取部分數值 > movie_array[1:2,1:2] movie_array[1:2,1:2] [,1] [,2] [1,] "Akira" "Room" [2,] "Toy Story" "The Wave" ``` 也可透過[,,]擷取元素 ```r= result<- array(1:24, dim=c(3,4,2)) print(result) print(result[1,,1]) print(result[1,,2]) > print(result) , , 1 [,1] [,2] [,3] [,4] [1,] 1 4 7 10 [2,] 2 5 8 11 [3,] 3 6 9 12 , , 2 [,1] [,2] [,3] [,4] [1,] 13 16 19 22 [2,] 14 17 20 23 [3,] 15 18 21 24 ## [row, column, matrix] > print(result[1,,1]) [1] 1 4 7 10 > print(result[1,,2]) [1] 13 16 19 22 > print(result[1,2,2]) [1] 16 ``` ### 小比較Array v.s. Matrices Uni-dimensional arrays are called vectors in R. Two-dimensional arrays are called matrices. ### Factors Factors take only a predefined, finite number of categorical values. 可以使用factor()創造，其內容會使用兩種屬性: `class` & `levels` 範例如下： ```r= x <- factor(c("male","female", "female", "male")) x > x [1] male female female male Levels: female male x <- factor(c("male", "female", "female", "male"), levels = c("male", "female")) x > x [1] male female female [4] male Levels: male female x[2] > x[2] [1] female Levels: male female ``` ```r= genre_vector <- c("comedy", "comedy", "Animation", "Animation", "Crime") genre_factor <- factor(genre_vector) # 使用factor的方法可以自動列出各個元素 genre_factor > genre_factor [1] comedy comedy Animation Animation Crime Levels: Animation comedy Crime # 使用Summary可以得知內部元素的個數 # Summary也可以用來得知vector的內容 summary(genre_factor) > summary(genre_factor) Animation comedy Crime 2 2 1 genre_vector <- c("comedy", "comedy", "Animation", "Animation", "Crime") summary(genre_vector) > summary(genre_vector) Length Class Mode 5 character character ``` #### 如何使用factor參數改變順序如果沒有設定參數，產出結果順序是錯亂的 ```r= movielength_vector <- c("Very short", "short", "medium", "short", "long", "very short", "very long") movielg <- factor(movielength_vector) movielg > movielg [1] Very short short medium short long very short very long Levels: long medium short very long very short Very short ``` 可以透過設定factor的參數`ordered=TRUE`,來獲取資訊 ```r= mvlength_factor <- factor(movielength_vector, ordered = TRUE, levels = c("very short", "short", "medium", "long", "very long")) mvlength_factor # ordered = True 由小到大 # levels = c("very short", "short", "medium", "long", "very long") 訂定等級次序 > mvlength_factor [1] <NA> short medium short long very short very long Levels: very short < short < medium < long < very long ``` ### dataframe Dataframe是R裡面最常用到的資料結構 Dataframe is kind of data structure contains correlated information. 以下是dataframe的範例 ```r= name <- c("Joe", "John", "Nancy") sex <- c("M", "M", "F") age <- c(27,26,26) df <- data.frame(name, sex, age) df > df name sex age 1 Joe M 27 2 John M 26 3 Nancy F 26 ``` ```r= movie <- data.frame(name = c("Toy Story", "Akira", "The Breakfast Club", "The Artist", "Modern Times", "Fight Club", "City of God", "The Untouchables"), year = c(1995, 1998, 1985, 2011, 1936, 1999, 2002, 1987)) movie > movie name year 1 Toy Story 1995 2 Akira 1998 3 The Breakfast Club 1985 4 The Artist 2011 5 Modern Times 1936 6 Fight Club 1999 7 City of God 2002 8 The Untouchables 1987 ``` #### 如何擷取dataframe的資料擷取某欄位的資料 e.g. movie$name, movie[1] ```r= movie$name > movie$name [1] Toy Story Akira [3] The Breakfast Club The Artist [5] Modern Times Fight Club [7] City of God The Untouchables 8 Levels: Akira City of God ... Toy Story ``` ```r= movie[1] > movie[1] name 1 Toy Story 2 Akira 3 The Breakfast Club 4 The Artist 5 Modern Times 6 Fight Club 7 City of God 8 The Untouchables ``` 擷取特定位置的值 ```r= movie movie[1,2] > movie name year 1 Toy Story 1995 2 Akira 1998 3 The Breakfast Club 1985 4 The Artist 2011 5 Modern Times 1936 6 Fight Club 1999 7 City of God 2002 8 The Untouchables 1987 > movie[1,2] [1] 1995 ``` 以下是實戰時用到的擷取方式，透過column1來指定特定一筆資料的特定屬性來獲取資料 ```r= movie[movie[,1] == "Toy Story","year"] > movie[movie[,1] == "Toy Story","year"] [1] 1995 ``` #### str()迅速瞭解dataframe的內容 str() function in R Language is used for compactly displaying the internal structure of a R object. ```r= str(movie) > str(movie) 'data.frame': 8 obs. of 2 variables: $ name: Factor w/ 8 levels "Akira","City of God",..: 8 1 6 5 4 3 2 7 $ year: num 1995 1998 1985 2011 1936 ... ``` #### head() & tail() 秀出開頭與結尾資料的function head()和tail()可以秀出前6筆跟末6筆資料 ```r= head(movie) > head(movie) name year 1 Toy Story 1995 2 Akira 1998 3 The Breakfast Club 1985 4 The Artist 2011 5 Modern Times 1936 6 Fight Club 1999 ``` ```r= > tail(movie) name year 3 The Breakfast Club 1985 4 The Artist 2011 5 Modern Times 1936 6 Fight Club 1999 7 City of God 2002 8 The Untouchables 1987 ``` #### 如何新增新欄位可採用賦值的方式新增欄位和數據但透過該種方式，資料個數一定要一致喔~ ```r movie["Satisfactory"] <- c(4, 3.9, 3.8, 3.5, 2.1, 2.9, 4.1, 5) movie > movie name year Satisfactory 1 Toy Story 1995 4.0 2 Akira 1998 3.9 3 The Breakfast Club 1985 3.8 4 The Artist 2011 3.5 5 Modern Times 1936 2.1 6 Fight Club 1999 2.9 7 City of God 2002 4.1 8 The Untouchables 1987 5.0 ``` #### 使用rbind()來新增一筆新數據 ```r= movie <- rbind(movie, c(name ="Dr.Strangelove", year=1964, Satisfactory=2.5)) movie Warning message: In `[<-.factor`(`*tmp*`, ri, value = "Dr.Strangelove") : invalid factor level, NA generated > movie name year Satisfactory 1 Toy Story 1995 4 2 Akira 1998 3.9 3 The Breakfast Club 1985 3.8 4 The Artist 2011 3.5 5 Modern Times 1936 2.1 6 Fight Club 1999 2.9 7 City of God 2002 4.1 8 The Untouchables 1987 5 9 <NA> 1964 2.5 ``` 此時會發現電影名稱進不去，而且有出現一串錯誤訊息這邊可以透過更改向量的屬性來處理該錯誤 ```r= # convert variable to character # 使用as.character的方式可以將數字向量轉為文字向量 movie$name <- as.character(movie$name) movie <- rbind(movie, c(name ="Dr.Strangelove", year=1964, Satisfactory=2.5)) movie # 此時資料就進來了 > movie name year Satisfactory 1 Toy Story 1995 4 2 Akira 1998 3.9 3 The Breakfast Club 1985 3.8 4 The Artist 2011 3.5 5 Modern Times 1936 2.1 6 Fight Club 1999 2.9 7 City of God 2002 4.1 8 The Untouchables 1987 5 9 Dr.Strangelove 1964 2.5 ``` #### 透過賦值刪除資料可以透過把負數賦值給dataframe來刪除特定的資訊例如要刪除第九筆資料，可透過 e.g. movie <- movie[-9,] 的方式來執行 ```r= movie <- movie[-9,] movie > movie name year Satisfactory 1 Toy Story 1995 4 2 Akira 1998 3.9 3 The Breakfast Club 1985 3.8 4 The Artist 2011 3.5 5 Modern Times 1936 2.1 6 Fight Club 1999 2.9 7 City of God 2002 4.1 8 The Untouchables 1987 5 ``` 如果要刪除第九筆到第七筆資料，可透過 e.g. movie <- movie[-9:-7,] movie ```r= movie <- movie[-9:-7,] movie > movie name year Satisfactory 1 Toy Story 1995 4 2 Akira 1998 3.9 3 The Breakfast Club 1985 3.8 4 The Artist 2011 3.5 5 Modern Times 1936 2.1 6 Fight Club 1999 2.9 ``` #### 如何移除dataframe的欄位可以透過把NULL賦值給特定欄位，即可將該欄消除 ```r= movie["Satisfactory"] <- NULL movie > movie name year 1 Toy Story 1995 2 Akira 1998 3 The Breakfast Club 1985 4 The Artist 2011 5 Modern Times 1936 6 Fight Club 1999 ``` ## 條件式(if) ### If..else R的條件式是透過大括號{}來撰寫的範例如下 ```r= movie_year <- 2002 if(movie_year>2000){ print("Movie year is greater than 2000") } > if(movie_year>2000){ + print("Movie year is greater than 2000") + } [1] "Movie year is greater than 2000" ``` ```r= movie_year <- 1997 if(movie_year>2000){ print("Movie year is greater than 2000") } else { print("Movie year is not greater than 2000") } > if(movie_year>2000){ + print("Movie year is greater than 2000") + } else { + print("Movie year is not greater than 2000") + } [1] "Movie year is not greater than 2000" ``` #### ifelse ```r= age <- 20 ifelse(age>=18, "Major", "Minor") > ifelse(age>=18, "Major", "Minor") [1] "Major" ``` #### switch ```r= age <- "Major" switch (age, Major = { print("Age is greater than 18") }, Minor = { print("Age is less than 18") } ) ``` #### 巢狀條件式 ```r= x <- 0 if(x < 0) { print("Negative number") } else if (x > 0) { print("Positve Number") } else print("Zero") ## else 不用大括號沒關係 ``` ### 迴圈與斷點 #### 迴圈(For Loop) ```r= years <- c(1995, 1998, 1985, 2011, 1936, 1999) for (yr in years) { print(yr) } > for (yr in years) { + print(yr) + } [1] 1995 [1] 1998 [1] 1985 [1] 2011 [1] 1936 [1] 1999 > ``` for loop 也可以設定條件在裡面，就能透過迴圈一一地確認vector內的元素是否符合特定條件 ```r= years <- c(1995, 1998, 1985, 2011, 1936, 1999) for (yr in years) { if(yr < 1980) { print("Old Movie") } else { print("Not that old") } } > for (yr in years) { + if(yr < 1980) { + print("Old Movie") + } else { + print("Not that old") + } + } [1] "Not that old" [1] "Not that old" [1] "Not that old" [1] "Not that old" [1] "Old Movie" [1] "Not that old" ``` #### 迴圈(While Loop) ```r= count <- 1 while(count <= 5){ print(c("Iteration number:", count)) count <- count + 1 } > while(count <= 5){ + print(c("Iteration number:", count)) + count <- count + 1 + } [1] "Iteration number:" "1" [1] "Iteration number:" "2" [1] "Iteration number:" "3" [1] "Iteration number:" "4" [1] "Iteration number:" "5" ``` #### 迴圈(repeat) 可以使用repeat製作迴圈，只要在裡面增加判斷式，製作斷點break就能產出迭代的效果 ```r= x <- 1 repeat { print(x) x = x+1 if(x == 6){ break } } ``` #### Break 達成某個條件需要跳出整個迴圈時，可以使用break ```r= num <- 1:5 for (val in num) { if (val == 3) { break } print(val) } [1] 1 [1] 2 ``` #### Next 達成某個條件需要省略結果時，可以使用next ```r= num <- 1:5 for (val in num) { if (val == 3) { next } print(val) } ``` ## Function A block of code that could be reused in different parts of a program 1. Pre-defined function 2. User-defined function ### Pre-defined Function 已經被內建在R裡面的function, 例如mean(), sort()等 ```r= ratings <- c(8.7, 6.9, 8.5) mean(ratings) sort(ratings) sort(ratings, decreasing = TRUE) > mean(ratings) [1] 8.033333 > sort(ratings) [1] 6.9 8.5 8.7 > sort(ratings, decreasing = TRUE) [1] 8.7 8.5 6.9 ``` ### User-defined function 可以自定義function所產出的內容，例如： ```r= printHelloWorld <- function() { print("Hello World") } printHelloWorld() > printHelloWorld() [1] "Hello World" ``` 可以透過設定參數以及產出的模式，例如設定寫入參數為x, y 呼叫add時一併寫入參數，即可產出數值 ```r= add <- function(x, y) { x+y } add(3, 4) > add(3, 4) [1] 7 ``` ### Return 的應用 return的指令在於生成結果後，回傳特定值，如果是單一值其實不用Return也會回傳結果 ```r= add <- function(x, y) { return(x+y) } add(3, 4) > add(3, 4) [1] 7 ``` 如果是需要回傳特定模式的值，例如list，則可以透過return來操作 ```r= arithmetic = function(x,y) { # add add = x + y # subtract sub = x - y # multiply mul = x * y # divide div = x / y # return the result as a list vector1 <- c(add, sub, mul, div) array1 <- array(vector1, dim=c(1,4)) return(array1) } > arithmetic(10,20) [,1] [,2] [,3] [,4] [1,] 30 -10 200 0.5 ``` return 在使用條件判斷回傳值的時候特別重要，以下是透過判斷數值大小來來回傳對應值的情況 ```r= isGoodRating <- function(rating) { if(rating < 7){ return("No") }else{ return("Yes") } } isGoodRating(10) isGoodRating(6) > isGoodRating(10) [1] "Yes" > isGoodRating(6) [1] "No" ``` 我們也可以在參數上面做一點小調整，設定一個threshold的變數，讓比較值可以在每次輸入時進行調整 ```r= isGoodRating<-function(rating, threshold = 8){ if(rating<threshold){ return("You suck") }else{ return("Nice") } } isGoodRating(2) isGoodRating(2, threshold=1) > isGoodRating(2) [1] "You suck" > isGoodRating(2, threshold=1) [1] "Nice" ``` ### 實戰做法命題: 需要透過輸入電影名稱，並且判斷其電影的好壞 Step 1: 製作一個dataframe Step 2: 撰寫一個判斷式來決定>=特定值的反應與否 Step 3: 撰寫一個function來擷取特定的資料 ![](https://i.imgur.com/xEaX2md.png) ```r= ## Step 1 movie <- data.frame( name = c("Toy Story", "Akira", "The Breakfast Club", "The Artist", "Modern Times", "Fight Club"), year = c(1995, 1998, 1985, 2011, 1936, 1999), length_min = c(81,125,97,100,87,139), genre = c("Animation","Animation","Drama","Romance","Comedy","Drama"), average_rating = c(8.3, 8.1, 7.9, 8, 8.6, 8.9), cost_millions = c(30, 10.4, 1, 15, 1.5, 63), foreign = c(0, 1, 0, 1, 0, 0), age_restriction = c(0, 14, 14, 12, 10, 18) ) ## Step 2 isGoodRating <- function(rating, threshold){ if(rating >= threshold){ return("Good Movie") }else{ return("Not Good") } } ## Step 3 watchMovie <- function(moviename, my_threshold = 9){ rating <- movie[movie[,1] == moviename,"average_rating"] isGoodRating(rating, threshold = my_threshold) } watchMovie("Akira") > watchMovie("Akira") [1] "Not Good" ``` ### Global & Local `<<-` 賦值並成為Global的變數 `<-` 僅賦值為Local的變數 ```r= myFunction <- function() { y <<-3.14 #Global temp <- 'Hello World' #local return(temp) #Output } myFunction() y temp > myFunction() [1] "Hello World" > y [1] 3.14 > temp Error: object 'temp' not found ``` ## Object ### 如何判斷object的class R的類別分成這四類 1. Numeric 2. Character 3. Logical 4. Integer 5. List 可使用`class()`來判斷 ### 如何變換class 可以透過像`as.integer`或是`as.character`來替換class ```r= age_restriction <- c(12, 18, 19, 20) class(age_restriction) age_restriction <- as.integer(age_restriction) class(age_restriction) age_restriction <- as.character(age_restriction) class(age_restriction) > class(age_restriction) [1] "numeric" > age_restriction <- as.integer(age_restriction) > class(age_restriction) [1] "integer" > age_restriction <- as.character(age_restriction) > class(age_restriction) [1] "character" ``` ### 如何Debug 可以使用tryCatch()來判斷錯誤 ```r= tryCatch("a"+10) > tryCatch("a"+10) Error in "a" + 10 : non-numeric argument to binary operator ``` tryCatch()有類似判斷式的用法 ```r= tryCatch(10+10, error=function(e) print("Oops, something went wrong!")) > tryCatch(10+10, error=function(e) + print("Oops, something went wrong!")) [1] 20 tryCatch(10+"a", error=function(e) print("Oops, something went wrong!")) > tryCatch(10+"a", error=function(e) + print("Oops, something went wrong!")) [1] "Oops, something went wrong!" ``` 這個用法也可以用來檢定迴圈或其他function ```r= tryCatch( for (i in 1:3) { print(i + "a") } , error = function(e) print("Found error.") ) ``` 可以用tryCatch()來客製化warning的訊息 ```r= as.integer("A") > as.integer("A") [1] NA Warning message: NAs introduced by coercion tryCatch(as.integer("A"), warning = function(e) print("Warning.")) > tryCatch(as.integer("A"), + warning = function(e) + print("Warning.")) [1] "Warning." > ``` ## R裡面的文本操作 ### 讀取文本readLines() ```r= summary <- readLines("C:/Users/GF63/Desktop/123.txt") summary > summary [1] "From Wikipedia, the free encyclopedia" [2] "Jump to navigationJump to search" ``` ### 判斷特定段落的文字數量nchar() ```r= nchar(summary[1]) #Paragraph words calculation > nchar(summary[1]) [1] 37 ``` ### 轉大寫或轉小寫toupper() & tolower() ```r= toupper(summary[1]) #轉大寫 > toupper(summary[1]) [1] "FROM WIKIPEDIA, THE FREE ENCYCLOPEDIA" tolower(summary[1]) #轉小寫 > tolower(summary[1]) [1] "from wikipedia, the free encyclopedia" ``` ### 把空格替代成符號chartr() ```r= chartr(" ", "-", summary[1]) #把每個字母的空格用特定符號取代隔開 > chartr(" ", "-", summary[1]) [1] "From-Wikipedia,-the-free-encyclopedia" ``` ### 把被符號或空格隔開的文字擷取出來並形成 list strsplit() ```r= char_list <- strsplit(summary[1], " ") char_list > char_list [[1]] [1] "From" "Wikipedia," "the" "free" [5] "encyclopedia" ``` ### 把列表解除，變成獨立的元素 unlist() ```r= word_list <- unlist(char_list) word_list > word_list [1] "From" "Wikipedia," "the" "free" [5] "encyclopedia" ``` ### 將各種字元進行排序 sort() ```r= sorted_list <- sort(word_list) sorted_list > sorted_list [1] "encyclopedia" "free" "From" "the" [5] "Wikipedia," ``` ### 擷取特定位置的文字 substr() ```r= sub_string <- substr(summary[1], start=4, stop = 50) sub_string > sub_string [1] "m Wikipedia, the free encyclopedia" ``` ### 修剪前導空格trimws() ```r= trimws(sub_string) ``` ### 擷取末端文字str_sub() ```r= install.packages("stringr") library(stringr) str_sub(summary[1], -8, -1) > str_sub(summary[1], -8, -1) [1] "clopedia" ``` ## Date的資料處理 [參考資料 | lubridate](https://cran.r-project.org/web/packages/lubridate/lubridate.pdf) ### 把UNIX轉換成日期as.POSIXct() & as.Date() ```r= bestActors <- data.frame(Actor.Name = c("Leonardo DiCaprio", "Eddie Redmayne", "Matthew McConaughey"), Date.of.Birth = c(153360000, 379123200, -5011200)) bestActors > bestActors Actor.Name Date.of.Birth 1 Leonardo DiCaprio 153360000 2 Eddie Redmayne 379123200 3 Matthew McConaughey -5011200 # 先使用POSIXct把日期格式列出來 YYYY MM DD HH MM SS actors.birthday <- as.POSIXct(bestActors$Date.of.Birth, origin="1970-01-01") actors.birthday > actors.birthday [1] "1974-11-11 08:00:00 CST" "1982-01-06 08:00:00 CST" [3] "1969-11-04 08:00:00 CST" # 因為不需要時間，所以用as.Date把值轉為簡易日期模式 actors.birthday <- as.Date(actors.birthday) actors.birthday > actors.birthday [1] "1974-11-11" "1982-01-06" "1969-11-04" ``` ### 把YYYY/MM/DD的日期格式進行轉換 ```r= # 一開始的Dataframe，日期格式會因為斜線而被誤會是數值計算 # 在日期的部分請記得調整成character bestActress <- data.frame( Actor.Name = c("Brie Larson", "Julianne Moore", "Cate Blanchett"), Date.of.Birth = c("1989/10/01", "1960/12/03", "1969/05/14")) > bestActress Actor.Name Date.of.Birth 1 Brie Larson 198.90000 2 Julianne Moore 54.44444 3 Cate Blanchett 28.12857 # 透過as.Date的轉換，並加入參數的說明 # 讓電腦可以將YYYY/MM/DD的格式轉成數字 actresses.birthday <- as.Date(bestActress$Date.of.Birth, "%Y/%m/%d") actresses.birthday > actresses.birthday <- as.Date(bestActress$Date.of.Birth, "%Y/%m/%d") > actresses.birthday [1] "1989-10-01" "1960-12-03" "1969-05-14" ``` ### 日期轉換的用法與規範 as.Date() ```r= as.Date("27/06/94", "%d/%m/%y") > as.Date("27/06/94", "%d/%m/%y") [1] "1994-06-27" as.Date("27/06/1994", "%d/%m/%Y") > as.Date("27/06/1994", "%d/%m/%Y") [1] "1994-06-27" ``` 以下是使用日期代稱的一覽表 1. %a 星期幾的簡稱(e.g. Mon, Tue) 2. %A 星期幾的全稱(e.g. Monday, Tuesday) 3. %b 月份的簡稱(e.g. Aug, Feb) 5. %B 月份的全稱(e.g. August, July) 6. %d 日 7. %m 月 8. %y 年份的2位數(e.g. 94 -> 1994) 9. %Y 年份的4位數(e.g. 2009, 2010) ### 日期期間計算 ```r= as.Date("2009/12/1") - as.Date("2000/12/1") > as.Date("2009/12/1") - as.Date("2000/12/1") Time difference of 3287 days ``` ### 日期比較 ```r= as.Date("1994/12/1") > as.Date("1995/1/1") > as.Date("1994/12/1") > as.Date("1995/1/1") [1] FALSE ``` ### 日期計算 ```r= as.Date("1994/06/27") - 15 > as.Date("1994/06/27") - 15 [1] "1994-06-12" ``` ### 觀看現有日期與時間 ```r= Sys.Date() # 系統顯示的日期 > Sys.Date() [1] "2022-03-13" date() # 現在的日期與時間 > date() [1] "Sun Mar 13 20:16:33 2022" Sys.time() # 現在系統的日期與時間 > Sys.time() [1] "2022-03-13 20:16:33 CST" ``` ### 日期與時間的轉換weekdays(), months() & quarters() 僅能用來轉換Sys.Date()和Systime() ```r= Sys.Date() > Sys.Date() [1] "2022-03-13" Sys.time() > Sys.time() [1] "2022-03-13 20:23:10 CST" weekdays(Sys.Date()) > weekdays(Sys.Date()) [1] "Sunday" weekdays(Sys.time()) > weekdays(Sys.time()) [1] "Sunday" months(Sys.Date()) > months(Sys.Date()) [1] "March" months(Sys.time()) > months(Sys.time()) [1] "March" quarters(Sys.Date()) > quarters(Sys.Date()) [1] "Q1" quarters(Sys.time()) > quarters(Sys.time()) [1] "Q1" ``` ### 日期的不同呈現方式 ```r= # 儒略曆 julian(Sys.Date()) > julian(Sys.Date()) [1] 19064 attr(,"origin") [1] "1970-01-01" # 儒略曆 julian(Sys.time()) > julian(Sys.time()) Time difference of 19064.52 days # 列出後續數個月的日期 seq(Sys.Date(), by = "month", length.out = 4) > seq(Sys.Date(), by = "month", length.out = 4) [1] "2022-03-13" "2022-04-13" "2022-05-13" "2022-06-13" # 列出後續數個月的日期和時間 seq(Sys.time(), by = "month", length.out = 4) > seq(Sys.time(), by = "month", length.out = 4) [1] "2022-03-13 20:26:38 CST" "2022-04-13 20:26:38 CST" [3] "2022-05-13 20:26:38 CST" "2022-06-13 20:26:38 CST" ``` ## Regular Expression的用法 ### grep() 用來驗證資料 Regular Expression is used for matching patterns and strings ![](https://i.imgur.com/Ai9UcYb.png) ![](https://i.imgur.com/QWQYrnv.png) ![](https://i.imgur.com/8qHobDW.png) ```r= grep("@.*", c("test@testing.com", "not an email", "test2@testing.com")) > grep("@.*", c("test@testing.com", "not an email", "test2@testing.com")) [1] 1 3 grep("@.*", c("test@testing.com", "not an email", "test2@testing.com"), value=TRUE) > grep("@.*", c("test@testing.com", "not an email", "test2@testing.com"), value=TRUE) [1] "test@testing.com" "test2@testing.com" ``` ### gsub() 用來替代資料 ```r= gsub("@.*", "@newdomain.com", c("test@testing.com", "not an email", "test2@testing.com > gsub("@.*", "@newdomain.com", c("test@testing.com", "not an email", "test2@testing.com")) [1] "test@newdomain.com" "not an email" "test2@newdomain.com" ``` ## regexpr() & regmatches() 把驗證後的資料萃取出來的方法 ```r= matches <- regexpr("@.*", c("doej@example.com", "not an email", "erina@example.com")) matches > matches [1] 5 -1 6 attr(,"match.length") [1] 12 -1 12 attr(,"index.type") [1] "chars" attr(,"useBytes") [1] TRUE regmatches(c("doej@example.com", "not an email", "erina@example.com"), matches) regm > regmatches(c("doej@example.com", "not an email", "erina@example.com"), matches) [1] "@example.com" "@example.com" ``` ## 操作資料的手法 ### 檢視資料階段 ### 資料與解析範例 ```r= install.packages("readxl") library(readxl) setwd("C:/Users/GF63/Desktop/R DS/Datasets_Updated/Datasets") setwd(choose.dir()) #視覺化的選取檔案 getwd() #設定檔案起始位置 BankCustomer <- read.csv("Bank Customer data.csv") View(BankCustomer) str(BankCustomer) #str - structure BankCustomer1 <- read.csv("Bank Customer data.csv", stringsAsFactors = TRUE) # 把文字視為Factor BankCustomer2 <-read.csv("Bank Customer data.csv", stringsAsFactors = FALSE) # 不把文字視為Factor str(BankCustomer2) ``` ![](https://i.imgur.com/S2CxBFz.png) ### 匯入R的基本資料類型 #### Excel ```r= #Example 1 library("readxl") setwd("C:/Users/GF63/Desktop/R DS/Datasets_Updated/Datasets") getwd() read_excel("1555058318_internet_dataset.xlsx") ``` ```r= #Example 2 library(gdata) #load gdata package help(read.xls) #documentation mydata = read.xls("mydata.xls") #read from first sheet ``` ```r= #Example 3 library(XLConnect) wk = loadWorkbook("mydata.xls") df = readWorksheet(wk, sheet="Sheet1") ``` #### Minitab ```r= library(foreign) help(read.mtp) mydata = read.mtp("mydata.mtp") ``` #### Table ```r= help(road.table) mydata = read.table("mydata.txt") ``` #### CSV ```r= help(read.csv) mydata = read.csv("mydata.csv", sep=",") ``` ### 匯出R的資料模式 #### Table ```r= help(write.table) write.table(mydata, "匯出的檔案位置&名稱", sep="\t") ``` #### Excel ```r= library(xlsx) help(write.xlsx) write.xlsx(mydata, "匯出的檔案位置&名稱") ``` #### CSV ```r= help(write.csv) write.csv(mydate, file="mydata.csv") ``` ### 資料讀取的補充 R的連結不是用預定的反斜線，而是一般的斜線，這在url比較常見複製貼上時記得需要改寫 #### read.csv() ![](https://i.imgur.com/IPzNuOI.png) ```r= a <- read.csv("C:/Users/GF63/Desktop/2019_04_10350-02-01-2_臺中市勞資爭議案件.csv") ``` #### 如何讀取excel檔案? ```r= install.packages("readxl") library(readxl) read_excel("C:/Users/GF63/Desktop/2019_04_10350-02-01-2_臺中市勞資爭議案件.xlsx") ``` #### 如何看見擷取的資料? 可以透過賦值的方式來參考資料 ```r= my_data <- read.csv("C:/Users/GF63/Desktop/2019_04_10350-02-01-2_臺中市勞資爭議案件.csv") my_data ``` 也可以透過[]來擷取特定的資料 ```r= my_data['Complex5'] #特定欄位 my_data[1,] # 特定筆(Row)資料 my_data[1, c("name", "length_min")] #特定筆資料內的特定屬性(見下圖) ``` ![](https://i.imgur.com/DWWxOEe.png) #### 如何看見所有datasets的類型 ```r= data() ``` ![](https://i.imgur.com/MBWJVPp.png) 可以透過help()來深入挖掘datasets的內容 ```r= help(CO2) ``` ![](https://i.imgur.com/1v2stOz.png) #### 如何讀取txt檔案可以使用readLines()的方法來進行 ``` text <- readLines("C:/Users/GF63/Desktop/123.txt") text ``` 計算段落數量 length() 計算段落內的文字數量 nchar() 計算檔案的大小(bytes)file.size() 將數據讀入向量或列表或文件中 scan() ```r= length(text) >[1] 6 nchar(text) >[1] 37 32 48 0 147 469 file.size("C:/Users/GF63/Desktop/123.txt") >[1] 743 text <- scan("C:/Users/GF63/Desktop/123.txt", "") >Read 128 items ``` ![](https://i.imgur.com/OKKSdC1.png) ### 如何匯出檔案 #### TXT檔以下範例是將一個矩陣寫入TXT檔並且儲存 ```r= # How to save the matrix in a file? m <- matrix(c(1,2,3,4,5,6), nrow = 2, ncol = 3) write(m, file = "C:/Users/GF63/Desktop/matrix_as_text.txt", ncolumns = 3, sep = " ") ``` ![](https://i.imgur.com/pdN1qEz.png) #### 如何匯出成csv檔案 ```r= # How to save the matrix in a file? m <- matrix(c(1,2,3,4,5,6), nrow = 2, ncol = 3) write.csv(m, file = "C:/Users/GF63/Desktop/matrix_as_text.csv", row.names = FALSE) ``` ![](https://i.imgur.com/ofk9tn4.png) ```r= # How to save the matrix in a file? m <- matrix(c(1,2,3,4,5,6), nrow = 2, ncol = 3) write.table(m, file = "C:/Users/GF63/Desktop/data.csv", row.names = FALSE, col.names= FALSE, sep = ",") ``` ![](https://i.imgur.com/cp0mPlX.png) #### 如何匯出成Excel檔案由於牽涉套件下載，需要先下載xlsx的套件而且R和Java的版本不能太舊，否則無法使用 ```r= install.packages("xlsx") library(xlsx) m <- matrix(c(1,2,3,4,5,6), nrow = 2, ncol = 3) write.xlsx(m, file = "C:/Users/GF63/Desktop/data.xlsx", sheetName="Sheet1", col.names = TRUE, row.names = FALSE) ``` ![](https://i.imgur.com/9BtSH5m.png) #### 如何匯出成.RData files 隨著版本更迭，save()的內容有些轉變以下是更改過的範例，例如safe的參數已經被拿掉了 ```r= list = c("var1", "var2", "var3") save(list, file = "C:/Users/GF63/Desktop/vars.RData") ``` ![](https://i.imgur.com/aB57Gs6.png) ## Data Manipulation The apply functions are used to perform a specific change to each column or row of R objects. ```r= apply() lapply() sapply() tapply() mapply() vapply() rapply() ``` ### apply() apply a function to a matrix row or column and returns a vector, array, or list. ```r= apply(x, margin, function) # margin is applied to a row or column # margin = 1 indicates that function needs to be applied to a row # margin = 2 indicates that function needs to be applied to a column # function can be mean, sum or average ``` Example 1 ```r= m <- matrix(c(1,2,3,4),2,2) apply(m, 1, sum) # Sum of rows > apply(m, 1, sum) [1] 4 6 apply(m, 2, sum) # Sum of columns > apply(m, 2, sum) [1] 3 7 ``` ### lappy() takes a list an argument and works by looping through each element in the `list`. The output of the function is a `list`. ```r= list <- list(a=c(1,1, b=c(2,2, c=c(3,3)))) lapply(list, sum) > lapply(list, sum) $a [1] 12 lapply(list, mean) > lapply(list, mean) $a [1] 2 ``` ### sapply() If the result is a list and every element in the list is size 1, then a vector is returned. If the result is a list and every element is of the same size(>1), then a matrix is returned. ```r= > list <- list(a=c(1,1), b=c(2,2), c=c(3,3)) list <- list(a=c(1,1), b=c(2,2), c=c(3,3,9,5,6,7)) sapply(list, sum) # sum of the list > sapply(list, sum) a b c 2 4 33 list <- list(a=c(1,2), b=c(1,2,3), c=c(1,2,3,4)) > sapply(list, range) # range of the list a b c [1,] 1 1 1 [2,] 2 3 4 ``` ## dplyr()講解 deplyr package transforms and summarizes tabular data with rows and columns. 1. Select() 2. Filter() 3. Arrange() 4. Mutate() 5. Summarize() The use of efficient data storage backends by dplyr results in quicker processing speed. ### 使用範例 mtcar & iris ```r= View(mtcars) # 檢視車系的資料集 select(mtcars, mpg, disp) #檢視特定資料欄位 select(mtcars, mpg:carb) #檢視從A:B的資料欄位 View(iris) # 檢視鳶尾花的資料集 select(iris, starts_with("Petal")) #開頭以Petal為主的欄位 select(iris, ends_with("Width")) #以Width結尾的欄位 select(iris, contains("etal")) #包含etal字元的欄位 select(iris, matches(".t.")) #有t的都算 ``` ### filter() ```r= filter(mtcars, cyl == 8) #把資料欄位cyl等於8的挑出來 filter(mtcars, cyl < 6) #把cyl<6的挑出來 ``` ```r= a <- filter(mtcars, cyl == 8 | vs == 0) #聯集 b <- filter(mtcars, cyl == 8 & vs == 0) #交集 c <- filter(mtcars, cyl == 8, vs == 0) #使用逗點屬於交集 write.csv(a, file="C:/Users/GF63/Desktop/R DS/Datasets_Updated/Datasets/mydata_a.csv") write.csv(b, file="C:/Users/GF63/Desktop/R DS/Datasets_Updated/Datasets/mydata_b.csv") ``` ![](https://i.imgur.com/R5mqhqK.png) ![](https://i.imgur.com/YMuq0SZ.png) ### arrange() It helps arrange the data in a specific order default的設定會以升序排列，可以用desc()來採取降序的方式排列 ```r= arrange(mtcars, desc(disp)) #以disp為依據，降序排列 arrange(mtcars, cyl, disp) #以cyl先排，由小到大，再以disp排列 ``` ### summarise() It summarizes multiple values to a single value in a dataset 可以使用group_by()來總括特定的群體，並使用mean()或sd()來處理數據 ```r= summarise(group_by(mtcars, cyl), mean(disp)) > summarise(group_by(mtcars, cyl), mean(disp)) # A tibble: 3 x 2 cyl `mean(disp)` <dbl> <dbl> 1 4 105. 2 6 183. 3 8 353. ``` ```r= summarise(group_by(mtcars, cyl), m = mean(disp), sd= sd(disp)) > summarise(group_by(mtcars, cyl), m = mean(disp), sd= sd(disp)) # A tibble: 3 x 3 cyl m sd <dbl> <dbl> <dbl> 1 4 105. 26.9 2 6 183. 41.6 3 8 353. 67.8 ``` ### Mutate() 可以使用mutate()在原本的dataset中新增變數 ```r= mutate(mtcars, my_custom_disp = disp/1.0237) ``` ### 範例說明 BankCustomersData #### 第一步驟匯入資料與套件(plyr) ```r= install.packages("plyr") library(plyr) setwd("C:/Users/GF63/Desktop/R DS/Datasets_Updated/Datasets") BankCustomerData <- read.csv("Bank Customer data.csv") ``` #### 第二步驟使用rename()進行欄位名稱的更換因為字母比較特殊，可以使用str()或是其他方法抓欄位名稱 ```r= BankCustomerData <- rename(BankCustomerData, c("ï..age" = "Age")) View(BankCustomerData) ``` 運用transform()的巢狀函數來新增條件欄位 ```r= BankCustomerDataCategorized <- transform(BankCustomerData, Generation = ifelse(Age<22, "Z", ifelse(Age<41, "Y", ifelse(Age<53, "X", "Baby Boomers")))) View(BankCustomerDataCategorized) BankCustomerDataCategorized ``` 使用table()函數產生次數分配表 ```r= ## Make the frequency table table(BankCustomerDataCategorized$Generation,BankCustomerDataCategorized$poutcome) > table(BankCustomerDataCategorized$Generation,BankCustomerDataCategorized$poutcome) failure other success unknown Baby Boomers 79 25 36 610 X 128 58 27 1126 Y 282 112 65 1959 Z 1 2 1 10 ``` ![](https://i.imgur.com/tHUsqry.png) ## 經典數據視覺化的效果 ### Bar Chart(長條圖) #### 長條圖(Vertical) ```r= counts <- table(mtcar$gear) # 先把資料欄位賦值給counts barplot(counts) # 再用barplot把資料轉成長條圖 ``` ![](https://i.imgur.com/FCwROya.png) #### 長條圖(Horizontal) ```r= counts <- table(mtcar$gear) # 先把資料欄位賦值給counts barplot(counts, horiz = TRUE) # 再用barplot把資料轉成長條圖，並把horiz的參數調成TRUE ``` ![](https://i.imgur.com/uxzcRYb.png) ### Bar Plots Bar plots are horizontal or vertical bars used to show comparisons between categorical values. They represent length, frequency, or proportion of categorical values. ```r= counts <- table(mtcar$gear) # 先把資料欄位賦值給counts barplot(counts, # 輸入代表資料的值 main = "Simple Bar Plot", # 圖表的主題名稱 xlab = "Improvement", # x軸的名稱 ylab = "Frequency", # y軸的名稱 legend = rownames(counts), # 圖示名稱的資料來源 col = c("blue", "yellow", "green") # 圖示的顏色 ) ``` ![](https://i.imgur.com/NbehBZE.png) ### Stacked bar plot with colors ```r= counts <- table(mtcars$vs, mtcars$gear) # table(子類別, 大類別) barplot(counts, main = "Car Distribution by Gears and VS", # 圖表主題 xlab = "Number of Gears", # x軸的名稱 col = c("Grey", "cornflowerblue"), # 子類別顏色 legend = rownames(counts) # 圖示資料來源 ) ``` ![](https://i.imgur.com/UvYYvUF.png) ```r= counts <- table(mtcars$gear, mtcars$vs) barplot(counts, main = "Car Distribution by Gears and VS", xlab = "Number of Gears", col = c("Grey", "cornflowerblue"), legend = rownames(counts)) ``` ![](https://i.imgur.com/tEk1zfv.png) #### Grouped Bar Plot ```r= counts <- table(mtcars$vs, mtcars$gear) barplot(counts, main="Car Distribution by Gears and VS", xlab = "Number of Gears", col = c("grey","cornflowerblue"), legend = rownames(counts), beside = TRUE) # beside = TRUE 可以把直條圖變成兩兩比較的形式 ``` ![](https://i.imgur.com/dwt162c.png) ```r= counts <- table(mtcars$gear, mtcars$vs) barplot(counts, main="Car Distribution by Gears and VS", xlab = "Number of Gears", col = c("grey","cornflowerblue"), legend = rownames(counts), beside = TRUE) ``` ![](https://i.imgur.com/eiGahiw.png) ### Pie Chart ```r= slices <- c(10, 12, 4, 16, 8) # 先將一系列的VALUE列表賦值給slices lbls <- c("US", "UK", "Australia", "Germany", "France") # 再將各區塊的名稱列表賦值給lbls pie(slices, labels = lbls, main="Simple Pie Chart") # pie(值, 名稱, 圖表名稱) ``` ![](https://i.imgur.com/PCT7uwb.png) #### Pie Chart with color & percentage ```r= slices <- c(10, 12, 4, 16, 8) # 定義值的列表 pct <- round(slices/sum(slices)*100) # 計算百分比 lbls <- paste(c("US", "UK", "Australia", "Germany", "France"), " ", pct, "%", sep= "") # 用paste把國家列表與數值的百分比配對，並且與%合併 pie(slices, labels = lbls, col=rainbow(5), main="Pie Chart with Percentages") # 顏色使用rainbow(5)，可以確保分配成五個不同的顏色 ``` ![](https://i.imgur.com/9jmxKqa.png) #### Pie Chart in 3D ```r= install.packages("plotrix") library(plotrix) # 需要先安裝plotrix slices <- c(10, 12, 4, 16, 8) lbls <- paste(c("US", "UK", "Australia", "Germany", "France"), " ", pct, "%", sep="") pie3D(slices, labels = lbls, explode = 0.0, main="3D Pie Chart") # explode 表示每個區塊分開的距離 ``` ![](https://i.imgur.com/JjSnkLi.png) ### histogram Creating a simple histogram using the mtcars dataset: ```r= mtcars$mpg hist(mtcars$mpg) ``` ![](https://i.imgur.com/O5pvbLp.png) #### Coloring the histogram ```r= mtcars$mpg hist(mtcars$mpg, breaks=8, col="darkgreen") ``` ![](https://i.imgur.com/OzufFvH.png) ### Kernel Density Plot A Kernel Density Plot shows the distribution of a continuous variable 連續性的曲線圖可以使用plot ```r= density_data <- density(mtcars$mpg) plot(density_data) ``` ![](https://i.imgur.com/10sqA4X.png) #### Filling density Plot with Color ```r= density_data <- density(mtcars$mpg) plot(density_data, main="Kernel Density of Miles Per Gallon") polygon(density_data, col="skyblue", border="black") ``` ![](https://i.imgur.com/jpAKGu4.png) ### Line Chart Line chart is used to represent a series of data points connected by a straight line. ```r= weight <- c(2.5, 2.8, 3.2, 4.8, 5.1, 5.9, 6.8, 7.1, 7.8, 8.1) months <- c(0,1,2,3,4,5,6,7,8,9) plot(months, weight, type = "b", main="Baby Weighy Chart") ``` ![](https://i.imgur.com/wF5RHQH.png) #### Colored Line Chart ```r= weight <- c(2.5, 2.8, 3.2, 4.8, 5.1, 5.9, 6.8, 7.1, 7.8, 8.1) months <- c(0,1,2,3,4,5,6,7,8,9) plot(months, weight, type = "b", main="Baby Weighy Chart", col="Blue") ``` ![](https://i.imgur.com/2WLI2Ur.png) ### Box Plot the distribution of data based on the 5 number summary: 1. Minimum 2. First Quartile 3. Median 4. Third Quartile 5. Maximum ```r= boxplot(airquality$Ozone, main = "Mean Ozone in parts per billion at Roosevelt Island", xlab="Parts Per Billion", ylab="Ozone", horizontal=TRUE) ``` ![](https://i.imgur.com/KFQa9Mq.png) #### Colored Box Plot ```r= boxplot(airquality$Ozone, main = "Mean Ozone in parts per billion at Roosevelt Island", xlab="Parts Per Billion", ylab="Ozone", col="green", horizontal=TRUE) ``` ![](https://i.imgur.com/0kTDQZ3.png) ### Heat Map A heat map is 2-dimensional representation of data that uses colors to represent the values. 熱力圖一定要用matrix，所以需要把資料轉成矩陣 1. Simple Heat Map 提供立即的資訊摘要 2. Elaborate Heat Map 協助了解複雜的數據集 ```r= mat<- as.matrix(mtcars); heatmap(mat); ``` ![](https://i.imgur.com/hKltX4y.png) #### Editing Heat Map - Normalization 可以將尺度(scale)設定為欄位間的比較，此時即可進行標準化的呈現 ```r= mat<- as.matrix(mtcars); heatmap(mat, scale="column"); ``` ![](https://i.imgur.com/XE0rsc7.png) ### ggplot2 It breaks up graphs into semantic components such as scales and layers. It is an alternative for the basic graphics of R. #### Example Creating a bar plot with just one variable with bars. ## Statistics in R ### Hypothesis的寫法通常以if...then的方式撰寫，例如: "If I eat more vegetables, then I will lose weight faster." ### 各種Hypothesis #### Simple Hypothesis In a simple hypothesis, there exists a relationship between two variables: one is called an independent variable or cause and the other is called a dependent variable or effect. #### Complex Hypothesis refers to the prediction of relationship between two or more independent variables or two or more dependent variables. #### Null Hypothesis a hypothesis of "no difference" #### Alternate Hypothesis is complementary to the null hypothesis. It is denoted by H1. #### Statistical Hypothesis Is a method of statistical inference performed using data from a scientific study. ### Data Sampling It's a statistical hypothesis technique used to select, manipulate and analyze a subset of data points to discover hidden patterns and trends in the larger data set. The sampling theory draws valid inferences about the population parameters on the basis of sample results. ### 型I錯誤型II錯誤 #### Type I Error Reject Ho when it is true. Probability is denoted by a. #### Type II Error Accept Ho when it is wrong or H1 is true. Probability is denoted by B. ### Confidence Coefficient the complement of the probability of Type I error(1-a) that yields confidence level when multiploed by 100%. ### 有母數v.s無母數?(Parametric Test & ) A parametric statistical test is one that makes assumptions about the parameters(defining properties) of the population distribution(s) from which one's data is drawn. 有母數統計資料中，已知母體分佈的相關資訊，其係以一組固定參數為基礎。無母數統計資料中，母體分佈的相關資訊不明，且參數不固正，因此必須檢定母體的假設。 #### Z-test的應用範例: 如果入學考試的成績符合常態分配，且平均分數為72，標準差為15.2，請計算84分以上的同學大概佔多少比例 ```r= pnorm(84, mean = 72, sd=15.2, lower.tail = FALSE) # lower.tail = TRUE 84分以下的學員數 # lower.tail = FALSE 84分以上的學員數 ``` #### t-test的應用在自由度為5、服從t-distribution的情況下，尋找2.5個百分位數跟97.5的百分位數 ```r= qt(c(.025,.975),df=5) ``` #### ANOVA ```r= df1 <- read.csv2("C:/Users/GF63/Desktop/R DS/Datasets_Updated/Datasets/fastfood-1.csv", header = TRUE, sep = ",") # 先匯入資料 data.frame(df1) # 將csv的資料匯入data.frame colnames(df1) <- c("Item.1", "Item.2", "Item.3") data.frame(df1) # 因為data標題有問題，所以重新調整 r=c(t(as.matrix(df1))) # 將數據連接成列，以向量方式進行分析 r f=c("Item.1", "Item.2", "Item.3") k = 3 # 組別個數 n = 6 # 組內樣本個數 tm = gl(k, 1, n*k, factor(f)) # gl()R語言中的函數用於通過指定其級別的模式來生成因子。 > [1] Item.1 Item.2 Item.3 Item.1 Item.2 Item.3 Item.1 Item.2 Item.3 Item.1 Item.2 Item.3 Item.1 Item.2 Item.3 Item.1 Item.2 Item.3 # 用法： # gl(x, k, length, labels, ordered) # 參數： # x:級別數 # k:重複次數 # length:結果長度 # labels:向量的標簽(可選) # ordered:用於對級別進行排序的布爾值 tm av = aov(r ~ tm) summary(av) > Df Sum Sq Mean Sq F value Pr(>F) tm 2 745.4 372.7 2.541 0.112 Residuals 15 2200.2 146.7 ``` ## Regression Analysis 迴歸分析 ![](https://i.imgur.com/2uDFvEx.png) ### 簡單迴歸 ```r= setwd("C:/Users/GF63/Desktop/R DS/Datasets_Updated/Datasets") getwd() Class<-read.csv("Lesson 7_Regression Analysis/Demo 1_Perform simple linear regression.csv") # 資料連結設定階段 View(Class) str(Class) summary(Class) # 資料確認階段 results <- lm(formula = Weight ~ Height, data = Class) results # linear model產生階段 # 產生方式1 results1 <- lm(formula = Class$Weight ~ Class$Height) results1 # 產生方式2 > results1 Call: lm(formula = Class$Weight ~ Class$Height) Coefficients: (Intercept) Class$Height -143.027 3.899 summary(results) > summary(results) Call: lm(formula = Weight ~ Height, data = Class) Residuals: Min 1Q Median 3Q Max -17.6807 -6.0642 0.5115 9.2846 18.3698 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -143.0269 32.2746 -4.432 0.000366 *** Height 3.8990 0.5161 7.555 7.89e-07 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 11.23 on 17 degrees of freedom Multiple R-squared: 0.7705, Adjusted R-squared: 0.757 F-statistic: 57.08 on 1 and 17 DF, p-value: 7.887e-07 ``` ### 複回歸 ```r= setwd("C:/Users/GF63/Desktop/R DS/Datasets_Updated/Datasets") getwd() cars_data <- read.csv("Lesson 7_Regression Analysis/Demo 2_ Perform Regression Analysis with multiple variables.csv") # 資料載入階段 View(cars_data) str(cars_data) summary(cars_data) # 資料檢視階段 cars_results <- lm(formula = MPG_City ~ Type + Origin + DriveTrain + EngineSize + Cylinders + Horsepower + Weight + Wheelbase + Length, data = cars_data) > cars_results Call: lm(formula = MPG_City ~ Type + Origin + DriveTrain + EngineSize + Cylinders + Horsepower + Weight + Wheelbase + Length, data = cars_data) Coefficients: (Intercept) TypeSedan TypeSports TypeSUV TypeTruck 64.922761 -28.234400 -29.428606 -29.339329 -29.082762 TypeWagon OriginEurope OriginUSA DriveTrainFront DriveTrainRear -28.157836 -0.537962 -0.371973 1.076169 0.148866 EngineSize Cylinders Horsepower Weight Wheelbase -0.247646 -0.172933 -0.012773 -0.002739 0.067798 Length -0.052474 cars_results summary(cars_results) > summary(cars_results) Call: lm(formula = MPG_City ~ Type + Origin + DriveTrain + EngineSize + Cylinders + Horsepower + Weight + Wheelbase + Length, data = cars_data) Residuals: Min 1Q Median 3Q Max -8.0895 -1.2792 -0.1612 0.8440 13.7522 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 6.492e+01 2.650e+00 24.496 < 2e-16 *** TypeSedan -2.823e+01 1.237e+00 -22.817 < 2e-16 *** TypeSports -2.943e+01 1.326e+00 -22.195 < 2e-16 *** TypeSUV -2.934e+01 1.299e+00 -22.581 < 2e-16 *** TypeTruck -2.908e+01 1.347e+00 -21.595 < 2e-16 *** TypeWagon -2.816e+01 1.292e+00 -21.801 < 2e-16 *** OriginEurope -5.380e-01 3.152e-01 -1.707 0.08867 . OriginUSA -3.720e-01 2.742e-01 -1.357 0.17565 DriveTrainFront 1.076e+00 3.289e-01 3.272 0.00116 ** DriveTrainRear 1.489e-01 3.694e-01 0.403 0.68713 EngineSize -2.476e-01 3.223e-01 -0.768 0.44272 Cylinders -1.729e-01 1.839e-01 -0.941 0.34746 Horsepower -1.277e-02 3.206e-03 -3.984 8.02e-05 *** Weight -2.739e-03 3.869e-04 -7.080 6.30e-12 *** Wheelbase 6.780e-02 3.393e-02 1.998 0.04638 * Length -5.247e-02 1.767e-02 -2.970 0.00315 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.09 on 410 degrees of freedom (2 observations deleted due to missingness) Multiple R-squared: 0.847, Adjusted R-squared: 0.8414 F-statistic: 151.3 on 15 and 410 DF, p-value: < 2.2e-16 ``` ### Cross Validation Cross Validation is a technique used to determine the accuracy in predicting models. ![](https://i.imgur.com/RMQGM5x.png) ### PCA(Principal Component Analysis) Def: Principal components are linear components of the original varialbles. They tend to capture as much variance as possible in a dataset. ![](https://i.imgur.com/wCr8OSq.png) PCA is a process of extracting variables from a dataset to explain maximum variance in the dataset. e.g. From "n" independent variables in a dataset, PCA extracts "k" new variables that explain the most variance in the dataset. #### Usage 1. It is used to eliminate the duplicate variables in cases where many variables are present in the dataset, to avoid redundancy. 2. Since dependent variable is not considered, this model can be categorized as an unsupervised model. ### Factor Analysis Factor analysis is a commonly used technique to find latent variables or factors in a model. It is also considered a dimensionality reduction technique. ## Classification ![](https://i.imgur.com/Ews9DHT.png) ### SVM的範例!! (Support Vector Machine) ```r= install.packages("plyr") library(plyr) setwd("C:/Users/GF63/Desktop/R DS/Datasets_Updated/Datasets") getwd() customer_churn <- read.csv("Lesson 8_Classification/Demo 1_ Support Vector Machines.csv") View(customer_churn) count(customer_churn$Churn) # 計算流失顧客的數量 str(customer_churn) # 這樣看不出顧客流失的狀況 customer_churn$Churn <- sapply(customer_churn$Churn, factor) # 透過sapply()把其中的一攔元素轉乘factor str(customer_churn) # split the data # cross validation # 70% 為訓練資料 sample_split <- floor(.7*nrow(customer_churn)) set.seed(1) training <- sample(seq_len(nrow(customer_churn)), size=sample_split) training churn_train <- customer_churn[training, ] churn_test <- customer_churn[-training, ] # support vector machine(SVM) install.packages("e1071") install.packages("caret") library(e1071) svm_churn <- svm(Churn ~ ., churn_train) library(caret) confusionMatrix(churn_train$Churn, predict(svm_churn), positive="1") # test data Prediction <- predict(svm_churn, churn_test[-1]) Prediction_results <- table(pred=Prediction, true=churn_test[,1]) print(Prediction_results) ``` ### Naive Bayes Classifier ![](https://i.imgur.com/7fU6e6j.png) ```r= ## 資料設定階段 setwd("C:/Users/GF63/Desktop/R DS/Datasets_Updated/Datasets") getwd() ## 安裝相關套件 # Load necessary packages install.packages("e1071","plyr","caret","mlbench") library(e1071) library(plyr) library(caret) library(mlbench) ## 讀取資料 # Load and verify the bank data bank_loan <- read.csv("Lesson 8_Classification/Demo 2_ Naive Bayes Classifier.csv") View(bank_loan) str(bank_loan) ## 轉換資料成factor # Convert Default from int to factor bank_loan$Default <- sapply(bank_loan$Default, factor) ## 將數據匯入naiveBayes模型 # build the model naive_model <- naiveBayes(Default ~., data = bank_loan) print(naive_model) # The model creates conditional probability for each feature separetely # We also have the apriori probabilities which indicates the distribution of our data ## 使用predict()進行模型預測 ## 運用table()進行數據結果的產出 # predict naive_predict <- predict(naive_model, bank_loan) naive_predict table(naive_predict, bank_loan$Default) summary(bank_loan) > naive_predict 0 1 0 607 146 1 93 154 ``` ### Decision Tree Classification * A decision tree is a graph that makes use of branching method to demonstrate every possible outcome of a decision. * In classification, the data is segregated based on a series of questions. • 比其他分類方法還要迅速 • 可以使用較為容易的分類方法 • 可以使用SQL Queries • 較高的分類精準度 ![](https://i.imgur.com/t8ZXhSE.png) ![](https://i.imgur.com/XQ0YMKN.png) ![](https://i.imgur.com/OsojNz9.png) ![](https://i.imgur.com/WRFtXPx.png) #### 過度適配(overfitting)的狀況 ##### 原因: * 太多分支 * 精準度不夠或未見的案例 ##### 如何避免: * 預先修剪(Prepruning) Stop the construction of a tree early. If the goodness measure is falling below a threshold, do not split the node. * 事後修剪(Postpruning) In case selecting an appropriate threshold is difficult, remove branches from a fully-developed tree by getting a progressively pruned trees' sequence. [參考資料-Learning Model : Decision Tree (1)-分類樹](https://medium.com/ai%E5%8F%8D%E6%96%97%E5%9F%8E/learning-model-decision-tree-1-%E5%88%86%E9%A1%9E%E6%A8%B9-5fbffd943c13) [參考案例-鐵達尼號模型](https://rstudio-pubs-static.s3.amazonaws.com/275285_90aaf9a2a64d43a5846a86dbcde8eba9.html) #### Example ![](https://i.imgur.com/jOwNNHt.png) ```r= setwd("C:/Users/GF63/Desktop/R DS/Datasets_Updated/Datasets") getwd() # Load necessary packages install.packages("e1071","plyr","caret","mlbench","rpart") library(e1071) library(plyr) library(caret) library(mlbench) library(rpart) # Load and verify the bank data bank_loan <- read.csv("Lesson 8_Classification/Demo 3_ Decision Tree Classification.csv") View(bank_loan) str(bank_loan) # Convert Default from int to factor bank_loan$Default <- sapply(bank_loan$Default, factor) # build the model tree_model <- rpart(Default ~ ., data = bank_loan, method="class") tree_model # analyze results printcp(tree_model) plotcp(tree_model) print(tree_model) summary(tree_model) plot(tree_model) ``` ### K-fold Cross Validation Algorithm ![](https://i.imgur.com/IjcSAoQ.png) ```r= setwd("C:/Users/GF63/Desktop/R DS/Datasets_Updated/Datasets") getwd() # Load necessary packages install.packages("e1071","plyr","caret","mlbench","rpart") library(e1071) library(plyr) library(caret) library(mlbench) library(rpart) # Load and verify the bank data bank_loan <- read.csv("Lesson 8_Classification/Demo 3_ Decision Tree Classification.csv") View(bank_loan) str(bank_loan) # Convert Default from int to factor bank_loan$Default <- sapply(bank_loan$Default, factor) # build the model tree_model <- rpart(Default ~ ., data = bank_loan, method="class") tree_model # analyze results printcp(tree_model) plotcp(tree_model) print(tree_model) summary(tree_model) plot(tree_model) folded_up <- createFolds(bank_loan, k=10, list=TRUE, returnTrain = FALSE) help("createFolds") train_set <- names(folded_up[1]) bank_loan[folded_up$train_set,] ``` ## Cluster ### K-mean Clustering Example ![](https://i.imgur.com/35JywX7.png) ```r= setwd("C:/Users/GF63/Desktop/R DS/Datasets_Updated/Datasets") getwd() help(set.seed) set.seed(111) # for getting the same data results customer_data <- read.csv("Bank Customer data.csv") View(customer_data) str(customer_data) cluster_up <- kmeans(customer_data, 3, iter.max=10) # segment = 3, iterate 10 times to make segment # the matrix should be completely numeric #data cleaning del_vars <- names(customer_data) %in% c("job", "marital", "education", "default", "housing", "loan", "contact", "month", "poutcome") # %in% = in function customer_data_num <- customer_data[!del_vars] # ! = not include customer_data_num <- na.omit(customer_data_num) # drop any period missing value by na.omit customer_data_num <- scale(customer_data_num) # When we scale the data, it will generate the z-score View(customer_data_num) # k-means clustering cluster_up <- kmeans(customer_data_num, 3, iter.max = 10) str(cluster_up) customer_data_num <- cbind(customer_data_num, ClusterNum = cluster_up$cluster) View(customer_data_num) # graph and count of expected clusters install.packages("mclust") library(mclust) fit <- Mclust(customer_data_num) plot(fit) ``` ![](https://i.imgur.com/AmMms6y.png) ### Hierarchical Clustering Example ![](https://i.imgur.com/X4EnAWY.png) ```r= setwd("C:/Users/GF63/Desktop/R DS/Datasets_Updated/Datasets") getwd() help(set.seed) set.seed(111) # for getting the same data results customer_data <- read.csv("Bank Customer data.csv") View(customer_data) str(customer_data) # data cleaning customer_data <- na.omit(customer_data) # Hierarchical Clustering cluster_h <- dist(customer_data, method = "euclidian") fit <- hclust(cluster_h, method = "ward") groups <- cutree(fit, k=3) groups customer_data <- cbind(customer_data, ClusterNum = groups) View(customer_data) # graph plot(fit) ``` ![](https://i.imgur.com/UPDpE4V.png) ## Association ### Association Example ```r= setwd("C:/Users/GF63/Desktop/R DS/Datasets_Updated/Datasets") getwd() install.packages("arules") library(arules) Groceries_Item = read.transactions("Lesson 10_Association/Demo 1_Perform Association Using the Apriori Algorithm .csv", sep=",") inspect(Groceries_Item[1:10]) # Check the 1-10 bracket data AprioriForGroceries = apriori(Groceries_Item, parameter=list(support=.006, confidence=.5)) summary(AprioriForGroceries) inspect(AprioriForGroceries) AprioriForGroceries = apriori(Groceries_Item, parameter = list(support=.01, confidence=.5)) summary(AprioriForGroceries) inspect(AprioriForGroceries) inspect(sort(AprioriForGroceries, by="confidence")) ``` ## 專案1 Web Data Analysis Background and Objective: The web analytics team of www.datadb.com is interested to understand the web activities of the site, which are the sources used to access the website. They have a database that states the keywords of time in the page, source group, bounces, exits, unique page views, and visits. ### Attributes #### Bounce A visitor who lands on your website, views only one web page and then leaves is called a “Bounce”. * immediately finds what they want and then leaves * think the page/site is not relevant to their needs and leaves. * ##### Data Def: Bounces It represents the percentage of visitors who enter the site and "bounce" (leave the site) rather than continuing to view other pages within the same site. #### Visits Each visit by a person can consist of multiple page views. And a single person may have multiple visits over days or months. Once a website visitor closes the browser, the visit is considered over. Google Analytics focuses heavily on visits. ##### Data Def: A visit counts all visitors, no matter how many times the same visitor may have been to your site. #### Exit 「離開率」指標的意義。星期一：網頁 B > 網頁 A > 網頁 C > 離開星期二：網頁 B > 離開星期三：網頁 A > 網頁 C > 網頁 B > 離開星期四：網頁 C > 離開星期五：網頁 B > 網頁 C > 網頁 A > 離開「離開百分比」和「跳出率」的計算如下：離開率：網頁 A：33% (有 3 個工作階段包含網頁 A，有 1 個工作階段從網頁 A 離開) 網頁 B：50% (有 4 個工作階段包含網頁 B，有 2 個工作階段從網頁 B 離開) 網頁 C：50% (有 4 個工作階段包含網頁 C，有 2 個工作階段從網頁 C 離開) ##### Data Def: #### Time on page ##### Data Def: It shows how long the user has spent on that particular page of the website. #### unique pageview ##### Data Def: It represents the number of sessions during which that page was viewed one or more times. Visits ### Analysis Tasks: The team is targeting the following issues: * Summarize the data * Whether the unique page view value depends on visits. * Find out the probable factors from the dataset, which could affect the exits. * Find the variables which possibly have an effect on the time on page. * Help the team in determining the factors that are impacting the bounce. #### Summarize the data & the unique page view value depends on visits ```r= library("readxl") setwd("C:/Users/GF63/Desktop/R DS/Datasets_Updated/Datasets") getwd() data_set <-read_excel("1555058318_internet_dataset.xlsx") summary(data_set) str(data_set) data_set$Bounces <- sapply(data_set$Bounces, factor) data_set$Exits <- sapply(data_set$Exits, factor) data_set$Continent <- sapply(data_set$Continent, factor) data_set$Sourcegroup <- sapply(data_set$Sourcegroup, factor) str(data_set) View(data_set) # Simple Regression results <- lm(formula = data_set$Uniquepageviews ~ data_set$Visits) results ``` ```r= # Decision Tree install.packages("rpart.plot") require(rpart.plot) ## 先把資料區分成 train=0.8, test=0.2 data_set$Sourcegroup <- NULL data_set$BouncesNew <- NULL set.seed(1) train.index <- sample(x=1:nrow(data_set), size=ceiling(0.7*nrow(data_set))) train <- data_set[train.index, ] test <- data_set[-train.index, ] # CART的模型：把存活與否的變數(Survived)當作Y，剩下的變數當作X cart.model<- rpart(Bounces ~. , data=data_set) # 輸出各節點的細部資訊(呈現在console視窗) cart.model ## analyze results printcp(tree_model) plotcp(tree_model) print(tree_model) summary(tree_model) plot(tree_model) install.packages("rpart.plot") require(rpart.plot) prp(cart.model, faclen=0, fallen.leaves=TRUE, shadow.col="gray", extra = 2) ## build the model tree_model <- rpart(Bounces ~ ., data = data_set, method="class") tree_model ## 也可用另一個繪圖套件partykit，函式是as.party()和plot() install.packages("partykit") require(partykit) rparty.tree <- as.party(cart.model) # 轉換cart決策樹 rparty.tree plot(rparty.tree) ``` ## 專案2 IBM HR Analytics Employee Attrition Modeling DESCRIPTION IBM is an American MNC operating in around 170 countries with major business vertical as computing, software, and hardware. Attrition(裁員) is a major risk to service-providing organizations where trained and experienced people are the assets of the company. The organization would like to identify the factors which influence the attrition of employees. Data Dictionary Age: Age of employee Attrition: Employee attrition status Department: Department of work DistanceFromHome Education: 1-Below College; 2- College; 3-Bachelor; 4-Master; 5-Doctor; EducationField EnvironmentSatisfaction: 1-Low; 2-Medium; 3-High; 4-Very High; JobSatisfaction: 1-Low; 2-Medium; 3-High; 4-Very High; MaritalStatus MonthlyIncome NumCompaniesWorked: Number of companies worked prior to IBM WorkLifeBalance: 1-Bad; 2-Good; 3-Better; 4-Best; YearsAtCompany: Current years of service in IBM Analysis Task: - [x] Import attrition dataset - [ ] Exploratory data analysis * Find the age distribution of employees in IBM * Find outliers * Explore attrition by age * Explore data for Left employees * Find out the distribution of employees by the education field * Give a bar chart for the number of married and unmarried employees - [ ] Build up a logistic regression model to predict which employees are likely to attrite and find the significant variables ```r= ## Import attrition dataset setwd("C:/Users/wuc25/Desktop") getwd() install.packages("readxl") install.packages("barplot") library("readxl") data_set<-read_excel("IBM Attrition Data.xlsx") str(data_set) View(data_set) ## Exploratory data analysis ## Age distribution of employees in IBM Age_Dist = table(data_set$Age) barplot(Age_Dist,main="Age Distribution",xlab= "Age",ylab="Number of Employees") ## Explore attrition by Age Age_Attr_Dist = table(data_set$Attrition, data_set$Age) barplot(Age_Attr_Dist,legend=c("No","Yes"),main="Age Distribution",xlab= "Age",ylab="Number of Employees") ## Explore data for the employees who left ## Attrition rate is highest among HR and Sales and low in R&D Attrition_by_Dept = table(data_set$Department, data_set$Attrition) Attrition_by_Dept_df = as.data.frame(Attrition_by_Dept) Attrition_by_Dept_df str(Attrition_by_Dept_df) ## 運用reshape & transform將數據進行組合 Attrition_by_Dept_df1 = reshape(Attrition_by_Dept_df, idvar="Var1",timevar = "Var2" ,direction="wide") Attrition_by_Dept_df2 = transform(Attrition_by_Dept_df1,AttritionRate = Freq.Yes/(Freq.Yes+Freq.No)) Attrition_by_Dept_df2 ##low job satisfaction has an attrition rate of ~23% while high job satisfaction scores are less likely to attride Attrition_by_JS = table(data_set$JobSatisfaction,data_set$Attrition) Attrition_by_JS_df = as.data.frame(Attrition_by_JS) Attrition_by_JS_df Attrition_by_JS_df1 = reshape(Attrition_by_JS_df, idvar="Var1",timevar = "Var2" ,direction="wide") Attrition_by_JS_df2 = transform(Attrition_by_JS_df1,AttritionRate = Freq.Yes/(Freq.Yes+Freq.No)) Attrition_by_JS_df2 ##Employees with poor work life balance are likely to attride Attrition_by_Work_life_bal = table(data_set$WorkLifeBalance,data_set$Attrition) Attrition_by_Work_life_bal_df = as.data.frame(Attrition_by_Work_life_bal) Attrition_by_Work_life_bal_df Attrition_by_Work_life_bal_df1 = reshape(Attrition_by_Work_life_bal_df, idvar="Var1",timevar = "Var2" ,direction="wide") Attrition_by_Work_life_bal_df2 = transform(Attrition_by_Work_life_bal_df1,AttritionRate = Freq.Yes/(Freq.Yes+Freq.No)) Attrition_by_Work_life_bal_df2 ## The higher the number of companies worked, higher are the chances to loose employees Attrition_by_Companies_Worked = table(data_set$NumCompaniesWorked,data_set$Attrition) Attrition_by_Companies_Worked_df = as.data.frame(Attrition_by_Companies_Worked) Attrition_by_Companies_Worked_df Attrition_by_Companies_Worked_df1 = reshape(Attrition_by_Companies_Worked_df, idvar="Var1",timevar = "Var2" ,direction="wide") Attrition_by_Companies_Worked_df2 = transform(Attrition_by_Companies_Worked_df1,AttritionRate = Freq.Yes/(Freq.Yes+Freq.No)) Attrition_by_Companies_Worked_df2 ## Find out the distribution of employees by education field Edu_dist = table(data_set$EducationField) Edu_dist barplot(Edu_dist,main="Employee Education Distribution",xlab= "Education",ylab="Number of Employees") ## bar chart for number of married and unmarried employees Marital_Status_dist = table(data_set$MaritalStatus) Marital_Status_dist barplot(Marital_Status_dist,main="Marital Status of employees",xlab= "Marital Status",ylab="Number of Employees") ## logistic regression model to predict which employees are likely to attride data_set$Attrition = as.factor(data_set$Attrition) LogisticModel = glm(Attrition~.,data_set,family="binomial") summary(LogisticModel) #Dropping insignificant variables such as Department, Education and YearsAtCompany LogisticModel = glm(Attrition~Age+DistanceFromHome+EducationField+EnvironmentSatisfaction+JobSatisfaction+MaritalStatus+MonthlyIncome+NumCompaniesWorked+WorkLifeBalance, data_set,family="binomial") summary(LogisticModel) ## Predicting the attrition probabilities Predicted_Attrition = predict(LogisticModel,data_set,type="response") Attrition_Final = cbind(data_set,Predicted_Attrition) View(Attrition_Final) ``` ## Project 3 ### Problem Statement Background of Problem Statement: A UK-based online retail store has captured the sales data for different products for the period of one year (Nov 2016 to Dec 2017). The organization sells gifts primarily on the online platform. The customers who make a purchase consume directly for themselves. There are small businesses that buy in bulk and sell to other customers through the retail outlet channel. Project Objective: **Find significant customers for the business who make high purchases of their favourite products.** The organization wants to roll out a loyalty program to the high-value customers after identification of segments. Use the clustering methodology to segment customers into groups: Domain: E-commerce Dataset Description: This is a transnational dataset that contains all the transactions occurring between Nov-2016 to Dec-2017 for a UK-based online retail store. * Attribute: Description * InvoiceNo: Invoice number (A 6-digit integral number uniquely assigned to each transaction) * StockCode: Product (item) code * Description: Product (item) name * Quantity: The quantities of each product (item) per transaction * InvoiceDate: The day when each transaction was generated * UnitPrice: Unit price (Product price per unit) * CustomerID: Customer number (Unique ID assigned to each customer) * Country: Country name (The name of the country where each customer resides) Analysis tasks to be performed: Use the clustering methodology to segment customers into groups: Use the following clustering algorithms: **K means** Hierarchical • Identify the right number of customer segments. • Provide the number of customers who are highly valued. • Identify the clustering algorithm that gives maximum accuracy and explains robust clusters. • If the number of observations is loaded in one of the clusters, break down that cluster further using the clustering algorithm. [ hint: Here loaded means if any cluster has more number of data points as compared to other clusters then split that clusters by increasing the number of clusters and observe, compare the results with previous results.] ## Project 4 One of the leading retail stores in the US, Walmart, would like to **predict the sales and demand accurately**. There are certain events and holidays which impact sales on each day. There are sales data available for 45 stores of Walmart. The business is facing a challenge due to unforeseen demands and runs out of stock some times, due to the inappropriate machine learning algorithm. **An ideal ML algorithm will predict demand accurately and ingest factors like economic conditions including CPI, Unemployment Index, etc.** Walmart runs several promotional markdown events throughout the year. These markdowns precede prominent holidays, the four largest of all, which are the Super Bowl, Labour Day, Thanksgiving, and Christmas. The weeks including these holidays are weighted five times higher in the evaluation than non-holiday weeks. Part of the challenge presented by this competition is modeling the effects of markdowns on these holiday weeks in the absence of complete/ideal historical data. Historical sales data for 45 Walmart stores located in different regions are available. Dataset Description This is the historical data which covers sales from 2010-02-05 to 2012-11-01, in the file Walmart_Store_sales. Within this file you will find the following fields: Store: the store number Date: the week of sales Weekly_Sales: sales for the given store Holiday_Flag : whether the week is a special holiday week 1 – Holiday week 0 – Non-holiday week Temperature: Temperature on the day of sale Fuel_Price: Cost of fuel in the region CPI: Prevailing consumer price index Unemployment: Prevailing unemployment rate Holiday Events Super Bowl: 12-Feb-10, 11-Feb-11, 10-Feb-12, 8-Feb-13 Labour Day: 10-Sep-10, 9-Sep-11, 7-Sep-12, 6-Sep-13 Thanksgiving: 26-Nov-10, 25-Nov-11, 23-Nov-12, 29-Nov-13 Christmas: 31-Dec-10, 30-Dec-11, 28-Dec-12, 27-Dec-13 Analysis Tasks * Basic Statistics tasks Which store has maximum sales ```r= ## Import dataset setwd("C:/Users/wuc25/Desktop") getwd() install.packages("dplyr") library("dplyr") Walmart_data_set <- read.csv("Walmart_Store_sales.csv") View(Walmart_data_set) # Basic Statistics tasks Which store has maximum sales summary(Walmart_data_set) by_group_store <- Walmart_data_set %>% group_by(Store) by_group_store result_total <- aggregate(Walmart_data_set$Weekly_Sales, by=list(Store = Walmart_data_set$Store), FUN=sum) View(result_total) ``` * Which store has maximum standard deviation i.e., the sales vary a lot. Also, find out the coefficient of mean to standard deviation ```r= ## Which store has maximum standard deviation result_sd <- summarise(group_by(Walmart_data_set, Store), sd(Weekly_Sales)) View(result_sd) ## Which store has maximum Coefficienct of Variation result_sd <- summarise(group_by(Walmart_data_set, Store), sd(Weekly_Sales)/mean(Weekly_Sales)) View(result_sd) ``` * Which store/s has good quarterly growth rate in Q3’2012 ```r= Walmart_data_set <- read.csv("Walmart_Store_sales.csv") Walmart_data_set$Date <- as.Date(Walmart_data_set$Date, "%d-%m-%Y") Walmart_data_set["Quarter"] <- quarters(Walmart_data_set$Date) Walmart_data_set["Year"] <- format(Walmart_data_set$Date, format ="%Y") Q3_2012_Sales<- filter(Walmart_data_set, Year == "2012" & Quarter == "Q3") View(Q3_2012_Sales) average_sales_2012Q3 <- summarise(group_by(Q3_2012_Sales, Store), mean(Weekly_Sales)) View(average_sales_2012Q3) ``` * Some holidays have a negative impact on sales. Find out holidays which have higher sales than the mean sales in non-holiday season for all stores together * Find out holidays which have higher sales than the mean sales in non-holiday season for all stores together ```r= non_holiday_sales <- filter(Walmart_data_set, Walmart_data_set$Holiday_Flag == 0) summary(non_holiday_sales) holiday_good_sales <- filter(Walmart_data_set, Walmart_data_set$Holiday_Flag == 1 & Weekly_Sales>1041256) View(holiday_good_sales) holiday_season <- sapply(holiday_good_sales$Date, factor) levels(holiday_season) ``` * Provide a **monthly** and **semester** view of sales in units and give insights Statistical Model * For Store 1 – Build prediction models to forecast demand * Linear Regression – Utilize variables like date and restructure dates as 1 for 5 Feb 2010 (starting from the earliest date in order). * * Hypothesize if CPI, unemployment, and fuel price have any impact on sales. Change dates into days by creating new variable. Select the model which gives best accuracy. To download the datasets click here.