Dplyr套件介紹

--- GA: UA-159972578-2 --- ###### tags: `R` `Data Manipulation` `資料前處理` # Dplyr套件介紹閱讀 [Rpub版本](https://rpubs.com/RitaTang/dplyr_intro) (美觀+附程式執行結果) ## 前置準備 ### 1. 下載Dplyr套件 `tidyverse`套件，包含了`dplyr`、`ggplot2`，與`stringr`等數據處理常用包 ```{r} library(tidyverse) # 一次進行下載與載入套件 ``` ### 2. 讀取資料 ```{r} # setwd("~/camp") # 設定環境目錄 df = read.csv('athlete_all.csv', stringsAsFactors = F) # 不要將字串變數轉換成類別變數 ``` ## Dplyr介紹 ### 1. 什麼是Dplyr? + `%>%`運算子(Pipeline)表示法：兼顧可讀性與精簡的特性 + 六大主要功能： <table> <tr> <th> 函數 </th> <th> 說明 </th> </tr> <tr> <td> select() </td> <td> 選擇變數 </td> </tr> <tr> <td> filter() </td> <td> 篩選出符合條件的變數 </td> </tr> <tr> <td> arrange() </td> <td> 依照變數排序資料 </td> </tr> <tr> <td> summarise() </td> <td> 聚合函數，對變數做群組運算 </td> </tr> <tr> <td> mutate() </td> <td> 新增變數 </td> </tr> <tr> <td> group_by() </td> <td> 依照類別變數分組 </td> </tr> </table> - - - ### 2. Dplyr Cheatsheet ![](https://i.imgur.com/CUJehYm.jpg =80%x) ![](https://i.imgur.com/KcYvBUG.jpg =80%x) ## 實戰演練 ### 1. `%>%`運算子 <b>假設今天欲確認df中變數`Games`的年份，是否與變數`year`的年份相同？</b> #### 【一般寫法】 ```{r} year = substr(df$Games, 1, 4) # 取第1到第4個字元 str(year) # 查看資料型態: chr year = as.integer(year) # 將資料型態從字串轉換成整數 str(year) # 查看資料型態: int identical(df$year, year) # 查看兩者年份是否相同 ``` #### 【進階寫法】 ```{r} year = as.integer(substr(df$Games, 1, 4)) # 將兩步驟直接合併為一步驟 identical(df$year, year) ``` #### 【`%>%`寫法】 ```{r} df$Games %>% substr(1,4) %>% as.integer() %>% identical(df$year) ``` #### 【小結】寫法 | 一般 | 進階 | `%>%` | -------- | -------- | -------- | -------- | 優點| 易讀性高 | 效率高 | 易讀又高效 | 缺點| 效率差 | 不好閱讀 | 沒學過看不懂 | - - - ### 2. select() 說明：選擇變數 #### 【基礎篇】只看姓名、年齡、性別 ```{r} df %>% select(Name, Age, Sex) %>% head(10) # 只看前十筆資料 ``` - - - ### 3. filter() 說明：篩選條件變數 #### 【基礎篇】找出18歲以上的女運動員 ```{r} df %>% filter(Age>=18 & Sex=='F') %>% head(5) ``` #### 【進階篇】找出的18歲以上的女運動員的姓名、年齡與運動項目 ```{r} df %>% filter(Age>=18 & Sex=='F') %>% # 篩選出18歲以上的女運動員 select(Name, Age, Sport) %>% # 挑選想要的欄位 head(10) ``` - - - ### 4. arrange() 說明：依變數排序 #### 【基礎篇】將資料依據最新的年份排序 ```{r} df %>% arrange(desc(year)) %>% head(3) # 預設為正序, desc代表倒序 ``` #### 【進階篇】找出不重複的18歲以上的女運動員的姓名、年齡與運動 ```{r} df %>% filter(Age>=18 & Sex=='F') %>% select(ID, Name, Age, Sport) %>% arrange(Age, Sport, Name) %>% # 依照年齡,運動項目,姓名排序 .[!duplicated(.$ID),] %>% # 找出不重複的人 head(10) ``` - - - ### 5. summarise() 說明：聚合運算 #### 【基礎篇】歷年奧運最年長運動員的歲數 ```{r} df %>% summarise(MaxAge = max(Age, na.rm=T)) ``` - - - ### 6. mutate() 說明：新增變數 #### 【基礎篇】製作一個新的變數：BMI ```{r} df %>% mutate(BMI = (Weight/(Height*0.01)^2)) %>% head(3) ``` - - - ### 7. group_by() 說明：將變數分組 #### 【基礎篇】依照獎牌分組 ```{r} df %>% group_by(Medal) ``` 觀察以上資料框，可以發現：單一使用group_by不會對資料框產生變化。 group_by是聚合函數，目的是以此分組依據去做某種動作，因此需配合其他dplyr函數一起使用。 #### 【進階篇】 2016年奧運運動員獎盃得主的最高年齡、平均年齡、性別比與BMI ```{r} df %>% filter(year==2016) %>% # 篩選2016年的資料 group_by(Medal) %>% # 依照獎牌分類 mutate(BMI = (Weight/(Height*0.01)^2), # 新增BMI變數 Male = (Sex=='M')) %>% # 新增男性變數(True:男; False:女) summarise(MaxAge = max(Age), # 計算最高年齡 Age = mean(Age), # 計算平均年齡 SexRatio = mean(Male), # 計算男性比 BMI = first(BMI)) %>% # 保留BMI的變數 .[c(2,3,1),] # 依照金牌>銀牌>銅牌之順序重新排列 ``` ## 其他常用技巧 ### 1. tapply() + tapply(x, index, function) + 將x依照index分組並做function ```{r} tapply(df$Sport, df$year, n_distinct) ``` - - - ### 2. aggregate() + aggregate(x, by=list(), function) + 將x by list的內容分組並做function + 與tapply相似，但可以多層次分組 ```{r} aggregate(df$Event, list(Sport=df$Sport, Year=df$year), n_distinct) %>% head(20) ``` - - - ### 3. unique() 說明：將重複的內容只取出一個代表 ```{r} unique(df$Sport) ``` - - - ### 4. left_join() + left_join(x, y, by=) + 保留x所有資料，根據by的變數，把x沒有但y有的資料合併至x資料框 ```{r} band_members band_instruments band_members %>% left_join(band_instruments) ```