資料視覺化：使用 `ggplot2`

--- disqus: yueswater --- # 資料視覺化：使用 `ggplot2` {%hackmd @themes/orangeheart %} <style> .likecoin-button { position: relative; width: 100%; max-width: 485px; max-height: 240px; margin: 0 auto; } .likecoin-button > div { padding-top: 49.48454%; } .likecoin-button > iframe { position: absolute; top: 0; left: 0; width: 100%; height: 100%; } </style> ###### tags: `R Language` 首先我們可以在 `R` 的環境下安裝 `ggplot2` 套件與 `tidyverse`，這兩個套件很像是 `python` 中的 `pandas` + `numpy` + `matplotlib`。[^1] ```r install.packages("ggplot2") install.packages("tidyverse") ``` 安裝完畢後，就可以使用這兩個套件： ```r library("ggplot2") library("tidyverse") ``` ## 繪圖基本架構有關 `ggplot2` 的使用說明，可以參考 `ggplot2` 官網所提供的[備忘表](https://github.com/rstudio/cheatsheets/raw/main/data-visualization-2.1.pdf)。[^2] ![](https://i.imgur.com/mt50x2c.png) 在資料視覺化的框架之下，我們可以將透過程式語言產生出的圖像拆分成以下兩個大區塊： - 原始資料(data) - 視覺變數(aesthetics) H. Wickham 的 "*[A Layered Grammar of Graphics](http://byrneslab.net/classes/biol607/readings/wickham_layered-grammar.pdf)*" 文章中，將圖像細分為以下的部分： ![](https://i.imgur.com/wKJ48x3.png) 也就是說，當我們拿到一筆資料之後，呈現資料的寫法一定是： ```r ggplot([資料名稱], aes(x=[變數名稱])) + 幾何變數層 ``` 其中幾何變數層是幫助我們將資料以特定的方式映射(maps to)到平面或立體空間中，最常見的就是散佈圖(scatter plot)、長條圖(bar plot)。 ## 資料視覺化實踐了解完 `ggplot2` 基本的程式碼架構，我們就可以實際對於資料進行操作。以下我們會針對兩個資料庫演練。 ### 披薩外送資料庫我們使用 Heumann, Christian, 與 Micheal Schomaker Shalabh 兩位教授合著的 "*Introduction to statistics and data analysis*" (2016) 中的 `pizza delivery data` 作為資料，請各位至本教科書之[附錄官網](https://chris.userweb.mwn.de/book/)下載，或點擊[此連結](https://chris.userweb.mwn.de/book/pizza_delivery.csv)進行下載。當然，你也可以直接在 `R` 中使用下列方式讀取資料： ```r data <- read.csv("https://chris.userweb.mwn.de/book/pizza_delivery.csv") ``` 假設我們今天對於披薩外送這份資料中的 `branch` 變數感興趣，想要知道哪家分店接到訂單的次數最多。由於該變數係名目資料，因此我們可以利用長條圖呈現。 ```r ggplot(data, aes(x=branch)) + geom_bar() ``` <a><img src='https://svgshare.com/i/nZR.svg' title='' /></a> 我們可以很清楚地看到 `West` 這家分店接到訂單的次數最多。注意到幾何變數層是一定要加上去的，如果沒有加上幾何變數層，就會變成一張空白的圖。 ```r ggplot(data, aes(x=branch)) ``` <a><img src='https://svgshare.com/i/nYN.svg' title='' /></a> 而我們也可以使用 `+` 來疊加不同圖層的效果。而作者在此提供一個小建議，當我們在加上圖層的時候，我個人的習慣是會在 `+` 後方換行，才不會讓程式碼太長。 ### 拉麵資料庫我們使用 Kaggle 上的[拉麵資料庫](https://www.kaggle.com/datasets/residentmario/ramen-ratings?resource=download&select=ramen-ratings.csv)。注意到這個資料庫中，星號的資料型別是字元(character)： ```r ramen <- read.csv("./ramen-ratings.csv") class(ramen$Stars) ## "character" ``` 我們可以將該變數轉換為 `numeric` 型別，並且順便將遺漏值 `NA` 去除。 ```r ramen$Stars <- as.numeric(ramen$Stars, na.rm = TRUE) ramen <- na.omit(ramen) ``` 假設我們今天想要了解不同國家使用泡麵外包裝的狀況，我們可以透過下方的程式碼達到目的： ```r ggplot(ramen, aes(x=Country, fill=Style)) + geom_bar() + coord_flip() ``` <a><img src='https://svgshare.com/i/nZF.svg' title='' /></a> 注意到以下幾點： - `aes(..., fill=Style)`：代表長條圖的填充係依照 `Style` 這個變數進行調整。 - `corrd_flip`：將 $x$ 與 $y$ 軸進行調換。接著，假設我們想要了解不同包裝之泡麵獲得的星數在各國的差異，可以透過下列方式呈現。 ```r ggplot(ramen, aes(x=Style, y=Stars)) + geom_bar(stat="identity") + facet_wrap(~Country) ``` <a><img src='https://svgshare.com/i/nZG.svg' title='' /></a> 其中， - `aes(x=Style, y=stars)`：代表 $x$ 軸為泡麵包裝的形式，$y$ 軸則為獲得的星數。 - `geom_bar(stat="identity")`：因匯總資料會將 $x$ 依照不同的值計算，而我們要將其使用我們指定的變數 $y$ 作為計算值。 - `facet_wrap( ~ Country)` 則是指定每個小平面層均須依照 `Country` 這個變數重複繪製。有關於 `ggplot2` 不同幾何圖形層的介紹，請參考[資料視覺化：ggplot2 中的幾何圖形層](https://hackmd.io/@yueswater/geom)。 [^1]:本文章係參考 Meng, Lee. “淺談資料視覺化以及 GGPLOT2 實踐.” LeeMeng, https://leemeng.tw/data-visualization-from-matplotlib-to-ggplot2.html. [^2]:“Function Reference.” ggplot2, https://ggplot2.tidyverse.org/reference/. <div class="likecoin-embed likecoin-button"> <div></div> <iframe scrolling="no" frameborder="0" src="https://button.like.co/in/embed/xiaolong70701/button?referrer=hackmd.io"></iframe> </div>