Data visualisation

# Data visualisation ## Week 1 ### Aesthetics:美感 ![](https://hackmd.io/_uploads/BkGCOxh83.png =50%x)![](https://hackmd.io/_uploads/BkqTde2U3.png =50%x) we can see that the dot and text can be really different ### Accessibility: 無障礙 ![](https://hackmd.io/_uploads/H11h5l2I3.png =30%x)![](https://hackmd.io/_uploads/B1f2qgn82.png =30%x)![](https://hackmd.io/_uploads/S14ncghUn.png =30%x) for color blind, it's really hard to tell the difference. ### Deceptive plots: 誤導 ![](https://hackmd.io/_uploads/S184B7n83.png =45%x)![](https://hackmd.io/_uploads/B1t4Bm3U3.png =45%x) 注意旁邊的UNIT，單位不同可以有更好的呈現，有時要適量的擷取資訊 :::success *The first principle is that you must not fool yourself; and you are the easiest person to fool.* **— Richard P. Feynman** ::: ### EDA: Exploratoty Data Analysis 探索性數據分析 ![](https://hackmd.io/_uploads/ry-UIm3Lh.png =75%x) 指的是在進行資料分析的初期階段，對資料進行探索和可視化，以了解資料的特徵、結構和潛在關係的過程。EDA的目的在於幫助分析人員發現資料集中的模式、異常值、缺失值、分布異常等 ### Model diagnostics: 模型診斷指在建立統計模型後，對模型進行評估和檢驗的過程。它旨在檢查模型是否符合統計假設，以及評估模型對資料的適應程度和預測能力。 ![](https://hackmd.io/_uploads/Skg1oPQh82.png =80%x)![] :::info Residuals vs. Fitted（殘差 vs. 擬合值）用於檢驗統計模型的殘差是否具有隨機性、常態分佈和恆定變異性。如果模型的殘差符合隨機性、常態分佈和恆定變異性的假設，那麼圖表中的殘差點應該隨擬合值均勻分佈在零附近，並且沒有明顯的趨勢或模式。如果圖表中的殘差點呈現明顯的趨勢、模式或異常值，則可能表示模型未能適當地擬合資料，或者在模型中存在違反統計假設的問題。 ::: :::info Quantile-Quantile Plot (殘差Q-Q圖) 用於檢驗殘差是否符合正態分佈的假設。 y 軸:殘差的分位數；x 軸:標準化正態分佈的分位數如果模型的殘差符合正態分佈的假設，那麼 Q-Q 圖上的點應該大致沿著一條直線分佈，如果 Q-Q 圖上的殘差點偏離直線，可能表示殘差不符合正態分佈 ::: :::info Scale-Location 它用於檢查回歸模型的殘差是否具有常態分佈和恆定變異性。如果模型的殘差滿足常態分佈和恆定變異性的假設，Scale-Location 圖上的點應該隨著擬合值均勻地分佈在水平線附近，沒有明顯的趨勢或模式。 ::: :::info Residuals vs. Leverage（殘差 vs. 杠杆值）它用於檢查回歸模型中的殘差是否受到極端的觀測值（或稱杠杆點）的影響這種圖表的目的是檢查是否存在極端的觀測值（高杠杆點），它們對模型的殘差產生較大的影響。如果圖表上的殘差點呈現聚集在低杠杆值的區域，則表示模型的殘差受到較少的影響，模型在該區域的擬合效果較好。 ::: ### Visual Inference ![](https://hackmd.io/_uploads/BJ4xsm3Un.png =50%x)![](https://hackmd.io/_uploads/Bkdlom3L2.png =50%x) 多數會認為左圖較有強烈的關係存在，但事實卻相反。 ### Big Data Visualisation ![](https://hackmd.io/_uploads/HkS5oQhU3.png =30%x)![](https://hackmd.io/_uploads/SJd5oQnLh.png =30%x)![](https://hackmd.io/_uploads/H1c9o7hUn.png =30%x) From left to right: Scatterplot, contourplot Scatteerplot with partial transparency. Q: which one is better ? ## Week 2 ### Principle of data visualisation #### DON'T DO **Bad colour schemes** e.g., rainbow colour map 過多的要素只會讓人眼花撩亂 **Misuse of animation** change blindness and too easy to “forget” the earlier states 容易誤導觀眾 **Scale abuse** inappropriate baseline, inconsistent scales 單位是鮮少人會注意到的細節，所以基準要是不一致，很容易造成認知上的誤會 **3D abuse** occlusion, projection, perceptual ambiguity我們處理2D的東西都會一直出錯了，何必呢? **Don’t oversummarise** e.g., plot distribution rather than just mean有一說一，太多的解釋都會造成過度解讀 #### SHOULD TO **Apprehension** (correct perception of relationships) 了解圖的關係、數據與數據的相關性 **Clarity** (visually distinguish elements) 視覺區分 **Consistency** (ability to interpret relative to similar graphs)能夠解釋圖的意義，即使兩張圖很相似，也能知道其關鍵的不同點 **Efficacy** (portray data as simply as possible)描述簡單一點，乾淨俐落 **Necessity** (is the graph necessary? are all of the graph elements necessary?) 圖也要簡單一點 **Truthfulness** (scale, coordinate system are accurate, not misleading)不要誤導觀眾 ## ggplot command :::info ggplot(data = <data>, mapping = aes(<aes> = <var>, ...)) + scale_<aes>_<scale>() + ... + geom_<layer>(mapping = aes(...)) + ... + <etc.> ::: <data> data.frame or tibble =>資料結構 <aes> aesthetic(x, y color, fill, size, shape, etc.) <var> datavariable of apporpriate type=>資料變數 scale_<aes>_<scale> (optional) modify rendering of <aes> to use <scale>渲染用 geom_<layer> (required if you want a plot) render the specified aesthetics (or a newly specified aesthetic) using plot type <layer>也是渲染用 <etc.> labels (xlab(), ylab(), etc.), themes (theme_classic(), theme_gray(), etc.), etc.給定標籤、主題 ## Color ###RGB and HCL RGB：RGB 是由紅色（Red）、綠色（Green）和藍色（Blue）三個基礎顏色通道組成的。電腦螢幕經常使用 RGB 色彩模型來產生與混合顏色。在 RGB 模型中，顏色是透過在每個通道上增加或減少亮度來產生的，這對於電腦硬體來說非常方便。然而，這種方法對於**人類的色彩感知並不直觀。** HCL：HCL 代表色調（Hue）、飽和度（Chroma）、和亮度（Luminance）。HCL 色彩空間基於人類的色覺體驗而設計，因此它更接近於我們的色彩感知方式。在 HCL 空間中，色調代表我們通常所說的顏色（如紅色、藍色、綠色等），飽和度代表顏色的強度或純度，亮度則代表顏色的明暗程度。**使用 HCL 模型可以更容易地產生在視覺上更有吸引力且更具可讀性的顏色模式。** ![](https://hackmd.io/_uploads/BydhYoAw3.png =45%x) ### Aside:opacity Alpha(opacity): * 0=fully transparent,1= fully alongside ### Color blindness toll :::info library(colorblindr) cvd_grid(fig) ![](https://hackmd.io/_uploads/SJubao0D3.png =45%x)![](https://hackmd.io/_uploads/SyobTjCvn.png =45%x) ::: ## Graphs for a single variable ### A single categorical variable * Dot chart ![](https://hackmd.io/_uploads/B1dxJ20P2.png =50%x) * Bar Chart ![](https://hackmd.io/_uploads/Hy8XknRD3.png =50%x) * Pie Chart ![](https://hackmd.io/_uploads/B1UBJhRw2.png =50%x) * Pictograph ![](https://hackmd.io/_uploads/ryhDk3Avh.png = =50%x) ### A single quantitative variable * Distribution Plots ![](https://hackmd.io/_uploads/Skc8ehRP2.png =35%x)box ![](https://hackmd.io/_uploads/rJxtx2ADh.png =35%x)Dot ![](https://hackmd.io/_uploads/H1H5e2APn.png =35%x)Histogram ![](https://hackmd.io/_uploads/SJI6e2ADh.png =35%x)Density ![](https://hackmd.io/_uploads/SyYybh0Dn.png =35%x)Q-Q ![](https://hackmd.io/_uploads/r1Qb-n0w3.png =35%x)Violin ![](https://hackmd.io/_uploads/ryIM-hCPn.png =35%x)Empirical CDF $\hat{F}_n(t)=\frac{NumberOfElementInTheSample\le t}{n}=\frac{1}{n}\sum_{i=1}^{n}1(X_i\le t)$ ![](https://hackmd.io/_uploads/rJ9lP2CD3.png=75%x)Empirical CDF hierarchy of Mappings ![](https://hackmd.io/_uploads/ry7Ij2Cwh.png =25%x)> ![](https://hackmd.io/_uploads/S1ugK20P3.png =25%x)>![](https://hackmd.io/_uploads/HJDfKnCw3.png =25%x)>![](https://hackmd.io/_uploads/HJZkc2Avn.png =25%x)![](https://hackmd.io/_uploads/H1WWc20Dh.png =25%x)![](https://hackmd.io/_uploads/S1oZ5nAD3.png =25%x)> ![](https://hackmd.io/_uploads/rJ4mq3CP2.png =25%x)>![](https://hackmd.io/_uploads/S1UNqhCD3.png =25%x)![](https://hackmd.io/_uploads/SJES9h0Pn.png =25%x)>![](https://hackmd.io/_uploads/SJb89nAPh.png =25%x)![](https://hackmd.io/_uploads/rk5Lq2CDn.png =25%x)![](https://hackmd.io/_uploads/SywD93RP3.png =25%x) ## week 3