Try   HackMD

Week 1 Data visualization

What would you like to show?

  1. Comparison 比較 Bar chart; Box plot
  2. Distribution 分布 Histogram; Boxplot
  3. Composition 組成 Pie chart; Stacked bar chart
  4. Relationship 關係 Scatter plot(bubble chart); Heat map
  • 直方圖 Histogram 資料分布狀況
  • 散布圖 Scatter plot 兩種變數的關係
  • 泡泡圖 Bubble plot 三種變數的關係

Using ggplot2 package

ggplot 不能做的事情:

  1. 3D graphics
  2. Graph type graphs 點線網絡的graph(nodes/edges layout)
  3. Interactive graphics 互動式的graph

ggplot is a building block of a graph include:

  1. data; aesthetic mapping; geometric object
  2. statistical transformations; scales
  3. coordinate system; position adjustments
  4. faceting

基本架構:

aesthetics + geometric objects

aesthetics 的參數設定:

aes()

  • position
  • color (outside)
  • fill (inside color)
  • shape
  • linetype
  • size

geometric objects: geom_()

  • geom_point()
  • geom_line()
  • geom_boxplot()
  • geom_histogram()
  • geom_bar()
  • geom_smoother()
  • geom_raster()

ggplot(): 準備畫布 (canvas)

Using data: landdata.csv

讀檔

library(ggplot2)
housing <- read.csv("檔案位置")
head(housing) 

Histogram

# 1. 
hist(housing$Home.value) 
# 2.
ggplot(housing, aes(x = Home.value)) + geom_histogram()

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →


More complex graphs

Traditional plot():

plot(Home.Value ~ Date, data=subset(housing, State == "MA"), type="l")
lines(Home.Value ~ Date, col="red", data=subset(housing, State == "TX"))
legend(1975, 400000, c("MA", "TX"), title="State", col=c("black", "red"), pch=c(1,1))

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

subset():
對屬性資料表去做查詢並擷取出來

By ggplot():

data <- subset(housing, State %in% c("MA", "TX"))

ggplot(data,aes(x=Date, y=Home.Value, color = State)) + geom_line()

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

補充 %in% 運算元:
判斷左邊集合的元素有沒有在右邊集合中,有則回傳 TRUE,沒有則回傳 FALSE

Points (Scatter Plot)

hp2001Q1 <- subset(housing, Date == 2001.25) 
ggplot(hp2001Q1, aes(y = Structure.Cost, x = Land.Value)) + geom_point()
ggplot(hp2001Q1, aes(y = Structure.Cost, x = log(Land.Value))) + geom_point()

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

(取LOG)


Lines

# 回歸線的截距和係數 (Intercept & coefficient)
model<-lm(Structure.Cost ~ log(Land.Value), data = hp2001Q1)
# 預測值
hp2001Q1$pred.SC <- predict(model)

p1 <- ggplot(hp2001Q1, aes(x = log(Land.Value), y = Structure.Cost))

p1 + geom_point(aes(color = Home.Value)) + geom_line(aes(y = pred.SC))

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →


Smoothers (趨勢線)

p1 + geom_point(aes(color = Home.Value)) + geom_smooth()

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →


Aesthetic Mapping vs. Assignment (scaltter plot)

p1 + geom_point(size = 2, color="red")

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

p1 + geom_point(aes(color=Home.Value, shape = region))

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

p1 + geom_point(aes(size=Home.Value, color = region))

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →


Bar chart: geom_bar()

ggplot(housing, aes(x=region)) + geom_bar()

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Stacked Bar Chart

ggplot(housing, aes(x=Year, fill=region)) 
      + geom_bar() 
      + labs(title = "Stacked Bar Chart", x = "YEAR", y = "Counts")

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

aggregate(要的資料, 用什麼分類, 用什麼函數(mean,sum等等))

housing.sum <- aggregate(housing["Home.Value"], housing["State"], FUN=mean)

ggplot(housing.sum, aes(x=State, y=Home.Value)) + geom_bar(stat='identity') # 一定要指定要畫的資料是什麼stat='identity'


Pie chart

用ggplot2畫圓餅圖的原理是將長條圖的y座標改成極座標
x軸不放資料,用顏色去區分region

housing2.sum <- aggregate(housing["Home.Value"], housing["region"], FUN=length) 

ggplot(housing2.sum, aes(x=region, y=Home.Value))+geom_bar(stat='identity')+labs(y="Counts")

ggplot(housing2.sum, aes(x="", y=Home.Value, fill =factor(region)))+geom_bar(stat='identity',width=1)+coord_polar(theta = "y", start=0)



Box plot

ggplot(housing, aes(x = region, y= Home.Value)) + geom_boxplot(fill = "red")+ scale_y_continuous("hoem value", breaks= seq(0,800000, by=100000))


Heat map

ggplot(housing, aes(x= Year, y= Qrtr)) + geom_raster(aes(fill = Home.Value)) +scale_fill_continuous(name="Value", breaks = c(200000, 500000, 800000), labels = c("'200", "'500", "'800"), low='gray', high='red')