Try   HackMD

Scatter plot matrix 散點圖矩陣

tags: 資料探勘

散點圖矩陣允許同時看到多個單獨變量的分佈和它們兩兩之間的關係,是為後續分析識別趨勢的很棒方法。

Pair plots in Python

In this notebook we will explore making pair plots in Python using the seaborn visualization library. We'll start with the default sns.pairplot.

# Pandas and numpy for data manipulation
import pandas as pd
import numpy as np
# matplotlib for plotting
import matplotlib.pyplot as plt
import matplotlib

# Set text size
matplotlib.rcParams['font.size'] = 18

# Seaborn for pairplots
import seaborn as sns

sns.set_context('talk', font_scale=1.2);
df = pd.read_csv('gapminder_data.csv')
df.columns = ['country', 'continent', 'year', 'life_exp', 'pop', 'gdp_per_cap']
df.head()

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Default Pair Plot with All Data

Let's use the entire dataset to create a simple, yet useful plot.

sns.pairplot(df);

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

我們可以看到 life-exp 和 gdp_per_cap 是正相關的,這表明較高收入國家的國民要活得更久一些(儘管這並不能表明二者存在因果關係)。這也顯示出世界範圍內的人口壽命隨着時間逐漸增長。我們可以從直方圖中瞭解到人口和 GDP 變量呈嚴重右偏態分佈。(右偏:右側尾部較長,平均數大於中位數)

Group and Color by a Variable

In order to better understand the data, we can color the pair plot using a categorical variable and the hue keyword. First, we will color the plots by the continent.

matplotlib.rcParams['font.size'] = 40
sns.pairplot(df, hue = 'continent');    # hue上色

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

調整資料

We can also see that the distribution of pop and gdp_per_cap is heavily skewed to the right. To better represent the data, we can take the log transform of those columns.

df['log_pop'] = np.log10(df['pop'])
df['log_gdp_per_cap'] = np.log10(df['gdp_per_cap'])

df = df.drop(columns = ['pop', 'gdp_per_cap'])
sns.pairplot(df, hue = 'continent');  

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

小結

現在我們發現大洋洲和歐洲趨向於擁有最高的期望壽命,而亞洲擁有最多的人口量。

代碼出自:https://github.com/WillKoehrsen/Data-Analysis/blob/master/pairplots/Pair%20Plots.ipynb