# Scatter plot matrix 散點圖矩陣 ###### tags: `資料探勘` 散點圖矩陣允許同時看到**多個單獨變量的分佈**和它們**兩兩之間的關係**,是為後續分析識別趨勢的很棒方法。 ## Pair plots in Python In this notebook we will explore making pair plots in Python using the seaborn visualization library. We'll start with the default sns.pairplot. ``` # Pandas and numpy for data manipulation import pandas as pd import numpy as np ``` ``` # matplotlib for plotting import matplotlib.pyplot as plt import matplotlib # Set text size matplotlib.rcParams['font.size'] = 18 # Seaborn for pairplots import seaborn as sns sns.set_context('talk', font_scale=1.2); ``` ``` df = pd.read_csv('gapminder_data.csv') df.columns = ['country', 'continent', 'year', 'life_exp', 'pop', 'gdp_per_cap'] df.head() ``` ![](https://i.imgur.com/WNsgCPR.png) ## Default Pair Plot with All Data Let's use the entire dataset to create a simple, yet useful plot. ``` sns.pairplot(df); ``` ![](https://i.imgur.com/HEI4Wmk.png) 我們可以看到 life-exp 和 gdp_per_cap 是正相關的,這表明較高收入國家的國民要活得更久一些(儘管這並不能表明二者存在因果關係)。這也顯示出世界範圍內的人口壽命隨着時間逐漸增長。我們可以從直方圖中瞭解到人口和 GDP 變量呈嚴重右偏態分佈。(右偏:右側尾部較長,平均數大於中位數) ## Group and Color by a Variable In order to better understand the data, we can color the pair plot using a categorical variable and the hue keyword. First, we will color the plots by the continent. ``` matplotlib.rcParams['font.size'] = 40 sns.pairplot(df, hue = 'continent'); # hue上色 ``` ![](https://i.imgur.com/Q8UTQke.png) ## 調整資料 We can also see that the distribution of pop and gdp_per_cap is heavily skewed to the right. To better represent the data, we can take the log transform of those columns. ``` df['log_pop'] = np.log10(df['pop']) df['log_gdp_per_cap'] = np.log10(df['gdp_per_cap']) df = df.drop(columns = ['pop', 'gdp_per_cap']) ``` ``` sns.pairplot(df, hue = 'continent'); ``` ![](https://i.imgur.com/K3CgXFF.png) ## 小結 現在我們發現大洋洲和歐洲趨向於擁有最高的期望壽命,而亞洲擁有最多的人口量。 <!-- 教程:[https://www.gushiciku.cn/dc_hk/106765614](https://) --> 代碼出自:[https://github.com/WillKoehrsen/Data-Analysis/blob/master/pairplots/Pair%20Plots.ipynb](https://)