# Scatter plot matrix 散點圖矩陣
###### tags: `資料探勘`
散點圖矩陣允許同時看到**多個單獨變量的分佈**和它們**兩兩之間的關係**,是為後續分析識別趨勢的很棒方法。
## Pair plots in Python
In this notebook we will explore making pair plots in Python using the seaborn visualization library. We'll start with the default sns.pairplot.
```
# Pandas and numpy for data manipulation
import pandas as pd
import numpy as np
```
```
# matplotlib for plotting
import matplotlib.pyplot as plt
import matplotlib
# Set text size
matplotlib.rcParams['font.size'] = 18
# Seaborn for pairplots
import seaborn as sns
sns.set_context('talk', font_scale=1.2);
```
```
df = pd.read_csv('gapminder_data.csv')
df.columns = ['country', 'continent', 'year', 'life_exp', 'pop', 'gdp_per_cap']
df.head()
```

## Default Pair Plot with All Data
Let's use the entire dataset to create a simple, yet useful plot.
```
sns.pairplot(df);
```

我們可以看到 life-exp 和 gdp_per_cap 是正相關的,這表明較高收入國家的國民要活得更久一些(儘管這並不能表明二者存在因果關係)。這也顯示出世界範圍內的人口壽命隨着時間逐漸增長。我們可以從直方圖中瞭解到人口和 GDP 變量呈嚴重右偏態分佈。(右偏:右側尾部較長,平均數大於中位數)
## Group and Color by a Variable
In order to better understand the data, we can color the pair plot using a categorical variable and the hue keyword. First, we will color the plots by the continent.
```
matplotlib.rcParams['font.size'] = 40
sns.pairplot(df, hue = 'continent'); # hue上色
```

## 調整資料
We can also see that the distribution of pop and gdp_per_cap is heavily skewed to the right. To better represent the data, we can take the log transform of those columns.
```
df['log_pop'] = np.log10(df['pop'])
df['log_gdp_per_cap'] = np.log10(df['gdp_per_cap'])
df = df.drop(columns = ['pop', 'gdp_per_cap'])
```
```
sns.pairplot(df, hue = 'continent');
```

## 小結
現在我們發現大洋洲和歐洲趨向於擁有最高的期望壽命,而亞洲擁有最多的人口量。
<!-- 教程:[https://www.gushiciku.cn/dc_hk/106765614](https://)
-->
代碼出自:[https://github.com/WillKoehrsen/Data-Analysis/blob/master/pairplots/Pair%20Plots.ipynb](https://)