owned this note
owned this note
Published
Linked with GitHub
# numpy & pandas 筆記 (2)
***本篇資料來源為莫煩 python:**
https://morvanzhou.github.io/tutorials/data-manipulation/np-pd/
pandas 跟 numpy 的使用情境:
numpy 用來替代 python 當中的 list (陣列,矩陣)
pandas 用來替代 python 當中的 dictionary (字典)
## pandas 基本功能
- **Pandas Series**
會自動加上字典的 index
```python=1
#coding=utf-8
import numpy as np
import pandas as pd
s = pd.Series([1,3,6, np.nan, 44, 1])
print(s)
'''
0 1.0
1 3.0
2 6.0
3 NaN
4 44.0
5 1.0
dtype: float64
'''
```
- **Pandas date_range**
```python=1
#coding=utf-8
import numpy as np
import pandas as pd
dates = pd.date_range('20170101',periods=6)
print(dates)
'''
DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04',
'2017-01-05', '2017-01-06'],
dtype='datetime64[ns]', freq='D')
'''
```
- **Pandas DataFrame 基本性質**
```python=1
#coding=utf-8
import numpy as np
import pandas as pd
# 宣告 data frame
df=pd.DataFrame(np.random.randn(6,4), index=dates, columns=['a','b','c','d'])
print(df)
'''
a b c d
2017-01-01 1.159450 -1.578301 -0.652926 0.375754
2017-01-02 0.393195 -0.346388 -0.754170 -0.008133
2017-01-03 1.192782 0.804110 -0.842838 0.012096
2017-01-04 -1.058825 -0.323025 0.722646 -1.166700
2017-01-05 -1.180235 -0.450869 1.203138 -0.815103
2017-01-06 -1.722095 0.304247 0.175714 -1.654301
'''
# 預設的 index 為 0 1 2 3 ...
df1=pd.DataFrame(np.arange(12).reshape((3,4)))
print(df1)
'''
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
'''
```
- **Pandas DataFrame 用法:使用 dictionary 宣告**
```python=1
#coding=utf-8
import numpy as np
import pandas as pd
# 使用 dictionary 定義 data frame
df2=pd.DataFrame({'A':1.,
'B':pd.Timestamp('20170101'),
'C':pd.Series(1, index=list(range(4)),dtype='float32'),
'D':np.array([3]*4, dtype='int32'),
'E':pd.Categorical(["test","train","test","train"]),
'F':'foo'})
print(df2)
'''
A B C D E F
0 1.0 2017-01-01 1.0 3 test foo
1 1.0 2017-01-01 1.0 3 train foo
2 1.0 2017-01-01 1.0 3 test foo
3 1.0 2017-01-01 1.0 3 train foo
'''
```
- **Pandas DataFrame 使用 dtype 查看資料型態**
```python=20
# 查看資料格式
print(df2.dtypes)
'''
A float64
B datetime64[ns]
C float32
D int32
E category
F object
dtype: object
'''
```
- **Pandas DataFrame 查看 index 與資料數值**
```python=31
# 輸出所有 row index
print(df2.index)
'''
Int64Index([0, 1, 2, 3], dtype='int64')
'''
# 輸出所有 column
print(df2.columns)
'''
Index([u'A', u'B', u'C', u'D', u'E', u'F'], dtype='object')
'''
# 輸出所有值 values
print(df2.values)
'''
[[1.0 Timestamp('2017-01-01 00:00:00') 1.0 3 'test' 'foo']
[1.0 Timestamp('2017-01-01 00:00:00') 1.0 3 'train' 'foo']
[1.0 Timestamp('2017-01-01 00:00:00') 1.0 3 'test' 'foo']
[1.0 Timestamp('2017-01-01 00:00:00') 1.0 3 'train' 'foo']]
'''
```
- **Pandas DataFrame 使用 describe 查看數字資料描述**
```python=51
# 輸出 describe 查看數字形式的特性
print(df2.describe())
'''
A C D
count 4.0 4.0 4.0
mean 1.0 1.0 3.0
std 0.0 0.0 0.0
min 1.0 1.0 3.0
25% 1.0 1.0 3.0
50% 1.0 1.0 3.0
75% 1.0 1.0 3.0
max 1.0 1.0 3.0
'''
```
- **Pandas DataFrame 轉置資料**
```python=64
# transpose 轉置矩陣
print(df2.T)
'''
0 1 2 3
A 1 1 1 1
B 2017-01-01 00:00:00 2017-01-01 00:00:00 2017-01-01 00:00:00 2017-01-01 00:00:00
C 1 1 1 1
D 3 3 3 3
E test train test train
F foo foo foo foo
'''
```
- **Pandas DataFrame 排序資料 sort_index 與 sort_values**
```python=75
# 排序,對於 axis=1 以 row 方向排序 ascending=False 倒排序
print(df2.sort_index(axis=1, ascending=False))
''' 排序結果: F E D C B A
F E D C B A
0 foo test 3 1.0 2017-01-01 1.0
1 foo train 3 1.0 2017-01-01 1.0
2 foo test 3 1.0 2017-01-01 1.0
3 foo train 3 1.0 2017-01-01 1.0
'''
# 排序,對於 axis=0 以 col 方向反向排序 (ascending=Fals)
print(df2.sort_index(axis=0, ascending=False))
''' 排序結果: 3 2 1 0
A B C D E F
3 1.0 2017-01-01 1.0 3 train foo
2 1.0 2017-01-01 1.0 3 test foo
1 1.0 2017-01-01 1.0 3 train foo
0 1.0 2017-01-01 1.0 3 test foo
'''
# 排序 sort_values 針對單行的值進行排序
print(df2.sort_values(by='E'))
''' 排序結果: test train 兩倆排在一起
A B C D E F
0 1.0 2017-01-01 1.0 3 test foo
2 1.0 2017-01-01 1.0 3 test foo
1 1.0 2017-01-01 1.0 3 train foo
3 1.0 2017-01-01 1.0 3 train foo
'''
```
## pandas indexing
- **data frame 基本 indexing 方法**
```python=1
#coding=utf-8
import pandas as pd
import numpy as np
dates = pd.date_range('2017-01-01',periods=6)
df = pd.DataFrame(np.arange(24).reshape((6,4)), index=dates, columns=['A','B','C','D'])
print(df['A'])
'''
2017-01-01 0
2017-01-02 4
2017-01-03 8
2017-01-04 12
2017-01-05 16
2017-01-06 20
Freq: D, Name: A, dtype: int64
'''
print(df.A)
'''
2017-01-01 0
2017-01-02 4
2017-01-03 8
2017-01-04 12
2017-01-05 16
2017-01-06 20
Freq: D, Name: A, dtype: int64
'''
print(df[0:3])
'''
A B C D
2017-01-01 0 1 2 3
2017-01-02 4 5 6 7
2017-01-03 8 9 10 11
'''
print(df['2017-01-02':'2017-01-04'])
'''
A B C D
2017-01-02 4 5 6 7
2017-01-03 8 9 10 11
2017-01-04 12 13 14 15
'''
```
- **data frame location 用法 loc**
```python=43
# 使用 data frame loc 來選擇查看數據
# 查看指定的 row 的內容
print(df.loc['2017-01-02'])
'''
A 4
B 5
C 6
D 7
Name: 2017-01-02 00:00:00, dtype: int64
'''
# 篩選 col 的指令欄位 row 部分全部印出
print(df.loc[:,['A','B']])
'''
A B
2017-01-01 0 1
2017-01-02 4 5
2017-01-03 8 9
2017-01-04 12 13
2017-01-05 16 17
2017-01-06 20 21
'''
# 指定 row 指定 col
print(df.loc['2017-01-03',['A','C']])
'''
A 8
C 10
Name: 2017-01-03 00:00:00, dtype: int64
'''
```
- **select py position 使用 iloc**
```python=74
# 使用 index 數字來指定 select by position: iloc
print(df.iloc[3])
'''
A 12
B 13
C 14
D 15
Name: 2017-01-04 00:00:00, dtype: int64
'''
# 第 3 row, 第 1 col
print(df.iloc[3,1]) # 13
# 第 3~5 row, 第 1~3 col
print(df.iloc[3:5,1:3])
'''
B C
2017-01-04 13 14
2017-01-05 17 18
'''
```
- **混合篩選 使用 ix**
```python=93
# 混合篩選 使用 ix
print(df.ix[:3, ['A','C']])
'''
A C
2017-01-01 0 2
2017-01-02 4 6
2017-01-03 8 10
'''
```
- **Booling indexing**
```python=101
# Booling indexing
print(df)
'''
A B C D
2017-01-01 0 1 2 3
2017-01-02 4 5 6 7
2017-01-03 8 9 10 11
2017-01-04 12 13 14 15
2017-01-05 16 17 18 19
2017-01-06 20 21 22 23
'''
# 針對 A 欄位資料 > 8 的篩選出來
print(df[df.A > 8])
'''
A B C D
2017-01-04 12 13 14 15
2017-01-05 16 17 18 19
2017-01-06 20 21 22 23
'''
```
## pandas 設值
- **在資料表格中根據 index 設定值**
```python=1
#coding=utf-8
import pandas as pd
import numpy as np
dates = pd.date_range('2017-01-01',periods=6)
df = pd.DataFrame(np.arange(24).reshape((6,4)), index=dates, columns=['A','B','C','D'])
print(df)
'''
A B C D
2017-01-01 0 1 2 3
2017-01-02 4 5 6 7
2017-01-03 8 9 10 11
2017-01-04 12 13 14 15
2017-01-05 16 17 18 19
2017-01-06 20 21 22 23
'''
df.iloc[2,2] = 1111
print(df)
'''
A B C D
2017-01-01 0 1 2 3
2017-01-02 4 5 6 7
2017-01-03 8 9 1111 11
2017-01-04 12 13 14 15
2017-01-05 16 17 18 19
2017-01-06 20 21 22 23
'''
df.loc['2017-01-02','C'] = 222
print(df)
'''
A B C D
2017-01-01 0 1 2 3
2017-01-02 4 5 222 7
2017-01-03 8 9 1111 11
2017-01-04 12 13 14 15
2017-01-05 16 17 18 19
2017-01-06 20 21 22 23
'''
```
- **設定符合條件,整欄位修改**
```python=1
#coding=utf-8
import pandas as pd
import numpy as np
dates = pd.date_range('2017-01-01',periods=6)
df = pd.DataFrame(np.arange(24).reshape((6,4)), index=dates, columns=['A','B','C','D'])
# 全部都改到
df[df.A>8] = 0
print(df)
'''
A B C D
2017-01-01 0 1 2 3
2017-01-02 4 5 222 7
2017-01-03 8 9 1111 11
2017-01-04 0 0 0 0
2017-01-05 0 0 0 0
2017-01-06 0 0 0 0
'''
```
- **設定符合條件,指定特定欄位才修改**
```python=1
#coding=utf-8
import pandas as pd
import numpy as np
dates = pd.date_range('2017-01-01',periods=6)
df = pd.DataFrame(np.arange(24).reshape((6,4)), index=dates, columns=['A','B','C','D'])
# 指定特定欄位修改
df.A[df.A>8] = 0
print(df)
'''
A B C D
2017-01-01 0 1 2 3
2017-01-02 4 5 222 7
2017-01-03 8 9 1111 11
2017-01-04 0 13 14 15
2017-01-05 0 17 18 19
2017-01-06 0 21 22 23
'''
dates = pd.date_range('2017-01-01',periods=6)
df = pd.DataFrame(np.arange(24).reshape((6,4)), index=dates, columns=['A','B','C','D'])
# 指定特定欄位修改
df.B[df.A>8] = 0
print(df)
'''
A B C D
2017-01-01 0 1 2 3
2017-01-02 4 5 6 7
2017-01-03 8 9 10 11
2017-01-04 12 0 14 15
2017-01-05 16 0 18 19
2017-01-06 20 0 22 23
'''
```
- **新增欄位並設定值**
```python=1
#coding=utf-8
import pandas as pd
import numpy as np
dates = pd.date_range('2017-01-01',periods=6)
df = pd.DataFrame(np.arange(24).reshape((6,4)), index=dates, columns=['A','B','C','D'])
# 新增欄位
df['E'] = pd.Series([1,2,3,4,5,6],index=pd.date_range('2017-01-01',periods=6))
df['F'] = np.nan
print(df)
'''
A B C D E F
2017-01-01 0 1 2 3 1 NaN
2017-01-02 4 5 6 7 2 NaN
2017-01-03 8 9 10 11 3 NaN
2017-01-04 12 13 14 15 4 NaN
2017-01-05 16 17 18 19 5 NaN
2017-01-06 20 21 22 23 6 NaN
'''
```
## pandas 處理資料表中的 "空值"
- **dropna 丟掉整欄/整列資料**
```python=1
#coding=utf-8
import pandas as pd
import numpy as np
dates = pd.date_range('2017-01-01',periods=6)
df = pd.DataFrame(np.arange(24).reshape((6,4)), index=dates, columns=['A','B','C','D'])
print(df)
'''
A B C D
2017-01-01 0 1 2 3
2017-01-02 4 5 6 7
2017-01-03 8 9 10 11
2017-01-04 12 13 14 15
2017-01-05 16 17 18 19
2017-01-06 20 21 22 23
'''
# 設定兩個空值
df.iloc[0,1] = np.nan
df.iloc[1,2] = np.nan
print(df)
'''
A B C D
2017-01-01 0 NaN 2.0 3
2017-01-02 4 5.0 NaN 7
2017-01-03 8 9.0 10.0 11
2017-01-04 12 13.0 14.0 15
2017-01-05 16 17.0 18.0 19
2017-01-06 20 21.0 22.0 23
'''
# dropna: how = any 偵測 row (axis=0) 當中有 nan 的就整個丟掉
print(df.dropna(axis=0, how='any')) # how = 'any' 'all'
'''
A B C D
2017-01-03 8 9.0 10.0 11
2017-01-04 12 13.0 14.0 15
2017-01-05 16 17.0 18.0 19
2017-01-06 20 21.0 22.0 23
'''
# dropna: how = any 偵測 col (axis=1) 當中有 nan 的就整個丟掉
print(df.dropna(axis=1, how='any')) # how = 'any' 'all'
'''
A D
2017-01-01 0 3
2017-01-02 4 7
2017-01-03 8 11
2017-01-04 12 15
2017-01-05 16 19
2017-01-06 20 23
'''
# p.s. 如果 how = all 的話,要整排 NaN 才會丟掉
```
- **fillna 在 NaN 空格中填入指定的值 value**
```python=57
# fillna: 遇到 NaN 填入指定的數字
print(df.fillna(value=0))
'''
A B C D
2017-01-01 0 0.0 2.0 3
2017-01-02 4 5.0 0.0 7
2017-01-03 8 9.0 10.0 11
2017-01-04 12 13.0 14.0 15
2017-01-05 16 17.0 18.0 19
2017-01-06 20 21.0 22.0 23
'''
```
- **isnull 查看資料是否為空值 NaN**
```python=68
# isnull 查看資料格是否是 NaN
print(df.isnull())
'''
A B C D
2017-01-01 False True False False
2017-01-02 False False True False
2017-01-03 False False False False
2017-01-04 False False False False
2017-01-05 False False False False
2017-01-06 False False False False
'''
# isnull 的應用:查看整個資料表是否有空值 NaN
print(np.any(df.isnull())==True) # output: True
'''
True 代表 df.isnull 當中有 True --> 意思是資料表中有 NaN (丟失的數據)
'''
```
## pandas 讀寫資料檔案
首先準備一個 student.csv 檔
裡面資料長這樣
```
Student ID,name ,age,gender
1100,Kelly,22,Female
1101,Clo,21,Female
1102,Tilly,22,Female
1103,Tony,24,Male
1104,David,20,Male
1105,Catty,22,Female
1106,M,3,Female
1107,N,43,Male
1108,A,13,Male
1109,S,12,Male
1110,David,33,Male
1111,Dw,3,Female
1112,Q,23,Male
1113,W,21,Female
```
pandas 可以讀取的檔案格式詳情:
https://pandas.pydata.org/pandas-docs/stable/api.html#input-output
```python=1
#coding=utf-8
import pandas as pd
import numpy as np
# 讀取
data = pd.read_csv('student.csv')
print(data)
'''
Student ID name age gender
0 1100 Kelly 22 Female
1 1101 Clo 21 Female
2 1102 Tilly 22 Female
3 1103 Tony 24 Male
4 1104 David 20 Male
5 1105 Catty 22 Female
6 1106 M 3 Female
7 1107 N 43 Male
8 1108 A 13 Male
9 1109 S 12 Male
10 1110 David 33 Male
11 1111 Dw 3 Female
12 1112 Q 23 Male
13 1113 W 21 Female
'''
# 存檔
data.to_pickle('student.pickle')
```
## pandas concatenating 合併資料
- **concat 設定 axis=0 為直向合併**
```python=1
#coding=utf-8
import pandas as pd
import numpy as np
df1 = pd.DataFrame(np.ones((3,4))*0, columns=['a','b','c','d'])
df2 = pd.DataFrame(np.ones((3,4))*1, columns=['a','b','c','d'])
df3 = pd.DataFrame(np.ones((3,4))*2, columns=['a','b','c','d'])
print(df1)
'''
a b c d
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
'''
print(df2)
'''
a b c d
0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0
'''
print(df3)
'''
a b c d
0 2.0 2.0 2.0 2.0
1 2.0 2.0 2.0 2.0
2 2.0 2.0 2.0 2.0
'''
# 使用 concat 合併 axis=0 為直向合併
res = pd.concat([df1,df2,df3],axis=0)
print(res)
'''
a b c d
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
0 1.0 1.0 1.0 1.0
1 1.0 1.0 1.0 1.0
2 1.0 1.0 1.0 1.0
0 2.0 2.0 2.0 2.0
1 2.0 2.0 2.0 2.0
2 2.0 2.0 2.0 2.0
'''
```
- **ignore_index = True 可以忽略合併時舊的 index 欄位,改採用自動產生的 index**
```python=46
# 目前上面結果的問題是,左側的 index 在合併時不連續,我們必須要設定 ignore index 才能解掉
res = pd.concat([df1,df2,df3],axis=0, ignore_index=True)
print(res)
'''
a b c d
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0
5 1.0 1.0 1.0 1.0
6 2.0 2.0 2.0 2.0
7 2.0 2.0 2.0 2.0
8 2.0 2.0 2.0 2.0
'''
```
- **concat 的 join 屬性有兩種模式 inner, outer(預設)**
```python=1
#coding=utf-8
import pandas as pd
import numpy as np
# concat 使用 join 設定
# join 有兩種模式,分別為 inner, outer
df1 = pd.DataFrame(np.ones((3,4))*0, columns=['a','b','c','d'],index=[1,2,3])
df2 = pd.DataFrame(np.ones((3,4))*0, columns=['b','c','d','e'],index=[2,3,4])
print(df1)
'''
a b c d
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0
'''
print(df2)
'''
b c d e
2 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0
'''
# 使用 concat 合併時,他預設的 join 模式是 'outer',會直接把沒有的資料用 NaN 代替
res = pd.concat([df1,df2]) # 這兩行程式是全等的
res = pd.concat([df1,df2], join='outer') # 這兩行程式是全等的
print(res)
'''
a b c d e
1 0.0 0.0 0.0 0.0 NaN
2 0.0 0.0 0.0 0.0 NaN
3 0.0 0.0 0.0 0.0 NaN
2 NaN 0.0 0.0 0.0 0.0
3 NaN 0.0 0.0 0.0 0.0
4 NaN 0.0 0.0 0.0 0.0
'''
# 使用 concat 的 join 模式為 'inner',會直接把沒有完整資料的刪除掉
res = pd.concat([df1,df2], join='inner', ignore_index=True)
print(res)
'''
b c d
0 0.0 0.0 0.0
1 0.0 0.0 0.0
2 0.0 0.0 0.0
3 0.0 0.0 0.0
4 0.0 0.0 0.0
5 0.0 0.0 0.0
'''
```
- **concat 的 join_axes 功能,用於水瓶合併時指定 index 參考者**
```python=1
#coding=utf-8
import pandas as pd
import numpy as np
# concat 的 join_axes 功能
df1 = pd.DataFrame(np.ones((3,4))*0, columns=['a','b','c','d'],index=[1,2,3])
df2 = pd.DataFrame(np.ones((3,4))*0, columns=['b','c','d','e'],index=[2,3,4])
# 設定左右合併 axis=1, join_axes 設定成按照 df1 的 index 來進行合併
res = pd.concat([df1,df2],axis=1, join_axes=[df1.index])
print(res)
'''
a b c d b c d e
1 0.0 0.0 0.0 0.0 NaN NaN NaN NaN
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
'''
```
- **使用 DataFrame append 來合併資料,新增資料**
```python=1
#coding=utf-8
import pandas as pd
import numpy as np
# concat 的 append 功能
df1 = pd.DataFrame(np.ones((3,4))*0, columns=['a','b','c','d'])
df2 = pd.DataFrame(np.ones((3,4))*1, columns=['a','b','c','d'])
# append 預設是往下加
res = df1.append(df2, ignore_index=True)
print(res)
'''
a b c d
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0
5 1.0 1.0 1.0 1.0
'''
# append 多個
df3 = pd.DataFrame(np.ones((3,4))*3, columns=['a','b','c','d'])
res = df1.append([df2,df3], ignore_index=True)
print(res)
'''
a b c d
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 1.0 1.0 1.0 1.0
4 1.0 1.0 1.0 1.0
5 1.0 1.0 1.0 1.0
6 3.0 3.0 3.0 3.0
7 3.0 3.0 3.0 3.0
8 3.0 3.0 3.0 3.0
'''
# 直接 append 一組資料
s1 = pd.Series([1,2,3,4],index=['a','b','c','d'])
res = df1.append(s1, ignore_index=True)
print(res)
'''
a b c d
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 1.0 2.0 3.0 4.0
'''
```
## pandas merge 合併資料
- **merge by 一個 key**
```python=1
#coding=utf-8
import pandas as pd
import numpy as np
left = pd.DataFrame({
'key':['K0','K1','K2','K3'],
'A':['A0','A1','A2','A3'],
'B':['B0','B1','B2','B3']})
right = pd.DataFrame({
'key':['K0','K1','K2','K3'],
'C':['C0','C1','C2','C3'],
'D':['D0','D1','D2','D3']})
print(left)
'''
A B key
0 A0 B0 K0
1 A1 B1 K1
2 A2 B2 K2
3 A3 B3 K3
'''
print(right)
'''
C D key
0 C0 D0 K0
1 C1 D1 K1
2 C2 D2 K2
3 C3 D3 K3
'''
# 目標,基於 key 把 left 與 right 合併
# 使用 merge
res = pd.merge(left,right, on='key')
print(res)
'''
A B key C D
0 A0 B0 K0 C0 D0
1 A1 B1 K1 C1 D1
2 A2 B2 K2 C2 D2
3 A3 B3 K3 C3 D3
'''
```
- **merge by 多個 key**
```python=1
#coding=utf-8
import pandas as pd
import numpy as np
left = pd.DataFrame({
'key1':['K0','K0','K1','K2'],
'key2':['K0','K1','K0','K1'],
'A':['A0','A1','A2','A3'],
'B':['B0','B1','B2','B3']})
right = pd.DataFrame({
'key1':['K0','K1','K1','K2'],
'key2':['K0','K0','K0','K0'],
'C':['C0','C1','C2','C3'],
'D':['D0','D1','D2','D3']})
print(left)
'''
A B key1 key2
0 A0 B0 K0 K0
1 A1 B1 K0 K1
2 A2 B2 K1 K0
3 A3 B3 K2 K1
'''
print(right)
'''
C D key1 key2
0 C0 D0 K0 K0
1 C1 D1 K1 K0
2 C2 D2 K1 K0
3 C3 D3 K2 K0
'''
```
- **inner 模式**
```python=34
# 目標,基於 key1, key2 把 left 與 right 合併
# 使用 merge 同時合併 by 多個 key 預設為 how='inner' 模式
res = pd.merge(left,right, on=['key1','key2']) # 這兩行效果一樣
res = pd.merge(left,right, on=['key1','key2'],how='inner') # 這兩行效果一樣
print(res)
''' 合併後,他預設只把相同的部分留下來 (inner 模式)
A B key1 key2 C D
0 A0 B0 K0 K0 C0 D0
1 A2 B2 K1 K0 C1 D1
2 A2 B2 K1 K0 C2 D2
'''
```
- **outer 模式**
```python=46
# 使用 merge 同時合併 by 多個 key, how='outer' 模式
res = pd.merge(left,right, on=['key1','key2'],how='outer')
print(res)
'''
A B key1 key2 C D
0 A0 B0 K0 K0 C0 D0
1 A1 B1 K0 K1 NaN NaN
2 A2 B2 K1 K0 C1 D1
3 A2 B2 K1 K0 C2 D2
4 A3 B3 K2 K1 NaN NaN
5 NaN NaN K2 K0 C3 D3
'''
```
- **right 模式**
```python=58
# 使用 merge 同時合併 by 多個 key, how='right' 模式
res = pd.merge(left,right, on=['key1','key2'],how='right')
print(res)
'''
A B key1 key2 C D
0 A0 B0 K0 K0 C0 D0
1 A2 B2 K1 K0 C1 D1
2 A2 B2 K1 K0 C2 D2
3 NaN NaN K2 K0 C3 D3
'''
```
- **left 模式**
```python=68
# 使用 merge 同時合併 by 多個 key, how='left' 模式
res = pd.merge(left,right, on=['key1','key2'],how='left')
print(res)
'''
A B key1 key2 C D
0 A0 B0 K0 K0 C0 D0
1 A1 B1 K0 K1 NaN NaN
2 A2 B2 K1 K0 C1 D1
3 A2 B2 K1 K0 C2 D2
4 A3 B3 K2 K1 NaN NaN
'''
```
- **使用 indicator 顯示 merge 的 mode**
```python=1
#coding=utf-8
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'col1':[0,1], 'col_left':['a','b']})
df2 = pd.DataFrame({'col1':[1,2,3], 'col_right':[2,2,2]})
print(df1)
'''
col1 col_left
0 0 a
1 1 b
'''
print(df2)
'''
col1 col_right
0 1 2
1 2 2
2 3 2
'''
# 使用 indicator 可以顯示 merge 的方式
res = pd.merge(df1,df2, on='col1', how='outer',indicator=True)
print(res)
'''
col1 col_left col_right _merge
0 0 a NaN left_only
1 1 b 2.0 both
2 2 NaN 2.0 right_only
3 3 NaN 2.0 right_only
'''
```
- **設定 indicator 欄位的名字**
```python=34
# 設定 indicator 欄位的名字
res = pd.merge(df1,df2, on='col1', how='outer',indicator='indicator_column')
print(res)
'''
col1 col_left col_right indicator_column
0 0 a NaN left_only
1 1 b 2.0 both
2 2 NaN 2.0 right_only
3 3 NaN 2.0 right_only
'''
```
- **merge by index**
```python=1
#coding=utf-8
import pandas as pd
import numpy as np
# merge by index
left = pd.DataFrame({
'A':['A0','A1','A2'],
'B':['B0','B1','B2']},
index=['K0','K1','K2'])
right = pd.DataFrame({
'C':['C0','C2','C3'],
'D':['D0','D2','D3']},
index=['K0','K2','K3'])
print(left)
'''
A B
K0 A0 B0
K1 A1 B1
K2 A2 B2
'''
print(right)
'''
C D
K0 C0 D0
K2 C2 D2
K3 C3 D3
'''
res = pd.merge(left,right, left_index=True, right_index=True, how='outer')
print(res)
'''
A B C D
K0 A0 B0 C0 D0
K1 A1 B1 NaN NaN
K2 A2 B2 C2 D2
K3 NaN NaN C3 D3
'''
```
- **merge 合併時,處理欄位名稱相同衝突,以 suffixes 區別**
```python=1
#coding=utf-8
import pandas as pd
import numpy as np
# 處理 overlapping
boys = pd.DataFrame({'k':['K0','K1','K2'],'age':[1,2,3]})
girls = pd.DataFrame({'k':['K0','K0','K3'],'age':[4,5,6]})
print(boys)
'''
age k
0 1 K0
1 2 K1
2 3 K2
'''
print(girls)
'''
age k
0 4 K0
1 5 K0
2 6 K3
'''
# 目前 age 欄位是重複的,我們為了要區別 boy 與 girl,必須要在新的合併表格中,為 age 欄位取新的名字
# 使用 suffixes 屬性即可辦到
res = pd.merge(boys,girls, on='k', suffixes=['_boy','_girl'], how='outer')
print(res)
'''
age_boy k age_girl
0 1.0 K0 4.0
1 1.0 K0 5.0
2 2.0 K1 NaN
3 3.0 K2 NaN
4 NaN K3 6.0
'''
```
## pandas plot
- **plot 基本用法 畫出 Series 數據**
```python=1
#coding=utf-8
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# plot data
# Series 線性數據
data = pd.Series(np.random.randn(1000), index=np.arange(1000))
data = data.cumsum() # 做累加
data.plot()
plt.show()
```
- **plot 畫出 DataFrame 四個數據**
```python=1
#coding=utf-8
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# DataFrame
data = pd.DataFrame(np.random.randn(1000,4),
index=np.arange(1000),
columns=list("ABCD"))
data = data.cumsum()
print(data.head(5)) # 印出前五個數據
'''
A B C D
0 -0.056981 -0.167990 -0.103564 -0.807399
1 1.008049 -1.633926 0.959755 0.405345
2 0.821038 -3.090023 2.821623 -0.880397
3 0.448243 -4.889474 4.477471 -1.378809
4 -1.453623 -3.347546 5.371346 -2.983690
'''
data.plot()
plt.show()
```
- **scatter 用法**
```python=1
#coding=utf-8
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# plot 的方法種類:
# 'bar', 'hist', 'box', 'kde', 'area', 'scatter', 'hexbin', 'pie'
# DataFrame
data = pd.DataFrame(np.random.randn(1000,4),
index=np.arange(1000),
columns=list("ABCD"))
data = data.cumsum()
print(data.head(5)) # 印出前五個數據
'''
A B C D
0 1.149112 -1.189742 -1.108183 -1.276239
1 0.889289 -0.979980 -0.821403 0.726542
2 1.219525 -0.753984 0.279848 1.686624
3 1.006253 -0.191323 0.595033 0.578449
4 0.710900 -0.820767 0.064716 1.539593
'''
# 使用 scatter
ax = data.plot.scatter(x='A',y='B', color='Red',label='Class 1') # 取 A 欄位 B 欄位
data.plot.scatter(x='A', y='C', color='Green', label='Class 2', ax=ax) # 取 A 欄位 C 欄位
plt.show()
```