numpy & pandas 筆記 (2)

*本篇資料來源為莫煩 python:
https://morvanzhou.github.io/tutorials/data-manipulation/np-pd/

pandas 跟 numpy 的使用情境:
numpy 用來替代 python 當中的 list (陣列,矩陣)
pandas 用來替代 python 當中的 dictionary (字典)

pandas 基本功能

  • Pandas Series
    會自動加上字典的 index

    ​​​​#coding=utf-8 ​​​​import numpy as np ​​​​import pandas as pd ​​​​s = pd.Series([1,3,6, np.nan, 44, 1]) ​​​​print(s) ​​​​''' ​​​​0 1.0 ​​​​1 3.0 ​​​​2 6.0 ​​​​3 NaN ​​​​4 44.0 ​​​​5 1.0 ​​​​dtype: float64 ​​​​'''
  • Pandas date_range

    ​​​​#coding=utf-8 ​​​​import numpy as np ​​​​import pandas as pd ​​​​dates = pd.date_range('20170101',periods=6) ​​​​print(dates) ​​​​''' ​​​​DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04', ​​​​ '2017-01-05', '2017-01-06'], ​​​​ dtype='datetime64[ns]', freq='D') ​​​​'''
  • Pandas DataFrame 基本性質

    ​​​​#coding=utf-8 ​​​​import numpy as np ​​​​import pandas as pd ​​​​# 宣告 data frame ​​​​df=pd.DataFrame(np.random.randn(6,4), index=dates, columns=['a','b','c','d']) ​​​​print(df) ​​​​''' ​​​​ a b c d ​​​​2017-01-01 1.159450 -1.578301 -0.652926 0.375754 ​​​​2017-01-02 0.393195 -0.346388 -0.754170 -0.008133 ​​​​2017-01-03 1.192782 0.804110 -0.842838 0.012096 ​​​​2017-01-04 -1.058825 -0.323025 0.722646 -1.166700 ​​​​2017-01-05 -1.180235 -0.450869 1.203138 -0.815103 ​​​​2017-01-06 -1.722095 0.304247 0.175714 -1.654301 ​​​​''' ​​​​# 預設的 index 為 0 1 2 3 ... ​​​​df1=pd.DataFrame(np.arange(12).reshape((3,4))) ​​​​print(df1) ​​​​''' ​​​​ 0 1 2 3 ​​​​0 0 1 2 3 ​​​​1 4 5 6 7 ​​​​2 8 9 10 11 ​​​​'''
  • Pandas DataFrame 用法:使用 dictionary 宣告

    ​​​​#coding=utf-8 ​​​​import numpy as np ​​​​import pandas as pd ​​​​# 使用 dictionary 定義 data frame ​​​​df2=pd.DataFrame({'A':1., ​​​​ 'B':pd.Timestamp('20170101'), ​​​​ 'C':pd.Series(1, index=list(range(4)),dtype='float32'), ​​​​ 'D':np.array([3]*4, dtype='int32'), ​​​​ 'E':pd.Categorical(["test","train","test","train"]), ​​​​ 'F':'foo'}) ​​​​print(df2) ​​​​''' ​​​​ A B C D E F ​​​​0 1.0 2017-01-01 1.0 3 test foo ​​​​1 1.0 2017-01-01 1.0 3 train foo ​​​​2 1.0 2017-01-01 1.0 3 test foo ​​​​3 1.0 2017-01-01 1.0 3 train foo ​​​​'''
  • Pandas DataFrame 使用 dtype 查看資料型態

    ​​​​# 查看資料格式 ​​​​print(df2.dtypes) ​​​​''' ​​​​A float64 ​​​​B datetime64[ns] ​​​​C float32 ​​​​D int32 ​​​​E category ​​​​F object ​​​​dtype: object ​​​​'''
  • Pandas DataFrame 查看 index 與資料數值

    ​​​​# 輸出所有 row index ​​​​print(df2.index) ​​​​''' ​​​​Int64Index([0, 1, 2, 3], dtype='int64') ​​​​''' ​​​​# 輸出所有 column ​​​​print(df2.columns) ​​​​''' ​​​​Index([u'A', u'B', u'C', u'D', u'E', u'F'], dtype='object') ​​​​''' ​​​​# 輸出所有值 values ​​​​print(df2.values) ​​​​''' ​​​​[[1.0 Timestamp('2017-01-01 00:00:00') 1.0 3 'test' 'foo'] ​​​​ [1.0 Timestamp('2017-01-01 00:00:00') 1.0 3 'train' 'foo'] ​​​​ [1.0 Timestamp('2017-01-01 00:00:00') 1.0 3 'test' 'foo'] ​​​​ [1.0 Timestamp('2017-01-01 00:00:00') 1.0 3 'train' 'foo']] ​​​​'''
  • Pandas DataFrame 使用 describe 查看數字資料描述

    ​​​​# 輸出 describe 查看數字形式的特性 ​​​​print(df2.describe()) ​​​​''' ​​​​ A C D ​​​​count 4.0 4.0 4.0 ​​​​mean 1.0 1.0 3.0 ​​​​std 0.0 0.0 0.0 ​​​​min 1.0 1.0 3.0 ​​​​25% 1.0 1.0 3.0 ​​​​50% 1.0 1.0 3.0 ​​​​75% 1.0 1.0 3.0 ​​​​max 1.0 1.0 3.0 ​​​​'''
  • Pandas DataFrame 轉置資料

    ​​​​# transpose 轉置矩陣 ​​​​print(df2.T) ​​​​''' ​​​​ 0 1 2 3 ​​​​A 1 1 1 1 ​​​​B 2017-01-01 00:00:00 2017-01-01 00:00:00 2017-01-01 00:00:00 2017-01-01 00:00:00 ​​​​C 1 1 1 1 ​​​​D 3 3 3 3 ​​​​E test train test train ​​​​F foo foo foo foo ​​​​'''
  • Pandas DataFrame 排序資料 sort_index 與 sort_values

    ​​​​# 排序,對於 axis=1 以 row 方向排序 ascending=False 倒排序 ​​​​print(df2.sort_index(axis=1, ascending=False)) ​​​​''' 排序結果: F E D C B A ​​​​ F E D C B A ​​​​0 foo test 3 1.0 2017-01-01 1.0 ​​​​1 foo train 3 1.0 2017-01-01 1.0 ​​​​2 foo test 3 1.0 2017-01-01 1.0 ​​​​3 foo train 3 1.0 2017-01-01 1.0 ​​​​''' ​​​​# 排序,對於 axis=0 以 col 方向反向排序 (ascending=Fals) ​​​​print(df2.sort_index(axis=0, ascending=False)) ​​​​''' 排序結果: 3 2 1 0 ​​​​ A B C D E F ​​​​3 1.0 2017-01-01 1.0 3 train foo ​​​​2 1.0 2017-01-01 1.0 3 test foo ​​​​1 1.0 2017-01-01 1.0 3 train foo ​​​​0 1.0 2017-01-01 1.0 3 test foo ​​​​''' ​​​​# 排序 sort_values 針對單行的值進行排序 ​​​​print(df2.sort_values(by='E')) ​​​​''' 排序結果: test train 兩倆排在一起 ​​​​ A B C D E F ​​​​0 1.0 2017-01-01 1.0 3 test foo ​​​​2 1.0 2017-01-01 1.0 3 test foo ​​​​1 1.0 2017-01-01 1.0 3 train foo ​​​​3 1.0 2017-01-01 1.0 3 train foo ​​​​'''

pandas indexing

  • data frame 基本 indexing 方法

    ​​​​#coding=utf-8 ​​​​import pandas as pd ​​​​import numpy as np ​​​​dates = pd.date_range('2017-01-01',periods=6) ​​​​df = pd.DataFrame(np.arange(24).reshape((6,4)), index=dates, columns=['A','B','C','D']) ​​​​print(df['A']) ​​​​''' ​​​​2017-01-01 0 ​​​​2017-01-02 4 ​​​​2017-01-03 8 ​​​​2017-01-04 12 ​​​​2017-01-05 16 ​​​​2017-01-06 20 ​​​​Freq: D, Name: A, dtype: int64 ​​​​''' ​​​​print(df.A) ​​​​''' ​​​​2017-01-01 0 ​​​​2017-01-02 4 ​​​​2017-01-03 8 ​​​​2017-01-04 12 ​​​​2017-01-05 16 ​​​​2017-01-06 20 ​​​​Freq: D, Name: A, dtype: int64 ​​​​''' ​​​​print(df[0:3]) ​​​​''' ​​​​ A B C D ​​​​2017-01-01 0 1 2 3 ​​​​2017-01-02 4 5 6 7 ​​​​2017-01-03 8 9 10 11 ​​​​''' ​​​​print(df['2017-01-02':'2017-01-04']) ​​​​''' ​​​​ A B C D ​​​​2017-01-02 4 5 6 7 ​​​​2017-01-03 8 9 10 11 ​​​​2017-01-04 12 13 14 15 ​​​​'''
  • data frame location 用法 loc

    ​​​​# 使用 data frame loc 來選擇查看數據 ​​​​# 查看指定的 row 的內容 ​​​​print(df.loc['2017-01-02']) ​​​​''' ​​​​A 4 ​​​​B 5 ​​​​C 6 ​​​​D 7 ​​​​Name: 2017-01-02 00:00:00, dtype: int64 ​​​​''' ​​​​# 篩選 col 的指令欄位 row 部分全部印出 ​​​​print(df.loc[:,['A','B']]) ​​​​''' ​​​​ A B ​​​​2017-01-01 0 1 ​​​​2017-01-02 4 5 ​​​​2017-01-03 8 9 ​​​​2017-01-04 12 13 ​​​​2017-01-05 16 17 ​​​​2017-01-06 20 21 ​​​​''' ​​​​# 指定 row 指定 col ​​​​print(df.loc['2017-01-03',['A','C']]) ​​​​''' ​​​​A 8 ​​​​C 10 ​​​​Name: 2017-01-03 00:00:00, dtype: int64 ​​​​'''
  • select py position 使用 iloc

    ​​​​# 使用 index 數字來指定 select by position: iloc ​​​​print(df.iloc[3]) ​​​​''' ​​​​A 12 ​​​​B 13 ​​​​C 14 ​​​​D 15 ​​​​Name: 2017-01-04 00:00:00, dtype: int64 ​​​​''' ​​​​# 第 3 row, 第 1 col ​​​​print(df.iloc[3,1]) # 13 ​​​​# 第 3~5 row, 第 1~3 col ​​​​print(df.iloc[3:5,1:3]) ​​​​''' ​​​​ B C ​​​​2017-01-04 13 14 ​​​​2017-01-05 17 18 ​​​​'''
  • 混合篩選 使用 ix

    ​​​​# 混合篩選 使用 ix ​​​​print(df.ix[:3, ['A','C']]) ​​​​''' ​​​​ A C ​​​​2017-01-01 0 2 ​​​​2017-01-02 4 6 ​​​​2017-01-03 8 10 ​​​​'''
  • Booling indexing

    ​​​​# Booling indexing ​​​​print(df) ​​​​''' ​​​​ A B C D ​​​​2017-01-01 0 1 2 3 ​​​​2017-01-02 4 5 6 7 ​​​​2017-01-03 8 9 10 11 ​​​​2017-01-04 12 13 14 15 ​​​​2017-01-05 16 17 18 19 ​​​​2017-01-06 20 21 22 23 ​​​​''' ​​​​# 針對 A 欄位資料 > 8 的篩選出來 ​​​​print(df[df.A > 8]) ​​​​''' ​​​​ A B C D ​​​​2017-01-04 12 13 14 15 ​​​​2017-01-05 16 17 18 19 ​​​​2017-01-06 20 21 22 23 ​​​​'''

pandas 設值

  • 在資料表格中根據 index 設定值

    ​​​​#coding=utf-8 ​​​​import pandas as pd ​​​​import numpy as np ​​​​dates = pd.date_range('2017-01-01',periods=6) ​​​​df = pd.DataFrame(np.arange(24).reshape((6,4)), index=dates, columns=['A','B','C','D']) ​​​​print(df) ​​​​''' ​​​​ A B C D ​​​​2017-01-01 0 1 2 3 ​​​​2017-01-02 4 5 6 7 ​​​​2017-01-03 8 9 10 11 ​​​​2017-01-04 12 13 14 15 ​​​​2017-01-05 16 17 18 19 ​​​​2017-01-06 20 21 22 23 ​​​​''' ​​​​df.iloc[2,2] = 1111 ​​​​print(df) ​​​​''' ​​​​ A B C D ​​​​2017-01-01 0 1 2 3 ​​​​2017-01-02 4 5 6 7 ​​​​2017-01-03 8 9 1111 11 ​​​​2017-01-04 12 13 14 15 ​​​​2017-01-05 16 17 18 19 ​​​​2017-01-06 20 21 22 23 ​​​​''' ​​​​df.loc['2017-01-02','C'] = 222 ​​​​print(df) ​​​​''' ​​​​ A B C D ​​​​2017-01-01 0 1 2 3 ​​​​2017-01-02 4 5 222 7 ​​​​2017-01-03 8 9 1111 11 ​​​​2017-01-04 12 13 14 15 ​​​​2017-01-05 16 17 18 19 ​​​​2017-01-06 20 21 22 23 ​​​​'''
  • 設定符合條件,整欄位修改

    ​​​​#coding=utf-8 ​​​​import pandas as pd ​​​​import numpy as np ​​​​dates = pd.date_range('2017-01-01',periods=6) ​​​​df = pd.DataFrame(np.arange(24).reshape((6,4)), index=dates, columns=['A','B','C','D']) ​​​​# 全部都改到 ​​​​df[df.A>8] = 0 ​​​​print(df) ​​​​''' ​​​​ A B C D ​​​​2017-01-01 0 1 2 3 ​​​​2017-01-02 4 5 222 7 ​​​​2017-01-03 8 9 1111 11 ​​​​2017-01-04 0 0 0 0 ​​​​2017-01-05 0 0 0 0 ​​​​2017-01-06 0 0 0 0 ​​​​'''
  • 設定符合條件,指定特定欄位才修改

    ​​​​#coding=utf-8 ​​​​import pandas as pd ​​​​import numpy as np ​​​​dates = pd.date_range('2017-01-01',periods=6) ​​​​df = pd.DataFrame(np.arange(24).reshape((6,4)), index=dates, columns=['A','B','C','D']) ​​​​# 指定特定欄位修改 ​​​​df.A[df.A>8] = 0 ​​​​print(df) ​​​​''' ​​​​ A B C D ​​​​2017-01-01 0 1 2 3 ​​​​2017-01-02 4 5 222 7 ​​​​2017-01-03 8 9 1111 11 ​​​​2017-01-04 0 13 14 15 ​​​​2017-01-05 0 17 18 19 ​​​​2017-01-06 0 21 22 23 ​​​​''' ​​​​dates = pd.date_range('2017-01-01',periods=6) ​​​​df = pd.DataFrame(np.arange(24).reshape((6,4)), index=dates, columns=['A','B','C','D']) ​​​​# 指定特定欄位修改 ​​​​df.B[df.A>8] = 0 ​​​​print(df) ​​​​''' ​​​​ A B C D ​​​​2017-01-01 0 1 2 3 ​​​​2017-01-02 4 5 6 7 ​​​​2017-01-03 8 9 10 11 ​​​​2017-01-04 12 0 14 15 ​​​​2017-01-05 16 0 18 19 ​​​​2017-01-06 20 0 22 23 ​​​​'''
  • 新增欄位並設定值

    ​​​​#coding=utf-8 ​​​​import pandas as pd ​​​​import numpy as np ​​​​dates = pd.date_range('2017-01-01',periods=6) ​​​​df = pd.DataFrame(np.arange(24).reshape((6,4)), index=dates, columns=['A','B','C','D']) ​​​​# 新增欄位 ​​​​df['E'] = pd.Series([1,2,3,4,5,6],index=pd.date_range('2017-01-01',periods=6)) ​​​​df['F'] = np.nan ​​​​print(df) ​​​​''' ​​​​ A B C D E F ​​​​2017-01-01 0 1 2 3 1 NaN ​​​​2017-01-02 4 5 6 7 2 NaN ​​​​2017-01-03 8 9 10 11 3 NaN ​​​​2017-01-04 12 13 14 15 4 NaN ​​​​2017-01-05 16 17 18 19 5 NaN ​​​​2017-01-06 20 21 22 23 6 NaN ​​​​'''

pandas 處理資料表中的 "空值"

  • dropna 丟掉整欄/整列資料
#coding=utf-8 import pandas as pd import numpy as np dates = pd.date_range('2017-01-01',periods=6) df = pd.DataFrame(np.arange(24).reshape((6,4)), index=dates, columns=['A','B','C','D']) print(df) ''' A B C D 2017-01-01 0 1 2 3 2017-01-02 4 5 6 7 2017-01-03 8 9 10 11 2017-01-04 12 13 14 15 2017-01-05 16 17 18 19 2017-01-06 20 21 22 23 ''' # 設定兩個空值 df.iloc[0,1] = np.nan df.iloc[1,2] = np.nan print(df) ''' A B C D 2017-01-01 0 NaN 2.0 3 2017-01-02 4 5.0 NaN 7 2017-01-03 8 9.0 10.0 11 2017-01-04 12 13.0 14.0 15 2017-01-05 16 17.0 18.0 19 2017-01-06 20 21.0 22.0 23 ''' # dropna: how = any 偵測 row (axis=0) 當中有 nan 的就整個丟掉 print(df.dropna(axis=0, how='any')) # how = 'any' 'all' ''' A B C D 2017-01-03 8 9.0 10.0 11 2017-01-04 12 13.0 14.0 15 2017-01-05 16 17.0 18.0 19 2017-01-06 20 21.0 22.0 23 ''' # dropna: how = any 偵測 col (axis=1) 當中有 nan 的就整個丟掉 print(df.dropna(axis=1, how='any')) # how = 'any' 'all' ''' A D 2017-01-01 0 3 2017-01-02 4 7 2017-01-03 8 11 2017-01-04 12 15 2017-01-05 16 19 2017-01-06 20 23 ''' # p.s. 如果 how = all 的話,要整排 NaN 才會丟掉
  • fillna 在 NaN 空格中填入指定的值 value
# fillna: 遇到 NaN 填入指定的數字 print(df.fillna(value=0)) ''' A B C D 2017-01-01 0 0.0 2.0 3 2017-01-02 4 5.0 0.0 7 2017-01-03 8 9.0 10.0 11 2017-01-04 12 13.0 14.0 15 2017-01-05 16 17.0 18.0 19 2017-01-06 20 21.0 22.0 23 '''
  • isnull 查看資料是否為空值 NaN
# isnull 查看資料格是否是 NaN print(df.isnull()) ''' A B C D 2017-01-01 False True False False 2017-01-02 False False True False 2017-01-03 False False False False 2017-01-04 False False False False 2017-01-05 False False False False 2017-01-06 False False False False ''' # isnull 的應用:查看整個資料表是否有空值 NaN print(np.any(df.isnull())==True) # output: True ''' True 代表 df.isnull 當中有 True --> 意思是資料表中有 NaN (丟失的數據) '''

pandas 讀寫資料檔案

首先準備一個 student.csv 檔
裡面資料長這樣

Student ID,name ,age,gender
1100,Kelly,22,Female
1101,Clo,21,Female
1102,Tilly,22,Female
1103,Tony,24,Male
1104,David,20,Male
1105,Catty,22,Female
1106,M,3,Female
1107,N,43,Male
1108,A,13,Male
1109,S,12,Male
1110,David,33,Male
1111,Dw,3,Female
1112,Q,23,Male
1113,W,21,Female

pandas 可以讀取的檔案格式詳情:
https://pandas.pydata.org/pandas-docs/stable/api.html#input-output

#coding=utf-8 import pandas as pd import numpy as np # 讀取 data = pd.read_csv('student.csv') print(data) ''' Student ID name age gender 0 1100 Kelly 22 Female 1 1101 Clo 21 Female 2 1102 Tilly 22 Female 3 1103 Tony 24 Male 4 1104 David 20 Male 5 1105 Catty 22 Female 6 1106 M 3 Female 7 1107 N 43 Male 8 1108 A 13 Male 9 1109 S 12 Male 10 1110 David 33 Male 11 1111 Dw 3 Female 12 1112 Q 23 Male 13 1113 W 21 Female ''' # 存檔 data.to_pickle('student.pickle')

pandas concatenating 合併資料

  • concat 設定 axis=0 為直向合併
    ​​​​#coding=utf-8 ​​​​import pandas as pd ​​​​import numpy as np ​​​​df1 = pd.DataFrame(np.ones((3,4))*0, columns=['a','b','c','d']) ​​​​df2 = pd.DataFrame(np.ones((3,4))*1, columns=['a','b','c','d']) ​​​​df3 = pd.DataFrame(np.ones((3,4))*2, columns=['a','b','c','d']) ​​​​print(df1) ​​​​''' ​​​​ a b c d ​​​​0 0.0 0.0 0.0 0.0 ​​​​1 0.0 0.0 0.0 0.0 ​​​​2 0.0 0.0 0.0 0.0 ​​​​''' ​​​​print(df2) ​​​​''' ​​​​ a b c d ​​​​0 1.0 1.0 1.0 1.0 ​​​​1 1.0 1.0 1.0 1.0 ​​​​2 1.0 1.0 1.0 1.0 ​​​​''' ​​​​print(df3) ​​​​''' ​​​​ a b c d ​​​​0 2.0 2.0 2.0 2.0 ​​​​1 2.0 2.0 2.0 2.0 ​​​​2 2.0 2.0 2.0 2.0 ​​​​''' ​​​​# 使用 concat 合併 axis=0 為直向合併 ​​​​res = pd.concat([df1,df2,df3],axis=0) ​​​​print(res) ​​​​''' ​​​​ a b c d ​​​​0 0.0 0.0 0.0 0.0 ​​​​1 0.0 0.0 0.0 0.0 ​​​​2 0.0 0.0 0.0 0.0 ​​​​0 1.0 1.0 1.0 1.0 ​​​​1 1.0 1.0 1.0 1.0 ​​​​2 1.0 1.0 1.0 1.0 ​​​​0 2.0 2.0 2.0 2.0 ​​​​1 2.0 2.0 2.0 2.0 ​​​​2 2.0 2.0 2.0 2.0 ​​​​'''
  • ignore_index = True 可以忽略合併時舊的 index 欄位,改採用自動產生的 index
    ​​​​# 目前上面結果的問題是,左側的 index 在合併時不連續,我們必須要設定 ignore index 才能解掉 ​​​​res = pd.concat([df1,df2,df3],axis=0, ignore_index=True) ​​​​print(res) ​​​​''' ​​​​ a b c d ​​​​0 0.0 0.0 0.0 0.0 ​​​​1 0.0 0.0 0.0 0.0 ​​​​2 0.0 0.0 0.0 0.0 ​​​​3 1.0 1.0 1.0 1.0 ​​​​4 1.0 1.0 1.0 1.0 ​​​​5 1.0 1.0 1.0 1.0 ​​​​6 2.0 2.0 2.0 2.0 ​​​​7 2.0 2.0 2.0 2.0 ​​​​8 2.0 2.0 2.0 2.0 ​​​​'''
  • concat 的 join 屬性有兩種模式 inner, outer(預設)
    ​​​​#coding=utf-8 ​​​​import pandas as pd ​​​​import numpy as np ​​​​# concat 使用 join 設定 ​​​​# join 有兩種模式,分別為 inner, outer ​​​​df1 = pd.DataFrame(np.ones((3,4))*0, columns=['a','b','c','d'],index=[1,2,3]) ​​​​df2 = pd.DataFrame(np.ones((3,4))*0, columns=['b','c','d','e'],index=[2,3,4]) ​​​​print(df1) ​​​​''' ​​​​ a b c d ​​​​1 0.0 0.0 0.0 0.0 ​​​​2 0.0 0.0 0.0 0.0 ​​​​3 0.0 0.0 0.0 0.0 ​​​​''' ​​​​print(df2) ​​​​''' ​​​​ b c d e ​​​​2 0.0 0.0 0.0 0.0 ​​​​3 0.0 0.0 0.0 0.0 ​​​​4 0.0 0.0 0.0 0.0 ​​​​''' ​​​​# 使用 concat 合併時,他預設的 join 模式是 'outer',會直接把沒有的資料用 NaN 代替 ​​​​res = pd.concat([df1,df2]) # 這兩行程式是全等的 ​​​​res = pd.concat([df1,df2], join='outer') # 這兩行程式是全等的 ​​​​print(res) ​​​​''' ​​​​ a b c d e ​​​​1 0.0 0.0 0.0 0.0 NaN ​​​​2 0.0 0.0 0.0 0.0 NaN ​​​​3 0.0 0.0 0.0 0.0 NaN ​​​​2 NaN 0.0 0.0 0.0 0.0 ​​​​3 NaN 0.0 0.0 0.0 0.0 ​​​​4 NaN 0.0 0.0 0.0 0.0 ​​​​''' ​​​​# 使用 concat 的 join 模式為 'inner',會直接把沒有完整資料的刪除掉 ​​​​res = pd.concat([df1,df2], join='inner', ignore_index=True) ​​​​print(res) ​​​​''' ​​​​ b c d ​​​​0 0.0 0.0 0.0 ​​​​1 0.0 0.0 0.0 ​​​​2 0.0 0.0 0.0 ​​​​3 0.0 0.0 0.0 ​​​​4 0.0 0.0 0.0 ​​​​5 0.0 0.0 0.0 ​​​​'''
  • concat 的 join_axes 功能,用於水瓶合併時指定 index 參考者
    ​​​​#coding=utf-8 ​​​​import pandas as pd ​​​​import numpy as np ​​​​# concat 的 join_axes 功能 ​​​​df1 = pd.DataFrame(np.ones((3,4))*0, columns=['a','b','c','d'],index=[1,2,3]) ​​​​df2 = pd.DataFrame(np.ones((3,4))*0, columns=['b','c','d','e'],index=[2,3,4]) ​​​​# 設定左右合併 axis=1, join_axes 設定成按照 df1 的 index 來進行合併 ​​​​res = pd.concat([df1,df2],axis=1, join_axes=[df1.index]) ​​​​print(res) ​​​​''' ​​​​ a b c d b c d e ​​​​1 0.0 0.0 0.0 0.0 NaN NaN NaN NaN ​​​​2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ​​​​3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ​​​​'''
  • 使用 DataFrame append 來合併資料,新增資料
    ​​​​#coding=utf-8 ​​​​import pandas as pd ​​​​import numpy as np ​​​​# concat 的 append 功能 ​​​​df1 = pd.DataFrame(np.ones((3,4))*0, columns=['a','b','c','d']) ​​​​df2 = pd.DataFrame(np.ones((3,4))*1, columns=['a','b','c','d']) ​​​​# append 預設是往下加 ​​​​res = df1.append(df2, ignore_index=True) ​​​​print(res) ​​​​''' ​​​​ a b c d ​​​​0 0.0 0.0 0.0 0.0 ​​​​1 0.0 0.0 0.0 0.0 ​​​​2 0.0 0.0 0.0 0.0 ​​​​3 1.0 1.0 1.0 1.0 ​​​​4 1.0 1.0 1.0 1.0 ​​​​5 1.0 1.0 1.0 1.0 ​​​​''' ​​​​# append 多個 ​​​​df3 = pd.DataFrame(np.ones((3,4))*3, columns=['a','b','c','d']) ​​​​res = df1.append([df2,df3], ignore_index=True) ​​​​print(res) ​​​​''' ​​​​ a b c d ​​​​0 0.0 0.0 0.0 0.0 ​​​​1 0.0 0.0 0.0 0.0 ​​​​2 0.0 0.0 0.0 0.0 ​​​​3 1.0 1.0 1.0 1.0 ​​​​4 1.0 1.0 1.0 1.0 ​​​​5 1.0 1.0 1.0 1.0 ​​​​6 3.0 3.0 3.0 3.0 ​​​​7 3.0 3.0 3.0 3.0 ​​​​8 3.0 3.0 3.0 3.0 ​​​​''' ​​​​# 直接 append 一組資料 ​​​​s1 = pd.Series([1,2,3,4],index=['a','b','c','d']) ​​​​res = df1.append(s1, ignore_index=True) ​​​​print(res) ​​​​''' ​​​​ a b c d ​​​​0 0.0 0.0 0.0 0.0 ​​​​1 0.0 0.0 0.0 0.0 ​​​​2 0.0 0.0 0.0 0.0 ​​​​3 1.0 2.0 3.0 4.0 ​​​​'''

pandas merge 合併資料

  • merge by 一個 key

    ​​​​#coding=utf-8 ​​​​import pandas as pd ​​​​import numpy as np ​​​​left = pd.DataFrame({ ​​​​ 'key':['K0','K1','K2','K3'], ​​​​ 'A':['A0','A1','A2','A3'], ​​​​ 'B':['B0','B1','B2','B3']}) ​​​​right = pd.DataFrame({ ​​​​ 'key':['K0','K1','K2','K3'], ​​​​ 'C':['C0','C1','C2','C3'], ​​​​ 'D':['D0','D1','D2','D3']}) ​​​​print(left) ​​​​''' ​​​​ A B key ​​​​0 A0 B0 K0 ​​​​1 A1 B1 K1 ​​​​2 A2 B2 K2 ​​​​3 A3 B3 K3 ​​​​''' ​​​​print(right) ​​​​''' ​​​​ C D key ​​​​0 C0 D0 K0 ​​​​1 C1 D1 K1 ​​​​2 C2 D2 K2 ​​​​3 C3 D3 K3 ​​​​''' ​​​​# 目標,基於 key 把 left 與 right 合併 ​​​​# 使用 merge ​​​​res = pd.merge(left,right, on='key') ​​​​print(res) ​​​​''' ​​​​ A B key C D ​​​​0 A0 B0 K0 C0 D0 ​​​​1 A1 B1 K1 C1 D1 ​​​​2 A2 B2 K2 C2 D2 ​​​​3 A3 B3 K3 C3 D3 ​​​​'''
  • merge by 多個 key

    ​​​​#coding=utf-8 ​​​​import pandas as pd ​​​​import numpy as np ​​​​left = pd.DataFrame({ ​​​​ 'key1':['K0','K0','K1','K2'], ​​​​ 'key2':['K0','K1','K0','K1'], ​​​​ 'A':['A0','A1','A2','A3'], ​​​​ 'B':['B0','B1','B2','B3']}) ​​​​right = pd.DataFrame({ ​​​​ 'key1':['K0','K1','K1','K2'], ​​​​ 'key2':['K0','K0','K0','K0'], ​​​​ 'C':['C0','C1','C2','C3'], ​​​​ 'D':['D0','D1','D2','D3']}) ​​​​print(left) ​​​​''' ​​​​ A B key1 key2 ​​​​0 A0 B0 K0 K0 ​​​​1 A1 B1 K0 K1 ​​​​2 A2 B2 K1 K0 ​​​​3 A3 B3 K2 K1 ​​​​''' ​​​​print(right) ​​​​''' ​​​​ C D key1 key2 ​​​​0 C0 D0 K0 K0 ​​​​1 C1 D1 K1 K0 ​​​​2 C2 D2 K1 K0 ​​​​3 C3 D3 K2 K0 ​​​​''' ​​​​
    • inner 模式
    ​​​​# 目標,基於 key1, key2 把 left 與 right 合併 ​​​​# 使用 merge 同時合併 by 多個 key 預設為 how='inner' 模式 ​​​​res = pd.merge(left,right, on=['key1','key2']) # 這兩行效果一樣 ​​​​res = pd.merge(left,right, on=['key1','key2'],how='inner') # 這兩行效果一樣 ​​​​print(res) ​​​​''' 合併後,他預設只把相同的部分留下來 (inner 模式) ​​​​ A B key1 key2 C D ​​​​0 A0 B0 K0 K0 C0 D0 ​​​​1 A2 B2 K1 K0 C1 D1 ​​​​2 A2 B2 K1 K0 C2 D2 ​​​​'''
    • outer 模式
    ​​​​# 使用 merge 同時合併 by 多個 key, how='outer' 模式 ​​​​res = pd.merge(left,right, on=['key1','key2'],how='outer') ​​​​print(res) ​​​​''' ​​​​ A B key1 key2 C D ​​​​0 A0 B0 K0 K0 C0 D0 ​​​​1 A1 B1 K0 K1 NaN NaN ​​​​2 A2 B2 K1 K0 C1 D1 ​​​​3 A2 B2 K1 K0 C2 D2 ​​​​4 A3 B3 K2 K1 NaN NaN ​​​​5 NaN NaN K2 K0 C3 D3 ​​​​'''
    • right 模式
    ​​​​# 使用 merge 同時合併 by 多個 key, how='right' 模式 ​​​​res = pd.merge(left,right, on=['key1','key2'],how='right') ​​​​print(res) ​​​​''' ​​​​ A B key1 key2 C D ​​​​0 A0 B0 K0 K0 C0 D0 ​​​​1 A2 B2 K1 K0 C1 D1 ​​​​2 A2 B2 K1 K0 C2 D2 ​​​​3 NaN NaN K2 K0 C3 D3 ​​​​'''
    • left 模式
    ​​​​# 使用 merge 同時合併 by 多個 key, how='left' 模式 ​​​​res = pd.merge(left,right, on=['key1','key2'],how='left') ​​​​print(res) ​​​​''' ​​​​ A B key1 key2 C D ​​​​0 A0 B0 K0 K0 C0 D0 ​​​​1 A1 B1 K0 K1 NaN NaN ​​​​2 A2 B2 K1 K0 C1 D1 ​​​​3 A2 B2 K1 K0 C2 D2 ​​​​4 A3 B3 K2 K1 NaN NaN ​​​​'''
  • 使用 indicator 顯示 merge 的 mode

    ​​​​#coding=utf-8 ​​​​import pandas as pd ​​​​import numpy as np ​​​​df1 = pd.DataFrame({'col1':[0,1], 'col_left':['a','b']}) ​​​​df2 = pd.DataFrame({'col1':[1,2,3], 'col_right':[2,2,2]}) ​​​​print(df1) ​​​​''' ​​​​ col1 col_left ​​​​0 0 a ​​​​1 1 b ​​​​''' ​​​​print(df2) ​​​​''' ​​​​ col1 col_right ​​​​0 1 2 ​​​​1 2 2 ​​​​2 3 2 ​​​​''' ​​​​# 使用 indicator 可以顯示 merge 的方式 ​​​​res = pd.merge(df1,df2, on='col1', how='outer',indicator=True) ​​​​print(res) ​​​​''' ​​​​ col1 col_left col_right _merge ​​​​0 0 a NaN left_only ​​​​1 1 b 2.0 both ​​​​2 2 NaN 2.0 right_only ​​​​3 3 NaN 2.0 right_only ​​​​'''
  • 設定 indicator 欄位的名字

    ​​​​# 設定 indicator 欄位的名字 ​​​​res = pd.merge(df1,df2, on='col1', how='outer',indicator='indicator_column') ​​​​print(res) ​​​​''' ​​​​ col1 col_left col_right indicator_column ​​​​0 0 a NaN left_only ​​​​1 1 b 2.0 both ​​​​2 2 NaN 2.0 right_only ​​​​3 3 NaN 2.0 right_only ​​​​'''
  • merge by index

    ​​​​#coding=utf-8 ​​​​import pandas as pd ​​​​import numpy as np ​​​​# merge by index ​​​​left = pd.DataFrame({ ​​​​ 'A':['A0','A1','A2'], ​​​​ 'B':['B0','B1','B2']}, ​​​​ index=['K0','K1','K2']) ​​​​right = pd.DataFrame({ ​​​​ 'C':['C0','C2','C3'], ​​​​ 'D':['D0','D2','D3']}, ​​​​ index=['K0','K2','K3']) ​​​​print(left) ​​​​''' ​​​​ A B ​​​​K0 A0 B0 ​​​​K1 A1 B1 ​​​​K2 A2 B2 ​​​​''' ​​​​print(right) ​​​​''' ​​​​ C D ​​​​K0 C0 D0 ​​​​K2 C2 D2 ​​​​K3 C3 D3 ​​​​''' ​​​​res = pd.merge(left,right, left_index=True, right_index=True, how='outer') ​​​​print(res) ​​​​''' ​​​​ A B C D ​​​​K0 A0 B0 C0 D0 ​​​​K1 A1 B1 NaN NaN ​​​​K2 A2 B2 C2 D2 ​​​​K3 NaN NaN C3 D3 ​​​​'''
  • merge 合併時,處理欄位名稱相同衝突,以 suffixes 區別

    ​​​​#coding=utf-8 ​​​​import pandas as pd ​​​​import numpy as np ​​​​# 處理 overlapping ​​​​boys = pd.DataFrame({'k':['K0','K1','K2'],'age':[1,2,3]}) ​​​​girls = pd.DataFrame({'k':['K0','K0','K3'],'age':[4,5,6]}) ​​​​print(boys) ​​​​''' ​​​​ age k ​​​​0 1 K0 ​​​​1 2 K1 ​​​​2 3 K2 ​​​​''' ​​​​print(girls) ​​​​''' ​​​​ age k ​​​​0 4 K0 ​​​​1 5 K0 ​​​​2 6 K3 ​​​​''' ​​​​# 目前 age 欄位是重複的,我們為了要區別 boy 與 girl,必須要在新的合併表格中,為 age 欄位取新的名字 ​​​​# 使用 suffixes 屬性即可辦到 ​​​​res = pd.merge(boys,girls, on='k', suffixes=['_boy','_girl'], how='outer') ​​​​print(res) ​​​​''' ​​​​ age_boy k age_girl ​​​​0 1.0 K0 4.0 ​​​​1 1.0 K0 5.0 ​​​​2 2.0 K1 NaN ​​​​3 3.0 K2 NaN ​​​​4 NaN K3 6.0 ​​​​'''

pandas plot

  • plot 基本用法 畫出 Series 數據
#coding=utf-8 import pandas as pd import numpy as np import matplotlib.pyplot as plt # plot data # Series 線性數據 data = pd.Series(np.random.randn(1000), index=np.arange(1000)) data = data.cumsum() # 做累加 data.plot() plt.show()
  • plot 畫出 DataFrame 四個數據
#coding=utf-8 import pandas as pd import numpy as np import matplotlib.pyplot as plt # DataFrame data = pd.DataFrame(np.random.randn(1000,4), index=np.arange(1000), columns=list("ABCD")) data = data.cumsum() print(data.head(5)) # 印出前五個數據 ''' A B C D 0 -0.056981 -0.167990 -0.103564 -0.807399 1 1.008049 -1.633926 0.959755 0.405345 2 0.821038 -3.090023 2.821623 -0.880397 3 0.448243 -4.889474 4.477471 -1.378809 4 -1.453623 -3.347546 5.371346 -2.983690 ''' data.plot() plt.show()
  • scatter 用法
#coding=utf-8 import pandas as pd import numpy as np import matplotlib.pyplot as plt # plot 的方法種類: # 'bar', 'hist', 'box', 'kde', 'area', 'scatter', 'hexbin', 'pie' # DataFrame data = pd.DataFrame(np.random.randn(1000,4), index=np.arange(1000), columns=list("ABCD")) data = data.cumsum() print(data.head(5)) # 印出前五個數據 ''' A B C D 0 1.149112 -1.189742 -1.108183 -1.276239 1 0.889289 -0.979980 -0.821403 0.726542 2 1.219525 -0.753984 0.279848 1.686624 3 1.006253 -0.191323 0.595033 0.578449 4 0.710900 -0.820767 0.064716 1.539593 ''' # 使用 scatter ax = data.plot.scatter(x='A',y='B', color='Red',label='Class 1') # 取 A 欄位 B 欄位 data.plot.scatter(x='A', y='C', color='Green', label='Class 2', ax=ax) # 取 A 欄位 C 欄位 plt.show()
Select a repo