10分鐘的Pandas入門-繁中版

tags: `Pandas`

本篇網址:https://hackmd.io/@wiimax/10-minutes-to-pandas

來自Pandas官方文件，原文詳見: https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html

10分鐘的Pandas入門-繁中版

Pandas介紹

此份介紹源自官方文件，是對Pandas的簡短介紹，~~其實一點也不短~~，可在官方Cookbook看到更複雜的文件說明。

需要使用的模組

import numpy as np
import pandas as pd

後續繪圖會使用的模組

import matplotlib.pyplot as plt

pandas 的基本資料結構

Pandas 提供了兩種類型的類別來處理資料：

Series：保存任何類型資料的一維數值組合。例如整數、字串、Python 物件等。
DataFrame：一種二維資料結構，用於保存數據，例如二維數組或具有行和列的表格。

Object creation 創建物件

參閱官方文件Data Structure Intro section

通過傳入一個list創建Series，pandas預設會產生整數的RangeIndex。

s = pd.Series([1, 3, 5, np.nan, 6, 8])
s

Out[4]: 
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

DataFrame透過使用帶有標籤的list傳遞帶有日期時間索引的 NumPy 數組來建立date_range() ：

# In[6]:
dates = pd.date_range('20130101', periods=6)
dates

# Out[6]: 
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

# In[7]
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
df

# Out[7]: 
                   A         B         C         D
2013-01-01  0.469112 -0.282863 -1.509059 -1.135632
2013-01-02  1.212112 -0.173215  0.119209 -1.044236
2013-01-03 -0.861849 -2.104569 -0.494929  1.071804
2013-01-04  0.721555 -0.706771 -1.039575  0.271860
2013-01-05 -0.424972  0.567020  0.276232 -1.087401
2013-01-06 -0.673690  0.113648 -1.478427  0.524988

np.random.randn(6, 4) 是一個 NumPy 函數，用於生成一個形狀為 (6, 4) 的數組，其中的元素來自於標準常態分佈（均值為 0，標準差為 1）。這個函數是為了方便從 Matlab 移植代碼而設計的，並且將標準常態分佈的生成封裝在了 standard_normal 函數中。

DataFrame以字典dict:{Key:Value}創建DataFrame，其中Key是欄（非列）標籤、Value是列之值。

# In[8]:
df2 = pd.DataFrame({'A': 1.,
                    'B': pd.Timestamp('20130102'),
                    'C': pd.Series(1, index=list(range(4)), dtype='float32'),
                    'D': np.array([3] * 4, dtype='int32'),
                    'E': pd.Categorical(["test", "train", "test", "train"]),
                    'F': 'foo'})
df2

# Out[8]: 
     A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo
The columns of the resulting DataFrame have different dtypes.

DataFrame欄位可以有不同的資料結構 dtypes：

# In [09]: 
df2.dtypes

# Out[09]: 
A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

如果用IPython、Jupyter notebook等筆記本形式使用Tab可自動展示補全所有的屬性、自定義欄位。

In [12]: df2.<TAB>  
df2.A                  df2.bool
df2.abs                df2.boxplot
df2.add                df2.C
df2.add_prefix         df2.clip
df2.add_suffix         df2.columns
df2.align              df2.copy    
df2.all                df2.count
df2.any                df2.combine
df2.append             df2.D
df2.apply              df2.describe
df2.applymap           df2.diff
df2.B                  df2.duplicated
...

Viewing data 檢視資料

參閱Basics section

以DataFrame.head()查看DataFrame的前n筆資料，DataFrame.tail()查看最後n筆資料：

# In [13]: 
df.head()

# Out[13]: 
                   A         B         C         D
2013-01-01  0.469112 -0.282863 -1.509059 -1.135632
2013-01-02  1.212112 -0.173215  0.119209 -1.044236
2013-01-03 -0.861849 -2.104569 -0.494929  1.071804
2013-01-04  0.721555 -0.706771 -1.039575  0.271860
2013-01-05 -0.424972  0.567020  0.276232 -1.087401

# In [14]: 
df.tail(3)

# Out[14]: 
                   A         B         C         D
2013-01-04  0.721555 -0.706771 -1.039575  0.271860
2013-01-05 -0.424972  0.567020  0.276232 -1.087401
2013-01-06 -0.673690  0.113648 -1.478427  0.524988

以DataFrame.index、DataFrame.columns 顯示索引及欄位名稱。

# In [15]: 
df.index

# Out[15]: 
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

# In [16]: 
df.columns

# Out[16]: 
Index(['A', 'B', 'C', 'D'], dtype='object')

DataFrame.to_numpy()轉換為NumPy。

但請注意如果該DataFrame具有不同資料型態(int、str…)，這可能是一項昂貴的操作，主因是NumPy數組對整個數組有一個dtype，而pandas DataFrames每列有一個dtype。當呼叫時 DataFrame.to_numpy()，pandas會找到可以容納 DataFrame中所有 dtypes 的NumPy dtype。這可能最終成為object，這需要將每個值都轉換為Python物件。

以下df的DataFrame值皆為浮點數， DataFrame.to_numpy()就相當快。

In [17]: df.to_numpy()

Out[17]: 
array([[ 0.4691, -0.2829, -1.5091, -1.1356],
       [ 1.2121, -0.1732,  0.1192, -1.0442],
       [-0.8618, -2.1046, -0.4949,  1.0718],
       [ 0.7216, -0.7068, -1.0396,  0.2719],
       [-0.425 ,  0.567 ,  0.2762, -1.0874],
       [-0.6737,  0.1136, -1.4784,  0.525 ]])

以下df2的DataFrame有不同dtypes，運算代價高

In [18]: df2.to_numpy()
Out[18]: 
array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
      dtype=object)

Note DataFrame.to_numpy() does not include the index or column labels in the output.

以describe()快速檢視數據統計摘要

In [19]: df.describe()
Out[19]: 
              A         B         C         D
count  6.000000  6.000000  6.000000  6.000000
mean   0.073711 -0.431125 -0.687758 -0.233103
std    0.843157  0.922818  0.779887  0.973118
min   -0.861849 -2.104569 -1.509059 -1.135632
25%   -0.611510 -0.600794 -1.368714 -1.076610
50%    0.022070 -0.228039 -0.767252 -0.386188
75%    0.658444  0.041933 -0.034326  0.461706
max    1.212112  0.567020  0.276232  1.071804

pd.series.std(ddof=1)預設為樣本的標準差，如果要像numpy.std以母體為標準差，應改為pd.series.std(ddof=0)

以T轉置資料矩陣(列、欄互換)

In [20]: df.T

Out[20]: 
   2013-01-01  2013-01-02  2013-01-03  2013-01-04  2013-01-05  2013-01-06
A    0.469112    1.212112   -0.861849    0.721555   -0.424972   -0.673690
B   -0.282863   -0.173215   -2.104569   -0.706771    0.567020    0.113648
C   -1.509059    0.119209   -0.494929   -1.039575    0.276232   -1.478427
D   -1.135632   -1.044236    1.071804    0.271860   -1.087401    0.524988

依軸排序sort_index(axis=1, ascending=False)，結果為以ROW、遞增排序。

In [21]: df.sort_index(axis=1, ascending=False)

Out[21]: 
                   D         C         B         A
2013-01-01 -1.135632 -1.509059 -0.282863  0.469112
2013-01-02 -1.044236  0.119209 -0.173215  1.212112
2013-01-03  1.071804 -0.494929 -2.104569 -0.861849
2013-01-04  0.271860 -1.039575 -0.706771  0.721555
2013-01-05 -1.087401  0.276232  0.567020 -0.424972
2013-01-06  0.524988 -1.478427  0.113648 -0.673690

DataFrame.sort_values()按值排序：

In [22]: df.sort_values(by='B')

Out[22]: 
                   A         B         C         D
2013-01-03 -0.861849 -2.104569 -0.494929  1.071804
2013-01-04  0.721555 -0.706771 -1.039575  0.271860
2013-01-01  0.469112 -0.282863 -1.509059 -1.135632
2013-01-02  1.212112 -0.173215  0.119209 -1.044236
2013-01-06 -0.673690  0.113648 -1.478427  0.524988
2013-01-05 -0.424972  0.567020  0.276232 -1.087401

Selection 選取

注意，雖然標準的Python、numpy表達式直觀可用，但建議以Pandas優化的選擇方法，如DataFrame.at()、DataFrame.iat()和。DataFrame.loc() DataFrame.iloc() 。

參閱文件Indexing and Selecting Data and MultiIndex / Advanced Indexing.

Getting 取得資料

DataFrame 選取單一欄位，將會回傳一個Series, df['A']相當於df.A:

In [23]: df['A']

Out[23]: 
2013-01-01    0.469112
2013-01-02    1.212112
2013-01-03   -0.861849
2013-01-04    0.721555
2013-01-05   -0.424972
2013-01-06   -0.673690
Freq: D, Name: A, dtype: float64

以中括號[]選擇想要的rows進行切片

In [24]: df[0:3]

Out[24]: 
                   A         B         C         D
2013-01-01  0.469112 -0.282863 -1.509059 -1.135632
2013-01-02  1.212112 -0.173215  0.119209 -1.044236
2013-01-03 -0.861849 -2.104569 -0.494929  1.071804

In [25]: df['20130102':'20130104']

Out[25]: 
                   A         B         C         D
2013-01-02  1.212112 -0.173215  0.119209 -1.044236
2013-01-03 -0.861849 -2.104569 -0.494929  1.071804
2013-01-04  0.721555 -0.706771 -1.039575  0.271860

以標籤進行選擇

請參閱按標籤選擇了解更多內容。DataFrame.loc() DataFrame.at()

選擇與標籤相符的行：

#即取得2013-01-01的數據
In [26]: df.loc[dates[0]]

Out[26]: 
A    0.469112
B   -0.282863
C   -1.509059
D   -1.135632
Name: 2013-01-01 00:00:00, dtype: float64

以標籤取得多欄位數據

In [27]: df.loc[:, ['A', 'B']]

Out[27]: 
                   A         B
2013-01-01  0.469112 -0.282863
2013-01-02  1.212112 -0.173215
2013-01-03 -0.861849 -2.104569
2013-01-04  0.721555 -0.706771
2013-01-05 -0.424972  0.567020
2013-01-06 -0.673690  0.113648

以標籤組合切片:

In [28]: df.loc['20130102':'20130104', ['A', 'B']]

Out[28]: 
                   A         B
2013-01-02  1.212112 -0.173215
2013-01-03 -0.861849 -2.104569
2013-01-04  0.721555 -0.706771

以標籤組合縮減顯示維度:

In [29]: df.loc['20130102', ['A', 'B']]

Out[29]: 
A    1.212112
B   -0.173215
Name: 2013-01-02 00:00:00, dtype: float64

獲取單筆數值:

In [30]: df.loc[dates[0], 'A']
Out[30]: 0.4691122999071863

快速存取標量（相當於先前的方法）：

In [31]: df.at[dates[0], 'A']
Out[31]: 0.4691122999071863

Selection by position 以位置選擇

loc以標籤取得Rows數據，iloc以行號取得數據。

在Selection by Position查看更多內容。DataFrame.iloc() DataFrame.iat()

以整數數值選擇:

In [32]: df.iloc[3]

Out[32]: 
A    0.721555
B   -0.706771
C   -1.039575
D    0.271860
Name: 2013-01-04 00:00:00, dtype: float64

以整數切片，使用方式類似numpy、python風格:

In [33]: df.iloc[3:5, 0:2]

Out[33]: 
                   A         B
2013-01-04  0.721555 -0.706771
2013-01-05 -0.424972  0.567020

以list指定位置，使用方式類似numpy、python風格:

In [34]: df.iloc[[1, 2, 4], [0, 2]]

Out[34]: 
                   A         C
2013-01-02  1.212112  0.119209
2013-01-03 -0.861849 -0.494929
2013-01-05 -0.424972  0.276232

對行rows切片:

In [35]: df.iloc[1:3, :]

Out[35]: 
                   A         B         C         D
2013-01-02  1.212112 -0.173215  0.119209 -1.044236
2013-01-03 -0.861849 -2.104569 -0.494929  1.071804

對欄columns切片:

In [36]: df.iloc[:, 1:3]

Out[36]: 
                   B         C
2013-01-01 -0.282863 -1.509059
2013-01-02 -0.173215  0.119209
2013-01-03 -2.104569 -0.494929
2013-01-04 -0.706771 -1.039575
2013-01-05  0.567020  0.276232
2013-01-06  0.113648 -1.478427

取得特定值:

In [37]: df.iloc[1, 1]
Out[37]: -0.17321464905330858

快速取得特定值（相當於先前的方法）:

In [38]: df.iat[1, 1]
Out[38]: -0.17321464905330858

Boolean indexing 布林索引

以單欄的值選取數據

In [39]: df[df.A > 0]
Out[39]: 
                   A         B         C         D
2013-01-01  0.469112 -0.282863 -1.509059 -1.135632
2013-01-02  1.212112 -0.173215  0.119209 -1.044236
2013-01-04  0.721555 -0.706771 -1.039575  0.271860

顯示DataFrame滿足布林條件的情形

In [40]: df[df > 0]
Out[40]: 
                   A         B         C         D
2013-01-01  0.469112       NaN       NaN       NaN
2013-01-02  1.212112       NaN  0.119209       NaN
2013-01-03       NaN       NaN       NaN  1.071804
2013-01-04  0.721555       NaN       NaN  0.271860
2013-01-05       NaN  0.567020  0.276232       NaN
2013-01-06       NaN  0.113648       NaN  0.524988

以isin()方法篩選數據:

In [41]: df2 = df.copy()
In [42]: df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']
In [43]: df2
Out[43]: 
                   A         B         C         D      E
2013-01-01  0.469112 -0.282863 -1.509059 -1.135632    one
2013-01-02  1.212112 -0.173215  0.119209 -1.044236    one
2013-01-03 -0.861849 -2.104569 -0.494929  1.071804    two
2013-01-04  0.721555 -0.706771 -1.039575  0.271860  three
2013-01-05 -0.424972  0.567020  0.276232 -1.087401   four
2013-01-06 -0.673690  0.113648 -1.478427  0.524988  three

In [44]: df2[df2['E'].isin(['two', 'four'])]
Out[44]: 
                   A         B         C         D     E
2013-01-03 -0.861849 -2.104569 -0.494929  1.071804   two
2013-01-05 -0.424972  0.567020  0.276232 -1.087401  four

df.copy()預設為df.copy(deep=True)，意思是預設執行深Copy

深複製：創建一個新的物件，並且徹底複製原始物件中的所有數據。深複製後，原始數據和新複製的數據互不影響，它們在記憶體中是完全獨立的。

淺複製：創建一個新的物件，但是不會徹底複製數據，只是複製數據的引用。淺複製後，原始數據和新複製的數據會相互影響，因為它們共享同一塊記憶體中的數據。

df2 = df 會是淺複製，df2 的任何改動也會變更 df ，因為是同一筆數據集。

Setting 設置

設定新列會自動按索引對齊資料：

In [45]: s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20130102', periods=6))

In [46]: s1
Out[46]: 
2013-01-02    1
2013-01-03    2
2013-01-04    3
2013-01-05    4
2013-01-06    5
2013-01-07    6
Freq: D, dtype: int64

In [47]: df['F'] = s1

以標籤更新數值:

In [48]: df.at[dates[0], 'A'] = 0

以位置更新數值:
```
In [49]: df.iat[0, 1] = 0
```

以NumPy array更新

In [50]: df.loc[:, 'D'] = np.array([5] * len(df))

df依前述操作更新結果

In [51]: df
Out[51]: 
                   A         B         C  D    F
2013-01-01  0.000000  0.000000 -1.509059  5  NaN
2013-01-02  1.212112 -0.173215  0.119209  5  1.0
2013-01-03 -0.861849 -2.104569 -0.494929  5  2.0
2013-01-04  0.721555 -0.706771 -1.039575  5  3.0
2013-01-05 -0.424972  0.567020  0.276232  5  4.0
2013-01-06 -0.673690  0.113648 -1.478427  5  5.0

以where條件判斷運算子更新值

In [52]: df2 = df.copy()

In [53]: df2[df2 > 0] = -df2

In [54]: df2
Out[54]: 
                   A         B         C  D    F
2013-01-01  0.000000  0.000000 -1.509059 -5  NaN
2013-01-02 -1.212112 -0.173215 -0.119209 -5 -1.0
2013-01-03 -0.861849 -2.104569 -0.494929 -5 -2.0
2013-01-04 -0.721555 -0.706771 -1.039575 -5 -3.0
2013-01-05 -0.424972 -0.567020 -0.276232 -5 -4.0
2013-01-06 -0.673690 -0.113648 -1.478427 -5 -5.0

Missing data 缺失值處裡

pandas以np.nan表示缺失值，預設情況不進行運算，參閱 Missing Data section

.reindex()可以修改/增加/刪除索引，將回傳一個數據的副本

In [55]: df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])

In [56]: df1.loc[dates[0]:dates[1], 'E'] = 1

In [57]: df1
Out[57]: 
                   A         B         C  D    F    E
2013-01-01  0.000000  0.000000 -1.509059  5  NaN  1.0
2013-01-02  1.212112 -0.173215  0.119209  5  1.0  1.0
2013-01-03 -0.861849 -2.104569 -0.494929  5  2.0  NaN
2013-01-04  0.721555 -0.706771 -1.039575  5  3.0  NaN

丟掉有區失值的行

In [58]: df1.dropna(how='any')
Out[58]: 
                   A         B         C  D    F    E
2013-01-02  1.212112 -0.173215  0.119209  5  1.0  1.0

對缺失值賦值

In [59]: df1.fillna(value=5)
Out[59]: 
                   A         B         C  D    F    E
2013-01-01  0.000000  0.000000 -1.509059  5  5.0  1.0
2013-01-02  1.212112 -0.173215  0.119209  5  1.0  1.0
2013-01-03 -0.861849 -2.104569 -0.494929  5  2.0  5.0
2013-01-04  0.721555 -0.706771 -1.039575  5  3.0  5.0

以.isna()使用布林遮罩

In [60]: pd.isna(df1)
Out[60]: 
                A      B      C      D      F      E
2013-01-01  False  False  False  False   True  False
2013-01-02  False  False  False  False  False  False
2013-01-03  False  False  False  False  False   True
2013-01-04  False  False  False  False  False   True

Operations 操作

參閱Basic section on Binary Ops。

Stats 統計

操作通常不包含缺失項(缺失要先預處理)

執行敘述統計-按列

In [61]: df.mean()
Out[61]: 
A   -0.004474
B   -0.383981
C   -0.687758
D    5.000000
F    3.000000
dtype: float64

執行敘述統計-按欄

In [62]: df.mean(1)
Out[62]: 
2013-01-01    0.872735
2013-01-02    1.431621
2013-01-03    0.707731
2013-01-04    1.395042
2013-01-05    1.883656
2013-01-06    1.592306
Freq: D, dtype: float64

Series或DataFrame如要操作不同維度需先對齊，Pandas會自動沿著指定維度廣播(broadcasting)，並且會用np.nan填滿未對齊的標籤。

#以時間為index對齊
#.shift(2)為資料沿軸順移2位
In [63]: s = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2) 
In [64]: s
Out[64]: 
2013-01-01    NaN
2013-01-02    NaN
2013-01-03    1.0
2013-01-04    3.0
2013-01-05    5.0
2013-01-06    NaN
Freq: D, dtype: float64

In [65]: df.sub(s, axis='index') 
Out[65]: 
                   A         B         C    D    F
2013-01-01       NaN       NaN       NaN  NaN  NaN
2013-01-02       NaN       NaN       NaN  NaN  NaN
2013-01-03 -1.861849 -3.104569 -1.494929  4.0  1.0
2013-01-04 -2.278445 -3.706771 -4.039575  2.0  0.0
2013-01-05 -5.424972 -4.432980 -4.723768  0.0 -1.0
2013-01-06       NaN       NaN       NaN  NaN  NaN

使用者定義的函數

應用DataFrame.agg()、DataFrame.transform()使用者定義的函數來分別減少或廣播其結果。

In [66]: df.agg(lambda x: np.mean(x) * 5.6)
Out[66]: 
A    -0.025054
B    -2.150294
C    -3.851445
D    28.000000
F    16.800000
dtype: float64

In [67]: df.transform(lambda x: x * 101.2)
Out[67]: 
                     A           B           C      D      F
2013-01-01    0.000000    0.000000 -152.716721  506.0    NaN
2013-01-02  122.665737  -17.529322   12.063922  506.0  101.2
2013-01-03  -87.219115 -212.982405  -50.086843  506.0  202.4
2013-01-04   73.021382  -71.525239 -105.204988  506.0  303.6
2013-01-05  -43.007200   57.382459   27.954680  506.0  404.8
2013-01-06  -68.177398   11.501219 -149.616767  506.0  506.0

Value 很重要

更多資訊請參見直方圖和離散化。

In [68]: s = pd.Series(np.random.randint(0, 7, size=10))

In [69]: s
Out[69]: 
0    4
1    2
2    1
3    2
4    6
5    4
6    4
7    6
8    4
9    4
dtype: int64

In [70]: s.value_counts()
Out[70]: 
4    5
2    2
6    2
1    1
Name: count, dtype: int64

字串方法

Series具有字串str的處理方法，可以方便地對數組的每個元素進行操作，如下面的程式碼片段所示。請參閱向量化字串方法以了解更多資訊。

請注意，str中的模式匹配通常默認使用正則表達式（在某些情況下總是使用它們）

In [71]: s = pd.Series(["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"])

In [72]: s.str.lower()
Out[72]: 
0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

Apply 應用

以Applying functions進行資料處理:

In [66]: df.apply(np.cumsum) #累加
Out[66]: 
                   A         B         C   D     F
2013-01-01  0.000000  0.000000 -1.509059   5   NaN
2013-01-02  1.212112 -0.173215 -1.389850  10   1.0
2013-01-03  0.350263 -2.277784 -1.884779  15   3.0
2013-01-04  1.071818 -2.984555 -2.924354  20   6.0
2013-01-05  0.646846 -2.417535 -2.648122  25  10.0
2013-01-06 -0.026844 -2.303886 -4.126549  30  15.0

In [67]: df.apply(lambda x: x.max() - x.min())
Out[67]: 
A    2.073961
B    2.671590
C    1.785291
D    0.000000
F    4.000000
dtype: float64

Merge合併

Concat連接

pandas提供各種簡易的合併Series及Dataframe物件操作方式，參閱Merging section

以concat()連接pandas物件:

In [73]: df = pd.DataFrame(np.random.randn(10, 4))

In [74]: df
Out[74]: 
          0         1         2         3
0 -0.548702  1.467327 -1.015962 -0.483075
1  1.637550 -1.217659 -0.291519 -1.745505
2 -0.263952  0.991460 -0.919069  0.266046
3 -0.709661  1.669052  1.037882 -1.705775
4 -0.919854 -0.042379  1.247642 -0.009920
5  0.290213  0.495767  0.362949  1.548106
6 -1.131345 -0.089329  0.337863 -0.945867
7 -0.932132  1.956030  0.017587 -0.016692
8 -0.575247  0.254161 -1.143704  0.215897
9  1.193555 -0.077118 -0.408530 -0.862495

# break it into pieces
In [75]: pieces = [df[:3], df[3:7], df[7:]] ＃分段

In [76]: pd.concat(pieces)
Out[76]: 
          0         1         2         3
0 -0.548702  1.467327 -1.015962 -0.483075
1  1.637550 -1.217659 -0.291519 -1.745505
2 -0.263952  0.991460 -0.919069  0.266046
3 -0.709661  1.669052  1.037882 -1.705775
4 -0.919854 -0.042379  1.247642 -0.009920
5  0.290213  0.495767  0.362949  1.548106
6 -1.131345 -0.089329  0.337863 -0.945867
7 -0.932132  1.956030  0.017587 -0.016692
8 -0.575247  0.254161 -1.143704  0.215897
9  1.193555 -0.077118 -0.408530 -0.862495

向DataFrame新增"Column"相對較快。但是，添加"Row"需要copy副本，並且可能很昂貴。我們建議將預先建立的記錄列表傳遞給DataFrame建構函數，而不是DataFrame透過迭代地向其追加記錄來建構。

Join

merge() 可以採用SQL style合併。參閱Database style joining章節。

In [77]: left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})

In [78]: right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})

In [79]: left
Out[79]: 
   key  lval
0  foo     1
1  foo     2

In [80]: right
Out[80]: 
   key  rval
0  foo     4
1  foo     5

In [81]: pd.merge(left, right, on='key')
Out[81]: 
   key  lval  rval
0  foo     1     4
1  foo     1     5
2  foo     2     4
3  foo     2     5

merge()唯一值:

In [82]: left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})

In [83]: right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})

In [84]: left
Out[84]: 
   key  lval
0  foo     1
1  bar     2

In [85]: right
Out[85]: 
   key  rval
0  foo     4
1  bar     5

In [86]: pd.merge(left, right, on='key')
Out[86]: 
   key  lval  rval
0  foo     1     4
1  bar     2     5

Grouping

透過“group by”將數據對每個分組應用不同的function並結合展示成果，過程為:
- 依據某種標準將數據拆分(Splitting)為組
- 將設計好的功能(applying)對每個組獨立處理。
- 結合(Combining)成果至資料結構

參閱Grouping章節.

In [91]: pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
                             'foo', 'bar', 'foo', 'foo'],
                       'B': ['one', 'one', 'two', 'three',
                             'two', 'two', 'one', 'three'],
                       'C': np.random.randn(8),
                       'D': np.random.randn(8)})

In [92]: df
Out[92]: 
     A      B         C         D
0  foo    one -1.202872 -0.055224
1  bar    one -1.814470  2.395985
2  foo    two  1.018601  1.552825
3  bar  three -0.595447  0.166599
4  foo    two  1.395433  0.047609
5  bar    two -0.392670 -0.136473
6  foo    one  0.007207 -0.561757
7  foo  three  1.928123 -1.623033

按列標籤分組，選擇列標籤，然後將 DataFrameGroupBy.sum()函數套用至結果組：

In [93]: df.groupby('A').sum()
Out[93]: 
            C        D
A                     
bar -2.802588  2.42611
foo  3.146492 -0.63958

以多欄位分組形成分層索引，並應用sum()

In [94]: df.groupby(['A', 'B']).sum()
Out[94]: 
                  C         D
A   B                        
bar one   -1.814470  2.395985
    three -0.595447  0.166599
    two   -0.392670 -0.136473
foo one   -1.195665 -0.616981
    three  1.928123 -1.623033
    two    2.414034  1.600434

Reshaping 重塑

參閱Hierarchical Indexing and Reshaping章節內容

Stack 堆疊

分層資料結構的形式

In [91]: arrays = [
   ....:    ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
   ....:    ["one", "two", "one", "two", "one", "two", "one", "two"],
   ....: ]
   ....: 

In [92]: index = pd.MultiIndex.from_arrays(arrays, names=["first", "second"])

In [93]: df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=["A", "B"])

In [94]: df2 = df[:4]

In [95]: df2
Out[95]: 
                     A         B
first second                    
bar   one    -0.727965 -0.589346
      two     0.339969 -0.693205
baz   one    -0.339355  0.593616
      two     0.884345  1.591431

使用stack()方法將DataFrame壓縮(compresses) 為階層形式的欄位

In [100]: stacked = df2.stack()

In [101]: stacked
Out[101]: 
first  second   
bar    one     A    0.029399
               B   -0.542108
       two     A    0.282696
               B   -0.087302
baz    one     A   -1.575170
               B    1.771208
       two     A    0.816482
               B    1.100230
dtype: float64

使用堆疊的DataFrame或Series（具有階層索引MultiIndex），與stack()相反的操作為unstack()，預設情況下為取消堆疊最後一級：

In [102]: stacked.unstack()
Out[102]: 
                     A         B
first second                    
bar   one     0.029399 -0.542108
      two     0.282696 -0.087302
baz   one    -1.575170  1.771208
      two     0.816482  1.100230

In [103]: stacked.unstack(1)
Out[103]: 
     second        one       two
first                      
bar   A      0.029399  0.282696
      B     -0.542108 -0.087302
baz   A     -1.575170  0.816482
      B      1.771208  1.100230

In [104]: stacked.unstack(0)
Out[104]: 
    first        bar       baz
second                      
one    A      0.029399 -1.575170
       B     -0.542108  1.771208
two    A      0.282696  0.816482
       B     -0.087302  1.100230

Pivot tables

參閱Pivot Tables

In [105]: df = pd.DataFrame({'A': ['one', 'one', 'two', 'three'] * 3,
                             'B': ['A', 'B', 'C'] * 4,
                             'C': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
                             'D': np.random.randn(12),
                             'E': np.random.randn(12)})

In [106]: df
Out[106]: 
        A  B    C         D         E
0     one  A  foo  1.418757 -0.179666
1     one  B  foo -1.879024  1.291836
2     two  C  foo  0.536826 -0.009614
3   three  A  bar  1.006160  0.392149
4     one  B  bar -0.029716  0.264599
5     one  C  bar -1.146178 -0.057409
6     two  A  foo  0.100900 -1.425638
7   three  B  foo -1.035018  1.024098
8     one  C  foo  0.314665 -0.106062
9     one  A  bar -0.773723  1.824375
10    two  B  bar -1.170653  0.595974
11  three  C  bar  0.648740  1.167115

我們可以非常輕鬆地從這些數據生成數據透視表：

In [107]: pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])
Out[107]: 
C             bar       foo
A     B                    
one   A -0.773723  1.418757
      B -0.029716 -1.879024
      C -1.146178  0.314665
three A  1.006160       NaN
      B       NaN -1.035018
      C  0.648740       NaN
two   A       NaN  0.100900
      B -1.170653       NaN
      C       NaN  0.536826

Time series 時間序列

pandas具有簡單，強大且高效的功能，用於在頻率轉換期間執行重採樣操作（例如，將第二數據轉換為5分鐘數據）。這在財務應用程序中非常常見，但不僅限於此。請參閱Time Series章節

In [108]: rng = pd.date_range('1/1/2012', periods=100, freq='S')
In [109]: ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)

In [110]: ts.resample('5Min').sum()
Out[110]: 
2012-01-01    25083
Freq: 5T, dtype: int64

時區呈現：

In [111]: rng = pd.date_range('3/6/2012 00:00', periods=5, freq='D')
In [112]: ts = pd.Series(np.random.randn(len(rng)), rng)
In [113]: ts
Out[113]: 
2012-03-06    0.464000
2012-03-07    0.227371
2012-03-08   -0.496922
2012-03-09    0.306389
2012-03-10   -2.290613
Freq: D, dtype: float64

In [114]: ts_utc = ts.tz_localize('UTC')
In [115]: ts_utc
Out[115]: 
2012-03-06 00:00:00+00:00    0.464000
2012-03-07 00:00:00+00:00    0.227371
2012-03-08 00:00:00+00:00   -0.496922
2012-03-09 00:00:00+00:00    0.306389
2012-03-10 00:00:00+00:00   -2.290613
Freq: D, dtype: float64

Series.tz_convert()將轉換為另一個時區：轉換為另一個時區：

In [116]: ts_utc.tz_convert('US/Eastern')
Out[116]: 
2012-03-05 19:00:00-05:00    0.464000
2012-03-06 19:00:00-05:00    0.227371
2012-03-07 19:00:00-05:00   -0.496922
2012-03-08 19:00:00-05:00    0.306389
2012-03-09 19:00:00-05:00   -2.290613
Freq: D, dtype: float64

BusinessDay在不同時間跨度表示間轉換：

In [113]: rng
Out[113]: 
DatetimeIndex(['2012-03-06', '2012-03-07', '2012-03-08', '2012-03-09',
               '2012-03-10'],
              dtype='datetime64[ns]', freq='D')

In [114]: rng + pd.offsets.BusinessDay(5)
Out[114]: 
DatetimeIndex(['2012-03-13', '2012-03-14', '2012-03-15', '2012-03-16',
               '2012-03-16'],
              dtype='datetime64[ns]', freq=None)

Categoricals 分類

現在pandas可以在DataFrame中包含分類數據，詳情參閱categorical introduction 及API documentation.

In [127]: df = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6],
                             "raw_grade": ['a', 'b', 'b', 'a', 'a', 'e']})

將原始成績轉換為分類數據

In [128]: df["grade"] = df["raw_grade"].astype("category")

In [129]: df["grade"]
Out[129]: 
0    a
1    b
2    b
3    a
4    a
5    e
Name: grade, dtype: category
Categories (3, object): [a, b, e]

重命名分類使其更有意義(使用 Series.cat.categories轉換).

In [130]: df["grade"].cat.categories = ["very good", "good", "very bad"]

重新整理類別，並添加缺少的類別(預設為回傳新 Series).

In [131]: df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium",
                                                        "good", "very good"])

In [132]: df["grade"]
Out[132]: 
0    very good
1         good
2         good
3    very good
4    very good
5     very bad
Name: grade, dtype: category
Categories (5, object): [very bad, bad, medium, good, very good]

按整理後的類別排序.sort_values()

In [133]: df.sort_values(by="grade")
Out[133]: 
   id raw_grade      grade
5   6         e   very bad
1   2         b       good
2   3         b       good
0   1         a  very good
3   4         a  very good
4   5         a  very good

按類別分類也會包含具空值的類別

In [134]: df.groupby("grade").size()
Out[134]: 
grade
very bad     1
bad          0
medium       0
good         2
very good    3
dtype: int64

Plotting 繪圖

請參閱Plotting文檔。

In [124]: import matplotlib.pyplot as plt

In [125]: plt.close("all")

此plt.close方法用於關閉圖形視窗：

In [135]: ts = pd.Series(np.random.randn(1000),
                         index=pd.date_range('1/1/2000', periods=1000))

In [136]: ts = ts.cumsum()

In [137]: ts.plot()
Out[137]: <matplotlib.axes._subplots.AxesSubplot at 0x7f24a8b314d0>

使用 Jupyter 時，繪圖將使用出現plot()。否則使用 matplotlib.pyplot.show顯示它或 matplotlib.pyplot.savefig將其寫入檔案。

在DataFrame上，該plot()方法可以方便地使用標籤繪製所有列：

In [138]: df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,
                            columns=['A', 'B', 'C', 'D'])

In [139]: df = df.cumsum()

In [140]: plt.figure()
Out[140]: <Figure size 640x480 with 0 Axes>

In [141]: df.plot()
Out[141]: <matplotlib.axes._subplots.AxesSubplot at 0x7f24a8b13750>

In [142]: plt.legend(loc='best')
Out[142]: <matplotlib.legend.Legend at 0x7f24a88250d0>

Getting data in/out 資料讀取、輸出

CSV

Writing to a csv file

df = pd.DataFrame(np.random.randint(0, 5, (10, 5)))
df.to_csv('foo.csv')

Reading from a csv file

In [136]: pd.read_csv("foo.csv")
Out[136]: 
   Unnamed: 0  0  1  2  3  4
0           0  4  3  1  1  2
1           1  1  0  2  3  2
2           2  1  4  2  1  2
3           3  0  4  0  2  2
4           4  4  2  2  3  4
5           5  4  0  4  3  1
6           6  2  1  2  0  3
7           7  4  0  4  4  4
8           8  4  4  1  0  1
9           9  0  4  3  0  3

Parquet

In [137]: df.to_parquet("foo.parquet")

In [138]: pd.read_parquet("foo.parquet")
Out[138]: 
   0  1  2  3  4
0  4  3  1  1  2
1  1  0  2  3  2
2  1  4  2  1  2
3  0  4  0  2  2
4  4  2  2  3  4
5  4  0  4  3  1
6  2  1  2  0  3
7  4  0  4  4  4
8  4  4  1  0  1
9  0  4  3  0  3

Excel

讀寫MS Excel

寫入excel檔案DataFrame.to_excel()

In [147]: df.to_excel('foo.xlsx', sheet_name='Sheet1')

讀取excel檔案read_excel()

In [148]: pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])
Out[148]: 
    Unnamed: 0          A          B         C          D
0   2000-01-01   0.266457  -0.399641 -0.219582   1.186860
1   2000-01-02  -1.170732  -0.345873  1.653061  -0.282953
2   2000-01-03  -1.734933   0.530468  2.060811  -0.515536
3   2000-01-04  -1.555121   1.452620  0.239859  -1.156896
4   2000-01-05   0.578117   0.511371  0.103552  -2.428202
..         ...        ...        ...       ...        ...
995 2002-09-22  -8.985362  -8.485624 -4.669462  31.367740
996 2002-09-23  -9.558560  -8.781216 -4.499815  30.518439
997 2002-09-24  -9.902058  -9.340490 -4.386639  30.105593
998 2002-09-25 -10.216020  -9.480682 -3.933802  29.758560
999 2002-09-26 -11.856774 -10.671012 -3.216025  29.369368

[1000 rows x 5 columns]

Gotchas 小陷阱

如果操作時遇到異常，如:

>>> if pd.Series([False, True, False]):
...     print("I was true")
Traceback
    ...
ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all().

請查看Comparisons來處理異常，或查看Gotchas也可以.