---
disqus: HackMD
---
# 10分鐘的Pandas入門-繁中版
###### tags: `Pandas`
本篇網址:https://hackmd.io/@wiimax/10-minutes-to-pandas
>來自Pandas官方文件,原文詳見: https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html
[![image](https://hackmd.io/_uploads/SJXWoCEJA.png)
](https://colab.research.google.com/github/willismax/MediaSystem-Python-Course/blob/main/01.Intro-Python/10%E5%88%86%E9%90%98Pandas.ipynb)
[TOC]
## Pandas介紹
- 此份介紹源自官方文件,是對Pandas的簡短介紹,~~其實一點也不短~~,可在官方[Cookbook](https://pandas.pydata.org/pandas-docs/stable/user_guide/cookbook.html#cookbook)看到更複雜的文件說明。
- 需要使用的模組
```python
import numpy as np
import pandas as pd
```
- 後續繪圖會使用的模組
```python
import matplotlib.pyplot as plt
```
## pandas 的基本資料結構
Pandas 提供了兩種類型的類別來處理資料:
1. [`Series`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#pandas.Series "貓熊系列"):保存任何類型資料的一維數值組合。例如整數、字串、Python 物件等。
2. [`DataFrame`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame "pandas.DataFrame"):一種二維資料結構,用於保存數據,例如二維數組或具有行和列的表格。
## Object creation 創建物件
- 參閱官方文件[Data Structure Intro section](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html#dsintro)
- 通過傳入一個list創建[Series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#pandas.Series),pandas預設會產生整數的[RangeIndex](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.RangeIndex.html#pandas.RangeIndex)。
```python
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s
Out[4]:
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
```
- [`DataFrame`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame "pandas.DataFrame")透過使用帶有標籤的list傳遞帶有日期時間索引的 NumPy 數組來建立[`date_range()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.date_range.html#pandas.date_range "pandas.date_range") :
```python
# In[6]:
dates = pd.date_range('20130101', periods=6)
dates
# Out[6]:
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
```
```python
# In[7]
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
df
# Out[7]:
A B C D
2013-01-01 0.469112 -0.282863 -1.509059 -1.135632
2013-01-02 1.212112 -0.173215 0.119209 -1.044236
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804
2013-01-04 0.721555 -0.706771 -1.039575 0.271860
2013-01-05 -0.424972 0.567020 0.276232 -1.087401
2013-01-06 -0.673690 0.113648 -1.478427 0.524988
```
> `np.random.randn(6, 4)` 是一個 NumPy 函數,用於生成一個形狀為 (6, 4) 的數組,其中的元素來自於標準常態分佈(均值為 0,標準差為 1)。這個函數是為了方便從 Matlab 移植代碼而設計的,並且將標準常態分佈的生成封裝在了 `standard_normal` 函數中。
- [`DataFrame`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame "pandas.DataFrame")以字典`dict`:`{Key:Value}`創建`DataFrame`,其中Key是欄(非列)標籤、Value是列之值。
```python
# In[8]:
df2 = pd.DataFrame({'A': 1.,
'B': pd.Timestamp('20130102'),
'C': pd.Series(1, index=list(range(4)), dtype='float32'),
'D': np.array([3] * 4, dtype='int32'),
'E': pd.Categorical(["test", "train", "test", "train"]),
'F': 'foo'})
df2
# Out[8]:
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo
The columns of the resulting DataFrame have different dtypes.
```
[`DataFrame`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame "pandas.DataFrame")欄位可以有不同的資料結構 [dtypes](https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#basics-dtypes):
```python
# In [09]:
df2.dtypes
# Out[09]:
A float64
B datetime64[ns]
C float32
D int32
E category
F object
dtype: object
```
- 如果用IPython、Jupyter notebook等筆記本形式使用`Tab`可自動展示補全所有的屬性、自定義欄位。
```python
In [12]: df2.<TAB>
df2.A df2.bool
df2.abs df2.boxplot
df2.add df2.C
df2.add_prefix df2.clip
df2.add_suffix df2.columns
df2.align df2.copy
df2.all df2.count
df2.any df2.combine
df2.append df2.D
df2.apply df2.describe
df2.applymap df2.diff
df2.B df2.duplicated
...
```
## Viewing data 檢視資料
- 參閱[Basics section](https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#basics)
- 以[`DataFrame.head()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html#pandas.DataFrame.head "pandas.DataFrame.head")查看DataFrame的前n筆資料,[`DataFrame.tail()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html#pandas.DataFrame.tail "pandas.DataFrame.tail")查看最後n筆資料:
```python
# In [13]:
df.head()
# Out[13]:
A B C D
2013-01-01 0.469112 -0.282863 -1.509059 -1.135632
2013-01-02 1.212112 -0.173215 0.119209 -1.044236
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804
2013-01-04 0.721555 -0.706771 -1.039575 0.271860
2013-01-05 -0.424972 0.567020 0.276232 -1.087401
```
```python
# In [14]:
df.tail(3)
# Out[14]:
A B C D
2013-01-04 0.721555 -0.706771 -1.039575 0.271860
2013-01-05 -0.424972 0.567020 0.276232 -1.087401
2013-01-06 -0.673690 0.113648 -1.478427 0.524988
```
- 以[`DataFrame.index`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.index.html#pandas.DataFrame.index "pandas.DataFrame.index")、[`DataFrame.columns`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.columns.html#pandas.DataFrame.columns "pandas.DataFrame.columns")
顯示索引及欄位名稱。
```python
# In [15]:
df.index
# Out[15]:
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
```
```python
# In [16]:
df.columns
# Out[16]:
Index(['A', 'B', 'C', 'D'], dtype='object')
```
- `DataFrame.to_numpy()`轉換為`NumPy`。
> 但請注意如果該`DataFrame`具有不同資料型態(int、str...),這可能是一項昂貴的操作,主因是NumPy數組對整個數組有一個dtype,而pandas DataFrames每列有一個dtype。當呼叫時 DataFrame.to_numpy(),pandas會找到可以容納 DataFrame中所有 dtypes 的NumPy dtype。這可能最終成為object,這需要將每個值都轉換為Python物件。
- 以下df的DataFrame值皆為浮點數,` DataFrame.to_numpy()`就相當快。
```python
In [17]: df.to_numpy()
Out[17]:
array([[ 0.4691, -0.2829, -1.5091, -1.1356],
[ 1.2121, -0.1732, 0.1192, -1.0442],
[-0.8618, -2.1046, -0.4949, 1.0718],
[ 0.7216, -0.7068, -1.0396, 0.2719],
[-0.425 , 0.567 , 0.2762, -1.0874],
[-0.6737, 0.1136, -1.4784, 0.525 ]])
```
- 以下df2的`DataFrame`有不同`dtypes`,運算代價高
```python
In [18]: df2.to_numpy()
Out[18]:
array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
dtype=object)
```
>Note DataFrame.to_numpy() does not include the index or column labels in the output.
- 以`describe()`快速檢視數據統計摘要
```python
In [19]: df.describe()
Out[19]:
A B C D
count 6.000000 6.000000 6.000000 6.000000
mean 0.073711 -0.431125 -0.687758 -0.233103
std 0.843157 0.922818 0.779887 0.973118
min -0.861849 -2.104569 -1.509059 -1.135632
25% -0.611510 -0.600794 -1.368714 -1.076610
50% 0.022070 -0.228039 -0.767252 -0.386188
75% 0.658444 0.041933 -0.034326 0.461706
max 1.212112 0.567020 0.276232 1.071804
```
> `pd.series.std(ddof=1)`預設為樣本的標準差,如果要像`numpy.std`以母體為標準差,應改為`pd.series.std(ddof=0)`
- 以`T`轉置資料矩陣(列、欄互換)
```python
In [20]: df.T
Out[20]:
2013-01-01 2013-01-02 2013-01-03 2013-01-04 2013-01-05 2013-01-06
A 0.469112 1.212112 -0.861849 0.721555 -0.424972 -0.673690
B -0.282863 -0.173215 -2.104569 -0.706771 0.567020 0.113648
C -1.509059 0.119209 -0.494929 -1.039575 0.276232 -1.478427
D -1.135632 -1.044236 1.071804 0.271860 -1.087401 0.524988
```
- 依軸排序`sort_index(axis=1, ascending=False)`,結果為以ROW、遞增排序。
```python
In [21]: df.sort_index(axis=1, ascending=False)
Out[21]:
D C B A
2013-01-01 -1.135632 -1.509059 -0.282863 0.469112
2013-01-02 -1.044236 0.119209 -0.173215 1.212112
2013-01-03 1.071804 -0.494929 -2.104569 -0.861849
2013-01-04 0.271860 -1.039575 -0.706771 0.721555
2013-01-05 -1.087401 0.276232 0.567020 -0.424972
2013-01-06 0.524988 -1.478427 0.113648 -0.673690
```
- [`DataFrame.sort_values()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html#pandas.DataFrame.sort_values "pandas.DataFrame.sort_values")按值排序:
```python
In [22]: df.sort_values(by='B')
Out[22]:
A B C D
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804
2013-01-04 0.721555 -0.706771 -1.039575 0.271860
2013-01-01 0.469112 -0.282863 -1.509059 -1.135632
2013-01-02 1.212112 -0.173215 0.119209 -1.044236
2013-01-06 -0.673690 0.113648 -1.478427 0.524988
2013-01-05 -0.424972 0.567020 0.276232 -1.087401
```
## Selection 選取
>注意,雖然標準的`Python`、`numpy`表達式直觀可用,但建議以`Pandas`優化的選擇方法,如[`DataFrame.at()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.at.html#pandas.DataFrame.at "pandas.DataFrame.at")、[`DataFrame.iat()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iat.html#pandas.DataFrame.iat "pandas.DataFrame.iat")和 。[`DataFrame.loc()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html#pandas.DataFrame.loc "pandas.DataFrame.loc") [`DataFrame.iloc()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html#pandas.DataFrame.iloc "pandas.DataFrame.iloc")
。
參閱文件[Indexing and Selecting Data](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing) and [MultiIndex / Advanced Indexing](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html#advanced).
### Getting 取得資料
- [`DataFrame`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame "pandas.DataFrame")
選取單一欄位,將會回傳一個`Series`, `df['A']`相當於`df.A`:
```python
In [23]: df['A']
Out[23]:
2013-01-01 0.469112
2013-01-02 1.212112
2013-01-03 -0.861849
2013-01-04 0.721555
2013-01-05 -0.424972
2013-01-06 -0.673690
Freq: D, Name: A, dtype: float64
```
- 以中括號`[]`選擇想要的rows進行切片
```python
In [24]: df[0:3]
Out[24]:
A B C D
2013-01-01 0.469112 -0.282863 -1.509059 -1.135632
2013-01-02 1.212112 -0.173215 0.119209 -1.044236
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804
```
```python
In [25]: df['20130102':'20130104']
Out[25]:
A B C D
2013-01-02 1.212112 -0.173215 0.119209 -1.044236
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804
2013-01-04 0.721555 -0.706771 -1.039575 0.271860
```
### 以標籤進行選擇
請參閱[按標籤選擇](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-label)了解更多內容。[`DataFrame.loc()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html#pandas.DataFrame.loc "pandas.DataFrame.loc") [`DataFrame.at()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.at.html#pandas.DataFrame.at "pandas.DataFrame.at")
- 選擇與標籤相符的行:
```python
#即取得2013-01-01的數據
In [26]: df.loc[dates[0]]
Out[26]:
A 0.469112
B -0.282863
C -1.509059
D -1.135632
Name: 2013-01-01 00:00:00, dtype: float64
```
- 以標籤取得多欄位數據
```python
In [27]: df.loc[:, ['A', 'B']]
Out[27]:
A B
2013-01-01 0.469112 -0.282863
2013-01-02 1.212112 -0.173215
2013-01-03 -0.861849 -2.104569
2013-01-04 0.721555 -0.706771
2013-01-05 -0.424972 0.567020
2013-01-06 -0.673690 0.113648
```
- 以標籤組合切片:
```python
In [28]: df.loc['20130102':'20130104', ['A', 'B']]
Out[28]:
A B
2013-01-02 1.212112 -0.173215
2013-01-03 -0.861849 -2.104569
2013-01-04 0.721555 -0.706771
```
- 以標籤組合縮減顯示維度:
```python
In [29]: df.loc['20130102', ['A', 'B']]
Out[29]:
A 1.212112
B -0.173215
Name: 2013-01-02 00:00:00, dtype: float64
```
- 獲取單筆數值:
```python
In [30]: df.loc[dates[0], 'A']
Out[30]: 0.4691122999071863
```
- 快速存取標量(相當於先前的方法):
```python
In [31]: df.at[dates[0], 'A']
Out[31]: 0.4691122999071863
```
### Selection by position 以位置選擇
> `loc`以標籤取得Rows數據,`iloc`以行號取得數據。
- 在[Selection by Position](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-integer)查看更多內容。[`DataFrame.iloc()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html#pandas.DataFrame.iloc "pandas.DataFrame.iloc") [`DataFrame.iat()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iat.html#pandas.DataFrame.iat "pandas.DataFrame.iat")
- 以整數數值選擇:
```python
In [32]: df.iloc[3]
Out[32]:
A 0.721555
B -0.706771
C -1.039575
D 0.271860
Name: 2013-01-04 00:00:00, dtype: float64
```
- 以整數切片,使用方式類似`numpy`、`python`風格:
```python
In [33]: df.iloc[3:5, 0:2]
Out[33]:
A B
2013-01-04 0.721555 -0.706771
20`
- 以list指定位置,使用方式類似`numpy`、`python`風格:
```python
In [34]: df.iloc[[1, 2, 4], [0, 2]]
Out[34]:
A C
2013-01-02 1.212112 0.119209
2013-01-03 -0.861849 -0.494929
2013-01-05 -0.424972 0.276232
```
- 對行rows切片:
```python
In [35]: df.iloc[1:3, :]
Out[35]:
A B C D
2013-01-02 1.212112 -0.173215 0.119209 -1.044236
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804
```
- 對欄columns切片:
```python
In [36]: df.iloc[:, 1:3]
Out[36]:
B C
2013-01-01 -0.282863 -1.509059
2013-01-02 -0.173215 0.119209
2013-01-03 -2.104569 -0.494929
2013-01-04 -0.706771 -1.039575
2013-01-05 0.567020 0.276232
2013-01-06 0.113648 -1.478427
```
- 取得特定值:
```python
In [37]: df.iloc[1, 1]
Out[37]: -0.17321464905330858
```
- 快速取得特定值(相當於先前的方法):
```python
In [38]: df.iat[1, 1]
Out[38]: -0.17321464905330858
```
### Boolean indexing 布林索引
- 以單欄的值選取數據
```python
In [39]: df[df.A > 0]
Out[39]:
A B C D
2013-01-01 0.469112 -0.282863 -1.509059 -1.135632
2013-01-02 1.212112 -0.173215 0.119209 -1.044236
2013-01-04 0.721555 -0.706771 -1.039575 0.271860
```
- 顯示DataFrame滿足布林條件的情形
```python
In [40]: df[df > 0]
Out[40]:
A B C D
2013-01-01 0.469112 NaN NaN NaN
2013-01-02 1.212112 NaN 0.119209 NaN
2013-01-03 NaN NaN NaN 1.071804
2013-01-04 0.721555 NaN NaN 0.271860
2013-01-05 NaN 0.567020 0.276232 NaN
2013-01-06 NaN 0.113648 NaN 0.524988
```
- 以`isin()`方法篩選數據:
```python
In [41]: df2 = df.copy()
In [42]: df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']
In [43]: df2
Out[43]:
A B C D E
2013-01-01 0.469112 -0.282863 -1.509059 -1.135632 one
2013-01-02 1.212112 -0.173215 0.119209 -1.044236 one
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 two
2013-01-04 0.721555 -0.706771 -1.039575 0.271860 three
2013-01-05 -0.424972 0.567020 0.276232 -1.087401 four
2013-01-06 -0.673690 0.113648 -1.478427 0.524988 three
In [44]: df2[df2['E'].isin(['two', 'four'])]
Out[44]:
A B C D E
2013-01-03 -0.861849 -2.104569 -0.494929 1.071804 two
2013-01-05 -0.424972 0.567020 0.276232 -1.087401 four
```
> `df.copy()`預設為`df.copy(deep=True)`,意思是預設執行深Copy
> 深複製:創建一個新的物件,並且徹底複製原始物件中的所有數據。深複製後,原始數據和新複製的數據互不影響,它們在記憶體中是完全獨立的。
> 淺複製:創建一個新的物件,但是不會徹底複製數據,只是複製數據的引用。淺複製後,原始數據和新複製的數據會相互影響,因為它們共享同一塊記憶體中的數據。
> `df2 = df` 會是淺複製,`df2` 的任何改動也會變更 `df` ,因為是同一筆數據集。
### Setting 設置
- 設定新列會自動按索引對齊資料:
```python
In [45]: s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range('20130102', periods=6))
In [46]: s1
Out[46]:
2013-01-02 1
2013-01-03 2
2013-01-04 3
2013-01-05 4
2013-01-06 5
2013-01-07 6
Freq: D, dtype: int64
In [47]: df['F'] = s1
```
- 以標籤更新數值:
```python
In [48]: df.at[dates[0], 'A'] = 0
```
- 以位置更新數值:
```python
In [49]: df.iat[0, 1] = 0
```
- 以NumPy array更新
```python
In [50]: df.loc[:, 'D'] = np.array([5] * len(df))
```
- df依前述操作更新結果
```python
In [51]: df
Out[51]:
A B C D F
2013-01-01 0.000000 0.000000 -1.509059 5 NaN
2013-01-02 1.212112 -0.173215 0.119209 5 1.0
2013-01-03 -0.861849 -2.104569 -0.494929 5 2.0
2013-01-04 0.721555 -0.706771 -1.039575 5 3.0
2013-01-05 -0.424972 0.567020 0.276232 5 4.0
2013-01-06 -0.673690 0.113648 -1.478427 5 5.0
```
- 以`where`條件判斷運算子更新值
```python
In [52]: df2 = df.copy()
In [53]: df2[df2 > 0] = -df2
In [54]: df2
Out[54]:
A B C D F
2013-01-01 0.000000 0.000000 -1.509059 -5 NaN
2013-01-02 -1.212112 -0.173215 -0.119209 -5 -1.0
2013-01-03 -0.861849 -2.104569 -0.494929 -5 -2.0
2013-01-04 -0.721555 -0.706771 -1.039575 -5 -3.0
2013-01-05 -0.424972 -0.567020 -0.276232 -5 -4.0
2013-01-06 -0.673690 -0.113648 -1.478427 -5 -5.0
```
## Missing data 缺失值處裡
- `pandas`以`np.nan`表示缺失值,預設情況不進行運算,參閱 [Missing Data section](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#missing-data)
- `.reindex()`可以修改/增加/刪除索引,將回傳一個數據的副本
```python
In [55]: df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
In [56]: df1.loc[dates[0]:dates[1], 'E'] = 1
In [57]: df1
Out[57]:
A B C D F E
2013-01-01 0.000000 0.000000 -1.509059 5 NaN 1.0
2013-01-02 1.212112 -0.173215 0.119209 5 1.0 1.0
2013-01-03 -0.861849 -2.104569 -0.494929 5 2.0 NaN
2013-01-04 0.721555 -0.706771 -1.039575 5 3.0 NaN
```
- 丟掉有區失值的行
```python
In [58]: df1.dropna(how='any')
Out[58]:
A B C D F E
2013-01-02 1.212112 -0.173215 0.119209 5 1.0 1.0
```
- 對缺失值賦值
```python
In [59]: df1.fillna(value=5)
Out[59]:
A B C D F E
2013-01-01 0.000000 0.000000 -1.509059 5 5.0 1.0
2013-01-02 1.212112 -0.173215 0.119209 5 1.0 1.0
2013-01-03 -0.861849 -2.104569 -0.494929 5 2.0 5.0
2013-01-04 0.721555 -0.706771 -1.039575 5 3.0 5.0
```
- 以`.isna()`使用布林遮罩
```python
In [60]: pd.isna(df1)
Out[60]:
A B C D F E
2013-01-01 False False False False True False
2013-01-02 False False False False False False
2013-01-03 False False False False False True
2013-01-04 False False False False False True
```
## Operations 操作
- 參閱[Basic section on Binary Ops](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#basics-binop)。
### Stats 統計
- 操作通常不包含缺失項(缺失要先預處理)
- 執行敘述統計-按列
```python
In [61]: df.mean()
Out[61]:
A -0.004474
B -0.383981
C -0.687758
D 5.000000
F 3.000000
dtype: float64
```
- 執行敘述統計-按欄
```python
In [62]: df.mean(1)
Out[62]:
2013-01-01 0.872735
2013-01-02 1.431621
2013-01-03 0.707731
2013-01-04 1.395042
2013-01-05 1.883656
2013-01-06 1.592306
Freq: D, dtype: float64
```
- Series或DataFrame如要操作不同維度需先對齊,Pandas會自動沿著指定維度廣播(broadcasting),並且會用`np.nan`填滿未對齊的標籤。
```python
#以時間為index對齊
#.shift(2)為資料沿軸順移2位
In [63]: s = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2)
In [64]: s
Out[64]:
2013-01-01 NaN
2013-01-02 NaN
2013-01-03 1.0
2013-01-04 3.0
2013-01-05 5.0
2013-01-06 NaN
Freq: D, dtype: float64
In [65]: df.sub(s, axis='index')
Out[65]:
A B C D F
2013-01-01 NaN NaN NaN NaN NaN
2013-01-02 NaN NaN NaN NaN NaN
2013-01-03 -1.861849 -3.104569 -1.494929 4.0 1.0
2013-01-04 -2.278445 -3.706771 -4.039575 2.0 0.0
2013-01-05 -5.424972 -4.432980 -4.723768 0.0 -1.0
2013-01-06 NaN NaN NaN NaN NaN
```
### 使用者定義的函數
- 應用`DataFrame.agg()`、`DataFrame.transform()`使用者定義的函數來分別減少或廣播其結果。
```python
In [66]: df.agg(lambda x: np.mean(x) * 5.6)
Out[66]:
A -0.025054
B -2.150294
C -3.851445
D 28.000000
F 16.800000
dtype: float64
In [67]: df.transform(lambda x: x * 101.2)
Out[67]:
A B C D F
2013-01-01 0.000000 0.000000 -152.716721 506.0 NaN
2013-01-02 122.665737 -17.529322 12.063922 506.0 101.2
2013-01-03 -87.219115 -212.982405 -50.086843 506.0 202.4
2013-01-04 73.021382 -71.525239 -105.204988 506.0 303.6
2013-01-05 -43.007200 57.382459 27.954680 506.0 404.8
2013-01-06 -68.177398 11.501219 -149.616767 506.0 506.0
```
### Value 很重要
- 更多資訊請參見[直方圖和離散化](https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#basics-discretization)。
```python
In [68]: s = pd.Series(np.random.randint(0, 7, size=10))
In [69]: s
Out[69]:
0 4
1 2
2 1
3 2
4 6
5 4
6 4
7 6
8 4
9 4
dtype: int64
In [70]: s.value_counts()
Out[70]:
4 5
2 2
6 2
1 1
Name: count, dtype: int64
```
### 字串方法
- [Series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html#pandas.Series)具有字串`str`的處理方法 ,可以方便地對數組的每個元素進行操作,如下面的程式碼片段所示。請參閱[向量化字串方法](https://pandas.pydata.org/pandas-docs/stable/user_guide/text.html#text-string-methods)以了解更多資訊。
- 請注意,str中的模式匹配通常默認使用[正則表達式](https://docs.python.org/3/library/re.html)(在某些情況下總是使用它們)
```python
In [71]: s = pd.Series(["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"])
In [72]: s.str.lower()
Out[72]:
0 a
1 b
2 c
3 aaba
4 baca
5 NaN
6 caba
7 dog
8 cat
dtype: object
```
### Apply 應用
- 以Applying functions進行資料處理:
```python
In [66]: df.apply(np.cumsum) #累加
Out[66]:
A B C D F
2013-01-01 0.000000 0.000000 -1.509059 5 NaN
2013-01-02 1.212112 -0.173215 -1.389850 10 1.0
2013-01-03 0.350263 -2.277784 -1.884779 15 3.0
2013-01-04 1.071818 -2.984555 -2.924354 20 6.0
2013-01-05 0.646846 -2.417535 -2.648122 25 10.0
2013-01-06 -0.026844 -2.303886 -4.126549 30 15.0
In [67]: df.apply(lambda x: x.max() - x.min())
Out[67]:
A 2.073961
B 2.671590
C 1.785291
D 0.000000
F 4.000000
dtype: float64
```
## Merge合併
### Concat連接
- pandas提供各種簡易的合併Series及Dataframe物件操作方式,參閱[Merging section](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#merging)
- 以`concat()`連接pandas物件:
```python
In [73]: df = pd.DataFrame(np.random.randn(10, 4))
In [74]: df
Out[74]:
0 1 2 3
0 -0.548702 1.467327 -1.015962 -0.483075
1 1.637550 -1.217659 -0.291519 -1.745505
2 -0.263952 0.991460 -0.919069 0.266046
3 -0.709661 1.669052 1.037882 -1.705775
4 -0.919854 -0.042379 1.247642 -0.009920
5 0.290213 0.495767 0.362949 1.548106
6 -1.131345 -0.089329 0.337863 -0.945867
7 -0.932132 1.956030 0.017587 -0.016692
8 -0.575247 0.254161 -1.143704 0.215897
9 1.193555 -0.077118 -0.408530 -0.862495
# break it into pieces
In [75]: pieces = [df[:3], df[3:7], df[7:]] #分段
In [76]: pd.concat(pieces)
Out[76]:
0 1 2 3
0 -0.548702 1.467327 -1.015962 -0.483075
1 1.637550 -1.217659 -0.291519 -1.745505
2 -0.263952 0.991460 -0.919069 0.266046
3 -0.709661 1.669052 1.037882 -1.705775
4 -0.919854 -0.042379 1.247642 -0.009920
5 0.290213 0.495767 0.362949 1.548106
6 -1.131345 -0.089329 0.337863 -0.945867
7 -0.932132 1.956030 0.017587 -0.016692
8 -0.575247 0.254161 -1.143704 0.215897
9 1.193555 -0.077118 -0.408530 -0.862495
```
> 向[DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html#pandas.DataFrame)新增"Column"相對較快。但是,添加"Row"需要copy副本,並且可能很昂貴。我們建議將預先建立的記錄列表傳遞給DataFrame建構函數,而不是DataFrame透過迭代地向其追加記錄來建構。
### Join
- [merge()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html#pandas.merge) 可以採用SQL style合併。參閱[Database style joining](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#merging-join)章節。
```python
In [77]: left = pd.DataFrame({'key': ['foo', 'foo'], 'lval': [1, 2]})
In [78]: right = pd.DataFrame({'key': ['foo', 'foo'], 'rval': [4, 5]})
In [79]: left
Out[79]:
key lval
0 foo 1
1 foo 2
In [80]: right
Out[80]:
key rval
0 foo 4
1 foo 5
In [81]: pd.merge(left, right, on='key')
Out[81]:
key lval rval
0 foo 1 4
1 foo 1 5
2 foo 2 4
3 foo 2 5
```
- [merge()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge.html#pandas.merge)唯一值:
```python
In [82]: left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [1, 2]})
In [83]: right = pd.DataFrame({'key': ['foo', 'bar'], 'rval': [4, 5]})
In [84]: left
Out[84]:
key lval
0 foo 1
1 bar 2
In [85]: right
Out[85]:
key rval
0 foo 4
1 bar 5
In [86]: pd.merge(left, right, on='key')
Out[86]:
key lval rval
0 foo 1 4
1 bar 2 5
```
## Grouping
- 透過“group by”將數據對每個分組應用不同的function並結合展示成果,過程為:
- 依據某種標準將數據拆分(Splitting)為組
- 將設計好的功能(applying)對每個組獨立處理。
- 結合(Combining)成果至資料結構
- 參閱[Grouping](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#groupby)章節.
```python
In [91]: pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo'],
'B': ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'C': np.random.randn(8),
'D': np.random.randn(8)})
In [92]: df
Out[92]:
A B C D
0 foo one -1.202872 -0.055224
1 bar one -1.814470 2.395985
2 foo two 1.018601 1.552825
3 bar three -0.595447 0.166599
4 foo two 1.395433 0.047609
5 bar two -0.392670 -0.136473
6 foo one 0.007207 -0.561757
7 foo three 1.928123 -1.623033
```
- 按列標籤分組,選擇列標籤,然後將 [DataFrameGroupBy.sum()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.sum.html#pandas.core.groupby.DataFrameGroupBy.sum)函數套用至結果組:
```python
In [93]: df.groupby('A').sum()
Out[93]:
C D
A
bar -2.802588 2.42611
foo 3.146492 -0.63958
```
- 以多欄位分組形成分層索引,並應用`sum()`
```python
In [94]: df.groupby(['A', 'B']).sum()
Out[94]:
C D
A B
bar one -1.814470 2.395985
three -0.595447 0.166599
two -0.392670 -0.136473
foo one -1.195665 -0.616981
three 1.928123 -1.623033
two 2.414034 1.600434
```
## Reshaping 重塑
參閱[Hierarchical Indexing](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html#advanced-hierarchical) and [Reshaping](https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html#reshaping-stacking)章節內容
### Stack 堆疊
- 分層資料結構的形式
```python
In [91]: arrays = [
....: ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
....: ["one", "two", "one", "two", "one", "two", "one", "two"],
....: ]
....:
In [92]: index = pd.MultiIndex.from_arrays(arrays, names=["first", "second"])
In [93]: df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=["A", "B"])
In [94]: df2 = df[:4]
In [95]: df2
Out[95]:
A B
first second
bar one -0.727965 -0.589346
two 0.339969 -0.693205
baz one -0.339355 0.593616
two 0.884345 1.591431
```
- 使用[stack()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.stack.html#pandas.DataFrame.stack)方法將DataFrame壓縮(compresses) 為階層形式的欄位
```python
In [100]: stacked = df2.stack()
In [101]: stacked
Out[101]:
first second
bar one A 0.029399
B -0.542108
two A 0.282696
B -0.087302
baz one A -1.575170
B 1.771208
two A 0.816482
B 1.100230
dtype: float64
```
- 使用堆疊的DataFrame或Series(具有階層索引[MultiIndex](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.MultiIndex.html#pandas.MultiIndex)),與`stack()`相反的操作為`unstack()`,預設情況下為取消堆疊最後一級:
```python
In [102]: stacked.unstack()
Out[102]:
A B
first second
bar one 0.029399 -0.542108
two 0.282696 -0.087302
baz one -1.575170 1.771208
two 0.816482 1.100230
In [103]: stacked.unstack(1)
Out[103]:
second one two
first
bar A 0.029399 0.282696
B -0.542108 -0.087302
baz A -1.575170 0.816482
B 1.771208 1.100230
In [104]: stacked.unstack(0)
Out[104]:
first bar baz
second
one A 0.029399 -1.575170
B -0.542108 1.771208
two A 0.282696 0.816482
B -0.087302 1.100230
```
### Pivot tables
- 參閱[Pivot Tables](https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html#reshaping-pivot)
```python
In [105]: df = pd.DataFrame({'A': ['one', 'one', 'two', 'three'] * 3,
'B': ['A', 'B', 'C'] * 4,
'C': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
'D': np.random.randn(12),
'E': np.random.randn(12)})
In [106]: df
Out[106]:
A B C D E
0 one A foo 1.418757 -0.179666
1 one B foo -1.879024 1.291836
2 two C foo 0.536826 -0.009614
3 three A bar 1.006160 0.392149
4 one B bar -0.029716 0.264599
5 one C bar -1.146178 -0.057409
6 two A foo 0.100900 -1.425638
7 three B foo -1.035018 1.024098
8 one C foo 0.314665 -0.106062
9 one A bar -0.773723 1.824375
10 two B bar -1.170653 0.595974
11 three C bar 0.648740 1.167115
```
- 我們可以非常輕鬆地從這些數據生成數據透視表:
```python
In [107]: pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])
Out[107]:
C bar foo
A B
one A -0.773723 1.418757
B -0.029716 -1.879024
C -1.146178 0.314665
three A 1.006160 NaN
B NaN -1.035018
C 0.648740 NaN
two A NaN 0.100900
B -1.170653 NaN
C NaN 0.536826
```
## Time series 時間序列
- pandas具有簡單,強大且高效的功能,用於在頻率轉換期間執行重採樣操作(例如,將第二數據轉換為5分鐘數據)。這在財務應用程序中非常常見,但不僅限於此。請參閱[Time Series](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timeseries)章節
```python
In [108]: rng = pd.date_range('1/1/2012', periods=100, freq='S')
In [109]: ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
In [110]: ts.resample('5Min').sum()
Out[110]:
2012-01-01 25083
Freq: 5T, dtype: int64
```
- 時區呈現:
```python
In [111]: rng = pd.date_range('3/6/2012 00:00', periods=5, freq='D')
In [112]: ts = pd.Series(np.random.randn(len(rng)), rng)
In [113]: ts
Out[113]:
2012-03-06 0.464000
2012-03-07 0.227371
2012-03-08 -0.496922
2012-03-09 0.306389
2012-03-10 -2.290613
Freq: D, dtype: float64
In [114]: ts_utc = ts.tz_localize('UTC')
In [115]: ts_utc
Out[115]:
2012-03-06 00:00:00+00:00 0.464000
2012-03-07 00:00:00+00:00 0.227371
2012-03-08 00:00:00+00:00 -0.496922
2012-03-09 00:00:00+00:00 0.306389
2012-03-10 00:00:00+00:00 -2.290613
Freq: D, dtype: float64
```
- [Series.tz_convert()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.tz_convert.html#pandas.Series.tz_convert)將轉換為另一個時區:轉換為另一個時區:
```python
In [116]: ts_utc.tz_convert('US/Eastern')
Out[116]:
2012-03-05 19:00:00-05:00 0.464000
2012-03-06 19:00:00-05:00 0.227371
2012-03-07 19:00:00-05:00 -0.496922
2012-03-08 19:00:00-05:00 0.306389
2012-03-09 19:00:00-05:00 -2.290613
Freq: D, dtype: float64
```
- [BusinessDay](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.tseries.offsets.BusinessDay.html#pandas.tseries.offsets.BusinessDay)在不同時間跨度表示間轉換:
```python
In [113]: rng
Out[113]:
DatetimeIndex(['2012-03-06', '2012-03-07', '2012-03-08', '2012-03-09',
'2012-03-10'],
dtype='datetime64[ns]', freq='D')
In [114]: rng + pd.offsets.BusinessDay(5)
Out[114]:
DatetimeIndex(['2012-03-13', '2012-03-14', '2012-03-15', '2012-03-16',
'2012-03-16'],
dtype='datetime64[ns]', freq=None)
```
## Categoricals 分類
- 現在pandas可以在DataFrame中包含分類數據,詳情參閱[categorical introduction](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html#categorical) 及[API documentation](https://pandas.pydata.org/pandas-docs/stable/reference/arrays.html#api-arrays-categorical).
```python
In [127]: df = pd.DataFrame({"id": [1, 2, 3, 4, 5, 6],
"raw_grade": ['a', 'b', 'b', 'a', 'a', 'e']})
```
- 將原始成績轉換為分類數據
```python
In [128]: df["grade"] = df["raw_grade"].astype("category")
In [129]: df["grade"]
Out[129]:
0 a
1 b
2 b
3 a
4 a
5 e
Name: grade, dtype: category
Categories (3, object): [a, b, e]
```
- 重命名分類使其更有意義(使用` Series.cat.categories`轉換).
```python
In [130]: df["grade"].cat.categories = ["very good", "good", "very bad"]
```
- 重新整理類別,並添加缺少的類別(預設為回傳新 Series).
```python
In [131]: df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium",
"good", "very good"])
In [132]: df["grade"]
Out[132]:
0 very good
1 good
2 good
3 very good
4 very good
5 very bad
Name: grade, dtype: category
Categories (5, object): [very bad, bad, medium, good, very good]
```
- 按整理後的類別排序`.sort_values()`
```python
In [133]: df.sort_values(by="grade")
Out[133]:
id raw_grade grade
5 6 e very bad
1 2 b good
2 3 b good
0 1 a very good
3 4 a very good
4 5 a very good
```
- 按類別分類也會包含具空值的類別
```python
In [134]: df.groupby("grade").size()
Out[134]:
grade
very bad 1
bad 0
medium 0
good 2
very good 3
dtype: int64
```
## Plotting 繪圖
- 請參閱[Plotting](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html#visualization)文檔。
```python
In [124]: import matplotlib.pyplot as plt
In [125]: plt.close("all")
```
- 此plt.close方法用於關閉圖形視窗:
```python
In [135]: ts = pd.Series(np.random.randn(1000),
index=pd.date_range('1/1/2000', periods=1000))
In [136]: ts = ts.cumsum()
In [137]: ts.plot()
Out[137]: <matplotlib.axes._subplots.AxesSubplot at 0x7f24a8b314d0>
```
![](https://i.imgur.com/N5k8wXV.png)
> 使用 Jupyter 時,繪圖將使用 出現plot()。否則使用 matplotlib.pyplot.show顯示它或 matplotlib.pyplot.savefig將其寫入檔案。
- 在DataFrame上,該`plot()`方法可以方便地使用標籤繪製所有列:
```python
In [138]: df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index,
columns=['A', 'B', 'C', 'D'])
In [139]: df = df.cumsum()
In [140]: plt.figure()
Out[140]: <Figure size 640x480 with 0 Axes>
In [141]: df.plot()
Out[141]: <matplotlib.axes._subplots.AxesSubplot at 0x7f24a8b13750>
In [142]: plt.legend(loc='best')
Out[142]: <matplotlib.legend.Legend at 0x7f24a88250d0>
```
![](https://i.imgur.com/J2YfhDD.png)
## Getting data in/out 資料讀取、輸出
### CSV
- [Writing to a csv file](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-store-in-csv)
```python
df = pd.DataFrame(np.random.randint(0, 5, (10, 5)))
df.to_csv('foo.csv')
```
- [Reading from a csv file](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-read-csv-table)
```python
In [136]: pd.read_csv("foo.csv")
Out[136]:
Unnamed: 0 0 1 2 3 4
0 0 4 3 1 1 2
1 1 1 0 2 3 2
2 2 1 4 2 1 2
3 3 0 4 0 2 2
4 4 4 2 2 3 4
5 5 4 0 4 3 1
6 6 2 1 2 0 3
7 7 4 0 4 4 4
8 8 4 4 1 0 1
9 9 0 4 3 0 3
```
### Parquet
```python
In [137]: df.to_parquet("foo.parquet")
```
```python
In [138]: pd.read_parquet("foo.parquet")
Out[138]:
0 1 2 3 4
0 4 3 1 1 2
1 1 0 2 3 2
2 1 4 2 1 2
3 0 4 0 2 2
4 4 2 2 3 4
5 4 0 4 3 1
6 2 1 2 0 3
7 4 0 4 4 4
8 4 4 1 0 1
9 0 4 3 0 3
```
### Excel
讀寫[MS Excel](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-excel)
- 寫入excel檔案[DataFrame.to_excel()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_excel.html#pandas.DataFrame.to_excel)
```python
In [147]: df.to_excel('foo.xlsx', sheet_name='Sheet1')
```
- 讀取excel檔案[read_excel()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html#pandas.read_excel)
```python
In [148]: pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])
Out[148]:
Unnamed: 0 A B C D
0 2000-01-01 0.266457 -0.399641 -0.219582 1.186860
1 2000-01-02 -1.170732 -0.345873 1.653061 -0.282953
2 2000-01-03 -1.734933 0.530468 2.060811 -0.515536
3 2000-01-04 -1.555121 1.452620 0.239859 -1.156896
4 2000-01-05 0.578117 0.511371 0.103552 -2.428202
.. ... ... ... ... ...
995 2002-09-22 -8.985362 -8.485624 -4.669462 31.367740
996 2002-09-23 -9.558560 -8.781216 -4.499815 30.518439
997 2002-09-24 -9.902058 -9.340490 -4.386639 30.105593
998 2002-09-25 -10.216020 -9.480682 -3.933802 29.758560
999 2002-09-26 -11.856774 -10.671012 -3.216025 29.369368
[1000 rows x 5 columns]
```
## Gotchas 小陷阱
- 如果操作時遇到異常,如:
```python
>>> if pd.Series([False, True, False]):
... print("I was true")
Traceback
...
ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.all().
```
- 請查看[Comparisons](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html#basics-compare)來處理異常,或查看[Gotchas](https://pandas.pydata.org/pandas-docs/stable/user_guide/gotchas.html#gotchas)也可以.