---
tags: Python, Pandas
---
# Python Pandas
[](https://hackmd.io/ReqCzXTHR7KtMYMWHUFJjA)
[TOC]
## Read Write Csv
```python
path = './test.csv'
df = pd.read_csv(path)
df = pd.read_csv(path, index=False, header=False)
# if file size == 0 will add header in csv
with open(path, 'w') as ff:
df.to_csv(ff, header=ff.tell() == 0, index=False)
```
```python
df = pd.DataFrame(
{
'Date':
['08/09/2018',
'10/09/2018',
'08/09/2018',
'10/09/2018'],
'Fruit':
['Apple',
'Apple',
'Banana',
'Banana'],
'Sale':
[34,
12,
22,
27]
})
print(df)
```
```
Date Fruit Sale
0 08/09/2018 Apple 34
1 10/09/2018 Apple 12
2 08/09/2018 Banana 22
3 10/09/2018 Banana 27
```
## Basic calculation
+ ### df.sum()
```python
# 以 columns 作為 index,對每個 columns 做總和,若是 str 則相連
print(df.sum())
```
```
Date 08/09/201810/09/201808/09/201810/09/2018
Fruit AppleAppleBananaBanana
Sale 95
dtype: object
```
+ ### df.max()
```python
# 以 columns 作為 index,對每個 columns 求 max
print(df.max())
```
```
Date 10/09/2018
Fruit Banana
Sale 34
dtype: object
```
+ ### df.min()
```python
# 以 columns 作為 index,對每個 columns 求 min
print(df.min())
```
```
Date 08/09/2018
Fruit Apple
Sale 12
```
+ ### df.mean()
```python
# 以 columns 作為 index,只對每個 columns 數字做 mean
print(df.mean())
```
```
Sale 23.75
dtype: float64
```
## Useful
+ ### Group by
```python
# 以 Fruit 作為 index,Sale 為第一 columns,提取所有 Fruit 為 apple 的 Sale 欄位
print(df.groupby('Fruit')['Sale'].get_group('Apple'))
```
```
0 34
1 12
Name: Sale, dtype: int64
```
+ ### Use df.groupby()
```python
# 若不用新增到 index 則使用 as_index=False
# 以 fruit 作為 index,取得所有 Fruit 為 Apple 的資料
print(df.groupby('Fruit').get_group('Apple'))
```
```
Date Fruit Sale
0 08/09/2018 Apple 34
1 10/09/2018 Apple 12
```
---
```python
train_df.groupby(['date_block_num', 'shop_id']).get_group((0, 0))
```
---
```python
# 以 Fruit 作為 index,計算 Fruit 相同的資料有多少
print(df.groupby('Fruit').size())
```
```
Fruit
Apple 2
Banana 2
dtype: int64
```
---
```python
# 以 Fruit 作為 index,取得 Sale 並且顯示在每個 Fruit 分類中最大的 Sale
print(df.groupby('Fruit')['Sale'].max())
```
```
Fruit
Apple 34
Banana 27
Name: Sale, dtype: int64
```
---
```python
# 以 Fruit 作為 index,取得 Sale 並且顯示在每個 Fruit 分類中最小的 Sale
print(df.groupby('Fruit')['Sale'].min())
```
```
Fruit
Apple 12
Banana 22
Name: Sale, dtype: int64
```
---
```python
# 以 Fruit 作為 index,計算並且顯示 Fruit 各自的分類中 Sale 的 min, max
a = df.groupby('Fruit')['Sale'].agg([min, max])
print(a)
```
```
min max
Fruit
Apple 12 34
Banana 22 27
```
---
```python
# 以 Fruit 作為 index,取得 Sale 並且顯示在每個 Fruit 分類中的 mean
print(df.groupby('Fruit')['Sale'].mean())
```
```
Fruit
Apple 23.0
Banana 24.5
Name: Sale, dtype: float64
```
---
```python
# 以 Date, Fruit 作為 index,且計算分別輸出現的次數
print(df.groupby(['Date', 'Fruit']).size())
```
+ ### df.describe()
```python
# 顯示所有資料中數字的相關訊息
print(df.describe())
```
```
Sale
count 4.000000
mean 23.750000
std 9.251126
min 12.000000
25% 19.500000
50% 24.500000
75% 28.750000
max 34.000000
```
+ ### info()
```python
print(df.info())
```
```
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 5 non-null object
1 Fruit 5 non-null object
2 Sale 5 non-null float64
dtypes: float64(1), object(2)
memory usage: 272.0+ bytes
```
+ ### sort_values()
```python
# 依照 min, max 為 columns 做反向排序,ascending=False 反向排序,由大到小
print(a.sort_values(by=['min', 'max'], ascending=False))
```
```
min max
Fruit
Banana 22 27
Apple 12 34
```
+ ### 選取 row 的資料
```python
# 提取第 0 列的資料,columns 作為 index
print(df.loc[0])
```
```
Date 08/09/2018
Fruit Apple
Sale 34
Name: 0, dtype: object
```
---
```python
# 使用下面更改過 index 的資料,取得 index 為 A 的資料
print(df.loc['A'])
```
```
Date 08/09/2018
Fruit Apple
Sale 34
Name: A, dtype: object
```
---
```python
# iloc 是輸入第幾個,loc 是輸入 index name
# 選取 row == 1 的資料
print(df.iloc[1])
```
```
Date 10/09/2018
Fruit Apple
Sale 12
Name: B, dtype: object
```
+ ### Change index
```python
df.index = ['A', 'B', 'C', 'D']
```
```
Date Fruit Sale
A 08/09/2018 Apple 34
B 10/09/2018 Apple 12
C 08/09/2018 Banana 22
D 10/09/2018 Banana 27
```
+ ### set_index()
```python
# 以 Fruit Sale 作為 index
print(df.set_index(['Fruit', 'Sale']))
```
```
Date
Fruit Sale
Apple 34 08/09/2018
12 10/09/2018
Banana 22 08/09/2018
27 10/09/2018
```
+ ### reset_index
```python
# inplace = True --> direct save in origin DataFrame
# drop = True --> don't save old index in DataFrame
df.reset_index(inplace=True, drop=True)
```
## Null space
```python
# 創建空值
df = pd.DataFrame(
{
'Date':
['08/09/2018',
'10/09/2018',
'08/09/2018',
'10/09/2018',
np.nan,
'08/09/2010'],
'Fruit':
['Apple',
'Apple',
'Banana',
np.nan,
'Orange',
'Apple'],
'Sale':
[34,
12,
22,
27,
12,
np.nan]
})
```
```
Date Fruit Sale
0 08/09/2018 Apple 34.0
1 10/09/2018 Apple 12.0
2 08/09/2018 Banana 22.0
3 10/09/2018 NaN 27.0
4 NaN Orange 12.0
5 08/09/2010 Apple NaN
```
---
```python
# 取出 df 裡面 Fruit 為空值的所有資料
print(df[pd.isnull(df['Fruit'])])
```
```
Date Fruit Sale
3 10/09/2018 NaN 27.0
```
---
```python
# 取出 df 裡面 Fruit 不為空值的所有資料
print(df[pd.notnull(df['Fruit'])])
```
```
Date Fruit Sale
0 08/09/2018 Apple 34.0
1 10/09/2018 Apple 12.0
2 08/09/2018 Banana 22.0
4 NaN Orange 12.0
5 08/09/2010 Apple NaN
```
---
```python
# 若有空值,刪除整列
print(df.dropna())
```
```
Date Fruit Sale
0 08/09/2018 Apple 34.0
1 10/09/2018 Apple 12.0
2 08/09/2018 Banana 22.0
```
---
```python
# 將所有空值改為99
print(df.fillna(99))
```
```
Date Fruit Sale
0 08/09/2018 Apple 34.0
1 10/09/2018 Apple 12.0
2 08/09/2018 Banana 22.0
3 10/09/2018 99 27.0
4 99 Orange 12.0
5 08/09/2010 Apple 99.0
```
## MultiIndex DataFrame Transfer
```python
# unstack(): MultiIndex->DataFrame
# stack(): DataFrame->MultiIndex
print(df.stack())
```
```
0 Date 08/09/2018
Fruit Apple
Sale 34
1 Date 10/09/2018
Fruit Apple
Sale 12
2 Date 08/09/2018
Fruit Banana
Sale 22
3 Date 10/09/2018
Sale 27
4 Fruit Orange
Sale 12
5 Date 08/09/2010
Fruit Apple
dtype: object
```
---
```python
# 等於沒變
print(df.stack().unstack())
```
```
Date Fruit Sale
0 08/09/2018 Apple 34
1 10/09/2018 Apple 12
2 08/09/2018 Banana 22
3 10/09/2018 NaN 27
4 NaN Orange 12
5 08/09/2010 Apple NaN
```
---
```python
# 給 MultiIndex 的 columns 取名稱
a = df.stack()
a.index.names = ['index', 'name']
print(a)
```
```
index name
0 Date 08/09/2018
Fruit Apple
Sale 34
1 Date 10/09/2018
Fruit Apple
Sale 12
2 Date 08/09/2018
Fruit Banana
Sale 22
3 Date 10/09/2018
Sale 27
4 Fruit Orange
Sale 12
5 Date 08/09/2010
Fruit Apple
dtype: object
```
+ ### replace()
```python
# 將 Fruit 的 Apple 換成 Pineapple
print(df.Fruit.replace('Apple', 'Pineapple'))
```
```
0 Pineapple
1 Pineapple
2 Banana
3 NaN
4 Orange
5 Pineapple
Name: Fruit, dtype: object
```
+ ### value_counts()
```python
# 計算 Fruit 裡面每個資料出現的次數
print(df.Fruit.value_counts())
```
```
Apple 3
Orange 1
Banana 1
Name: Fruit, dtype: int64
```
+ ### size
```python
# 計算 Fruit 的總數量
print(df.Fruit.size)
```
```
6
```
+ ### rename()
```python
# 將 Fruit 名稱改為 TestFood
print(df.rename(columns={'Fruit':'TestFood'}))
```
```
Date TestFood Sale
0 08/09/2018 Apple 34.0
1 10/09/2018 Apple 12.0
2 08/09/2018 Banana 22.0
3 10/09/2018 NaN 27.0
4 NaN Orange 12.0
5 08/09/2010 Apple NaN
```
---
```python
# 將 index 為 0 改為 zero, 1 改為 one
print(df.rename(index={0:'zero', 1:'one'}))
```
```
Date Fruit Sale
zero 08/09/2018 Apple 34.0
one 10/09/2018 Apple 12.0
2 08/09/2018 Banana 22.0
3 10/09/2018 NaN 27.0
4 NaN Orange 12.0
5 08/09/2010 Apple NaN
```
---
```python
# 將 row 標籤取為 wines, columns 標籤取為 fields
print(df.rename_axis("wines", axis='rows').rename_axis("fields", axis='columns'))
```
```
fields Date Fruit Sale
wines
0 08/09/2018 Apple 34.0
1 10/09/2018 Apple 12.0
2 08/09/2018 Banana 22.0
3 10/09/2018 NaN 27.0
4 NaN Orange 12.0
5 08/09/2010 Apple NaN
```
+ ### concat()
```python
# 將多個資料串接在一起,若有相同的 columns 名稱則會合併
print(pd.concat([df, df]))
```
```
Date Fruit Sale
0 08/09/2018 Apple 34.0
1 10/09/2018 Apple 12.0
2 08/09/2018 Banana 22.0
3 10/09/2018 NaN 27.0
4 NaN Orange 12.0
5 08/09/2010 Apple NaN
0 08/09/2018 Apple 34.0
1 10/09/2018 Apple 12.0
2 08/09/2018 Banana 22.0
3 10/09/2018 NaN 27.0
4 NaN Orange 12.0
5 08/09/2010 Apple NaN
```
---
```python
# 將多個資料串接在一起,若有不相同的 columns 名稱則會創新的 columns,缺資料的補 NaN
c = pd.DataFrame({
'test':[0, 1, 2]
})
print(pd.concat([df, c]))
```
```
Date Fruit Sale test
0 08/09/2018 Apple 34.0 NaN
1 10/09/2018 Apple 12.0 NaN
2 08/09/2018 Banana 22.0 NaN
3 10/09/2018 NaN 27.0 NaN
4 NaN Orange 12.0 NaN
5 08/09/2010 Apple NaN NaN
0 NaN NaN NaN 0.0
1 NaN NaN NaN 1.0
2 NaN NaN NaN 2.0
```
+ ### join()
```python
# 左右合併數據,若有相同的 columns,則使用 lsuffix='_before',替左邊的數據重新取名(原來的名字 + '_before', lsuffix='_after' 則是右邊
c = pd.DataFrame({
'test':[0, 1, 2],
'Fruit':['Apple', 'Banana', np.nan]
})
print(df.join(c, lsuffix='_before', rsuffix='_after'))
```
```
Date Fruit_before Sale test Fruit_after
0 08/09/2018 Apple 34.0 0.0 Apple
1 10/09/2018 Apple 12.0 1.0 Banana
2 08/09/2018 Banana 22.0 2.0 NaN
3 10/09/2018 NaN 27.0 NaN NaN
4 NaN Orange 12.0 NaN NaN
5 08/09/2010 Apple NaN NaN NaN
```
+ ### Transpose
```python
print(df.T)
```
```
0 1 2 3 4 5
Date 08/09/2018 10/09/2018 08/09/2018 10/09/2018 NaN 08/09/2010
Fruit Apple Apple Banana NaN Orange Apple
Sale 34 12 22 27 12 NaN
```
+ ### Read from ClipBoard
```python
df = pd.read_clipboard()
```
## Custom setting
```python
# display all columns data
df.set_option('display.max_columns', None)
```
+ ### Show float precision
```python
pd.set_option("display.precision", 1)
```
+ ### split()
```python
# 逗號分割
df.Fruit.str.split(',')
```
```
0 [Apple]
1 [Apple]
2 [Banana]
3 NaN
4 [Orange]
5 [Apple]
Name: Fruit, dtype: object
```
+ ### tolist()
```python
# 要在欄位為 list 時才可使用
df.Fruit.tolist()
```
+ ### remove repeat row
```python
repeat = df.duplicated()
out = df[~repeat]
```