Python Pandas - HackMD

--- tags: Python, Pandas --- # Python Pandas [![hackmd-github-sync-badge](https://hackmd.io/ReqCzXTHR7KtMYMWHUFJjA/badge)](https://hackmd.io/ReqCzXTHR7KtMYMWHUFJjA) [TOC] ## Read Write Csv ```python path = './test.csv' df = pd.read_csv(path) df = pd.read_csv(path, index=False, header=False) # if file size == 0 will add header in csv with open(path, 'w') as ff: df.to_csv(ff, header=ff.tell() == 0, index=False) ``` ```python df = pd.DataFrame( { 'Date': ['08/09/2018', '10/09/2018', '08/09/2018', '10/09/2018'], 'Fruit': ['Apple', 'Apple', 'Banana', 'Banana'], 'Sale': [34, 12, 22, 27] }) print(df) ``` ``` Date Fruit Sale 0 08/09/2018 Apple 34 1 10/09/2018 Apple 12 2 08/09/2018 Banana 22 3 10/09/2018 Banana 27 ``` ## Basic calculation + ### df.sum() ```python # 以 columns 作為 index，對每個 columns 做總和，若是 str 則相連 print(df.sum()) ``` ``` Date 08/09/201810/09/201808/09/201810/09/2018 Fruit AppleAppleBananaBanana Sale 95 dtype: object ``` + ### df.max() ```python # 以 columns 作為 index，對每個 columns 求 max print(df.max()) ``` ``` Date 10/09/2018 Fruit Banana Sale 34 dtype: object ``` + ### df.min() ```python # 以 columns 作為 index，對每個 columns 求 min print(df.min()) ``` ``` Date 08/09/2018 Fruit Apple Sale 12 ``` + ### df.mean() ```python # 以 columns 作為 index，只對每個 columns 數字做 mean print(df.mean()) ``` ``` Sale 23.75 dtype: float64 ``` ## Useful + ### Group by ```python # 以 Fruit 作為 index，Sale 為第一 columns，提取所有 Fruit 為 apple 的 Sale 欄位 print(df.groupby('Fruit')['Sale'].get_group('Apple')) ``` ``` 0 34 1 12 Name: Sale, dtype: int64 ``` + ### Use df.groupby() ```python # 若不用新增到 index 則使用 as_index=False # 以 fruit 作為 index，取得所有 Fruit 為 Apple 的資料 print(df.groupby('Fruit').get_group('Apple')) ``` ``` Date Fruit Sale 0 08/09/2018 Apple 34 1 10/09/2018 Apple 12 ``` --- ```python train_df.groupby(['date_block_num', 'shop_id']).get_group((0, 0)) ``` --- ```python # 以 Fruit 作為 index，計算 Fruit 相同的資料有多少 print(df.groupby('Fruit').size()) ``` ``` Fruit Apple 2 Banana 2 dtype: int64 ``` --- ```python # 以 Fruit 作為 index，取得 Sale 並且顯示在每個 Fruit 分類中最大的 Sale print(df.groupby('Fruit')['Sale'].max()) ``` ``` Fruit Apple 34 Banana 27 Name: Sale, dtype: int64 ``` --- ```python # 以 Fruit 作為 index，取得 Sale 並且顯示在每個 Fruit 分類中最小的 Sale print(df.groupby('Fruit')['Sale'].min()) ``` ``` Fruit Apple 12 Banana 22 Name: Sale, dtype: int64 ``` --- ```python # 以 Fruit 作為 index，計算並且顯示 Fruit 各自的分類中 Sale 的 min， max a = df.groupby('Fruit')['Sale'].agg([min, max]) print(a) ``` ``` min max Fruit Apple 12 34 Banana 22 27 ``` --- ```python # 以 Fruit 作為 index，取得 Sale 並且顯示在每個 Fruit 分類中的 mean print(df.groupby('Fruit')['Sale'].mean()) ``` ``` Fruit Apple 23.0 Banana 24.5 Name: Sale, dtype: float64 ``` --- ```python # 以 Date, Fruit 作為 index，且計算分別輸出現的次數 print(df.groupby(['Date', 'Fruit']).size()) ``` + ### df.describe() ```python # 顯示所有資料中數字的相關訊息 print(df.describe()) ``` ``` Sale count 4.000000 mean 23.750000 std 9.251126 min 12.000000 25% 19.500000 50% 24.500000 75% 28.750000 max 34.000000 ``` + ### info() ```python print(df.info()) ``` ``` <class 'pandas.core.frame.DataFrame'> RangeIndex: 6 entries, 0 to 5 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Date 5 non-null object 1 Fruit 5 non-null object 2 Sale 5 non-null float64 dtypes: float64(1), object(2) memory usage: 272.0+ bytes ``` + ### sort_values() ```python # 依照 min, max 為 columns 做反向排序，ascending=False 反向排序，由大到小 print(a.sort_values(by=['min', 'max'], ascending=False)) ``` ``` min max Fruit Banana 22 27 Apple 12 34 ``` + ### 選取 row 的資料 ```python # 提取第 0 列的資料，columns 作為 index print(df.loc[0]) ``` ``` Date 08/09/2018 Fruit Apple Sale 34 Name: 0, dtype: object ``` --- ```python # 使用下面更改過 index 的資料，取得 index 為 A 的資料 print(df.loc['A']) ``` ``` Date 08/09/2018 Fruit Apple Sale 34 Name: A, dtype: object ``` --- ```python # iloc 是輸入第幾個，loc 是輸入 index name # 選取 row == 1 的資料 print(df.iloc[1]) ``` ``` Date 10/09/2018 Fruit Apple Sale 12 Name: B, dtype: object ``` + ### Change index ```python df.index = ['A', 'B', 'C', 'D'] ``` ``` Date Fruit Sale A 08/09/2018 Apple 34 B 10/09/2018 Apple 12 C 08/09/2018 Banana 22 D 10/09/2018 Banana 27 ``` + ### set_index() ```python # 以 Fruit Sale 作為 index print(df.set_index(['Fruit', 'Sale'])) ``` ``` Date Fruit Sale Apple 34 08/09/2018 12 10/09/2018 Banana 22 08/09/2018 27 10/09/2018 ``` + ### reset_index ```python # inplace = True --> direct save in origin DataFrame # drop = True --> don't save old index in DataFrame df.reset_index(inplace=True, drop=True) ``` ## Null space ```python # 創建空值 df = pd.DataFrame( { 'Date': ['08/09/2018', '10/09/2018', '08/09/2018', '10/09/2018', np.nan, '08/09/2010'], 'Fruit': ['Apple', 'Apple', 'Banana', np.nan, 'Orange', 'Apple'], 'Sale': [34, 12, 22, 27, 12, np.nan] }) ``` ``` Date Fruit Sale 0 08/09/2018 Apple 34.0 1 10/09/2018 Apple 12.0 2 08/09/2018 Banana 22.0 3 10/09/2018 NaN 27.0 4 NaN Orange 12.0 5 08/09/2010 Apple NaN ``` --- ```python # 取出 df 裡面 Fruit 為空值的所有資料 print(df[pd.isnull(df['Fruit'])]) ``` ``` Date Fruit Sale 3 10/09/2018 NaN 27.0 ``` --- ```python # 取出 df 裡面 Fruit 不為空值的所有資料 print(df[pd.notnull(df['Fruit'])]) ``` ``` Date Fruit Sale 0 08/09/2018 Apple 34.0 1 10/09/2018 Apple 12.0 2 08/09/2018 Banana 22.0 4 NaN Orange 12.0 5 08/09/2010 Apple NaN ``` --- ```python # 若有空值，刪除整列 print(df.dropna()) ``` ``` Date Fruit Sale 0 08/09/2018 Apple 34.0 1 10/09/2018 Apple 12.0 2 08/09/2018 Banana 22.0 ``` --- ```python # 將所有空值改為99 print(df.fillna(99)) ``` ``` Date Fruit Sale 0 08/09/2018 Apple 34.0 1 10/09/2018 Apple 12.0 2 08/09/2018 Banana 22.0 3 10/09/2018 99 27.0 4 99 Orange 12.0 5 08/09/2010 Apple 99.0 ``` ## MultiIndex DataFrame Transfer ```python # unstack(): MultiIndex->DataFrame # stack(): DataFrame->MultiIndex print(df.stack()) ``` ``` 0 Date 08/09/2018 Fruit Apple Sale 34 1 Date 10/09/2018 Fruit Apple Sale 12 2 Date 08/09/2018 Fruit Banana Sale 22 3 Date 10/09/2018 Sale 27 4 Fruit Orange Sale 12 5 Date 08/09/2010 Fruit Apple dtype: object ``` --- ```python # 等於沒變 print(df.stack().unstack()) ``` ``` Date Fruit Sale 0 08/09/2018 Apple 34 1 10/09/2018 Apple 12 2 08/09/2018 Banana 22 3 10/09/2018 NaN 27 4 NaN Orange 12 5 08/09/2010 Apple NaN ``` --- ```python # 給 MultiIndex 的 columns 取名稱 a = df.stack() a.index.names = ['index', 'name'] print(a) ``` ``` index name 0 Date 08/09/2018 Fruit Apple Sale 34 1 Date 10/09/2018 Fruit Apple Sale 12 2 Date 08/09/2018 Fruit Banana Sale 22 3 Date 10/09/2018 Sale 27 4 Fruit Orange Sale 12 5 Date 08/09/2010 Fruit Apple dtype: object ``` + ### replace() ```python # 將 Fruit 的 Apple 換成 Pineapple print(df.Fruit.replace('Apple', 'Pineapple')) ``` ``` 0 Pineapple 1 Pineapple 2 Banana 3 NaN 4 Orange 5 Pineapple Name: Fruit, dtype: object ``` + ### value_counts() ```python # 計算 Fruit 裡面每個資料出現的次數 print(df.Fruit.value_counts()) ``` ``` Apple 3 Orange 1 Banana 1 Name: Fruit, dtype: int64 ``` + ### size ```python # 計算 Fruit 的總數量 print(df.Fruit.size) ``` ``` 6 ``` + ### rename() ```python # 將 Fruit 名稱改為 TestFood print(df.rename(columns={'Fruit':'TestFood'})) ``` ``` Date TestFood Sale 0 08/09/2018 Apple 34.0 1 10/09/2018 Apple 12.0 2 08/09/2018 Banana 22.0 3 10/09/2018 NaN 27.0 4 NaN Orange 12.0 5 08/09/2010 Apple NaN ``` --- ```python # 將 index 為 0 改為 zero， 1 改為 one print(df.rename(index={0:'zero', 1:'one'})) ``` ``` Date Fruit Sale zero 08/09/2018 Apple 34.0 one 10/09/2018 Apple 12.0 2 08/09/2018 Banana 22.0 3 10/09/2018 NaN 27.0 4 NaN Orange 12.0 5 08/09/2010 Apple NaN ``` --- ```python # 將 row 標籤取為 wines， columns 標籤取為 fields print(df.rename_axis("wines", axis='rows').rename_axis("fields", axis='columns')) ``` ``` fields Date Fruit Sale wines 0 08/09/2018 Apple 34.0 1 10/09/2018 Apple 12.0 2 08/09/2018 Banana 22.0 3 10/09/2018 NaN 27.0 4 NaN Orange 12.0 5 08/09/2010 Apple NaN ``` + ### concat() ```python # 將多個資料串接在一起，若有相同的 columns 名稱則會合併 print(pd.concat([df, df])) ``` ``` Date Fruit Sale 0 08/09/2018 Apple 34.0 1 10/09/2018 Apple 12.0 2 08/09/2018 Banana 22.0 3 10/09/2018 NaN 27.0 4 NaN Orange 12.0 5 08/09/2010 Apple NaN 0 08/09/2018 Apple 34.0 1 10/09/2018 Apple 12.0 2 08/09/2018 Banana 22.0 3 10/09/2018 NaN 27.0 4 NaN Orange 12.0 5 08/09/2010 Apple NaN ``` --- ```python # 將多個資料串接在一起，若有不相同的 columns 名稱則會創新的 columns，缺資料的補 NaN c = pd.DataFrame({ 'test':[0, 1, 2] }) print(pd.concat([df, c])) ``` ``` Date Fruit Sale test 0 08/09/2018 Apple 34.0 NaN 1 10/09/2018 Apple 12.0 NaN 2 08/09/2018 Banana 22.0 NaN 3 10/09/2018 NaN 27.0 NaN 4 NaN Orange 12.0 NaN 5 08/09/2010 Apple NaN NaN 0 NaN NaN NaN 0.0 1 NaN NaN NaN 1.0 2 NaN NaN NaN 2.0 ``` + ### join() ```python # 左右合併數據，若有相同的 columns，則使用 lsuffix='_before'，替左邊的數據重新取名(原來的名字 + '_before'， lsuffix='_after' 則是右邊 c = pd.DataFrame({ 'test':[0, 1, 2], 'Fruit':['Apple', 'Banana', np.nan] }) print(df.join(c, lsuffix='_before', rsuffix='_after')) ``` ``` Date Fruit_before Sale test Fruit_after 0 08/09/2018 Apple 34.0 0.0 Apple 1 10/09/2018 Apple 12.0 1.0 Banana 2 08/09/2018 Banana 22.0 2.0 NaN 3 10/09/2018 NaN 27.0 NaN NaN 4 NaN Orange 12.0 NaN NaN 5 08/09/2010 Apple NaN NaN NaN ``` + ### Transpose ```python print(df.T) ``` ``` 0 1 2 3 4 5 Date 08/09/2018 10/09/2018 08/09/2018 10/09/2018 NaN 08/09/2010 Fruit Apple Apple Banana NaN Orange Apple Sale 34 12 22 27 12 NaN ``` + ### Read from ClipBoard ```python df = pd.read_clipboard() ``` ## Custom setting ```python # display all columns data df.set_option('display.max_columns', None) ``` + ### Show float precision ```python pd.set_option("display.precision", 1) ``` + ### split() ```python # 逗號分割 df.Fruit.str.split(',') ``` ``` 0 [Apple] 1 [Apple] 2 [Banana] 3 NaN 4 [Orange] 5 [Apple] Name: Fruit, dtype: object ``` + ### tolist() ```python # 要在欄位為 list 時才可使用 df.Fruit.tolist() ``` + ### remove repeat row ```python repeat = df.duplicated() out = df[~repeat] ```