Week 2 - Pandas

--- title: Week 2 - Pandas tags: class1000, Pandas description: Basic introduction of data clean tool --- # Week 2 - Pandas ## 目錄 - [基本介紹](#基本介紹) - [Series](#Series) - [DataFrame](#DataFrame) - [敘述性統計重要方法](#敘述性統計重要方法) - [載入、儲存資料](#載入、儲存資料) - [資料處理的重要方法](#資料處理的重要方法) - [Groupby功能](#Groupby功能) - [時間套件](#時間套件) - [文字處理](#文字處理) - [資料視覺化](#資料視覺化) - [綜合練習](#→綜合練習) >課前思考：Why pandas not Excel? 若你今天要處理... >1. 處理超大數據:dizzy:ex:100萬筆:arrow_up:data >2. 依條件判斷，不特定行數，資料聚合 >3. 依條件判斷，不特定行數，資料計算 >4. 資料合併、分割 > >參考資料 >* [官方API](https://pandas.pydata.org/pandas-docs/stable/index.html):muscle::muscle::100: --- ## 基本介紹 ### 起手式 ```python # 引入套件 import numpy as np import pandas as pd ``` ### Pandas特色 * 基於```Numpy```所建構的高級套件 * 提供讓資料分析更快更簡單的資料結構，基本分為兩大架構： * **Series**：主要為建立索引的==一維陣列== * **DataFrame**：用來處理結構化(Table like)的資料，有列索引與欄標籤的==二維資料集==，例如關聯式資料庫、CSV等 * 可快速進行資料的前處理，如 * **資料補值**```DataFrame.fillna()``` * **去除空值**```DataFrame.dropna()``` * 更多的**輸入來源**及**輸出整合性**，例如：整合資料庫更多介紹 > * pandas 名字的來源：panal data, python data analysis > * pandas 背景：Wes McKinney, AQR > * [維基百科介紹]( https://en.wikipedia.org/wiki/Pandas_(software)) --- ## Series ### 特色 * 類似一維ndarray物件 * 自動建立索引or自定義索引（Series由索引＋資料組成） * 自動對齊索引 * 檢測缺失資料： * 檢查空值```s.isnull()``` * 檢查非空值```s.notnull()``` * 有序的Dictionary(適用許多dictionary才能使用的方法)ex: * 依索引進行排序： ```s.sort_index()``` * 依值進行排序： ```s.sort_values()``` ### 建立 ```python s = pd.Series(data, index=, dtype=, copy=False) ``` 可以傳入的資料類型(常用) * list * dictionary * ndarray > [欲知詳請請見](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html) ### 重要屬性 * 索引(key)：``` s.index``` * 值(value)：```s.values``` * 值類型(資料型態)：```s.dtype``` ### 索引切片三種不同的方法： * 通過**index**索引資料：```s['index_name']``` * 通過**index**索引資料：```s.loc['index_name']``` * 通過**index的位置**索引資料：```s.iloc[row_number]``` ```python= # 範例 obj = pd.Series([4, 7, -5, 3],index=['a','b','c','d']) # 創建Series obj['a'] # 要用index索引 # obj[4] # 不可用value索引，會出現錯誤訊息 ``` ### → 練習 * 建立一陣列，並轉換成Series，檢視其index, dtype, values ```python= import pandas as pd import numpy as np # make some data r1 = np.arange(2,21,2) obj = pd.Series(r1) # 解答 # This is what matters⬇️ print(obj.index) # 檢視「索引」 print(obj.dtype) # 檢視「資料型態」 print(obj.values) # 檢視「值」 ``` --- ## DataFrame ### 特色 * **表格**型資料結構 * 有序的row（列索引）、column（欄索引） * 每個column可以由不同的值類型(dtype)組成 * 由多個Series組成的字典（共用同個列索引） ### 建立 ```python # 直接建立 df = pd.DataFrame(data, index=, columns= dtype=, copy=False) # 從dict匯入 df = pd.DataFrame.from_dict(data, orient='columns', dtype=, columns=) ``` 可以傳入的資料類型(常用) * lists(多層) ```[[v11,v12],[v21,v22],[v31,v32]...]``` * list + dictionary ```[{k11:v11, k12:v12}, {k21:k21, k21:v21}, {k31:v31, k32:v32}...]``` * dictionaries(多層) ```{k1:{k11:v11}, k2:{k21:v21}...}``` * [ndarrays](https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.html) ```array([v]...)``` ### 重要屬性 * 列(index)：```df.index``` * 欄(columns)：```df.columns``` * 值(value)：```df.values``` * 值類型(資料型態)```df.dtypes``` ### 索引切片 * 通過**欄標籤**索引列資料：```df['column_name']``` * 通過**列標籤**索引列資料：```df.loc['row_name', 'column_name']``` * 通過**列標籤的位置**索引列資料(一定要是數字) ```df.iloc[row_index, column_index]``` ```python= # 範例 # make some data data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2001, 2002, 2003], 'pop': [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]} frame = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],index=['one', 'two', 'three', 'four','five', 'six']) # This is what matters⬇️ print('檢視整個DataFrame：') display(frame) # 檢視整個DataFrame print('列索引：') display(frame.loc[['one','two']]) # 檢視列索引(多筆資料) print('欄索引：') display(frame['state']) # 檢視欄索引(單筆資料) ``` ![](https://i.imgur.com/blfL9Rw.png) ### → 練習 ```python= import numpy as np import pandas as pd # make some data np.random.seed(50) data = { 'name':['Owen', 'Kevin', 'Candy', 'Albie', 'Kai', 'Elliot', 'Louis', 'Austin','Marvin'], 'first_mid':[s if s>10 else None for s in np.random.randint(-10,50,size=9)], 'second_mid':[s for s in np.random.randint(70,100,size=9)], 'final':[s for s in np.random.randint(60,100,size=9)] } df = pd.DataFrame(data, columns=['name','first_mid', 'second_mid', 'final']) df.index = df['name'] del df['name'] ``` 請使用上面資料進行下面練習： 1. 以不同方法檢視資料 a. 直接呼叫 b. 使用 \.info() c. 檢視前七筆 3. 用0分填補空值 4. 檢視敘述性統計 5. 畫出每個人的成績分布圖 ```python= # 解答 # Q1 print('-----Q1-----') display(df) # 直接呼叫 display(df.info()) # 使用 .info() display(df.head(7)) # 檢視前七筆 # Q2 print('-----Q2-----') df = df.fillna(0) # 用0分填補空值 display(df) # Q3 print('-----Q3-----') display(df.describe()) # 檢視敘述性統計 # Q4 print('-----Q4-----') display(df.plot(marker='o')) # 畫出每個人的成績分布圖 ``` --- ## 敘述性統計重要方法 * 檢視基本資訊：```df.info()``` * 計算敘述性統計：```df.describe()``` * 計算非NAN項目數： ```df.count(axis={index (0), columns (1)})``` * 加總：```df.sum(axis=0)``` * 累計加總：```df.cumsum(axis=0)``` * 最大值：```df.max(axis=0)``` * 最小值：```df.min(axis=0)``` * 平均值：```df.mean(axis=0)``` * 中位數：```df.median(axis=0)``` * 四分位數```df.quantile(axis=0, q=0.5)``` * 標準差：```df.std(axis=0)``` > 進階方法 > * [Series敘述性統計](https://pandas.pydata.org/pandas-docs/stable/reference/series.html#computations-descriptive-stats) > * [DataFrame敘述性統計](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html#computations-descriptive-stats) --- ## 載入、儲存資料 * [讀取.csv類型檔案](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html#pandas.read_csv)： ```df = pd.read_csv(filepath, sep=',', header=, names=, dtype=, encoding='utf-8'...)``` * [讀取excel類型檔案](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html#pandas.read_excel)： ```df = pd.read_excel(filepath, sheet_name=0, header=0, names=, dtype=...)``` * [讀取HTML類型檔案](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html#pandas.read_html)：==少用== ```df = pd.read_html(url, match='.+', header=, attrs=, dtype=, encoding=...)``` * [讀取SAS類型檔案](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sas.html#pandas.read_sas)：==少用== ```df.read_sas(filepath, format=, encoding=...)``` * [輸出成.csv類型檔案](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html#pandas.DataFrame.to_csv)： ```df.to_csv(filepath_name, sep=',', index=True, header=True, columns=, encoding='utf-8'...)``` * [輸出成Excel類型檔案](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_excel.html#pandas.DataFrame.to_excel)： ```df.to_excel(filepath_name, sheet_name='Sheet1', index=True, header=True, columns=...)``` > 進階方法 > * [DataFrame輸入](https://pandas.pydata.org/pandas-docs/stable/reference/io.html) > * [DataFrame輸出](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html#serialization-io-conversion) --- ## 資料處理的重要方法 ### 搜尋 * 計算一維項目個數：==常用== ```s.value_counts()``` * 尋找唯一值：```s.unique()``` * 判斷指定值是否存在：```s.isin('value')``` * 判斷是空值：```df.isnull()``` * 判斷非空值：```df.notnull()``` * 判斷重複值：==常用== ```df.duplicated(subset=, keep=False {‘first’, ‘last’, False})``` ### 排序 * 依列索引排序： ```df.sort_index(axis=0, ascending=True, inplace=False, na_position='last')``` * 依欄位值排序：==常用== ```df.sort_values(by, axis=0, ascending=True, inplace=False, na_position='last')``` * 樞紐分析表：==跟Excel的樞紐分析表功能很像== ```df.pivot_table(index, columns, aggfunc=np.mean, values=)``` ### 移除 * 移除欄位： ```df.drop(labels= axis=0, inplace=False)``` * 移除空值：==常用== ```df.dropna(axis=0, how='any'{‘any’, ‘all’}, subset=, inplace=False)``` * 移除重複值：==常用== ```df.drop_duplicates(subset=, keep=False, inplace=False)``` ### 填補 * 填補空值：==常用== ```df.fillna(value, axis=0, inplace=False, limit=)``` ### 索引操作 * 重新設定列索引：```df.reset_index(drop=False, inplace=False)``` * 重新指定索引名：```df.rename(index=, columns= inplace=False)``` ### 檢視 * 檢視前n筆資料：```df.head(rows=5)``` * 檢視末n筆資料：```df.tail(rows=5)``` ### 資料聚合 * 在尾端新增資料：```df.append(df)``` * [Concatnate](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html)： ```df = pd.concat([df1,df2],axis=0, join='outer', ignore_index=False,sort=False)``` * [Merge](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html?highlight=merge#pandas.DataFrame.merge)： ```df.merge(df, how='inner'{‘left’, ‘right’, ‘outer’, ‘inner’}, left_on=, right_on=, sort=False, suffix=(‘_x’, ‘_y’))``` > 進階方法 > * [Series資料處理](https://pandas.pydata.org/pandas-docs/stable/reference/series.html) > * [DataFrame資料處理](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html) > * [資料合併merge, join, concatnate介紹](https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html)→就像玩積木一樣慢慢組起來 > * What is data merging?-->[介紹文](https://www.displayr.com/what-is-data-merging/) --- ## Groupby功能 > ==pandas 最實用的功能== > 想想看當你想要針對特定欄的特定值進行操作... > > ==跟Excel的分組小計功能很像== ### 基本介紹 ![](https://i.imgur.com/dIukih3.png) * 拆分：根據一個或多個key切割pandas 物件 * Key值可能是functions, arrays, DataFrame column names * 應用：套用函數針對每群分別計算 * 計算聚合統計(summary statistic)，如：count, mean, stacdard deviation etc. * 可使用群組內其他操作，如：**正規化**、**線性回歸**、**排序**、**挑選子集** * 合併：將結果合併成一張表 * 計算[樞紐分析表](#排序)、**交叉分析表** * [分位數分析](#敘述性統計重要方法) ex:四分位數 ### 拆分 * Groupby函數：```df.groupby(by=, axis=0, sort=True)``` ```python df = pd.DataFrame({'A': ['foo', 'bar', 'foo', 'bar', \ 'foo', 'bar', 'foo', 'foo'], \ 'B': ['one', 'one', 'two', 'three', \ 'two', 'two', 'one', 'three'], \ 'C': np.random.randn(8), \ 'D': np.random.randn(8)}) group1 = df.groupby('A') #單索引 group2 = df.groupby(['A', 'B']) # 多索引，類似excel的小計功能 ``` ### 應用 * 使用[敘述性統計函數](#敘述性統計重要方法)彙總計算 * 自定義函數進行計算 ```python group1.sum() group2.mean() ``` ### 合併 * [用apply函數對每一筆資料進行處理](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.GroupBy.apply.html?highlight=apply#pandas.core.groupby.GroupBy.apply)：==進階用法== ```GroupBy.apply(func, axis=0)``` * 用for迴圈對每一筆資料進行處理：==沒效率但好用== ```python for index, row in group1: # do something... print(index) display(row) ``` ## 時間套件 ### 基本介紹 Pandas 將python內建的datetime套件濃縮放進來，讓pandas 可以高效的處理時間相關的物件。時間序列主要有以下幾種： * 時間戳(datetime)：表示特定的時刻 * 固定時期(periods)：例如2019/12/24~2019/12/25 * 時間間格(timedelta)：例如2018/11~2019/11有多久 > [datetime套件複習](https://hackmd.io/@singlien/ByxOx6G5r#time套件與datetime套件) ### 常用函數 * [時間轉文字](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.dt.strftime.html#pandas.Series.dt.strftime)：```Series.dt.strftime(format)``` * [文字轉時間](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.to_datetime.html#pandas.to_datetime)：==重要== ```pd.to_datetime(Series, format=None)``` * 將時間物件轉成數字 * 日```Series.dt.days``` * 秒```Series.dt.seconds``` > 進階方法 > https://pandas.pydata.org/pandas-docs/stable/reference/series.html#api-series-dt --- ## 文字處理 ### 基本介紹 Pandas 將python內建的string套件濃縮放進來，讓pandas 可以高效的處理欄位值。 > [字串複習](https://hackmd.io/@singlien/ByWewB-qB#%E5%AD%97%E4%B8%B2) ### 常用函數 * 找到特定文字的位置：==重要== ```Series.str.find(pattern, start=, end=)``` * 檢查是否包含特定文字：==重要== ```Series.str.contains(pattern, case=True, regex=True)``` * 計算特定文字出現次數：```Series.str.count(pattern)``` * 將各自串以特定字元組合起來(當Series中包含list時適用)：==好用== ```Series.str.join(pattern)``` > 進階方法 > https://pandas.pydata.org/pandas-docs/stable/reference/series.html#api-series-str --- ## 資料視覺化 * [畫圖](https://blog.csdn.net/brucewong0516/article/details/80524442)： ```DataFrame.plot(*x=None, y=None, kind='line', layout=(row, column), title='string', xtick, ytick, legend=True*)``` ```python= # 範例 import pandas as pd import numpy as np # Make some fake data. np.random.seed(50) data = pd.DataFrame( np.random.randn(10,4), index = np.arange(10), columns = list("ABCD") ) data.plot() # 直接呼叫就會幫你畫圖，預設折線圖，每一欄都畫 ``` ![範例圖3-1](https://i.imgur.com/znmtXE7.png) ![範例圖3-2](https://i.imgur.com/qFcAMxC.png) > Pandas只能畫出簡單的圖，若要厲害的還是要試試看其他套件... > * [matplotlib](https://matplotlib.org/tutorials/index.html)：比較低階一點，要考量到圖表的各個要素 > * [seaborn](https://seaborn.pydata.org/api.html) --- ## →綜合練習 ### 說明從[政府資料開放](https://data.gov.tw/)平台下載2017年全年「[臺北小巨蛋場地租用資訊](https://data.gov.tw/dataset/61869)」資料，從裡面整理各種不同類型的活動， * 租借的平均天數 * 各廠商租借的頻率 * 辦了什麼活動 ### 檔案下載 * [臺北小巨蛋場地租用資訊](https://drive.google.com/drive/folders/1NJgDhc4e1mjhr6_FOEGyKFMM6S4U1zIP?usp=sharing) ### 步驟詳解 * 檔案讀取 * **所提供檔案編碼為big5** * 將月資料成一份大表格(年資料) * 看檔案整體的樣子 * 將日期轉換為時間物件 * 計算租用天數 * 計算租金 * **假設租金每日50萬** * 記得先將租用日期從```timedelta64[ns]```轉換成```int``` * 刪除重複值 * 計算各活動類型平均租借時間 * 計算各廠商總租借時間及租金 * groupby使用 * count * sum * mean * 檢視第一名都辦了什麼活動？ * 輸出成Excel檔 > [參考解答](https://colab.research.google.com/drive/1BNwEVK7w5KReEe9Flg9N8caOrQ8RhuZn) --- ## 給你一條魚不如教你釣魚有問題的時候該怎麼辦呢？去找這些參考資料GO！ (直接餵狗！懂？) > [Basic Python tutorial](https://www.runoob.com/python3/python3-tutorial.html) > [Pandas tutorial](https://pandas.pydata.org/pandas-docs/stable/) > [Numpy tutorial](https://docs.scipy.org/doc/numpy/reference/) > [HTML tutorial](https://www.w3schools.com/html/) > > [上一篇請點此：Week 1 - Basic Python](https://hackmd.io/@singlien/ByWewB-qB) > [下一篇請點此：Week 3 - 其他常用套件](https://hackmd.io/@singlien/ByxOx6G5r)