Manipulating DataFrames with pandas

# Manipulating DataFrames with pandas ###### tags: `Datacamp` `python` `panda` `data science` `Data Manipulation with Python` >**作者:何彥南** >Datacamp 課程: [Manipulating DataFrames with pandas](https://www.datacamp.com/courses/manipulating-dataframes-with-pandas) **注意:** 1. 以下df 為 pandas 的 DataFrame 的型式的表格資料。 2. pd 為 panda 套件的縮寫。 3. 請以官方文件 [panda doc](https://pandas.pydata.org/pandas-docs/stable/) 為主。 4. 注意 `panda 的版本(0.25.0)`，有些功能可能在新版無法使用。 5. 相關資料在上方datacamp課程頁面的右下角的 dataset [toc] --- # [CH1]Extracting and transforming data >In this chapter, you will learn all about how to index, slice, filter, and transform DataFrames, using a variety of datasets, ranging from 2012 US election data for the state of Pennsylvania to Pittsburgh weather data. ## 1. What you will learn * Extracting, filtering, and transforming data from DataFrames * Advanced indexing with multiple levels * Tidying, rearranging and restructuring your data * Pivoting, melting, and stacking DataFrames * Identifying and spli!ing DataFrames by groups ## 2. Indexing DataFrames ![](https://i.imgur.com/SEzNvB8.png) >可直接用index 抓取 dataframe 裡面的值，像是下面我們先指定column_lable`['salt']`再指定row_label `['jan']` > ![](https://i.imgur.com/eX0gFiS.png) >使用column attribute的特性，可以用`df.column_label`呼叫df的指定column，像下面的`df.eggs` 與`df['eggs']`是一樣的意思。 > ![](https://i.imgur.com/2BmF7fK.png) >再來我們也可以使用 .loc accessor 的方式抓取dataframe 中你要的部分 * loc [ ] [官方文檔](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html) * 在dataframe 下他是一個很靈活的方法，其組成是`loc[row_label ,column_label ]`，其中row_label和column_label 也可以是一個範圍，像是`A:B`就是，label A 到label B的範圍。 * 他與iloc [ ]的用法大致上一樣，只是`iloc 是用 index (數字)` 來指定，而`loc 是使用 index_label (名字，大多是字串)` ![](https://i.imgur.com/UFOYHxb.png) >使用雙`[[ ]]`的方式可以在dataframe 中選取多個columns > ![](https://i.imgur.com/81g7ESc.png) ## 3. Slicing DataFrames ### sales DataFrame ![](https://i.imgur.com/TBj2iAF.png) >選取單個column時，可以發現他是一個Series的資料形式 > ![](https://i.imgur.com/bprfKPP.png) > 我們可以利用 `[ ]`選取部分值 > ![](https://i.imgur.com/vrNNBRw.png) ### Using .loc[ ] & .iloc >下面是`loc 結合 : `的幾種用法 >　如果只有 `:`的話代表所有 index ![](https://i.imgur.com/nzVTD2P.png) ![](https://i.imgur.com/SDN6Bx2.png) >也可以對column和row都設定範圍 ![](https://i.imgur.com/nMYP39u.png) > 使用 iloc 也可以達到一樣的效果，但是要注意的是它是使用index 的編號，而不是名稱。 ![](https://i.imgur.com/q6sZwt1.png) ### Using list in .loc[ ]、.iloc[ ] > 也可以使用 list 去選取指定的column ![](https://i.imgur.com/u9DweXH.png) >同樣的，iloc 也可以用list的形式 ![](https://i.imgur.com/dXzwgo2.png) >要注意的是使用list 後就算只有指定一個column 他還是dataframe 的資料形式。 ![](https://i.imgur.com/jpJwNYf.png) ## 4. Filtering DataFrames ### Creating a Boolean Series >我們直接對一個 Series 做比較時，它會形成一個等長的bool Series。 ![](https://i.imgur.com/lb7mKk5.png) >我們可以直間將上面的 bool series放入`[ ]`，這樣pandaa 就會去篩選出符合(true) 的value ![](https://i.imgur.com/gaz9QOp.png) > 我們也可以結合 `&(and)` 或 `|(or)`，去做多個條件的篩選。 > ![](https://i.imgur.com/OH5ZFAn.png) ### DataFrames with zeros and NaNs | copy() >我們可以使用copy直接複製dataframe，像下面有` 0 `或者空值`Nan`要注意。 ![](https://i.imgur.com/8PeC0OD.png) ### Select columns | all()、any()、isnull()、notnull() >with all nonzeros * all( ) [官方文檔](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.all.html#pandas.DataFrame.all) * 判斷column中是否`全部值皆為True(不為False或0)的column`，返回bool ![](https://i.imgur.com/bRaNE0Z.png) > with any nonzeros * any( ) [官方文檔](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.any.html#pandas.DataFrame.any) * 與all()相對的，判斷column 中`是否有True(不為0或False)的值`，返回bool ![](https://i.imgur.com/TE8Bnv0.png) ### deal with NaN > with any NaNs * isnull( ) [官方文檔](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isnull.html) * 判斷`是否為空值`，返回bool ![](https://i.imgur.com/cUwag9B.png) >without NaNs * notnull( ) [官方文檔](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.notnull.html#pandas.DataFrame.notnull) * 判斷`是否不為空值`，與isnull() 相反，返回bool。 ![](https://i.imgur.com/IbarEmi.png) > Drop rows with any NaNs * dropna() [官方文檔](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html) * 把含有空值的column丟掉 * how: 可以設定`'all'`或`'any'` * any:只要有Nan就丟掉 * all:column或row 全部為Nan，則丟掉。 ![](https://i.imgur.com/37S9XcX.png) ### Filtering a column based on another >當然我們也可以根據其他column 去篩選另外一個column ，`[ ] 這種篩選法可以接受所有一樣長度的bool Series` 。 ![](https://i.imgur.com/PyMDyWE.png) > 我們也以利用這個特性去改變value，像下面只要salt這行>55 就對eggs這行中同個row的值+5 > ![](https://i.imgur.com/KZbQa1s.png) ## 5. Transforming DataFrames ### DataFrame vectorized methods > 我們可以利用floordiv 對所有的值除於12後只留整數。 * floordiv( ) [官方文檔](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.floordiv.html) ![](https://i.imgur.com/bteRtlz.png) > NumPy 的floor_divide() 也可以達到一樣的效果 * floor_divide()[官方文檔](https://docs.scipy.org/doc/numpy/reference/generated/numpy.floor_divide.html) ![](https://i.imgur.com/hmdTPEr.png) ### Plain Python functions > 使用apply 可以使用function 或 lambda 去處理dataframe所有值 * apply() [官方文檔](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html) >function ![](https://i.imgur.com/hdm8HJJ.png) >lambda ![](https://i.imgur.com/REO7BlE.png) ### Storing a transformation > 可以直接用` df['column_name']=`的方式存取轉換過的 column ![](https://i.imgur.com/wyxrsJT.png) ### The DataFrame index | df.index > 返回 index(列標籤) ![](https://i.imgur.com/2HT6U2W.png) ### Working with string values | upper()、lower()、map() >使用upper()可以把英文字改成大寫，lower 則改小寫 ![](https://i.imgur.com/xMOk1gv.png) >也可以結合map() * map( ) [官方文檔](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.map.html) * 與apply很像它可以對一個Series 中的所有值做一樣的轉換，但是擅長於文字處理這塊。 ![](https://i.imgur.com/6NdeKlc.png) ### Defining columns using other columns >也可以直接指定多個column，做基本的運算併產生一個新的column。 ![](https://i.imgur.com/e1X0X5G.png) --- # [CH2]Advanced indexing >Having learned the fundamentals of working with DataFrames, you will now move on to more advanced indexing techniques. You will learn about MultiIndexes, or hierarchical indexes, and learn how to interact with and extract data from them. ## 1. Index objects and labeled data ### pandas Data Structures * Key building blocks * Indexes: Sequence of labels * Series: 1D array with Index * DataFrames: 2D array with Series as columns * Indexes * Immutable (Like dictionary keys) * Homogenous in data type (Like NumPy arrays) ### Creating a Series >Series式組成dataframe的基本，從下面可以看到它`包含一連串的值和資料類型`。 ![](https://i.imgur.com/fzpjCvQ.png) >我們也可以直接指定一組Series 當作index ![](https://i.imgur.com/NmznPmf.png) ### using .index >由下圖可知，,index可以隨意選取，index.name 還可以知道 index的名稱。 ![](https://i.imgur.com/ETFRXsX.png) >index 也可以改名字 ![](https://i.imgur.com/BCo4hhq.png) >記住，index 裡面的單個值是無法更改的 ![](https://i.imgur.com/zJCXfxG.png) ![](https://i.imgur.com/nrECVRs.png) ### Unemployment data >這邊我們使用 unemployment的資料示範基本的index操做 ![](https://i.imgur.com/2DsxVjK.png) >info() 也可以看到當前的index 是由一串連續數字組成，稱作`rangeindex` ![](https://i.imgur.com/nhS3ZE0.png) > 首先我們選擇Zip為我們新的index > ![](https://i.imgur.com/jOO22uf.png) >再來我們移除原本的column ![](https://i.imgur.com/ex5RPid.png) >利用 type()、name、columns，來檢視index * df.columns: 回傳 column_label ![](https://i.imgur.com/4SnE0hS.png) > 在 read_csv() 的時候我們也可以使用 index_col 直接設定當index的column ![](https://i.imgur.com/i0M9dYk.png) ## 2. Hierarchical Indexing >接著我們要介紹分層的index ### Stock data ![](https://i.imgur.com/UKrHrlF.png) ### Setting index | set_index() >首先我們使用set_index()，把兩個column 設成index * set_index() [官方文檔](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.set_index.html) ![](https://i.imgur.com/6ilcrkt.png) >我們得到了一個 MultiIndex 的東西，name 也沒有了，取代的是names > ![](https://i.imgur.com/m2tl1RT.png) ### Sorting index | sort_index() >接著我們使用sort_index() 將資料根據其中一個index做分類排序 * sort_index( ) [官方文檔](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_index.html) ![](https://i.imgur.com/Ik3sbB6.png) > 我們可以將兩個`分別來自 Symbol 和 Data 的index放入tuple中`，並使用loc[ ] 呼叫對應的row，再加上column_label 就可以獲取指定的值。 > ![](https://i.imgur.com/MoHfTc4.png) ### Slicing (outermost index) >利用整理過的 MultiIndex 和loc[ ] 的特性，我們可以輕鬆依據類別選取需要的部分資料 ![](https://i.imgur.com/ZyFR71l.png) ![](https://i.imgur.com/brE0TeE.png) > 更厲害的是，我們可以加入tuple對不同的index 做選取，還有不要忘了之前教的 `:` 他就是所有的意思 > ![](https://i.imgur.com/S2bLekA.png) ![](https://i.imgur.com/0F6a2zk.png) >加入slice() [官方文檔](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#why-does-assignment-fail-when-using-chained-indexing) ![](https://i.imgur.com/S9JrqUa.png) # [CH3]Rearranging and reshaping >Here, you will learn how to reshape your DataFrames using techniques such as pivoting, melting, stacking, and unstacking. These are powerful techniques that allow you to tidy and rearrange your data into the format that allows you to most easily analyze it for insights. ## 1. Pivoting DataFrames ### Reshaping by pivoting | pivot() ![](https://i.imgur.com/DDPFh6s.png) >pivot() 他可以指定轉換後的 column 和 index 還有values，雖然它強大但它還是有它的限制，但要記的`一組column和index只能對應一個value`。 ![](https://i.imgur.com/SSYijVu.png) >因為沒有指定values所以會返回除了設為index和column以外，所有的columns作為values的結果，這種利用column的特性生成全新的dataframe，所以我們可以稱pivot()為column的魔術師。 ![](https://i.imgur.com/JuKZYnW.png) ## 2. Stacking & unstacking DataFrames >再來我們要介紹堆疊與非堆疊的dataframe ### Creating a multi-level index ![](https://i.imgur.com/7IS1sPK.png) ### Unstacking a multi-index | unstack() >使用 unstack() * unstack() [官方文檔](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.unstack.html) * 他可以對pivot()過的index轉成column，`透過指定level(分類)` ![](https://i.imgur.com/LqnMfcg.png) ### Stacking DataFrames | stack() >而stack()就是反過來將column 轉成index * stack() [官方文檔](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.stack.html#pandas.DataFrame.stack) ![](https://i.imgur.com/TuI9XUr.png) ### Swapping levels >swaplevel可以交換index的前後順序 * swaplevel() [官方文檔](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.swaplevel.html) ![](https://i.imgur.com/wAXhNxG.png) > Sorting rows > ![](https://i.imgur.com/RuICMDJ.png) ## 3. Melting DataFrames ### Clinical trials data ![](https://i.imgur.com/nLa0zV4.png) >melt()與pivot()是相對的，所以這邊我們先製造一個pivot後的dataframe ![](https://i.imgur.com/hmzAQLm.png) ### Melting DataFrame | melt() ![](https://i.imgur.com/qMwgbwF.png) >這邊可以發現我們直接melt，`它會把column轉成row塞到新產生的variable行`，但因為沒設參數，所以結果不太優 * melt() [官方文檔](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.melt.html) ![](https://i.imgur.com/jS1hpoS.png) >設定id_vars，我們將`treatment` 放到一個獨立的column，成為id且會對應一個variable ![](https://i.imgur.com/XRdhmRY.png) >設定 value_vars，選取要從pivot還原的變數(column)，variable為原本的行名，value為原本column底下對應的值 > ![](https://i.imgur.com/GjsmLgg.png) >設定 value_name，改轉換後value的名字 > ![](https://i.imgur.com/TkYi11U.png) ## 4. Pivot tables ### Pivot table | pivot_table() ![](https://i.imgur.com/PASVpOV.png) > 有些情況pivot無法轉換，像下面有重複值的時候。 > ![](https://i.imgur.com/JN8tqWD.png) >這時我們就要使用pivot_table，可以看到它預設是`取多個重複值的mean` ![](https://i.imgur.com/nOnD4p5.png) >`aggfunc='count' `可以返回個配對的數量。 > ![](https://i.imgur.com/yZ8JA27.png) # [CH4]Grouping data >In this chapter, you'll learn how to identify and split DataFrames by groups or categories for further aggregation or analysis. You'll also learn how to transform and filter your data, including how to detect outliers and impute missing values. Knowing how to effectively group data in pandas can be a seriously powerful addition to your data science toolbox. ## 1. Categoricals and groupby ### Using count | count() * count() [官方文檔](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.count.html) ![](https://i.imgur.com/DICNXL9.png) ### Boolean filter and count >這邊count 是表示符合`sales['weekday']=='Sun'`這個條件的數量 ![](https://i.imgur.com/Byp8VOl.png) ### Groupby and count | groupby() > groupby 會根據指定的欄位把其中的值分group，然後我們可以在後面對這些group做一些事，像是count()可以知道各個group有幾個資料 * groupby [官方文檔](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html) * sales.groupby('weekday').count() * split by ‘weekday’ * apply count() function on each group * combine counts per group * Some reducing functions * mean() * std() * sum() * first(), last() * min(), max() ![](https://i.imgur.com/ekVL8LB.png) ### Groupby and sum ![](https://i.imgur.com/5GUMPeE.png) >這邊是對weekday group 完後在挑出`bread`和`butter`兩個column顯示 > ![](https://i.imgur.com/rT3GLZm.png) ### Groupby and mean: multi-level index >這邊一次對兩個column group ，它會先依據`city` group在依`weekday` group ![](https://i.imgur.com/sEw0g8r.png) ### Groupby and sum: by series > 當然groupby也可以用在series上，`把series當作一個column放進去`就好了，會依據裡面的值去分類。 ![](https://i.imgur.com/zY3sg6H.png) >在這邊`customers`這個series就是一個分類的標準，真正分類的還是前面導向的sales資料 ![](https://i.imgur.com/p3FVAG2.png) ### Categorical data | unique() > unique跟groupby很像，但只會返回不一樣值的array，速度快 * unique() [官方文檔](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.unique.html#pandas.Series.unique) * Advantages * Uses less memory * Speeds up operations like groupby() * category [官方文檔](https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html) ![](https://i.imgur.com/oKz32oD.png) ## 2. Groupby and aggregation ### Sales data ![](https://i.imgur.com/H5PNQQG.png) ### Multiple aggregations | agg() >用agg()的方式可以在一個固定的軸上做多個操作，下面就是同時使用`max`和`sum`使用，而軸就是`bread`和`butter`這兩個columns * agg [官方文檔](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.agg.html) ![](https://i.imgur.com/pDPtA2z.png) > Aggregation functions * string names * ‘sum’ * ‘mean’ * ‘count ### Custom aggregation > 在agg()裡面我們也可以放入function ![](https://i.imgur.com/hZrEAP9.png) ![](https://i.imgur.com/Dgv6uRV.png) > 使用dictionaries，可以針對不同的column使用不同的function，很方便呢! > ![](https://i.imgur.com/BsPvo5Q.png) ## 3. Groupby and transformation ![](https://i.imgur.com/9qLL9ZY.png) ### The z-score ![](https://i.imgur.com/IZpnafE.png) > 使用自己定義的`z-score` > ![](https://i.imgur.com/h3iIMeh.png) > 加入groupby 依據 `'yr'`對year 做分群，在對`'mpg'`裡的每一群做Z-score > ![](https://i.imgur.com/rwU3386.png) ### Apply transformation and aggregation > 使用apply() 加上function，在function裡把groupby後的結果整理出來，放到dict裡並轉成dataframe。 ![](https://i.imgur.com/o6dx4bu.png) >在function裡面group是一個`groupby object` 握們可以透過[column_name]直接選去分類後的某一個column ![](https://i.imgur.com/CmhEw4T.png) ## 4. Groupby and filtering ### groupby object >[auto-mpg.csv](https://www.kaggle.com/uciml/autompg-dataset) > ![](https://i.imgur.com/cC9arGf.png) >首先，對 year group後取每組的mean ![](https://i.imgur.com/xPS3yWr.png) >製造 `groupby object`，我們可以看到它是一個groupby的資料型態，groups會返回dict，keys則會返回 dixt的key也就是所有類別。 ![](https://i.imgur.com/Ygb935D.png) > 我們可以對 `groupby object`做迭代，`group_name`代表dict的key、group 代表該group的dataframe > ![](https://i.imgur.com/o8zNSQr.png) > 在 for 迴圈裡加入contain > ![](https://i.imgur.com/DTzeZ0K.png) >comprehension 的形式 > ![](https://i.imgur.com/2Wc3nw2.png) >加入 Boolean，這樣就可以知道每年中符合與不符合條件的有多少 > ![](https://i.imgur.com/nK7SbMh.png)