Merging DataFrames with pandas

# Merging DataFrames with pandas ###### tags: `Datacamp` `python` `panda` `data science` `Data Manipulation with Python` >**作者:何彥南** >Datacamp 課程: [Merging DataFrames with pandas](https://www.datacamp.com/courses/merging-dataframes-with-pandas) **注意:** 1. 以下df 為 pandas 的 DataFrame 的型式的表格。 2. pd 為 panda 套件的縮寫。 3. 請以官方文件 [panda doc](https://pandas.pydata.org/pandas-docs/stable/) 為主。 4. 注意panda 的版本(0.25.0)，有些功能可能在新版無法使用。 [toc] --- # [CH1] Preparing Data >In this chapter, you'll learn about different techniques you can use to import multiple files into DataFrames. Having imported your data into individual DataFrames, you'll then learn how to share information between DataFrames using their Indexes. Understanding how Indexes work is essential information that you'll need for merging DataFrames later in the course. ## 1. Reading multiple data files ### Tools for pandas data import | read_csv()、glob > 讀取資料 * read_csv() [官方文檔](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) * pandas 還資源許多其他檔案類型: * pd.read_excel() * pd.read_html() * pd.read_json() ![](https://i.imgur.com/YbCNqpC.png) >使用迴圈 ![](https://i.imgur.com/wp2nJys.png) > 使用一句式的for迴圈，批量讀取檔案 ![](https://i.imgur.com/Sd8BxhP.png) > 使用 glob * glob: 搜尋檔案名稱的套件 [官方文檔](https://docs.python.org/3/library/glob.html#module-glob) ![](https://i.imgur.com/0FPOsL9.png) ## 2. Reindexing DataFrames >這邊我們將介紹在pandas裡index的基本操作 ### “Indexes” vs. “Indices” * indices: 在單個index裡的多個標籤 * indexes: 多個index ![](https://i.imgur.com/tvG1ZFu.png) ### Importing weather data | index、reindex()、sort_index()、dropna() >[Pittsburgh weather data](https://assets.datacamp.com/production/repositories/516/datasets/58c1ead59818b2451324e9e84239db7bda6b11d3/pittsburgh2013.csv) (from Datacamp) ![](https://i.imgur.com/52i57x5.png) >用print 檢視資料 ![](https://i.imgur.com/bs1D8eD.png) >查看index資料 * df.index [官方文檔](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.html) ![](https://i.imgur.com/OBzW5BO.png) >重設index * df.reindex() [官方文檔](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reindex.html) ![](https://i.imgur.com/vxG3LXr.png) * df.sort_index [官方文檔](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_index.html) * Series.sort_index [官方文檔](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.sort_index.html) ![](https://i.imgur.com/FgaNYQQ.png) > 將df裡的`column` 設為index ![](https://i.imgur.com/BU4atsU.png) >也可以依自己輸入的名字抓取對應的index，當沒有該index回傳`NaN` ![](https://i.imgur.com/qhtB4Zi.png) > 以上面reindex後的 `w_mean3` 的index為準，抓取`w_max` 對應index的值，並使用drop_na()丟掉空值 * df.dropna() [官方文檔](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html) ![](https://i.imgur.com/o7VU6d2.png) > 這邊以 `w_mean` 和`w_max` 抓取互相對應index的值，真方便呢~ ![](https://i.imgur.com/m3SjEM4.png) ## 3.Arithmetic with Series & DataFrames ### Loading weather data | loc[ ] >[Pittsburgh weather data](https://assets.datacamp.com/production/repositories/516/datasets/58c1ead59818b2451324e9e84239db7bda6b11d3/pittsburgh2013.csv) (from Datacamp) ![](https://i.imgur.com/TzzqqcF.png) > 純量的乘法，對指定日期內的`PrecipitationIn`這個column 乘 2.54 * df.loc[`index_lable`,`column_lable`] [官方文檔](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html) * 可以抓取指定的`index_lable`(row名字)和`column_lable`(col名字)。 * `:` 可以抓取一個範圍 ![](https://i.imgur.com/YgtzBug.png) > 也可以用list一次選取多欄，得到絕對的溫度範圍 `week1_range` ![](https://i.imgur.com/mty2Q4x.png) > 這邊我們先獲取這幾天的平均溫度` week1_mean` ![](https://i.imgur.com/Q1VRRuW.png) ### Relative temperature range | divide()、pct_change() > 我們想要的到一個相對的比例，但是發生錯誤了QQ * 錯誤顯示pandas不知到要怎麼對 timestamp 和 str 做運算，因為她不知道逆要他算dataframe裡面的哪個值 ![](https://i.imgur.com/9DKHOBR.png) >這邊我們使用pandas 內建的divide()，對`week1_range` 裡的所有儲存格除以` week1_mean` * df.divide() [官方文件](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.divide.html) ![](https://i.imgur.com/XwyMpib.png) > pct_change() 可以產生相對於前一個值得變動比率，因為是相對於前一個值，所以第一個值一定為NaN * Series.pct_change() [官方文件](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.pct_change.html) * df.pct_change()[官方文件](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.DataFrame.pct_change.html) ![](https://i.imgur.com/5qVWpe7.png) ### Olympic medals data >[Summer Olympic medals](https://assets.datacamp.com/production/repositories/516/datasets/2d14df8d3c6a1773358fa000f203282c2e1107d6/Summer%20Olympic%20medals.zip) (from Datacamp) >銅牌獎 ![](https://i.imgur.com/Xwv9JbV.png) >銀牌獎 ![](https://i.imgur.com/CelxtnI.png) >金牌獎 ![](https://i.imgur.com/nYTgnSU.png) ### Adding bronze, silver | add() * df.add() [官方文檔](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.add.html) >把兩個資料並一起，可以看到`2247`分別為bronze和silver的加總 ![](https://i.imgur.com/DN6rXM3.png) >使用add() 也可以，這邊`Germany` 和`Italy` 因為其中一個df有NaN 所以結果為NaN ![](https://i.imgur.com/fwbCklX.png) > 我們可以加入 `fill_value`的參數，只要其中一個友值就填補回來。 ![](https://i.imgur.com/KPwrtQJ.png) ### Chaining .add() >如果我們想對多個呢? ![](https://i.imgur.com/YWRiI3a.png) >用多個add串接省去許多麻煩 ![](https://i.imgur.com/UjQWQD4.png) # [CH2] Concatenating data >Having learned how to import multiple DataFrames and share information using Indexes, in this chapter you'll learn how to perform database-style operations to combine DataFrames. In particular, you'll learn about appending and concatenating DataFrames while working with a variety of real-world datasets. ## 1. Appending & concatenating Series >append 和concat 很類似，兩個都可以用來合併資料 * append() [官方文檔](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.append.html) * Series & DataFrame method * 用法: s1.append(s2) * 只能在s1下面接著s2 * concat() [官方文檔](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html) * pandas module function * 用法: pd.concat([s1, s2, s3]) * 可以對row也可以對column ### Series of US states data | Series >建立series ![](https://i.imgur.com/l6prCk7.png) ### append() | reset_index() ![](https://i.imgur.com/Zz4rZpQ.png) > 使用appemnd後，可以看到資料還保有原本的index ![](https://i.imgur.com/vu1uw7h.png) >使用reset_index()，重設index ![](https://i.imgur.com/KFF14oY.png) ### concat() ![](https://i.imgur.com/SBURSDO.png) > 在concat裡面我們可以直接使用ignore_index 去忽略原本的index ![](https://i.imgur.com/rp167kf.png) ## 2. Appending & concatenating DataFrames ### Loading population data ![](https://i.imgur.com/3nfGH6Q.png) >一樣，我們先一下它長怎樣 ![](https://i.imgur.com/QuAWN6Q.png) >用append將兩個dataframe合併起來，要注意他們兩個的column名稱一樣都是`2010 Census Population`。 ![](https://i.imgur.com/JhWKzSy.png) ### Append() > 這邊我們換成`Population`和 `unemployment` 這兩個dataframe ![](https://i.imgur.com/S0mTAXa.png) >我們將兩個dataframe併起來，但這次就沒那麼順利了，發現許多空值，因為append()預設是以`column的名稱`為準向後添加。但是這兩筆資料的column名稱不同，所以就自動補上空值。 ![](https://i.imgur.com/gtZKHzs.png) > 另外，我們還可以看到這邊有index`2860`重複了 ![](https://i.imgur.com/xpxwaTl.png) ### Concat() >依資料合成的方向主要可以分成兩種 >concat rows (`axis=0`，預設):上下依造row去合併，就是類似上面介紹的append()。 ![](https://i.imgur.com/pBpR3gY.png) >concat columns (`axis=1`):而這就是左右合併column的意思 ![](https://i.imgur.com/t1e0qRo.png) * 注意:`NaN`的問題，上面兩種方式在合併時，當index沒有對應的值或是兩個dataframe長度不一，concat()會自動幫你填入空值，所以還要再對空值做對應的處理 ## 3 Concatenation, keys, & MultiIndexes ### Loading rainfall data | concat(keys=) ![](https://i.imgur.com/oTgCjSV.png) >先檢視資料長怎樣 ![](https://i.imgur.com/bOXeQqR.png) > 對row進行合併 ![](https://i.imgur.com/Yyd16Pd.png) >使用keys，我們可以發現它產生了複合index `2013` `2014` ![](https://i.imgur.com/4U4btB5.png) >我們也可以使用loc[ ]去抓取 `2014` ，其中還包含月份的index ![](https://i.imgur.com/PUWpnUA.png) ### Concatenating columns >這邊`axis='columns'` 與 `axis=1` 是一樣的意思 ![](https://i.imgur.com/XWuO4St.png) >當然，我們也可以對column使用keys，產生複合的index ![](https://i.imgur.com/wSxbtnr.png) ### pd.concat() with dict >我們也可以對dict，但是格式要注意。 ![](https://i.imgur.com/HOLUki4.png) ## 4. Outer & inner joins ### Using with arrays | numpy >使用arrays，他是一種在numpy下的資料型態 * arange() [官方文檔](https://docs.scipy.org/doc/numpy/reference/generated/numpy.arange.html) * reshape() [官方文檔](https://docs.scipy.org/doc/numpy/reference/generated/numpy.reshape.html) ![](https://i.imgur.com/jja8WU3.png) >可以左右併起來 * hstack() [官方文檔](https://docs.scipy.org/doc/numpy/reference/generated/numpy.hstack.html) * vstack() [官方文檔](https://docs.scipy.org/doc/numpy/reference/generated/numpy.vstack.html#numpy.vstack) * concatenate() [官方文檔](https://docs.scipy.org/doc/numpy/reference/generated/numpy.concatenate.html#numpy.concatenate) ![](https://i.imgur.com/H8zahcZ.png) >也可以上下併起來 ![](https://i.imgur.com/H2bB6x6.png) > 這邊因為array的形狀不對，所以發生錯誤。 ![](https://i.imgur.com/h4kqHuP.png) ### Population & unemployment data ![](https://i.imgur.com/FfoSWiG.png) >把他轉乘array的格式 ![](https://i.imgur.com/wt4ivNd.png) >對應的列數相同，我們可以把它左右併起來 ![](https://i.imgur.com/d0SejqG.png) ### concat(Joins=) * 將多個table 的row 合再一起 * 分肥以下兩種 * Outer join :取聯集，會補空值`NaN` * Inner join :取交集 >inner ![](https://i.imgur.com/7zUmLs8.png) >outer ![](https://i.imgur.com/PEE6yLz.png) >再來向下concat 但是因為column之間沒有交集所以為空集合 ![](https://i.imgur.com/ftXnxBv.png) # [CH3] Merging data > Here, you'll learn all about merging pandas DataFrames. You'll explore different techniques for merging, and learn about left joins, right joins, inner joins, and outer joins, as well as when to use which. You'll also learn about ordered merging, which is useful when you want to merge DataFrames whose columns have natural orderings, like date-time columns. ## 1. Merging DataFrames ### Merging * merge() [官方文檔](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html) >下面我們以人口的資料和城市的資料為例 ![](https://i.imgur.com/xm5RMJG.png) ![](https://i.imgur.com/vFuOFvH.png) >這邊我們直接使用merge，可以發現他預設的是由右邊的df中index對應的值，合到左邊的df ![](https://i.imgur.com/SckKjXg.png) ### Merging on multiple columns | merge(on=) ![](https://i.imgur.com/ohjI5Ya.png) >我們跟上面一樣用merge，但是結果卻是空的df，原因是因為兩個df的column 名字都一樣，所以merge不知道你要合哪一個column以甚麼為準。 ![](https://i.imgur.com/O1wClF7.png) >這邊我們加入 `on=` 這個參數，可以看到他會以你設定on的那個column為主，去兩個df中對應的值併再一起，因為他們的column一樣，所以merge後會自動加上 _X 、_Y。 ![](https://i.imgur.com/N9Zjsz0.png) >也on可以設定兩個columns。 ![](https://i.imgur.com/FeVj58V.png) > 使用 `suffixes=` 這個變數可以替代轉換後 _X 、_Y 的名字。 ![](https://i.imgur.com/UMwe421.png) ### Specifying columns to merge ![](https://i.imgur.com/rpmGFwd.png) >使用`left on=` 和 `right_on`，可以設定左右兩個表分別以哪column為基準。 ![](https://i.imgur.com/Svoea3N.png) >左右反過來也行喔，只是要注意他們對應的參數。 ![](https://i.imgur.com/AHkZ68l.png) ## 2. Joining DataFrames ### merge(how=) ![](https://i.imgur.com/JmyN3i6.png) >下面合成的方式有四種 >inner:兩者交集併再一起 ![](https://i.imgur.com/37UxKoY.png) >left:以左邊的df為主，把右邊對應的值併過來 ![](https://i.imgur.com/3DAHIbh.png) >right:以右邊的df為主，把左邊對應的值併過來 ![](https://i.imgur.com/1lqLTW6.png) >outer:輛個df的聯集 ![](https://i.imgur.com/g5IP2z6.png) ### Using .join() * join() [官方文檔](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.join.html) ![](https://i.imgur.com/67AwdGz.png) >直接使用join()，他預設是left，就跟merge(how='left')一樣的意思。 ![](https://i.imgur.com/2qmifLn.png) >join 也一樣可以使用how，這邊我們分別用`how=‘right’`、`how=‘outer’`、`how=‘inner’` ![](https://i.imgur.com/dSmo3Ba.png) ![](https://i.imgur.com/7CW1rfU.png) ![](https://i.imgur.com/bOnKRoA.png) ### Which should you use? >append和concat屬於直接合併，而join和merge是依據條件去做兩個資料的串接。其中merge是最客製化的。 * df1.append(df2): stacking vertically (垂直堆疊) * pd.concat([df1, df2]): * stacking many horizontally or vertically (垂直和水平堆疊) * simple inner/outer joins on Indexes * df1.join(df2): inner/outer/left/right joins on Indexes (依據index 作合併) * pd.merge([df1, df2]): many joins on multiple columns (可以指定column為依據做合併) ## 3. Ordered merges >這邊我們使用銷售資料示範一下merge_order()的作用 ### Using merge_ordered() * merge_ordered()[官方文檔](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge_ordered.html) * 針對`timeseries` 的merge ![](https://i.imgur.com/cnvyNcv.png) ![](https://i.imgur.com/rt3MVWd.png) >首先我們先用merge() ![](https://i.imgur.com/J3j3JQs.png) >但我們發現sorted_values()沒有用 ![](https://i.imgur.com/ujhnp4K.png) >這時我們可以使用merge_oder()，它會自動按造日期去排。 ![](https://i.imgur.com/aRkKV58.png) >也可以加入on 和suffixes ![](https://i.imgur.com/6qL2Hgh.png) ### Ordered merge with ffill >這邊我們使用股票和GDP資料，示範如何填補合併後ˇ的空值。 ![](https://i.imgur.com/C0d6GDX.png) ![](https://i.imgur.com/KLpnWxr.png) > 我們用merge_oder()將兩個df 依據 `Data` 合併，但是有些天沒有對應的資料。 ![](https://i.imgur.com/y28T8bT.png) >於是我們可以使用`merge_odered(fill_method=)`去填補`NaN`，而這邊的`ffill`就是 first fill 的意思，就是依據前一個非空值去填補。 ![](https://i.imgur.com/kctS35l.png)