Combining Datasets: Concat and Append

# Combining Datasets: Concat and Append > Lee Tsung-Tang > ###### tags: `python` `pandas` `combining data` `Python Data Science Handbook` 引用整理自[Python Data Science Handbook CH3](https://jakevdp.github.io/PythonDataScienceHandbook/) [TOC] {%hackmd @88u1wNUtQpyVz9FsQYeBRg/r1vSYkogS %} pnadas有多種不同合併資料的方式，從簡單到複雜的操作 > Here we'll take a look at simple concatenation of Series and DataFrames with the `pd.concat` function; later we'll dive into more sophisticated in-memory `merges` and `joins` implemented in Pandas. ```python= import pandas as pd import numpy as np # useful way to make dataframe def make_df(cols, ind): """Quickly make a DataFrame""" data = {c: [str(c) + str(i) for i in ind] for c in cols} return pd.DataFrame(data, ind) # example DataFrame make_df('ABC', range(3)) # A B C #0 A0 B0 C0 #1 A1 B1 C1 #2 A2 B2 C2 ``` In addition, we'll create a quick class that allows us to display multiple `DataFrames` side by side. The code makes use of the special `_repr_html_` method, which IPython uses to implement its rich object display: ```python= class display(object): """Display HTML representation of multiple objects""" template = """<div style="float: left; padding: 10px;"> <p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1} </div>""" def __init__(self, *args): self.args = args def _repr_html_(self): return '\n'.join(self.template.format(a, eval(a)._repr_html_()) for a in self.args) def __repr__(self): return '\n\n'.join(a + '\n' + repr(eval(a)) for a in self.args) ``` ## Recall: Concatenation of NumPy Arrays 參考[The Basics of NumPy Arrays](/cNcuLIl_QoWF4_Gz494HrQ) > Concatenation of `Series` and `DataFrame` objects is very similar to concatenation of Numpy `arrays` (`np.concatenate()`) 範例如下 ```python= x = [1, 2, 3] y = [4, 5, 6] z = [7, 8, 9] np.concatenate([x, y, z]) array([1, 2, 3, 4, 5, 6, 7, 8, 9]) ``` > `np.concatenate()`的第一個arg. 是`tuple`或`list` of `arrays`；第二個arg. `axis`是合併的方向 > ```python= x = [[1, 2], [3, 4]] np.concatenate([x, x], axis=1) # by row array([[1, 2, 1, 2], [3, 4, 3, 4]]) ``` ## Simple Concatenation with `pd.concat` pandas 的 `pd.concat()`與numpy的類似，但有更多的參數可以調整 ```python= # Signature in Pandas v0.18 pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, copy=True) ``` > 簡單的合併 `series` `DataFrame` ```python= ser1 = pd.Series(['A', 'B', 'C'], index=[1, 2, 3]) ser2 = pd.Series(['D', 'E', 'F'], index=[4, 5, 6]) pd.concat([ser1, ser2]) #1 A #2 B #3 C #4 D #5 E #6 F #dtype: object ``` > `DataFrame` > ```python= df1 = make_df('AB', [1, 2]) df2 = make_df('AB', [3, 4]) display('df1', 'df2', 'pd.concat([df1, df2])') ``` ![](https://i.imgur.com/QRLX6vz.png) :a: `pd.concat()`預設是以列合併(row-wise) > `pd.concat()`同樣也能用`axis`參數控制合併的方向 ```python= df3 = make_df('AB', [0, 1]) df4 = make_df('CD', [0, 1]) display('df3', 'df4', "pd.concat([df3, df4], axis='col')") ``` ![](https://i.imgur.com/3RhSueX.png) :a: `axis='col'`等價於`axis=1`；`axis='row'`等價於`axis=0` (用0、1容易混淆) ## Duplicate indices > `pd.concat`與`np.concatenate`最大的不同是pandas版本的在合併時即使會有**重複**的index發生也會==全部保留== ```python= x = make_df('AB', [0, 1]) y = make_df('AB', [2, 3]) y.index = x.index # make duplicate indices! display('x', 'y', 'pd.concat([x, y])') ``` ![](https://i.imgur.com/5Wmxege.png) :a: 重複的index在`DataFrame`會造成許多額外的問題，`pd.concat()`提供幾種方式處理 #### Catching the repeats as an error > 設定`verify_integrity` argument 為`True`，此時如果遇到重複的index就會有error ```python= try: pd.concat([x, y], verify_integrity=True) except ValueError as e: print("ValueError:", e) #ValueError: Indexes have overlapping values: [0, 1] ``` #### Ignoring the index > 忽略原本的index，直接給新一組index > `ignore_index=True` ```python= display('x', 'y', 'pd.concat([x, y], ignore_index=True)') ``` ![](https://i.imgur.com/lI0Gmu5.png) #### Adding MultiIndex keys > 使用hierarchically indexed > `keys` ```python= display('x', 'y', "pd.concat([x, y], keys=['x', 'y'])") ``` ![](https://i.imgur.com/OgLpwx2.png) :a: 可以用[Hierarchical Indexing](/fX8rYDVrQLOTN2Q5QvOmzA)討論的方法進行操作 ### Concatenation with joins > 如果`DataFrame`欄位不一致，`pd.concate()`有幾種方式可以處理 ```python= df5 = make_df('ABC', [1, 2]) df6 = make_df('BCD', [3, 4]) display('df5', 'df6', 'pd.concat([df5, df6])') ``` ![](https://i.imgur.com/g0EabHZ.png) :a: 預設是保留所有欄位，如果的`DataFrame`在對應的欄位沒有資料就會是遺漏值(`NaN`) > `join_axes` `join` 可以控制合併的方式，預設的聯集合併方式為`join='outer'`，也可以調整為交集合併`join='inner'` ```python= display('df5', 'df6', "pd.concat([df5, df6], join='inner')") ``` ![](https://i.imgur.com/RMs301J.png) :a: 合併後的資料只保留兩邊都有的欄位 > 另一個方法是用`join_axes`直接放入要保留的list of index objects > 下例是放入其中一個`DataFrame`的columns，表示合併後的資料只保留這些columns ```python= display('df5', 'df6', "pd.concat([df5, df6], join_axes=[df5.columns])") ``` ![](https://i.imgur.com/RbiMysl.png) #### Concatenation mulitiple DataFrame(補充) 這邊是其他補充資料 > 可以直接input `list` of `DataFrame`合併多個資料 > ```python= pdList = [df1, df2, ...] # List of your dataframes new_df = pd.concat(pdList) ``` ### The append() method > `pd.concat([df1, df2])`等價於`df1.append(df2)` ```python= display('df1', 'df2', 'df1.append(df2)') ``` ![](https://i.imgur.com/386mLU7.png) :warning: 與`list`的`append()`、`extend()`不同，pandas的`append()`不會直接改變原本的資料，而是create新的object。這會額外消耗大量記憶體，所以如果要合併大量資料，建議使用`pd.concat()`

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.