owned this note
owned this note
Published
Linked with GitHub
# Combining Datasets: Concat and Append
> Lee Tsung-Tang
> ###### tags: `python` `pandas` `combining data` `Python Data Science Handbook`
引用整理自[Python Data Science Handbook CH3](https://jakevdp.github.io/PythonDataScienceHandbook/)
[TOC]
{%hackmd @88u1wNUtQpyVz9FsQYeBRg/r1vSYkogS %}
pnadas有多種不同合併資料的方式,從簡單到複雜的操作
> Here we'll take a look at simple concatenation of Series and DataFrames with the `pd.concat` function; later we'll dive into more sophisticated in-memory `merges` and `joins` implemented in Pandas.
```python=
import pandas as pd
import numpy as np
# useful way to make dataframe
def make_df(cols, ind):
"""Quickly make a DataFrame"""
data = {c: [str(c) + str(i) for i in ind]
for c in cols}
return pd.DataFrame(data, ind)
# example DataFrame
make_df('ABC', range(3))
# A B C
#0 A0 B0 C0
#1 A1 B1 C1
#2 A2 B2 C2
```
In addition, we'll create a quick class that allows us to display multiple `DataFrames` side by side. The code makes use of the special `_repr_html_` method, which IPython uses to implement its rich object display:
```python=
class display(object):
"""Display HTML representation of multiple objects"""
template = """<div style="float: left; padding: 10px;">
<p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
</div>"""
def __init__(self, *args):
self.args = args
def _repr_html_(self):
return '\n'.join(self.template.format(a, eval(a)._repr_html_())
for a in self.args)
def __repr__(self):
return '\n\n'.join(a + '\n' + repr(eval(a))
for a in self.args)
```
## Recall: Concatenation of NumPy Arrays
參考[The Basics of NumPy Arrays](/cNcuLIl_QoWF4_Gz494HrQ)
> Concatenation of `Series` and `DataFrame` objects is very similar to concatenation of Numpy `arrays` (`np.concatenate()`) 範例如下
```python=
x = [1, 2, 3]
y = [4, 5, 6]
z = [7, 8, 9]
np.concatenate([x, y, z])
array([1, 2, 3, 4, 5, 6, 7, 8, 9])
```
> `np.concatenate()`的第一個arg. 是`tuple`或`list` of `arrays`; 第二個arg. `axis`是合併的方向
>
```python=
x = [[1, 2],
[3, 4]]
np.concatenate([x, x], axis=1) # by row
array([[1, 2, 1, 2],
[3, 4, 3, 4]])
```
## Simple Concatenation with `pd.concat`
pandas 的 `pd.concat()`與numpy的類似,但有更多的參數可以調整
```python=
# Signature in Pandas v0.18
pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
keys=None, levels=None, names=None, verify_integrity=False,
copy=True)
```
> 簡單的合併 `series` `DataFrame`
```python=
ser1 = pd.Series(['A', 'B', 'C'], index=[1, 2, 3])
ser2 = pd.Series(['D', 'E', 'F'], index=[4, 5, 6])
pd.concat([ser1, ser2])
#1 A
#2 B
#3 C
#4 D
#5 E
#6 F
#dtype: object
```
> `DataFrame`
>
```python=
df1 = make_df('AB', [1, 2])
df2 = make_df('AB', [3, 4])
display('df1', 'df2', 'pd.concat([df1, df2])')
```

:a: `pd.concat()`預設是以列合併(row-wise)
> `pd.concat()`同樣也能用`axis`參數控制合併的方向
```python=
df3 = make_df('AB', [0, 1])
df4 = make_df('CD', [0, 1])
display('df3', 'df4', "pd.concat([df3, df4], axis='col')")
```

:a: `axis='col'`等價於`axis=1`;`axis='row'`等價於`axis=0` (用0、1容易混淆)
## Duplicate indices
> `pd.concat`與`np.concatenate`最大的不同是pandas版本的在合併時即使會有**重複**的index發生也會==全部保留==
```python=
x = make_df('AB', [0, 1])
y = make_df('AB', [2, 3])
y.index = x.index # make duplicate indices!
display('x', 'y', 'pd.concat([x, y])')
```

:a: 重複的index在`DataFrame`會造成許多額外的問題,`pd.concat()`提供幾種方式處理
#### Catching the repeats as an error
> 設定`verify_integrity` argument 為`True`,此時如果遇到重複的index就會有error
```python=
try:
pd.concat([x, y], verify_integrity=True)
except ValueError as e:
print("ValueError:", e)
#ValueError: Indexes have overlapping values: [0, 1]
```
#### Ignoring the index
> 忽略原本的index,直接給新一組index
> `ignore_index=True`
```python=
display('x', 'y', 'pd.concat([x, y], ignore_index=True)')
```

#### Adding MultiIndex keys
> 使用hierarchically indexed
> `keys`
```python=
display('x', 'y', "pd.concat([x, y], keys=['x', 'y'])")
```

:a: 可以用[Hierarchical Indexing](/fX8rYDVrQLOTN2Q5QvOmzA)討論的方法進行操作
### Concatenation with joins
> 如果`DataFrame`欄位不一致,`pd.concate()`有幾種方式可以處理
```python=
df5 = make_df('ABC', [1, 2])
df6 = make_df('BCD', [3, 4])
display('df5', 'df6', 'pd.concat([df5, df6])')
```

:a: 預設是保留所有欄位,如果的`DataFrame`在對應的欄位沒有資料就會是遺漏值(`NaN`)
> `join_axes` `join` 可以控制合併的方式,預設的聯集合併方式為`join='outer'`,也可以調整為交集合併`join='inner'`
```python=
display('df5', 'df6',
"pd.concat([df5, df6], join='inner')")
```

:a: 合併後的資料只保留兩邊都有的欄位
> 另一個方法是用`join_axes`直接放入要保留的list of index objects
> 下例是放入其中一個`DataFrame`的columns,表示合併後的資料只保留這些columns
```python=
display('df5', 'df6',
"pd.concat([df5, df6], join_axes=[df5.columns])")
```

#### Concatenation mulitiple DataFrame(補充)
這邊是其他補充資料
> 可以直接input `list` of `DataFrame`合併多個資料
>
```python=
pdList = [df1, df2, ...] # List of your dataframes
new_df = pd.concat(pdList)
```
### The append() method
> `pd.concat([df1, df2])`等價於`df1.append(df2)`
```python=
display('df1', 'df2', 'df1.append(df2)')
```

:warning: 與`list`的`append()`、`extend()`不同,pandas的`append()`不會直接改變原本的資料,而是create新的object。這會額外消耗大量記憶體,所以如果要合併大量資料,建議使用`pd.concat()`