owned this note
owned this note
Published
Linked with GitHub
# Aggregation and Grouping
> Lee Tsung-Tang
> ###### tags: `python` `pandas` `Python Data Science Handbook`
引用整理自[Python Data Science Handbook CH3](https://jakevdp.github.io/PythonDataScienceHandbook/)
[TOC]
{%hackmd @88u1wNUtQpyVz9FsQYeBRg/r1vSYkogS %}
An essential piece of analysis of large data is efficient summarization: computing aggregations like `sum()`, `mean()`, `median()`, `min()`, and `max()`, in which a single number gives insight into the nature of a potentially large dataset. In this section, we'll explore ==aggregations== in Pandas
For convenience, we'll use the same display magic function that we've seen in previous sections:
```python=
import numpy as np
import pandas as pd
class display(object):
"""Display HTML representation of multiple objects"""
template = """<div style="float: left; padding: 10px;">
<p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
</div>"""
def __init__(self, *args):
self.args = args
def _repr_html_(self):
return '\n'.join(self.template.format(a, eval(a)._repr_html_())
for a in self.args)
def __repr__(self):
return '\n\n'.join(a + '\n' + repr(eval(a))
for a in self.args)
```
## Planets Data
>Here we will use the `Planets dataset`, available via the [Seaborn](http://seaborn.pydata.org/) package
```python=
import seaborn as sns
planets = sns.load_dataset('planets')
planets.shape
#(1035, 6)
planets.head()
```

:waning_crescent_moon: 資料包含到2014年為止1000+多顆太陽系行星的資訊
## Simple Aggregation in Pandas
we explored some of the data aggregations available for NumPy arrays[Aggregations: Min, Max, and Everything In Between](/bxIdXGwyT9-iYBETNLUvIQ)
`pd.series`使aggregations會返回single value
```python=
rng = np.random.RandomState(42)
ser = pd.Series(rng.rand(5))
ser
#0 0.374540
#1 0.950714
#2 0.731994
#3 0.598658
#4 0.156019
#dtype: float64
ser.sum()
#2.8119254917081569
ser.mean()
#0.56238509834163142
```
> For a DataFrame, by default the aggregates return results within ==each column==:
```python=
df = pd.DataFrame({'A': rng.rand(5),
'B': rng.rand(5)})
```

```python=
df.mean()
#A 0.477888
#B 0.443420
#dtype: float64
```
>By specifying the `axis` argument, you can instead aggregate within ==each row==:
```python=
df.mean(axis='columns')
#0 0.088290
#1 0.513997
#2 0.849309
#3 0.406727
#4 0.444949
#dtype: float64
```
> Pandas `Series` and `DataFrames` include all of the common aggregates mentioned in [Aggregations: Min, Max, and Everything In Between](/bxIdXGwyT9-iYBETNLUvIQ)
> pandas 的 `DF.describe` method一次返回column的多個aggregates
```python=
planets.dropna().describe()
```

> The following table summarizes some other built-in Pandas aggregations:
|Aggregation |Description
|--|--
|`count()`|Total number of items
|`first()`, `last()`|First and last item
|`mean()`, `median()`| Mean and median
`min()`, `max()`|Minimum and maximum
`std()`, `var()`|Standard deviation and variance
`mad()`|Mean absolute deviation
`prod()`|Product of all items
`sum()`|Sum of all items
These are all methods of DataFrame and Series objects.
## GroupBy: Split, Apply, Combine
The name **"group by"** comes from a command in the SQL database language, but it is perhaps more illuminative to think of it in the terms first coined by Hadley Wickham of Rstats fame: ==split, apply, combine.==
### Split, apply, combine
> split-apply-combine operation 的示意圖如下,其中 "apply" 是 summation aggregation:

This makes clear what the groupby accomplishes:
- The ==split step== involves breaking up and grouping a DataFrame depending on the value of the specified key.
- The ==apply step==involves computing some function, usually an aggregate, transformation, or filtering, within the individual groups.
- The ==combine step== merges the results of these operations into an output array.
> 範例
```python=
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
'data': range(6)}, columns=['key', 'data'])
```

> `groupby()` method of DataFrames, passing the name of the desired `key` column:
```python=
df.groupby('key')
#<pandas.core.groupby.DataFrameGroupBy object at 0x117272160>
```
:warning: 注意,返回的是`DataFrameGroupBy` object
#### 關於[DataFrameGroupBy object](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html#pandas.DataFrame.groupby)
A groupby operation involves some *combination of splitting the object, applying a function, and combining the results*. This can be used to group large amounts of data and compute operations on these groups.
例如:
```python=
df.groupby('key').sum()
data
```

:waning_gibbous_moon: 可以使用所有Pandas 或 NumPy aggregation function,以其它對DF的操作
### The GroupBy object
`GroupBy` object 最常運用的是[Aggregate, Filter, Transform, Appl](#Aggregate,-filter,-transform,-apply),此處先從基礎的操作介紹
#### Column indexing
The `GroupBy` object supports ==column indexing== in the same way as the DataFrame, and returns a modified GroupBy object. For example:
```python=
planets.groupby('method')
#<pandas.core.groupby.DataFrameGroupBy object at 0x1172727b8>
planets.groupby('method')['orbital_period']
#<pandas.core.groupby.SeriesGroupBy object at 0x117272da0>
```
:waning_crescent_moon: 取出`orbital_period`為`pd.series`
> 在沒有進行aggregate前不會進行任何計算
```python=
planets.groupby('method')['orbital_period'].median()
```

> 返回各method的orbital periods (in days)中位數
#### Iteration over groups
`GroupBy` object 可以直接疊帶[python Iterator and generator](/@s710262101/rJpLVUc0V),因此可以直接調用迴圈,每個group都會返回一個`pd.series`或DF
```python=
for (method, group) in planets.groupby('method'):
print("{0:30s} shape={1}".format(method, group.shape))
```

:waning_crescent_moon: 雖然方便,但通常用bulil in function的`apply`會更快
#### Dispatch methods
> 有些方法可能不被預設用於`GroupBy` object,仍然可以將每個group視為獨立的DF/series各別執行method:
>
```python=
planets.groupby('method')['year'].describe().unstack()
```

:waning_crescent_moon: the vast majority of planets have been discovered by the Radial Velocity and Transit methods.
> `describe()` method applied於每個groups,接著再合併結果
>
> Again, ==any valid DataFrame/Series method== can be used on the corresponding `GroupBy` object.
### Aggregate, filter, transform, apply
對 `GroupBy` objects 最常見的操作是 `aggregate()`, `filter()`, `transform()`, and `apply()` methods
For the purpose of the following subsections, we'll use this `DataFrame`:
```python=
rng = np.random.RandomState(0)
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'],
'data1': range(6),
'data2': rng.randint(0, 10, 6)},
columns = ['key', 'data1', 'data2'])
df
```

#### Aggregation
`aggregate()` method 非常彈性
> It can take *a string, a function, or a list thereof*, and compute all the aggregates at once
```python=
df.groupby('key').aggregate(['min', np.median, max])
```

> 另一個更常見的方法是用`dict`操作,可以指定對不同column做不同操作
```python=
df.groupby('key').aggregate({'data1': 'min',
'data2': 'max'})
```

#### Filtering
> `filter()` method 可以drop data by groups
>For example, we might want to keep all groups in which the standard deviation is larger than some critical value:
```python=
def filter_func(x):
return x['data2'].std() > 4
display('df', "df.groupby('key').std()", "df.groupby('key').filter(filter_func)")
```

:waning_crescent_moon: filter function 會return **Boolean value** 判斷該組是否通過條件,例如上例中group A的 standard deviation 沒有大於 4,因此最終DF刪除group A
#### Transformation
While aggregation must return a ***reduced version*** of the data
> transformation 的結果會與完整的資料重新合併,回傳的ouptput與input有一模一樣的shape
> 組平減:
```python=
df.groupby('key').transform(lambda x: x - x.mean())
```

#### The `apply()` method
`apply` method 作用於DF時,是針對整個DF進行操作(逐行或逐列)
> For example, here is an `apply()` that normalizes the first column by the *sum of the second*:
```python=
def norm_by_data2(x):
# x is a DataFrame of group values
x['data1'] /= x['data2'].sum()
return x
display('df', "df.groupby('key').apply(norm_by_data2)")
```

:warning: `apply()` 回傳Pandas object or scalar
### Specifying the split key
#### A list, array, series, or index providing the grouping keys
The key can be any `series` or `list` with a *length matching that of the DataFrame*. For example:
> 這個list表示各列的組別
```python=
L = [0, 1, 0, 1, 2, 0]
display('df', 'df.groupby(L).sum()')
```

#### A dictionary or series mapping index to group
Another method is to provide a dictionary that maps index values to the group keys:
> 重新分組
```python=
df2 = df.set_index('key')
mapping = {'A': 'vowel', 'B': 'consonant', 'C': 'consonant'}
display('df2', 'df2.groupby(mapping).sum()')
```

#### Any Python function
> 可以直接在input時對group column/index進行操作
```python=
display('df2', 'df2.groupby(str.lower).mean()')
```

#### A list of valid keys
Further, any of the preceding key choices can be combined to group on a multi-index:
```python=
df2.groupby([str.lower, mapping]).mean()
```

### Grouping example
> 每種`method`在各年代發現行星的數量
```python=
decade = 10 * (planets['year'] // 10)
decade = decade.astype(str) + 's'
decade.name = 'decade'
planets.groupby(['method', decade])['number'].sum().unstack().fillna(0)
```

:waning_crescent_moon: 說明:以'method', decade分組,在將各組的數量('number')加總(sum),翻轉(unstack),補0。