Aggregation and Grouping

# Aggregation and Grouping > Lee Tsung-Tang > ###### tags: `python` `pandas` `Python Data Science Handbook` 引用整理自[Python Data Science Handbook CH3](https://jakevdp.github.io/PythonDataScienceHandbook/) [TOC] {%hackmd @88u1wNUtQpyVz9FsQYeBRg/r1vSYkogS %} An essential piece of analysis of large data is efficient summarization: computing aggregations like `sum()`, `mean()`, `median()`, `min()`, and `max()`, in which a single number gives insight into the nature of a potentially large dataset. In this section, we'll explore ==aggregations== in Pandas For convenience, we'll use the same display magic function that we've seen in previous sections: ```python= import numpy as np import pandas as pd class display(object): """Display HTML representation of multiple objects""" template = """<div style="float: left; padding: 10px;"> <p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1} </div>""" def __init__(self, *args): self.args = args def _repr_html_(self): return '\n'.join(self.template.format(a, eval(a)._repr_html_()) for a in self.args) def __repr__(self): return '\n\n'.join(a + '\n' + repr(eval(a)) for a in self.args) ``` ## Planets Data >Here we will use the `Planets dataset`, available via the [Seaborn](http://seaborn.pydata.org/) package ```python= import seaborn as sns planets = sns.load_dataset('planets') planets.shape #(1035, 6) planets.head() ``` ![](https://i.imgur.com/dEIHYVL.png =80%x) :waning_crescent_moon: 資料包含到2014年為止1000+多顆太陽系行星的資訊 ## Simple Aggregation in Pandas we explored some of the data aggregations available for NumPy arrays[Aggregations: Min, Max, and Everything In Between](/bxIdXGwyT9-iYBETNLUvIQ) `pd.series`使aggregations會返回single value ```python= rng = np.random.RandomState(42) ser = pd.Series(rng.rand(5)) ser #0 0.374540 #1 0.950714 #2 0.731994 #3 0.598658 #4 0.156019 #dtype: float64 ser.sum() #2.8119254917081569 ser.mean() #0.56238509834163142 ``` > For a DataFrame, by default the aggregates return results within ==each column==: ```python= df = pd.DataFrame({'A': rng.rand(5), 'B': rng.rand(5)}) ``` ![](https://i.imgur.com/GllYIVS.png) ```python= df.mean() #A 0.477888 #B 0.443420 #dtype: float64 ``` >By specifying the `axis` argument, you can instead aggregate within ==each row==: ```python= df.mean(axis='columns') #0 0.088290 #1 0.513997 #2 0.849309 #3 0.406727 #4 0.444949 #dtype: float64 ``` > Pandas `Series` and `DataFrames` include all of the common aggregates mentioned in [Aggregations: Min, Max, and Everything In Between](/bxIdXGwyT9-iYBETNLUvIQ) > pandas 的 `DF.describe` method一次返回column的多個aggregates ```python= planets.dropna().describe() ``` ![](https://i.imgur.com/J71ED4V.png =70%x) > The following table summarizes some other built-in Pandas aggregations: |Aggregation |Description |--|-- |`count()`|Total number of items |`first()`, `last()`|First and last item |`mean()`, `median()`| Mean and median `min()`, `max()`|Minimum and maximum `std()`, `var()`|Standard deviation and variance `mad()`|Mean absolute deviation `prod()`|Product of all items `sum()`|Sum of all items These are all methods of DataFrame and Series objects. ## GroupBy: Split, Apply, Combine The name **"group by"** comes from a command in the SQL database language, but it is perhaps more illuminative to think of it in the terms first coined by Hadley Wickham of Rstats fame: ==split, apply, combine.== ### Split, apply, combine > split-apply-combine operation 的示意圖如下，其中 "apply" 是 summation aggregation: ![](https://i.imgur.com/2BcEc4m.png) This makes clear what the groupby accomplishes: - The ==split step== involves breaking up and grouping a DataFrame depending on the value of the specified key. - The ==apply step==involves computing some function, usually an aggregate, transformation, or filtering, within the individual groups. - The ==combine step== merges the results of these operations into an output array. > 範例 ```python= df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'], 'data': range(6)}, columns=['key', 'data']) ``` ![](https://i.imgur.com/LMz9qpp.png) > `groupby()` method of DataFrames, passing the name of the desired `key` column: ```python= df.groupby('key') #<pandas.core.groupby.DataFrameGroupBy object at 0x117272160> ``` :warning: 注意，返回的是`DataFrameGroupBy` object #### 關於[DataFrameGroupBy object](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html#pandas.DataFrame.groupby) A groupby operation involves some *combination of splitting the object, applying a function, and combining the results*. This can be used to group large amounts of data and compute operations on these groups. 例如： ```python= df.groupby('key').sum() data ``` ![](https://i.imgur.com/5gmLfLf.png) :waning_gibbous_moon: 可以使用所有Pandas 或 NumPy aggregation function，以其它對DF的操作 ### The GroupBy object `GroupBy` object 最常運用的是[Aggregate, Filter, Transform, Appl](#Aggregate,-filter,-transform,-apply)，此處先從基礎的操作介紹 #### Column indexing The `GroupBy` object supports ==column indexing== in the same way as the DataFrame, and returns a modified GroupBy object. For example: ```python= planets.groupby('method') #<pandas.core.groupby.DataFrameGroupBy object at 0x1172727b8> planets.groupby('method')['orbital_period'] #<pandas.core.groupby.SeriesGroupBy object at 0x117272da0> ``` :waning_crescent_moon: 取出`orbital_period`為`pd.series` > 在沒有進行aggregate前不會進行任何計算 ```python= planets.groupby('method')['orbital_period'].median() ``` ![](https://i.imgur.com/roNcIxZ.png =70%x) > 返回各method的orbital periods (in days)中位數 #### Iteration over groups `GroupBy` object 可以直接疊帶[python Iterator and generator](/@s710262101/rJpLVUc0V)，因此可以直接調用迴圈，每個group都會返回一個`pd.series`或DF ```python= for (method, group) in planets.groupby('method'): print("{0:30s} shape={1}".format(method, group.shape)) ``` ![](https://i.imgur.com/ovo94sR.png) :waning_crescent_moon: 雖然方便，但通常用bulil in function的`apply`會更快 #### Dispatch methods > 有些方法可能不被預設用於`GroupBy` object，仍然可以將每個group視為獨立的DF/series各別執行method: > ```python= planets.groupby('method')['year'].describe().unstack() ``` ![](https://i.imgur.com/Igjw1cg.png) :waning_crescent_moon: the vast majority of planets have been discovered by the Radial Velocity and Transit methods. > `describe()` method applied於每個groups，接著再合併結果 > > Again, ==any valid DataFrame/Series method== can be used on the corresponding `GroupBy` object. ### Aggregate, filter, transform, apply 對 `GroupBy` objects 最常見的操作是 `aggregate()`, `filter()`, `transform()`, and `apply()` methods For the purpose of the following subsections, we'll use this `DataFrame`: ```python= rng = np.random.RandomState(0) df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'], 'data1': range(6), 'data2': rng.randint(0, 10, 6)}, columns = ['key', 'data1', 'data2']) df ``` ![](https://i.imgur.com/Lyse5GF.png =30%x) #### Aggregation `aggregate()` method 非常彈性 > It can take *a string, a function, or a list thereof*, and compute all the aggregates at once ```python= df.groupby('key').aggregate(['min', np.median, max]) ``` ![](https://i.imgur.com/FFUcDC7.png =50%x) > 另一個更常見的方法是用`dict`操作，可以指定對不同column做不同操作 ```python= df.groupby('key').aggregate({'data1': 'min', 'data2': 'max'}) ``` ![](https://i.imgur.com/nl01EUZ.png) #### Filtering > `filter()` method 可以drop data by groups >For example, we might want to keep all groups in which the standard deviation is larger than some critical value: ```python= def filter_func(x): return x['data2'].std() > 4 display('df', "df.groupby('key').std()", "df.groupby('key').filter(filter_func)") ``` ![](https://i.imgur.com/owJfpwd.png =70%x) :waning_crescent_moon: filter function 會return **Boolean value** 判斷該組是否通過條件，例如上例中group A的 standard deviation 沒有大於 4，因此最終DF刪除group A #### Transformation While aggregation must return a ***reduced version*** of the data > transformation 的結果會與完整的資料重新合併，回傳的ouptput與input有一模一樣的shape > 組平減: ```python= df.groupby('key').transform(lambda x: x - x.mean()) ``` ![](https://i.imgur.com/o1lgAqX.png =20%x) #### The `apply()` method `apply` method 作用於DF時，是針對整個DF進行操作(逐行或逐列) > For example, here is an `apply()` that normalizes the first column by the *sum of the second*: ```python= def norm_by_data2(x): # x is a DataFrame of group values x['data1'] /= x['data2'].sum() return x display('df', "df.groupby('key').apply(norm_by_data2)") ``` ![](https://i.imgur.com/aI108iQ.png) :warning: `apply()` 回傳Pandas object or scalar ### Specifying the split key #### A list, array, series, or index providing the grouping keys The key can be any `series` or `list` with a *length matching that of the DataFrame*. For example: > 這個list表示各列的組別 ```python= L = [0, 1, 0, 1, 2, 0] display('df', 'df.groupby(L).sum()') ``` ![](https://i.imgur.com/FHW3hUm.png) #### A dictionary or series mapping index to group Another method is to provide a dictionary that maps index values to the group keys: > 重新分組 ```python= df2 = df.set_index('key') mapping = {'A': 'vowel', 'B': 'consonant', 'C': 'consonant'} display('df2', 'df2.groupby(mapping).sum()') ``` ![](https://i.imgur.com/WKJRzZs.png =60%x) #### Any Python function > 可以直接在input時對group column/index進行操作 ```python= display('df2', 'df2.groupby(str.lower).mean()') ``` ![](https://i.imgur.com/u9WH0gG.png) #### A list of valid keys Further, any of the preceding key choices can be combined to group on a multi-index: ```python= df2.groupby([str.lower, mapping]).mean() ``` ![](https://i.imgur.com/dY1Pshc.png) ### Grouping example > 每種`method`在各年代發現行星的數量 ```python= decade = 10 * (planets['year'] // 10) decade = decade.astype(str) + 's' decade.name = 'decade' planets.groupby(['method', decade])['number'].sum().unstack().fillna(0) ``` ![](https://i.imgur.com/hZGyC96.png) :waning_crescent_moon: 說明：以'method', decade分組，在將各組的數量('number')加總(sum)，翻轉(unstack)，補0。

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.