--- title: Virgil - Intro To Pandas Seaborn - S61 Groupby tags: Virgil, LearnWorld, IntroPandasSeaborn --- <a target="_blank" href="https://colab.research.google.com/drive/1dmqNCOmQ8thWeKz_fW-dX30Hbo3D8waz"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a> ## 1. Groupby By `groupby` we are referring to a process involving one or more of the following steps: * **Splitting** the data into groups based on some criteria * **Applying** a function to each group independently * **Combining** the results into a data structure ```python df.groupby('day')['total_bill'].sum() ``` day Fri 325.88 Sat 1778.40 Sun 1627.16 Thur 1096.33 Name: total_bill, dtype: float64 ```python # Average tip theo ngày df.groupby('day')['tip'].mean() ``` day Fri 2.734737 Sat 2.993103 Sun 3.255132 Thur 2.771452 Name: tip, dtype: float64 ```python df.head() ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>total_bill</th> <th>tip</th> <th>sex</th> <th>smoker</th> <th>day</th> <th>time</th> <th>size</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>16.99</td> <td>1.01</td> <td>Female</td> <td>No</td> <td>Sun</td> <td>Dinner</td> <td>2</td> </tr> <tr> <th>1</th> <td>10.34</td> <td>1.66</td> <td>Male</td> <td>No</td> <td>Sun</td> <td>Dinner</td> <td>3</td> </tr> <tr> <th>2</th> <td>21.01</td> <td>3.50</td> <td>Male</td> <td>No</td> <td>Sun</td> <td>Dinner</td> <td>3</td> </tr> <tr> <th>3</th> <td>23.68</td> <td>3.31</td> <td>Male</td> <td>No</td> <td>Sun</td> <td>Dinner</td> <td>2</td> </tr> <tr> <th>4</th> <td>24.59</td> <td>3.61</td> <td>Female</td> <td>No</td> <td>Sun</td> <td>Dinner</td> <td>4</td> </tr> </tbody> </table> </div> ```python # trung bình table size theo sex df.groupby('sex')['size'].mean() ``` 2.6305732484076434 **13 Aggregation Functions in Pandas** `mean()`: Compute mean of groups `sum()`: Compute sum of group values `size()`: Compute group sizes `count()`: Compute count of group `std()`: Standard deviation of groups `var()`: Compute variance of groups `sem()`: Standard error of the mean of groups `describe()`: Generates descriptive statistics `first()`: Compute first of group values `last()`: Compute last of group values `nth()`: Take nth value, or a subset if n is a list `min()`: Compute min of group values `max()`: Compute max of group values``` ```python # Apply aggregation function on the groupby df.groupby('day').mean() ``` 🙋🏻‍♂️ **DISCUSSION ▸ What is the datatype of the above result?** A --- a DataFrame B --- a GroupBy Datatype C --- a Series D --- a SQL Database ```python # Continue using dataframe syntax df.groupby('day').mean()['total_bill'] ``` 🙋🏻‍♂️ **DISCUSSION ▸ What is the output of the above code cell?** A --- a DataFrame B --- a GroupBy Datatype C --- a Series D --- a SQL Database ```python # Continue using syntax df.groupby('day').mean()['total_bill'] ``` day Fri 17.151579 Sat 20.441379 Sun 21.410000 Thur 17.682742 Name: total_bill, dtype: float64 ***Group by multiple values, one method*** ```python # GROUP BY TWO COLUMNS: Groupby the dataset by day and sex # Có nhiều hơn 1 cột, sử dụng list [] df.groupby(['day', 'sex'])['tip'].mean() ``` 2.781111111111111 ```python # Ngoặc vuông: select, có nhiều hơn 1 cột # Ngoặt tròn: hàm -- groupby(), mean(), sum(), .info(), sort_values() # Ngoặt nhọn: (dictionary) ``` ```python # Continue using dataframe syntax to get a certain value df.groupby(['day', 'sex'])['tip'].mean().loc['Fri'].loc['Female'] ``` day sex Fri Female 2.781111 Male 2.693000 Sat Female 2.801786 Male 3.083898 Sun Female 3.367222 Male 3.220345 Thur Female 2.575625 Male 2.980333 Name: tip, dtype: float64 ***Groupby one value, multiple methods*** ```python # Groupby on one value, multiple methods. --> Hàm .agg df.groupby('day')['tip'].agg(['mean', 'sum']) ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>mean</th> <th>sum</th> </tr> <tr> <th>day</th> <th></th> <th></th> </tr> </thead> <tbody> <tr> <th>Fri</th> <td>2.734737</td> <td>51.96</td> </tr> <tr> <th>Sat</th> <td>2.993103</td> <td>260.40</td> </tr> <tr> <th>Sun</th> <td>3.255132</td> <td>247.39</td> </tr> <tr> <th>Thur</th> <td>2.771452</td> <td>171.83</td> </tr> </tbody> </table> </div> ***Groupby multiple values, multiple methods*** ```python # Groupby on multiple values, multiple methods. df.groupby('day')[['tip', 'total_bill']].agg({'tip':'sum', 'total_bill':'mean'}) ```