Virgil - Descriptive Statistics - S31 Dispersion

--- title: Virgil - Descriptive Statistics - S31 Dispersion tags: Virgil, LearnWorld, DescriptiveStatistics --- <a target="_blank" href="https://colab.research.google.com/drive/1Pi32kRVz0rx6IVxSJNv_PtqzGcNOd3Fq#scrollTo=-0NdYm2Ej8qQ"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a> # Dispersion ```python # Import libraries import pandas as pd import matplotlib.pyplot as plt import seaborn as sns sns.set_style("white") ``` Through this notebook, we will use the Restaurant dataset as example. The dataset is acquired from customers that visited a restaurant. Each row is a bill of a customer. ```python # Load example data df = pd.read_csv('https://www.dropbox.com/s/ory0s3z89z3zkt4/restaurant.csv?dl=1') ``` ### 2.2. Measure of Dispersion Using the measure of Central Tendency alone is not enough to fully describe a variable. For example, - SAMPLE A: `-100, -80, -60, 0, 60, 80, 100` - SAMPLE B: `-3, -2, -1, 0, 1, 2, 3` Both datasets have the mean of 0. However, they are totally different. In this case, we need the Measure of Dispersion to further describe the sample. The Measure of Dispersion describes how **scattered** the values are and how much they differ from the mean value. 4 of the most common measures are: **Range, Quantile, Variance, Standard Deviation.** ### 2.2.a/ Range Range is the difference between Max and Min. For example, in the above example, sample A has the range of 200 *(100 - (-100))*, however, sample B has smaller range of 6 *(3-(-3))* - Data in the sample A tends to be further away from the mean value and scatters in a large range. - Data in the sample B is much closer to the mean value and vary in a smaller range. ```python # Code Example: The range of total bill of the restaurant df['total_bill'].max() - df['total_bill'].min() ``` 47.74 ### 2.2.b/ Quantile Using quantile method, data is divided into many parts. Each part contains equal number of observations. For example: using quantile method with 4 parts. <img src='https://cdn.scribbr.com/wp-content/uploads/2020/09/iqr_quartiles.png'> - The three marks which divide the data into 4 equal part are called Quantile 1, Quantile 2, Quantile 3. - Quantile 2 is the Median, since it divides the data into 50/50 number of observations. - The data in the range from Quantile 1 to Quantile 3 is called Inter-quantile Range. The easiest way to use quanntile method is to use `.describe()` ```python # Example --- Quantile on total_bill df['total_bill'].describe() ``` count 244.000000 mean 19.785943 std 8.902412 min 3.070000 25% 13.347500 50% 17.795000 75% 24.127500 max 50.810000 Name: total_bill, dtype: float64 - 25% of the customers pay under \$13 for a meal - 50% of the customers pay under \$17 for a meal - 25% of the customers pay between \$17 and \$24 - 25% of the customers pay more than \$24 for their meal. ### 2.2.c/ Variance and Standard Deviation Variance represents how scatter the data is from the mean value. It is calculated by taking the sum of all the difference of each observation to the mean, then average them. <img src="https://cdn1.coolgyan.org/wp-content/uploads/2019/08/variance-and-standard-deviation-formula.png"> **❊ EXAMPLE:** ## `sample_A = [-100, -80, -60, 0, 60, 80, 100]` - The Variance is calculated as: ####$σ_A^2 = \frac{(-100 - 0)^2 + (-80 - 0)^2 + (-60 - 0)^2 + (0-0)^2 + (60 - 0)^2 + (80 - 0)^2 + (100 - 0)^2} 7 = 5714.28$ - The Standard Deviation is calculated by taking square-root of the Variance: #### $σ_A = \sqrt{σ^2} = 75.6$ ##`sample_B = [-3, -2, -1, 0, 1, 2, 3]` - #### $σ_B^2 = \frac{(-3 - 0)^2 + (-3 - 0)^2 + (-1 - 0)^2 + (0-0)^2 + (1 - 0)^2 + (2 - 0)^2 + (3 - 0)^2} 7 = 4$ - #### $σ_B = \sqrt{σ^2} = 2$ 🙋🏻‍♂️ ***Observations:*** - Two distribution that share the same mean: - The larger the variance (and standard deviation) means that data is likely to spreak further away from the mean. - The smaller the variance (and standard deviation) means that data tends to cluster closely to the mean. - Std has the same unit with the data while the variance is to the power of 2. Lucky for us that we don't have to calculate Variance and Standard Deviation manually. Pandas helps us by either using `.var()` or `.std()` or `.describe()` ```python # Example: Getting var and std of total_bill df['total_bill'].var() ``` 79.25293861397826 ```python df['total_bill'].std() ``` 8.902411954856856 ```python df['total_bill'].describe() ``` count 244.000000 mean 19.785943 std 8.902412 min 3.070000 25% 13.347500 50% 17.795000 75% 24.127500 max 50.810000 Name: total_bill, dtype: float64 ❊ ***Key notes about Standard Deviation*** - Standard Deviation measures spread around the mean. Because of its close links with the mean, standard deviation can be greatly affected if the mean gives a poor measure of central tendency. - Standard deviation is also influenced by outliers one value could contribute largely to the results of the standard deviation. In that sense, the standard deviation is a **good indicator of the presence of outliers**. This makes standard deviation a very useful measure of spread for symmetrical distributions with no outliers. - Standard deviation is also useful when comparing the spread of two separate data sets that have approximately the same mean. - Three standard deviations from the mean is a common cut-off in practice for identifying outliers in a normal or normal-like distribution. <img src="https://www.kdnuggets.com/wp-content/uploads/std-dev-normal.jpg" width=600> ## ★ Extra --- Looking at the distribution of bills on weekday compared to weekend. ```python df.head() ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>total_bill</th> <th>tip</th> <th>sex</th> <th>smoker</th> <th>day</th> <th>time</th> <th>size</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>16.99</td> <td>1.01</td> <td>Female</td> <td>No</td> <td>Sun</td> <td>Dinner</td> <td>2</td> </tr> <tr> <th>1</th> <td>10.34</td> <td>1.66</td> <td>Male</td> <td>No</td> <td>Sun</td> <td>Dinner</td> <td>3</td> </tr> <tr> <th>2</th> <td>21.01</td> <td>3.50</td> <td>Male</td> <td>No</td> <td>Sun</td> <td>Dinner</td> <td>3</td> </tr> <tr> <th>3</th> <td>23.68</td> <td>3.31</td> <td>Male</td> <td>No</td> <td>Sun</td> <td>Dinner</td> <td>2</td> </tr> <tr> <th>4</th> <td>24.59</td> <td>3.61</td> <td>Female</td> <td>No</td> <td>Sun</td> <td>Dinner</td> <td>4</td> </tr> </tbody> </table> </div> ```python # Get data of weekday and weekend weekday = df.loc[df['day'].isin(['Thur', 'Fri']), 'total_bill'] weekend = df.loc[df['day'].isin(['Sat', 'Sun']), 'total_bill'] ``` ```python weekday.describe() ``` count 81.000000 mean 17.558148 std 7.936160 min 5.750000 25% 12.260000 50% 15.980000 75% 20.530000 max 43.110000 Name: total_bill, dtype: float64 ```python weekend.describe() ``` count 163.000000 mean 20.893006 std 9.168543 min 3.070000 25% 14.605000 50% 18.350000 75% 25.285000 max 50.810000 Name: total_bill, dtype: float64 Looking at the descriptive statistics report, there are no significant diffenrence between the two time. - Central of Tendency: The mean are almost equal (17 vs 20) - Dispersion: The standard deviation are also close (7 vs 9) --> There are not much of difference between bills in weekday and weekend. ```python sns.histplot(weekday) ``` ```python sns.histplot(weekend) ```