Virgil - Descriptive Statistics - S21 Central Tendency

--- title: Virgil - Descriptive Statistics - S21 Central Tendency tags: Virgil, LearnWorld, DescriptiveStatistics --- <a target="_blank" href="https://colab.research.google.com/drive/1Pi32kRVz0rx6IVxSJNv_PtqzGcNOd3Fq#scrollTo=P2YJKGTdOBAO"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a> # **Central Tedency** ```python # Import libraries import pandas as pd import matplotlib.pyplot as plt import seaborn as sns sns.set_style("white") ``` Through this notebook, we will use the Restaurant dataset as example. The dataset is acquired from customers that visited a restaurant. Each row is a bill of a customer. ```python # Load example data df = pd.read_csv('https://www.dropbox.com/s/ory0s3z89z3zkt4/restaurant.csv?dl=1') ``` ### 2.1/ Measure of Central Tendency The **Measure of Central of Tendency** is the tendency of quantitative data to cluster around some central value. It is calculated by different methods such as: **Mean, Median and Mode**. - **Mean (Arithmetic Mean)**: the sum of the values divided by the number of values. - **Median**: the middlemost number in a sorted set of data. Be careful: To find the Median, the data should first be sorted in order from least to greatest. In case of two number at the center, median is calculated by taking the average of the two. <img src="https://i.imgur.com/gxCNOTt.png" width=200> - **Mode**: the values that appear the most in a set of data. *For example*: Given a sample of data `[0, 2, 2, 3, 4, 6, 8, 8, 10]` - Mean is: (0+2+2+3+4+6+8+8+10) / 9 = 4.7 ] - Median is: 4 - Mode is: 2 and 8 (both appear twice) Based on the Mean, Median, Mode, the shape of data distribution can be determine. ![](https://miro.medium.com/max/2000/0*s5edNywp_G5mw0CN.png) The implementation of Mean, Mode, Median in Pandas is easy. For example, describe the variable `total bill` by the measure of central tendency. ```python # Mean df['total_bill'].mean() ``` 19.785942622950824 ```python # Median df['total_bill'].median() ``` 17.795 ```python # Mode df['total_bill'].mode() ``` 0 13.42 dtype: float64 ```python #@title Distribution of total bill plt.figure(figsize=(15, 10)) sns.histplot(df['total_bill'], kde=True) plt.ylim(0, 70) plt.axvline(df['total_bill'].mean(), c='red') plt.text(df['total_bill'].mean()-0.5, 60, 'MEAN', c='r', rotation='vertical') plt.axvline(df['total_bill'].median(), c='blue') plt.text(df['total_bill'].median()-0.5, 60, 'MEDIAN', c='blue', rotation='vertical') plt.axvline(df['total_bill'].mode().values[0], c='green') plt.text(df['total_bill'].mode().values[0]-0.5, 60, 'MODE', c='green', rotation='vertical') plt.show() ``` ***The Affect of Outliers on Measure of Central Tendency*** <img src="https://i.imgur.com/zAdTDBm.png" width=300> Above is the example of a sample of pizza price taken from 2 cities New York and LA. The prices are overally similar. There is only a rather big value in the NYC data (66.00). First using the Measure of Central Tendency to describe the data of the two cities. ```python # New York City nyc = pd.Series([1, 2, 3, 3, 5, 6, 7, 8, 9, 11, 66]) print('Mean of NYC: ', nyc.mean()) print('Median of NYC: ', nyc.median()) print('Mode of NYC: ', nyc.mode().values[0]) ``` Mean of NYC: 11.0 Median of NYC: 6.0 Mode of NYC: 3 ```python # LA # Sample of pizza dataset in LA la = pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) print(f"Mean of LA : {la.mean()}") print(f"Median of LA : {la.median()}") print(f"Mode of LA :\n{la.mode()}") ``` Mean of LA : 5.5 Median of LA : 5.5 Mode of LA : 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 10 dtype: int64 🙋🏻‍♂️ ***Observations:*** - The Mean of NYC is as twice as the LA's. The large value 66 affected the mean value significantly. Those data that is either too large or too small compared to the common range of the dataset, we call it **outliers.** We'll discuss about the method to detect outliers later in the class. - Median values are similar between the two cities and not affected by the outliers. - There is no mode in the LA dataset because every data appear for the same number of times (1). ***❊ Pro and Cons of each method of the Measure of Central Tendency*** | | MEAN | MEDIAN | MODE | |:--:|:--:|:--:|:--:| | **PRO** | Use all data value --> good representation of the data | Unaffected by outliers Can be used in ordered categorical data (ordinal data) | Unaffected by outliers Can be used in categorical data | | **CONS** | Sensitive to outliers Cannot be calculated for qualitative data | Not use all the available information of the data Incapable of further algebraic treatment | Some data may have more than one mode, or no mode at all. Could be an inaccurate representative of the data | ***❊ When to use which?*** <img src='https://i.imgur.com/UGvZDF5.png' width=600>