Virgil - Descriptive Statistics - S11 Descriptive Statistics

--- title: Virgil - Descriptive Statistics - S11 Descriptive Statistics tags: Virgil, LearnWorld, DescriptiveStatistics --- <a target="_blank" href="https://colab.research.google.com/drive/1Pi32kRVz0rx6IVxSJNv_PtqzGcNOd3Fq"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a> # **DESCRIPTIVE STATISTICS** <img src='https://methods.sagepub.com/images/virtual/intermediate-statistics-using-spss/10.4135_9781071802625-img18.jpg'> ```python # Import libraries import pandas as pd import matplotlib.pyplot as plt import seaborn as sns sns.set_style("white") ``` ## 1/ Population and Sample A **population** is an entire group about which some information is required to be ascertained. A statistical population need not consist only of people. We can have population of heights, weights, BMIs, hemoglobin levels, events, outcomes, so long as the population is well defined with explicit inclusion and exclusion criteria. Example: population of Vietnamese customers, population of teenagers' heights. A **sample** is any part of the fully defined population. For example, 100 patients of COVID-19 in a clinical study is a sample of the population of all the COVID-19 patients, provided the sample is properly chosen and the inclusion and exclusion criteria are well defined. To make accurate inferences, the sample has to be representative. A representative sample is one in which each and every member of the population has an equal and mutually exclusive chance of being selected. ▸ [Read More about Population and Sample](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3105563/) <img src='https://ashutoshtripathicom.files.wordpress.com/2019/04/populationvssample-e1556351520474.png' width=500> ## 2/ Descriptive Statistics <img src='https://media.geeksforgeeks.org/wp-content/uploads/20200310224503/TypesOfStatistic-1.png' width=600> A **descriptive statistic** is a summary statistic that quantitatively describes or summarizes features from a collection of information. ▸ Descriptive statistics is distinguished from inferential statistics by its aim to summarize a sample, rather than use the data to learn about the population that the sample of data is thought to represent. ▸ This generally means that descriptive statistics, unlike inferential statistics, is not developed on the basis of probability theory, and are frequently non-parametric statistics. ▸ In order to describe the distribution of a sample, descriptive statistics uses 2 measurements: * Measure of Central Tendency * Measure of Dispersion <img src='https://i.imgur.com/uKuyHoV.png' width=500> Through this notebook, we will use the Restaurant dataset as example. The dataset is acquired from customers that visited a restaurant. Each row is a bill of a customer. ```python # Load example data df = pd.read_csv('https://www.dropbox.com/s/ory0s3z89z3zkt4/restaurant.csv?dl=1') ``` ```python df.head() ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>total_bill</th> <th>tip</th> <th>sex</th> <th>smoker</th> <th>day</th> <th>time</th> <th>size</th> </tr> </thead> <tbody> <tr> <th>0</th> <td>16.99</td> <td>1.01</td> <td>Female</td> <td>No</td> <td>Sun</td> <td>Dinner</td> <td>2</td> </tr> <tr> <th>1</th> <td>10.34</td> <td>1.66</td> <td>Male</td> <td>No</td> <td>Sun</td> <td>Dinner</td> <td>3</td> </tr> <tr> <th>2</th> <td>21.01</td> <td>3.50</td> <td>Male</td> <td>No</td> <td>Sun</td> <td>Dinner</td> <td>3</td> </tr> <tr> <th>3</th> <td>23.68</td> <td>3.31</td> <td>Male</td> <td>No</td> <td>Sun</td> <td>Dinner</td> <td>2</td> </tr> <tr> <th>4</th> <td>24.59</td> <td>3.61</td> <td>Female</td> <td>No</td> <td>Sun</td> <td>Dinner</td> <td>4</td> </tr> </tbody> </table> </div> When using Pandas, the function `.describe()` helps print out a comprehensive descriptive statistics summary of the quantitative variables of the data. For example, getting the descriptive statistics report of a restaurant dataset. ```python df.describe() ``` <div> <style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; } .dataframe tbody tr th { vertical-align: top; } .dataframe thead th { text-align: right; } </style> <table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th></th> <th>total_bill</th> <th>tip</th> <th>size</th> </tr> </thead> <tbody> <tr> <th>count</th> <td>244.000000</td> <td>244.000000</td> <td>244.000000</td> </tr> <tr> <th>mean</th> <td>19.785943</td> <td>2.998279</td> <td>2.569672</td> </tr> <tr> <th>std</th> <td>8.902412</td> <td>1.383638</td> <td>0.951100</td> </tr> <tr> <th>min</th> <td>3.070000</td> <td>1.000000</td> <td>1.000000</td> </tr> <tr> <th>25%</th> <td>13.347500</td> <td>2.000000</td> <td>2.000000</td> </tr> <tr> <th>50%</th> <td>17.795000</td> <td>2.900000</td> <td>2.000000</td> </tr> <tr> <th>75%</th> <td>24.127500</td> <td>3.562500</td> <td>3.000000</td> </tr> <tr> <th>max</th> <td>50.810000</td> <td>10.000000</td> <td>6.000000</td> </tr> </tbody> </table> </div> To plot the distribution of one variable (one column). We can quickly use Seaborn. ```python sns.histplot(df['total_bill']); ```