Descriptive Statistics

--- title: Content description: duration: 5400 card_type: cue_card --- ### Content - **Introduction** - **Descriptive Statistics** - Measures of Central Tendency - Mean - Median - Mode - Measures of Variability - Range - Variance - Standard Deviation - **Inferential Statistics** - **Weighted Average** - **Inter Quartile Range** - Quartile - Percentile - Box Plot - **IQR implementation on real life dataset** - **Random Variables** - Discrete RV - Continuous RV - **Distribution Functions** - Histogram - Probability Mass Function (PMF) - Probability Density Function (PDF) - Cumulative Distribution Function (CDF) --- title: Descriptive statistics and Inferential statistics description: duration: 5400 card_type: cue_card --- ### Introduction (2 mins) Greetings Everyone, Since this lecture, we have been only talking about probability. We did not talk about statistics at all. Probability and statistics go hand in hand. Statistics is a very broad subject in itself but we are going to look at it based on the context of Data science only. So let's start our lecture There are 2 types of Statistics: ### **1. Descriptive Statistics** (3 mins) The word descriptive means "**DESCRIBE**" Descriptive statistics involve summarizing and presenting data in a meaningful way, providing a clear and concise overview of a dataset. **Example:** You are driving a car and you look at your dashboard. - The speedometer shows the speed of your car at the moment is 65 km/hr. So, it is simply describing speed. - This Speedometer simply describes an event that a vehicle is moving at a certain speed so it is an example of descriptive statistics. ### **2. Inferential Statistics** (5 mins) Inferential statistics, on the other hand, involve making predictions, inferences, or drawing conclusions about a larger population based on a sample of data. <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/056/088/original/odo.jpeg?1699354146 width = "600"> Continuing the example: - The car's speedometer displays the current speed but **doesn't predict your arrival time** because it depends on various factors like distance and traffic. - What Google Maps will do here, is estimate arrival time based on data and assumptions, but it's only sometimes 100% accurate. - This prediction is an example of inferential statistics, as it draws conclusions from real-world scenarios - It is trying to **"infer"** something. It is concluding out of it. So, it is inferential statistics. These predictions are not random, there are a lot of hypotheses involved here. We are going to learn about it in detail in our next module which is entirely about Inferential statistics. **Conclusion:** - Descriptive statistics summarizes data - Inferential statistics draws conclusions based on the observations. These two are the essential branches of statistics Let's explore descriptive statistics --- title: Measures of Central Tendency description: duration: 5400 card_type: cue_card --- ### **Measures of Central Tendency** In statistics, we often use measures to understand and describe a set of data. Three common measures are: 1. **Mean** - The Mean is the average of all data points 2. **Median** - The Median is the middle value when the data is sorted. 3. **Mode** - It is the observation with the highest frequency Let's go deeper with the help of an example now. > Q. If you are looking for a new job and you want to check how much companies are paying for that role, where will you check? At Glassdoor or leves.fyi Based on this let's have an example #### **Example-1: Data Scientist's Salaries** (7-8 mins) ``` Suppose you are looking for a data scientist job at FAANG. The sample of salaries is taken and recorded as [30L, 30L, 35L, 40L, 40L]. What will be the salary you would be expecting? ``` <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/056/089/original/sal.png?1699354428 height = "200" width = "600"> **Approach:** ### **Mean** (3 mins) > What will be our thought process here, and on what basis we will do the negotiations? Our initial approach would likely be to calculate the Mean, or average, of these salaries. This is how we can calculate it: - Mean will be (30 + 30 + 35 + 40 + 40)/5 = **35 lakhs** - **$\Largeμ = \frac{∑X}{N}$** Where, - $μ$ = population mean - $∑X$ = sum of each value in the population - $N$ = number of values in the population So, the mean salary in this sample is 35 lakhs. We might negotiate our expected salary around this figure. Suppose, a **new candidate comes** in the context and **his salary is 3 crores**. - **New mean** will become = (30 + 30 + 35 + 40 + 40 + 300)/6 = **79 lakhs** The mean salary dramatically increased to 79 lakhs because of the new candidate's exceptionally high salary. > Does this mean that you should now negotiate for 79L? We can say that this **new candidate is an outlier** in the data which is affecting the mean value. <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/056/090/original/out.png?1699354715 height = "200" width = "500"> ### **Median** (5 mins) Here comes the concept of **"Median"** to measure central tendency instead of measuring it using Mean. Before the new candidate joined, the Median was 35L. This means that 35L was the center value when the salaries were sorted in ascending order: - **Original Salaries:** [30L, 30L, 35L, 40L, 40L] - **Sorted:** [30L, 30L, 35L, 40L, 40L] - **Median** = $35L$ (the middle value) After the new candidate with a significantly higher salary arrived (300L), the new Median became 37.5 lakhs: - **New Salaries**: [30L, 30L, 35L, 40L, 40L, 300L] - **Sorted**: [30L, 30L, 35L, 40L, 40L, 300L] - **New Median** = (35L + 40L) / 2 = 75L / 2 = $37.5L$. So, it would be suitable to negotiate at 37.5 lakhs. There is a **huge difference in the new mean and new median**. **Conclusion:** - The outliers dramatically affect the Mean but the Median remains more robust and closer to the typical value of the dataset. - Which concludes that **Median is more robust to outliers** Let's look into some quick examples to get better understanding of this. #### **Example-2**: (2 mins) ``` What is the median of the following numbers: 10, 20, 30, 40, 50, 60, 70 ``` - The number of observations is odd and the numbers are sorted. - The median would be: $(\frac{n+1}{2})th$ observation's value = (7+1)/2 = 4th observation. Therefore, the median will be 40. #### **Example-3:** (2 mins) ``` What is the median of these numbers: 10, 20, 30, 40, 50, 60, 70, 80 ``` - The number of observations is even and the numbers are sorted. - The median would be: $\frac{(\frac{n}{2})th \, observation \,+\, (\frac{n}{2}+1)th \, observation}{2}$ = (4th obs + 5th obs)/2 = (40+50)/2 = 45 There is one more term often used in statistics i.e. "MODE". Let's look into it ### **Mode**(2 mins) It is the observation with the highest frequency - It is **most occurring data point** in the dataset. Suppose the data points are recorded as - [90, 90, 90, 80, 90, 70, 95, 90] - The mode will be **90**. - Remember, sometimes if there are no data points that repeat, then we can say that there is no mode There can also be **more than one mode** in the dataset. Suppose the data points are recorded as - [2, 2, 3, 3, 4] - We can call this **Bi-modal** with 2 and 3 as the modes <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/056/092/original/mod.png?1699354981 height = "150" width = "600"> Now, Let's solve some quizzes --- title: Quiz-1 description: duration: 60 card_type: quiz_card --- # Question ``` There are 4 people whose average age is 24. We know the age of three people: 20, 22, and 28. What is the median age of these 4 people? ``` # Choices - [ ] 22 - [x] 24 - [ ] 25 - [ ] 26 --- title: Quiz 1 Explanation description: duration: 5400 card_type: cue_card --- ### Explanation for Quiz 1 Let the age of fourth person be "x". (20 + 22 + 28 + x)/4 = 24 (70 + x)/4 = 24 70 + x = 96 x = 26 Age of the fourth person is 26 Sort the data in ascending order: Number sorted in ascending order: 20, 22, 26, 28 - Here N = Even. So, median will be $\frac{(\frac{n}{2})th \, observation \,+\, (\frac{n}{2}+1)th \, observation}{2}$ - (2nd obs + 3rd obs )/2 Median = (22 + 26)/2 = 24 **Conclusion** Median age of these 4 people is 24 years --- title: Quiz-2 description: duration: 60 card_type: quiz_card --- # Question ``` In a survey about favourite animal: 30 people said cat, 40 people said dog, 20 people said cow. What is the mode of favourite animals in this data? ``` # Choices - [ ] Cat - [x] Dog - [ ] Cow --- title: Quiz 2 Explanation description: duration: 5400 card_type: cue_card --- ### Explanation for Quiz 2 The dog is liked by the maximum number of people. So, the frequency of occurring the dog in the dataset is the highest. **Conclusion**: The Mode in this data is Dog --- title: Weighted Average description: duration: 5400 card_type: cue_card --- ### **Weighted Average: Reflecting Importance** In many situations, all data points are not equally important when calculating an average. In such cases, we use a concept called Weighted Average. - In Weighted Average, each data point is assigned a weight that represents its **importance** or relevance. - We **multiply each data point by its corresponding weight**, **sum these products**, and **then divide by the total weight**. - It gives a more accurate representation of the "center" or "average" because it considers the significance of each data point. let's understand this with an example. #### **Example: Calculating GPA** ( 3-5 mins) In real life, a common application of weighted average is calculating Grade Point Average (GPA) for students. Consider a student's course list for a semester: <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/056/093/original/stu2.png?1699355669 height=200 width = 350> - Each course has a specific credit value, which represents the weight of that course in the GPA calculation. For example, Math carries 3 credits, History carries 4 credits, and so on. - Each course is assigned a grade, representing the student's performance in that course. To calculate the GPA: 1. Calculate the weighted score for each course by multiplying the credit by the numerical grade. 2. Sum up all the weighted scores. 3. Divide the total weighted score by the total credits. Weighted Average will be: - **For Math:** $3(CREDIT) * 5(GRADE) = 15$ - **For History:** $4 * 4 = 16$ - **For Chemistry:** $3 * 5 = 15$ - **For English:** $2 * 3 = 6$ $GPA = \frac{Total \ Weighted \ Score}{Total \ Credits}$ $= \frac{52}{17} = 3.05$ **Conclusion**: So, the student's GPA for this semester is 3.05 It's a weighted average that considers both the grades and the credit hours, reflecting their academic performance more accurately. Let's solve some quizzes. --- title: Quiz-3 description: duration: 60 card_type: quiz_card --- # Question ``` A survey of number of pets in a town saw that - 30% people had 0 pets, 40% had 1 pet, 10% had 2 pets, 20% had 3 pets. What is the average number of pets? ``` # Choices - [ ] 1 - [x] 1.2 - [ ] 1.5 - [ ] 2.1 --- title: Quiz 3 explanation description: duration: 5400 card_type: cue_card --- ### Explanation for Quiz 3 Suppose there are 100 people. - 30 people have 0 pets. - 40 people have 1 pet. - 10 people have 2 pets. - 20 people have 3 pets. Average number of pets = $\frac{{(30 \cdot 0) + (40 \cdot 1) + (10 \cdot 2) + (20 \cdot 3)}}{{100}}$ = 1.2 This is a simple example of a "weighted average". Here we are not taking the average directly like adding 0 + 1 + 2.. Instead of this, we are multiplying by its weight means how many times it is occurring and then dividing it by the total weigth i.e. 100 --- title: Quiz-4 description: duration: 60 card_type: quiz_card --- # Question ``` The mean weight of 2 children in a family is 40 Kgs. If the weight of the mother is included, the mean becomes 45. What is the weight of the mother? ``` # Choices - [ ] 45 - [ ] 50 - [x] 55 - [ ] 60 - [ ] 65 --- title: Quiz 4 explanation description: duration: 5400 card_type: cue_card --- ### Explanation for Quiz 4 - Let the Weight of children = x and y in kgs The mean weight of 2 children is 40 kgs i.e $\frac{(x + y)}{2} = 40$ $x + y = 80$ - Let's assume the weight of the mother denoted by 'm' If we include the weight of the mother along with the weight of children then the new mean will be 45 kgs $\frac{(x + y + m)}{3} = 45$ $x + y + m = 135$ From this we can get, m = 55 The weight of the mother is 55 kg. --- title: Measures of variability description: duration: 5400 card_type: cue_card --- ### **Measures of Variability** Three common measures of variability are 1. Range 2. Variance 3. Standard Deviation Let's discuss Range #### **Range** (3 mins) Range is nothing but **Maximum value - Minimum value** Suppose the Salaries of some employees in a company are : [30, 30, 35, 40, 40] - Here, the range of the salary will be **40-30 = 10** It describes the overall spread of the data that the difference between maximum and minimum values is 10. <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/056/096/original/rang.png?1699356360 height = "250" width = "500"> . > Q. What will happen if there is an "Outlier" in the data? - Let the salaries be: [30, 30, 35, 40, 40, 300] New range will be **300 - 30 = 270** - As you can see one outlier can destroy the range of the dataset. We can conclude that **Range of the data is also not robust to the outliers like the Mean** To solve this issue, statisticians came up with the metric called "**Inter Quartile Range**". Let's look into it. #### **Inter Quartile Range** (10 mins) IQR is the metric that provides a robust way to measure the spread of a dataset. - The IQR is the range between the first quartile (Q1) and the third quartile (Q3) of a dataset. - Means, **$IQR = Q3 - Q1$** > What is Quartiles? #### **Quartiles** It is the value which divides the dataset into four equal parts. There are three quartiles, **Q1, Q2, and Q3**. - Q1 represents the 25th percentile, meaning that 25% of the data falls below this value. - Q2 is the median and represents the 50th percentile, dividing the data into two equal halves. - Q3 represents the 75th percentile, meaning that 75% of the data falls below this value. Suppose we have this data with us, > **What each values are representing here?** <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/056/098/original/gro.png?1699356596 height = "200" width = "500"> - 0.25 represents Q1 or 25th percentile: - It means that 0.25% of people are shorter than this lady - 0.5 represents Q2 or 50th percentile: - It means that 0.5% or half of the people in the dataset are shorter than this guy - 0.75 represents Q3 or 75th Percentile: - It means that 0.75% of people are shorter than this guy - Maximum or 1: - It means that 100% of people are shorter than this lady or she is the tallest person in the dataset. > Q.What is percentile? #### **Percentile** A value that tells us that some **"p%" observations are less than that value** - Let's say the value occurring at 50 Percentile is 68. We can say that 50% of the data is less than 68 One more example: > Suppose you scored 99 percentile in your 10th boards, what does this mean? - It indicates that **99% of students scored less marks than you**. Now if you want the graphical representation of a dataset's summary statistics, including the median, quartiles, and potential outliers. Box plot comes into the picture #### **Box Plot** It provides a visual way to understand the distribution and spread of data. **Box:** - The box itself represents the interquartile range (IQR), It is divided into two parts, the lower (bottom) quartile (Q1) and the upper (top) quartile (Q3). - The length of the box is determined by the range between Q1 and Q3. **Line (Median):** - Inside the box, a line or bar is drawn that represents the median, which is the middle value of the dataset when it's ordered. **Whiskers:** - Two lines, or "whiskers," extend from the box in both directions. - The Value of Lower whiskers is determined by Q1 - 1.5(IQR). It is the minimum value of the range - The Value of Upper whiskers is determined by Q3 + 1.5(IQR). It is the maximum value of the range **Outliers:** - Data points that fall outside the whiskers are considered outliers. **Example** We have a sorted data, the box plot of the data will look like this <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/056/099/original/so34.png?1699356783 height = "300" width = "550"> --- ***Interactive tool for Box Plot*** https://www.desmos.com/calculator/h9icuu58wn How to use it: Just enter the data in the list A on the left hand side, then run the Box plot command on the left hand side only. It'll generate a Box plot based on your entered data ### **Variance** (5-7 mins) Now let's have a look into another measure of variability, variance: Variance, measures the spread or dispersion of the values of a random variable around its mean. It quantifies how much **individual values deviate from the mean**. - A **higher variance** indicates that the values are more spread out from the mean. - While a **lower variance** suggests that the values are closer to the mean. We can plot the histogram to visualise the spread or distribution of the data Code: ```python= import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt ``` Code: ```python= !wget --no-check-certificate https://drive.google.com/uc?id=1Mrt008vkE4nVb1zE4f06_rtq70QPfkIo -O weight-height.csv ``` >Output: ``` --2024-01-18 09:28:53-- https://drive.google.com/uc?id=1Mrt008vkE4nVb1zE4f06_rtq70QPfkIo Resolving drive.google.com (drive.google.com)... 142.250.128.139, 142.250.128.102, 142.250.128.113, ... Connecting to drive.google.com (drive.google.com)|142.250.128.139|:443... connected. HTTP request sent, awaiting response... 303 See Other Location: https://drive.usercontent.google.com/download?id=1Mrt008vkE4nVb1zE4f06_rtq70QPfkIo [following] --2024-01-18 09:28:53-- https://drive.usercontent.google.com/download?id=1Mrt008vkE4nVb1zE4f06_rtq70QPfkIo Resolving drive.usercontent.google.com (drive.usercontent.google.com)... 173.194.193.132, 2607:f8b0:4001:c0f::84 Connecting to drive.usercontent.google.com (drive.usercontent.google.com)|173.194.193.132|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 428120 (418K) [application/octet-stream] Saving to: ‘weight-height.csv’ weight-height.csv 100%[===================>] 418.09K --.-KB/s in 0.003s 2024-01-18 09:28:53 (118 MB/s) - ‘weight-height.csv’ saved [428120/428120] ``` Code: ```python= df_hw = pd.read_csv("weight-height.csv") df_hw.head() ``` >Output: <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/056/108/original/hj8.png?1699361102 height = 200 width = 300> Code: ```python= df_hw.describe() ``` >Output: <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/056/109/original/go09.png?1699361185 height = 300 width = 280> We will going to work on the single column for now Code: ```python= df_height = df_hw["Height"] df_height.head() ``` >Output: ``` 0 73.847017 1 68.781904 2 74.110105 3 71.730978 4 69.881796 Name: Height, dtype: float64 ``` Code: ```python= sns.histplot(df_height) ``` >Output: <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/056/117/original/nsi.png?1699363504 height = 300 width = 400> #### **Example: The Height Guessing Game** Let's turn learning about variance into a fun game. > Imagine you're tasked with guessing the height of a randomly chosen person from a dataset, and you have to place your bet. Here's how we will calculate and understand variance through this game. **Round 1: Your First Guess** - You make your first guess, **predicting that the person's height is 68 inches**. As it turns out, the **actual height is 64 inches**. - Your error for this round is 4 inches. **Round 2: A Different Guess** - For the second round, you **guess 63 inches**, but it's revealed that the person's **actual height is 68 inches**. - In this case, your error is 5 inches. Now, here's the critical question: > Q. When playing this game, would you ever choose an extreme guess like 55 inches or 75 inches? **No**, It's highly unlikely that someone's height would be that far from the average. **Minimizing Error:** So, what's the best strategy to minimize your errors in this game? It's pretty clear that we want to choose a height that's as close as possible to the average height of the people in the dataset. **Understanding Variance:** This notion of minimizing error by making guesses closer to the average is at the core of what we call "**variance**" - Variance tells us how data points are spread out from the average. - The **smaller the variance, the closer data points are to the mean**, making our guesses more accurate. In this game, by minimizing the variance of your guesses, we increase your chances of being right. Let's explore another way to measure error **Defining Error:** $Error = (Actual \ Height - Guessed \ Height)^2$ > Q. Now, to minimize this error, what's the best approach? In our Height Guessing Game, we've seen that aiming for the mean (μ) height is the key. Means, Guessed height should be the mean value. - **$Error = (H_1 - μ)^2$** (guessing for 1 time) - It is also known as **Mean Squared Error** Imagine you're playing the game 10 times, guessing the mean height each time: Error1 = $(H1 - μ)^2$ Error2 = $(H2 - μ)^2$ Error3 = $(H3 - μ)^2$ ... Error10 = $(H10 - μ)^2$ To find the overall error, we can sum up these individual errors and then divide by the number of guesses, which gives us the variance: **Variance Calculation:** Variance = (Error1 + Error2 + Error3 + ... + Error10) / 10 - **$Variance = \Large\frac{(H_1 - μ)^2 + (H_2 - μ)^2 + (H_3 - μ)^2 + ..... + (H_{10} - μ)}{10}^2$** > Q. So, if the variance is low, what does that mean? It implies that most of our guesses are incredibly accurate. In general, variance quantifies how spread out, the data values are from the average (mean) value. It assesses the average squared difference between data points and the mean. The formula for calculating variance for n data points is: ### **Variance Calculation Formula:** **$variance$ = $\sigma^2 = \frac{\displaystyle\sum_{i=1}^{n}(H_i - \mu)^2}{n}$** - σ2 is the population variance. - $H_i$ is the ith data point. - µ is the population mean. - n is the number of data points in the population. Now that we have a clear understanding of variance and how it measures the spread or dispersion of data points, Let's look into another essential concept closely related to variance. It's called "Standard Deviation" ### **Standard Deviation** ( 3- 5 mins) Let's introduce an even more practical and commonly used statistic - the "Standard Deviation." While variance quantifies the dispersion of data, standard deviation is derived from variance and offers a more interpretable measure. - The standard deviation represents how much individual data points deviate from the mean or average value. - It gives us a clear sense of the typical or expected amount of variation in our dataset. - In simple words, it represents that how far is our data point from the mean ( μ ) **Standard Deviation Formula:** The standard deviation, can be calculated by taking the square root of the variance: **$SD = \sqrt{variance}$** - $\Large σ = \sqrt{\frac{\displaystyle\sum_{i=1}^{n}(H_i - \mu)^2}{n}}$ **Interpretation:** - A lower standard deviation signifies that data points tend to be close to the mean, indicating **less variability**. - Conversely, a higher standard deviation indicates greater data dispersion, suggesting **more variability** within the dataset. --- title: IQR Implementation on real life dataset description: duration: 5400 card_type: cue_card --- ### **IQR implementation on a real-life dataset** (10-12 mins) Let's apply this to real world dataset Now let's see how we can calculate IQR and create a Box plot using python, by working on a real life dataset #### **Problem Statement**: When we talk about these two players: 1. Sehwag 2. Rahul Dravid We all know that Sehwag has **aggressive batting style** While Rahul Dravid **plays patiently**, with no risk and stands on the crease like a "Wall" - Let's analyse both of their matches and try to find some insights about their range of scores. - We will use IQR here to calculate the range of their scores accurately and will also try to find if they have any "Outlier" scores in their careers. We will conclude that out of these two batsman, who is the more consistent batsman?. Let's start with Sehwag's matches Code: ```python= import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt ``` Code: ```python= !wget --no-check-certificate https://drive.google.com/uc?id=1JYyGv7QSb_GkVGan5rnHNrtOg2ewJB_f -O sehwag.csv ``` Code: ```python= sehwag = pd.read_csv("sehwag.csv") sehwag.head() ``` >Output: <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/056/101/original/op220.png?1699358676 height = "200" width = "600"> Code: ```python= sehwag["Runs"].describe() ``` >Output: <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/056/102/original/dfg34.png?1699358869 height = 200 width = 350> We want to find the range of his scores > Let's find Quartiles first on the "Runs" column. So Q1, Q2 and Q3 will be Code: ```python= # 25th percentile or Q1 p_25 = np.percentile(sehwag["Runs"], 25) p_25 ``` >Output: ``` 8.0 ``` This value indicates that 25% of all the values present in the dataset for Sehwag's run is less than 8 We can also say, Out of all the matches that Shewag played, in 25% of those matches, he scored less than 8 runs. Code: ```python= #50th percentile or Q2, also "Median" p_50 = np.percentile(sehwag["Runs"], 50) p_50 ``` >Output: ``` 23.0 ``` This indicates that in 50% of the matches, he scored less than 23 runs Code: ```python= #75th percentile or Q3 p_75 = np.percentile(sehwag["Runs"], 75) p_75 ``` >Output: ``` 46.0 ``` This indicates that in 75% of the matches, he scored less than 46 runs > So, IQR will be? We know IQR = Q3 - Q1 Code: ```python= # Inter Quartile Range iqr_sehwag = p_75 - p_25 iqr_sehwag ``` >Output: ``` 38.0 ``` Code: ```python= normal_range = (sehwag["Runs"].max() - sehwag["Runs"].min()) normal_range ``` >Output: ``` 219 ``` We can observe the difference here, **IQR is 38** which means that middle 50% of the data lies in the range of 38. So more than 50% of the time, Sehwag scores in the range of 38 runs - On the other hand, the **normal range is very high i.e. 219** which is certainly not a good range to consider. - We can say one thing that **there in an Outlier present in the data** means in some matches he has scored so many runs like more than 300 in a single match This is why the range is getting affected by the outlier Let's plot the box plot to visualise the spread of the data. Code: ```python= sns.boxplot(data=sehwag["Runs"], orient="h") ``` >Output: <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/056/103/original/boxd23.png?1699359296 height = 300 width = 400> We can see that Q1, Q2, and Q3 values lie within the box and we can also see whiskers on both the sides of box which we can say is the limit. We already saw how to calculate the lower whisker and upper whisker All the values outside the limit are considered "Outlier" Code: ```python= # upper limit = Q3 + 1.5 * IQR upper = 46 + 1.5*(iqr_sehwag) upper ``` >Output: ``` 103.0 ``` Here, we cannot have values on the left side of the lower whisker as the batsman cannot score less than 0 runs. So all the outliers will be present on the right side of the upper whisker Code: ```python= # all the values greater than upper is outlier outliers_sehwag = sehwag[sehwag["Runs"]>upper] len(outliers_sehwag) ``` >Output: ``` 14 ``` Code: ```python= 14/245 ``` >Output: ``` 0.05714285714285714 ``` **Conclusion**: Here we can say that **5.7% values from the dataset are outliers**. This means we can say that 5.7 or ~6% times Sehwag has scored more than the IQR which is 38 runs Now let's have a same process into Dravid's stats **Cricket - Dravid** (10 mins) Code: ```python= !wget --no-check-certificate https://drive.google.com/uc?id=1nrKmOYQNiTqFhMIoAwE00ULKhGhMEVMZ -O dravid.csv ``` Code: ```python= dravid = pd.read_csv("dravid.csv") dravid["Runs"].describe() ``` >Output: <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/056/105/original/op78.png?1699359727 height = 200 width = 300> Code: ```python= #25th percentile or Q1 per_25 = np.percentile(dravid["Runs"], 25) per_25 ``` >Output: ``` 10.0 ``` This indicates that in 25% of the matches, he scored less than 10 runs Code: ```python= #50th percentile or Q2 , also "Median" per_50 = np.percentile(dravid["Runs"], 50) per_50 ``` >Output: ``` 26.0 ``` This indicates that in 50% of the matches, he scored less than 26 runs Code: ```python= #75th percentile or Q3 per_75 = np.percentile(dravid["Runs"], 75) per_75 ``` >Output: ``` 54.0 ``` This indicates that in 75% of the matches, he scored less than 54 runs Code: ```python= # Inter Quartile Range iqr_dravid = per_75 - per_25 iqr_dravid ``` >Output: ``` 44.0 ``` Code: ```python= normal_range = (dravid["Runs"].max() - dravid["Runs"].min()) normal_range ``` >Output: ``` 153 ``` Code: ```python= sns.boxplot(data=dravid["Runs"], orient="h") ``` >Output: <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/056/106/original/fj65.png?1699360087 height=300 width = 400> Code: ```python= # upper limit = Q3 + 1.5 * IQR upper_dravid = per_75 + 1.5*(iqr_dravid) upper_dravid ``` >Output: ``` 120.0 ``` Code: ```python= # all the values greater than upper is outlier outliers_dravid = dravid[dravid["Runs"]>upper_dravid] len(outliers_dravid)/len(dravid) ``` >Output: ``` 0.009433962264150943 ``` Code: ```python= outliers_dravid['Runs'].shape ``` >Output: ``` (3,) ``` Code: ```python= dravid.shape ``` >Output: ``` (318, 14) ``` #### **Conclusion** Here we can say that **0.9% values from the dataset are outliers**. This means we can say that 0.9% times Dravid has scored more than the IQR which is 44 runs So we can say that in Sehwag case there is **6% outliers** and in Dravid's case there are only **0.9% outliers** which shows that "**Dravid was more consistent than Sehwag**" Now let's calculate the standard deviation through which we can measure the amount of variation or dispersion in runs scored by Sachin and Dravid. > Code: ```python= std_dev_sehwag = np.std(sehwag["Runs"]) print("The amount of variations in runs scored by sehwag is:",std_dev_sehwag) ``` > Output: ``` The amount of variations in runs scored by sehwag is: 34.73830672594385 ``` > Code: ```python= std_dev_dravid = np.std(dravid["Runs"]) print("The amount of variations in runs scored by dravid is:",std_dev_dravid) ``` > Output: ``` The amount of variations in runs scored by dravid is: 29.635116182506632 ``` Lower standard deviation indicates less variability in the batsman's performance. > **For Sehwag:** - Standard Deviation of Runs: 34.74. > **For Dravid:** - Standard Deviation of Runs: 29.64 **Conclusion:** Dravid has a lower standard deviation compared to Sehwag. This suggests that **Dravid's run scores are more consistent or less variable than Sehwag's**. - In other words, Dravid tends to have a more stable performance in terms of runs compared to Sehwag, who shows more variability in his run scores. So, from both **IQR method** and measuring **Standard Deviation** we can conclude that Dravid is more consitent then Sehwag **Let's talk about a new concept** --- title: Random variables and Distribution functions description: duration: 5400 card_type: cue_card --- ### **Random Variable (RV)** (5 - 7 mins): Having explored measures of variability such as variance and standard deviation which particularly deal with random variables, let's try to understand what exactly a random variable is. Random variables are essential for modelling and understanding uncertainty in various scenarios, let's have a look. > **Can you think of situations in your daily life where uncertainty or randomness plays a significant role?** (Learner Participation) - Weather forecasts - Stock market predictions - Even a coin toss has uncertainty Because of this randomness or uncertainty, the concept of probability seeps into our daily life. For instance, - When talking about weather forecasts, the reporter talks about the percentage chance (probability) of rain, storm, etc. - We've even quantified that on a coin toss, the probability of getting a heads is 0.5 Therefore, when we're looking at things from a mathematical mindset, we need a way to account for such events, that exhibit randomness. For this, we use something known as a **Random variable** A random variable is a situation/event/experiment, for which we are not certain about the outcome. It is a way to assign numbers to the outcomes of such events. They can further be divided into 2 types: - Discrete RV - Continuous RV #### Examples of Discrete RV Here, we can count the number of possible outcomes. <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/056/107/original/rv5.png?1699360560 height = "300" width = "600"> . > **Coin Toss** Since we mentioned it earlier, let's consider a coin toss. **What are its possible outcomes?** Heads and Tails. - There is no other possible other than this. Hence, we can represent its outcomes as a random variable, that can take values: $\left \{H, T \right \}$ > **Throw of a dice** - Let's assign a random variable, "X," to represent the outcome of the die roll. - So, a throw of dice can be represented as: $X = \left \{ 1, 2, 3, 4, 5, 6 \right \}$, depending on the outcome of the roll. - It can not have an outcome lesser than 1, or greater than 6 - Or even, any decimal value between 1 and 2 - Hence it is also discrete RV > **Consider the contingency table provided.** - Here, we have 2 random variables - X: Number of sandwiches bought - Y: Number of drinks bought Note that they are both discrete RV, as they can only have following values: - $X = \left \{1, 2 \right \}$ - $Y = \left \{1, 2, 3 \right \}$ #### Examples of Continuous RV Here, we cannot count the number of possible outcomes. They are infinite. > **Height of students in a class** - Suppose the lowest student height in the class is: 4.5 feet - Suppose the highest student height is: 5.9 feet Now, we can have students that have height as - 4.511 feet - 4.92 feet - 5.8555 feet So, we have an infinite number of possible height values between 4.5 and 5.9 feet. We cannot count them Whereas, we could count the number of possibilities in a coin toss or dice throw. > **Other examples of Continuous RV can be:** - Temperature of a room - Time taken to complete a task - Distance travelled ...etc Now, that we have the proper understanding of random variables, let's have a look into different distribution functions which are used to describe the probability distribution of random variables. Let's have a look into a dataset but first let's see what is distribution function is ### **Distribution Functions** (3 mins) **Probability Density Function (PDF)**: - The PDF is a function that describes the probability density of a continuous random variable over its range. - The term "density" here is similar to how tightly data is packed around a specific point, like cars on a road. **Probability Mass Function (PMF)**: - The PMF is a function that describes the probability of a discrete random variable taking on a specific value. **Cumulative Distribution Function (CDF)**: - The CDF is a function that gives the probability that a random variable is less than or equal to a specified value. Let's implement this using a height dataset We will going to work on the height dataframe that we saw above for now Code: ```python= df_height = df_hw["Height"] df_height.head() ``` >Output: ``` 0 73.847017 1 68.781904 2 74.110105 3 71.730978 4 69.881796 Name: Height, dtype: float64 ``` Code: ```python= # minimum height min_height = df_height.min() min_height ``` >Output: ``` 54.2631333250971 ``` Code: ```python= # maximum height max_height = df_height.max() max_height ``` >Output: ``` 78.9987423463896 ``` Code: ```python= total = len(df_height) total ``` >Output: ``` 10000 ``` When we talk about probability, we try to construct the Distribution plots. Cumulative Distribution Function (CDF), Probability Mass Function (PMF), and Probability Density Function (PDF) are all related to random variables and are used to describe the probability distribution of random variables. First, let's see what is random variable To plot this type of distribution we generally use Histograms or Distribution plots #### **Histogram** (5 mins) It is a graphical representation of a dataset's distribution, showing the frequency or probability of different values within the data. Code: ```python= sns.displot(df_height) ``` >Output: <img src =https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/056/110/original/jwp0.png?1699361448 height = 400 width = 400> > Q.What we can understand from this distribution? - Each bar in the histogram represents one of the intervals or ranges, - The height of the bar indicates the frequency or number of data points falling within that interval. **Count**: - It indicates the "**frequency**", which means in the particular bar or range of height, how many values are there. - We can say this like, around 500 people have their height in the range of 63 - 65 (that on bar) This is what histograms or distribution plots tell about the data Now let's have a look into some distribution functions ### **Probability Mass Function (PMF)** (3 mins) The PMF is a function that describes the probability of a discrete random variable taking on a specific value. It associates each possible value of the random variable with its probability of occurrence. **Example: Rolling a Fair Six-Sided Die** - If you have a discrete random variable X representing the outcome of rolling a fair six-sided die Possible outcome is: 1, 2, 3, 4, 5, 6. This is discrete random variable - The PMF might look like $P(X = 1) = \frac{1}{6}$, $P(X = 2) = \frac{1}{6}$, and so on. <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/056/111/original/pmf.png?1699362755 height = "300" width = "500"> ### **Probability Density Function (PDF)** (5 mins) **PDF is used for continuous random variables**, as opposed to PMF, which is for discrete variables. If you want to find the probability of a specific value which is continuous random variable within the given range then we will use PDF. - It doesn't provide the probability of a specific value but gives you the probability of the RV falling within a certain interval. - For example, what are the chances that the next height you chose will fall between 62 and 65 We can visualize a PDF by using distribution plots like histograms or KDE (Kernel Density Estimation ) plots. Code: ```python= sns.kdeplot(df_height) ``` >Output: <img src=https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/056/112/original/cou.png?1699362935 height=300 width = 400> **Example:** If we have a continuous random variable Y representing the height of people in a population, The PDF might represent the probability that a randomly chosen person has a height within a certain range, such as between 65 and 70. - We will find out the area under that interval to find the probability <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/056/113/original/bo56.png?1699363039 height = 300 width =400> Next up we have: ### **Cumulative Distribution Function (CDF)** (5-7 mins) The Cumulative Distribution Function (CDF) describes the probability that a random variable takes on a value less than or equal to a given value. In the context of this dataset, in CDF, we talk about fractions of people who are less than the given height - Let's say you take 60 inches, then what fraction of the people have less than or equal to this value? This fraction is calculated using CDF - It gives you the cumulative probability up to a certain point. **Example:** If you have a random variable Z representing the number of heads in three coin tosses, The CDF would tell you the probability that Z is less than or equal to a certain number, like P(Z ≤ 2). > **How to calculate CDF**? The **CDF is calculated by accumulating the probabilities for each height value**. - As you move along the X-axis (height values) on the CDF graph, you're essentially adding up the probabilities - It shows how likely it is to find someone with a height less than or equal to that value. <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/056/114/original/cdf.png?1699363140 height = "200" width = "500"> - The CDF graph typically starts at 0% on the Y-axis (probability) when height is at its minimum (in our dataset) - It ends at 100% when height is at its maximum. - The curve starts at the left and gradually climbs towards the right. - The steepness of the curve at a particular point represents how quickly the probability is accumulating **Conclusion** So, the PDF shows you the probability of a specific height, while the CDF shows you the probability of heights up to a certain value in your dataset. Let's plot the CDF graph for this dataset manually Code: ```python= # CDF: Cumulative distribution function # will take 100 values between the range of 50 and 80 inclusively using np.linpace x_values = np.linspace(50, 80, 100) # Will contain fraction of people shorter than x y_values = [] for x in x_values: # find out people shorter than x people_shorter_than_x = df_height[df_height <= x] # find out number of such people num_people_shorter_than_x = len(people_shorter_than_x) # How many fraction of people are shorter than x so dividing it by total value fraction_people_shorter_than_x = num_people_shorter_than_x / total # Appending into the y_values list y_values.append(fraction_people_shorter_than_x) # plotting the CDF plt.plot(x_values, y_values, c="b") ``` >Output: <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/056/115/original/bj2.png?1699363286 height = 300 width = 400> - This Curve is called "**Cumulative Distribution Function**". - It is a function which takes the x value and returns the y value - $f(x) = y$ It is the inverse of the percentile means - **Percentile will take input as 25 and give output as 63.5** means - 25% of the people are shorter than 63.5 - While **CDF will take input as 63.5 and give output as 0.25** means - if we want to find how many people are having height less than or equal to 63.5 i.e. 25% of people. <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/056/116/original/lhfj.png?1699363348 height = 50 width = 600> ### **Conclusion :** In summary, the relationships are as follows: - The PMF is used for discrete random variables. - The PDF is used for continuous random variables. - The CDF is used for both discrete and continuous random variables to provide cumulative probabilities. These functions are essential tools in probability and statistics for describing and understanding the behaviour of random variables. Let's study about the statistical measures that are related to random variables. It helps to describe the characteristics of a random variable and provide insights into its behaviour: --- title: Conclusion description: duration: 5400 card_type: cue_card --- ### **Conclusion** With this, we conclude today's lecture. Today, we understood about lots of different topics and laid the foundation for the next lectures. Please keep revising this class as it is the foundation class for your statistics journey and from the next class we will start learning about different probability distributions. See you in the next class

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.