CLT and CI - HackMD

--- title: Content description: duration: 5400 card_type: cue_card --- ### Content - **Central Limit Theorem** - **Confidence Intervals** - Using CLT - Using Bootstrapping - With replacement --- title: Central Limit Theorem description: duration: 5400 card_type: cue_card --- ### Introduction (2 mins) Greeting everyone, In our previous lecture, we explored the normal distribution, where we discussed how many real-world phenomena tend to follow this bell-shaped curve. In today's lecture, we are going to cover very important topics of statistics i.e. Central Limit Theorem (CLT) and Confidence Interval (CI). Let's get started ### Central Limit Theorem (12-15 mins) In the probability distributions-2 lecture, we have covered everything related to sample, sample statistics and sampling distribution. The central limit theorem relies on the concept of a sampling distribution, which is the probability distribution of a statistic for a large number of samples taken from a population. > **Let's recall the sampling distribution** Draw random samples from a population, calculate means for each sample, and repeat. The collection of these sample means forms a sampling distribution. > Imagine a scenario ``` Imagine you have a big jar filled with jellybeans. ``` Each jellybean represents a piece of data in your population. The color and size of the jellybeans can be different, representing the diversity in your data. Now, grab a handful of jellybeans from the jar and calculate the average size of those jellybeans. Put those jellybeans back, shake the jar, and grab another handful, calculate the average again. Repeat this process many times. According to the Central Limit Theorem: 1. **No matter how the jellybeans are distributed originally**, as you take more and more samples and calculate the average each time, the distribution of those averages will start to look like a bell curve. 2. **The more handfuls of jellybeans you take, the closer the distribution gets to a perfect bell curve**, even if the original distribution of jellybeans was not bell-shaped at all. This is powerful because it allows us to make certain assumptions and predictions about the averages of large samples, even if they don't know much about the original population. **So,** The central limit states that "the mean of a random sample will resemble even closer to the population mean as the sample size increases and it will approximate a normal distribution regardless of the shape of the population distribution" - Means, if you take sufficiently large samples from a population, the samples’ means **will be normally distributed**, even if the population isn’t normally distributed. <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/057/395/original/clt1.png?1700474602 height = 300 width = 400> Similarlly, - If we have a sufficiently large number of independent and identically distributed (i.i.d.) random variables and sum them up, the distribution of the sum will tend to be approximately normal, regardless of the shape of the original distribution. This is called **CLT for sums of random variables** In summary both versions essentially state that as you **sum or average a sufficiently large number of i.i.d. random variables**, the resulting distribution tends to approach normality. The "30" rule of thumb is often mentioned as a guideline for the classical CLT, but the actual requirement may vary based on the characteristics of the population distribution. Sample size (n) plays a vital role in the context of the Central Limit Theorem (CLT) and its impact can be summarized as follows: ### Sample Size and Normality: A **larger sample size leads** to a sampling distribution that closely resembles a **normal distribution**. - The CLT tends to work well when the **sample size $(n)$ is sufficiently large**, typically considered as **$n≥30$**. However, it is not a strict rule but a rough guideline. For **moderately skewed distributions, even smaller sample sizes can sometimes be sufficient**. - For Small n $(n<30)$ the CLT can still be useful, especially if the population distribution is close to normal. However, for smaller sample sizes, the normality assumption becomes more critical. <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/057/396/original/clt2.png?1700474761 height = 330 width = 600> Here, - We started by taking 2 samples **(n=2)** from the population, calculated their mean, and repeated this many times to form a distribution. This distribution tends to look more like a normal distribution. - When we **increased the sample size, the distribution resembled a normal distribution even more closely**. ### **Sample Size and Standard Deviations**: As you can see in the image above, sample size also affects the spread of the sampling distribution. - A **smaller n (n = 2), will have a high standard deviation** because sample means are less precise estimates of the population mean, resulting in more spread. - A **larger n ( n >= 30) will have a low standard deviation** since sample means become more precise estimates of the population mean, leading to less spread. **Conclusion** The **sample size not only influences how closely the sampling distribution approximates a normal curve but also impacts the spread or precision of sample means**. As the sample size increases, the CLT becomes more applicable, and the standard deviation decreases, making the estimates of population parameters more reliable. If we summarize the CLT, ### **Conditions of the CLT will be** To apply the central limit theorem, four conditions must be met: 1. **Randomization**: - Data should be randomly sampled, ensuring every population member has an equal chance of being included. 2. **Independence**: - Each sample value should be independent, with one event's occurrence not affecting another. - Commonly met in probability sampling methods, which independently select observations. 3. **Large Sample Condition**: - A sample size of 30 or more is generally considered "sufficiently large." - This threshold can vary slightly based on the population distribution's shape. These conditions ensure the applicability of the central limit theorem. --- title: Applications of CLT , Summarizing CLT and Example description: duration: 5400 card_type: cue_card --- ### **Application of CLT on real life dataset** (12-15 mins) Let's apply the central limit theorem to real distribution to see if the distribution of the sample means tends to follow normal distribution or not. We'll take the height dataset on which we have been working from few lectures. As we know it is already normally distributed, so let's take the sample means and see if they follow normal distribution. Code: ```python= !wget --no-check-certificate https://drive.google.com/uc?id=1Mrt008vkE4nVb1zE4f06_rtq70QPfkIo -O weight-height.csv ``` > Output: ``` weight-height.csv 100%[===================>] 418.09K --.-KB/s in 0.003s 2024-01-18 10:03:08 (134 MB/s) - ‘weight-height.csv’ saved [428120/428120] ``` Code: ```python= import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt from scipy.stats import norm ``` Code: ```python= df_hw = pd.read_csv('weight-height.csv') df_hw.head() ``` >Output: <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/057/404/original/clt5.png?1700476492 height = 200 width = 300> We are going to work on height column so let's store it in a different dataframe. Code: ```python= df_height = df_hw["Height"] sns.histplot(df_height) ``` >Output: <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/057/405/original/clt6.png?1700476596 height = 300 width = 400> Code: ```python= # mean of the entire population mu = df_height.mean() mu ``` >Output: ``` 66.36755975482124 ``` Code: ```python= sigma = df_height.std() sigma ``` >Output: ``` 3.8475281207732293 ``` We will now randomly select five samples and determine the average height of these samples ### Sample size = 5 Code: ```python= df_height.sample(5) ``` >Output: ``` 5608 62.069334 2318 76.806344 5647 61.016914 1682 69.902756 882 68.890095 Name: Height, dtype: float64 ``` Code: ```python= np.mean(df_height.sample(5)) ``` >Output: ``` 67.8041732628697 ``` **Observation** - We can notice that on running the above code, it is generating 5 different samples every time and the sample mean is also changing with that. Let's repeat this process 10,000 times so we will get sample means of 10,000 unique samples (size = 5). We will plot the distributions of these 10,000 sample means to see if they follow the normal distribution. Code: ```python= sample_5 = [np.mean(df_height.sample(5)) for i in range(10000) ] sns.histplot(sample_5, kde=True) ``` >Output: <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/057/408/original/clt7.png?1700477436 height = 350 width = 500> Code: ```python= np.mean(sample_5) ``` >Output: ``` 66.36291130707713 ``` Code: ```python= np.std(sample_5) ``` >Output: ``` 1.7009715727280839 ``` **Observation** - We can conclude that the distribution of those 10000 samples means is normally distributed and most of the values lies between 62 and 72. - There might be some cases where the samples contain only short peoples that is why we can see some values between 60 and 62 - Similarly, there might be some cases where the samples contain only tall people that is why we can see some values between 70 and 72. ### Sample size = 20 > **Q. What would happen If we increase the size of our sample?** We studied earlier in the lecture that as we increase the size of the sample, the spread of data will be less. - This means, as we increase the size of the sample, the sample mean will come closer and closer to the population mean Let's try this out. - Let's increase the sample size to 20. - We will again perform 10,00 iterations and plot the distributions of the sample means Code: ```python= sample_20 = [np.mean(df_height.sample(20)) for i in range(10000) ] sns.histplot(sample_20, kde=True) ``` >Output: <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/057/409/original/clt8.png?1700477618 height = 350 width = 500> Code: ```python= np.mean(sample_20) ``` >Output: ``` 66.36437874021179 ``` Code: ```python= np.std(sample_20) ``` >Output: ``` 0.8745559612271192 ``` **Observation** - We can clearly see that as **we increase the number of samples from 5 to 20, the sample means come closer to the actual mean and the standard deviation becomes less**. - Previously the majority of the values were between 62 and 72. Now the spread of the data has decreased and values lie between 64 and 69 So we found that by increasing the size of the sample, the variability or SD of the sample distributions decreases and the sample mean tends to be much closer to the population mean. ### Comparison of statistics. Let's compare the statistics of population data and sample data to observe some patterns Code: ```python= # population mean mu = df_height.mean() # population SD sigma = df_height.std() # mean of sample distributions having sample size = 5 mu_5 = np.mean(sample_5) # SD of sample distributions having sample size = 5 sigma_5 = np.std(sample_5) # mean of sample distributions having sample size = 20 mu_20 = np.mean(sample_20) # SD of sample distributions having sample size = 20 sigma_20 = np.std(sample_20) ``` Code: ```python= print(mu, mu_5, mu_20) print(sigma, sigma_5, sigma_20) ``` >Output: ``` 66.36755975482124 66.36291130707713 66.36437874021179 3.8475281207732293 1.7009715727280839 0.8745559612271192 ``` **Observation** Here, **Population Statistics:** - $μ$ = population mean - $σ$ = population standard deviation **Sample Statistics:** - $μ_5$ = mean of sample means (from samples of size 5) - $σ_5$ = standard deviation of the sample means (from samples of size 5) - $μ_{20}$ = mean of sample means (from samples of size 20) - $σ_{20}$ = standard deviation of the sample means (from samples of size 20) 1. We can clearly observe that, - As we increase the sample size, the SD of sample means decreases. - The **SD of sampling distribution ($σ_{\bar x}$) is less than the population SD ($σ$).** $\Large σ > σ_5 > σ_{20}$ This aligns with the CLT, which states that the standard deviation of the sampling distribution ($σ_{\bar x}$) is the standard deviation of the population ($σ$) divided by the square root of the sample size. - $\Large σ_{\bar x} = \frac{σ}{\sqrt n}$ We have already studied it, it is known as **Standard Error**. It indicates that how far my sample mean is from the actual mean. By looking into the means we can observe that, 2. the mean of the sampling distribution is equal to the mean of the population - $\Large μ_{\bar x} = μ$ ### **Summarizing CLT** (3 mins) > How can we mathematically represent it? We can describe the sampling distribution of the mean using this notation: **$\Large {\bar X} \backsim \Large N(μ \ , \ \frac{σ}{\sqrt n})$** Where: - $\bar X$ is the sampling distribution of the sample means - $\backsim$ means “follows the distribution” - $N$ is the normal distribution - $µ$ is the mean of the population - $σ$ is the standard deviation of the population - $n$ is the sample size Now, let's solve some examples ### Example 1: (5 mins) ``` Systolic blood pressure of a group of people is known to have an average of 122 mmHg and a standard deviation of 10 mmHg. Calculate the probability that the average blood pressure of 16 people will be greater than 125 mmHg. ``` Given, for the entire population - $μ$ = 122 - $σ$ = 10 We need to calculate the probability that average BP of 16 people will be > 125 - Sample size n = 16 - and by CLT, the $\bar X$ = $μ$ = 122 - The standard deviation will be Code: ```python= # SE = σ/sqrt(n) sigma = 10/np.sqrt(16) sigma ``` >Output: ``` 2.5 ``` So, the probability of finding people having greater than 125 average BP will be we know, $P[X>125] = 1 - P[X < 125]$ > How can we find P[X < 125]? by calculating "norm.cdf(zscore)" Code: ```python= # zscore = (X - mu)/σ z_score = (125 - 122)/sigma z_score ``` >Output: ``` 1.2 ``` Code: ```python= # P[X>125]=1−P[X<125] probability = 1 - norm.cdf(z_score) probability ``` >Output: ``` 0.11506967022170822 ``` The probability that the average blood pressure of 16 people will be greater than 125 mmHg is 0.115 <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/057/410/original/clt9.png?1700478231 height = 400 width = 400> Let's solve some quizzes --- title: Quiz 1 description: duration: 60 card_type: quiz_card --- # Question Weekly toothpaste sales have a mean 1000 and std dev 200. Sample is taken of size 4. What is the probability that the average weekly sales next month is more than 1110? # Choices - [ ] 0.29 - [x] 0.13 - [ ] 0.11 - [ ] 0.08 --- title: Quiz 1 explanation description: duration: 5400 card_type: cue_card --- ### Quiz 1 explanation Given, For population data - $μ$ = 1000 - $σ$ = 200 So, for samples, - $\bar X$ = $μ$ = 1000 - Sample SD will be Code: ```python= # SE = σ/sqrt(n) sigma = 200/np.sqrt(4) sigma ``` >Output: ``` 100.0 ``` So, the probability that the average weekly sales will be greater than 1110 will be $P[X>1110] = 1 - P[X < 1110] $ > How can we find P[X < 1110]? by calculating "norm.cdf(zscore)" Code: ```python= # zscore = (X - mu)/σ z_score = (1110 - 1000)/sigma z_score ``` >Output: ``` 1.1 ``` Code: ```python= probability = 1 - norm.cdf(z_score) probability ``` >Output: ``` 0.13566606094638267 ``` The probability that the average weekly sales next month is more than 1110 is 0.13. <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/057/411/original/clt10.png?1700479311 height = 400 width = 450> Next quiz --- title: Quiz 2 description: duration: 60 card_type: quiz_card --- # Question In an e-commerce website, the average purchase amount per customer is $80 with a standard deviation of $15. If we randomly select a sample of 50 customers, what is the probability that the average purchase amount in the sample will be less than $75? # Choices - [ ] 0.36 - [ ] 0.18 - [ ] 0.01 - [x] 0.009 --- title: Quiz 2 explanation description: duration: 5400 card_type: cue_card --- ### Quiz 2 explanation Given, For population data - $μ$ = 80 - $σ$ = 15 So, for samples with size 50 - $\bar X$ = $μ$ = 80 - Sample SD will be, Code: ```python= # SE = σ/sqrt(n) sigma = 15/np.sqrt(50) sigma ``` >Output: ``` 2.1213203435596424 ``` So, the probability that the average purchase amount will be less than 75$ will be $P[X<75]$ > How can we find P[X < 75]? by calculating "norm.cdf(zscore)" Code: ```python= # zscore = (X - mu)/σ z_score = (75 - 80)/sigma probability = norm.cdf(z_score) probability ``` >Output: ``` 0.009211062727049501 ``` The probability that the average purchase amount in the sample will be less than $75 is 0.009 <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/057/423/original/clt11.png?1700480332 height = 300 width = 500> <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/057/424/original/clt12.jpeg?1700480431 height = 400 width = 800> Now, Let's jump into the next important concept in statistics, Confidence Interval. --- title: Confidence Interval, Compute 95% Confidence interval, Examples description: duration: 5400 card_type: cue_card --- ### **Confidence Interval** (5-7 mins) In our probability distibutions-2 lectures, we took an example of turtles to discuss point estimates. Let's recall it. **Example:** We wanted to estimate the mean weight of a certain species of turtle in Florida by taking a single random sample **($S$)** of 50 turtles and using the sample mean to estimate the true population mean. <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/057/425/original/clt13.png?1700480631 height = 200 width = 500> But as we know there is one problem here, **Problem:** - The mean weight of turtles in the sample is not guaranteed to exactly match the mean weight of turtles in the whole population. **Solution:** In order to capture this uncertainty, we will say that the population mean will lie in the **range of the sample mean (point estimates) $+/-$ some errors here and there**. So, whatever sample mean I will get from the sample, I will try to provide a range and the population mean will lie within that range. This interval or range is called **Confidence Interval**. In summary, - A confidence interval **is the mean of your estimate plus and minus the variation in that estimate**. - This is the range of values you expect your estimate to fall within a certain level of confidence. > **Q. What is confidence?** Confidence, in statistics, is another way to describe probability. For example, - If you construct a confidence interval with a 95% confidence level, you are confident that 95 out of 100 times the estimate will fall between the upper and lower values specified by the confidence interval. - **$x_1$ is the lower bound and $x_2$ is the upper bound**. ### **Q. How to calculate the confidence interval?** There are 2 ways to find confidence interval 1. **Using CLT** - This is for mean values. If we have mean as a statistic then we will calculate CI using the central limit theorem (CLT) 2. **Using Bootstrapping** - If you have statistics other than mean like median then we are not allowed to use CLT. We can use bootstrap method to calculate confidence intervals in those scenarios. Let's calculate the confidence interval using CLT first, - Using CLT, we will get these values and range between these values will be our confidence interval. <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/058/006/original/sample.png?1701171647 height = 200 width = 400> Here $\bar X_{S}$ is the mean of sample ($S$). <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/057/427/original/clt15.png?1700480759 height = 400 width = 400> ## Compute a 95% Confidence Interval (7 mins) To get a 95% confidence interval, we need to find the area that covers 95% of the data - Let's say we have 2 points around mean one on the left and one on the right. These points will be upper bound and lower bound. - We can calculate the data points for these z scores (z1 and z2) which will be the confidence interval. - We know that between z1 and z2 95% population lies so on the **left hand side of z1 there is 2.5% population and on the right side of z2 there is 2.5% population** <img src=https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/057/743/original/Screenshot_2023-11-23_at_5.51.42_PM.png?1700742162 width=300> So, to find the value below this percentage of data falls, we will use the PPF. > **Q. How do we find z1 and z2** - **$z1 = norm.ppf(0.025)$ as we have 2.5% data till z1** Similarly to calculate z2 will use - **$z2 = norm.ppf(1-0.025)$ as we have 2.5% data remaining after z2** <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/057/429/original/clt17.png?1700480912 height = 250 width = 400> Code: ```python= # z1 will be z1 = norm.ppf(0.025) z1 ``` >Output: ``` -1.9599639845400545 ``` Code: ```python= # z2 will be z2 = norm.ppf(1 - 0.025) # we can also use norm.ppf(0.975) z2 ``` >Output: ``` 1.959963984540054 ``` > Q. What is the formula for the z score? $\Large Z = \frac{{X} - μ}{σ}$ So, $\Large X = μ + (Z * σ)$ Where - $X$ is individual data point - $\mu$ is population mean - $\sigma$ is population standard deviation Note: From this, we will get data points associated with the corresponding z score, which is Z in this case. **Here we are dealing with Samples so**, - We will use $\bar X$ which is sample mean. - In the case of sample we consider standard error so $σ = \frac {σ}{\sqrt n}$ The Z-score is a measure of how many standard deviations a data point (in this case, the sample mean) is from the population mean - We need to find how many standard deviation away the sample mean is from population mean Z score will be: $\Large Z = \frac{\bar X - \mu}{\frac{\sigma}{\sqrt n}}$ Where, - $\bar X$ = sample mean - $\mu$ - population mean - $\sigma/\sqrt n$ = standard error In our sample $S$ that we are looking above the **, our population mean will lie between z scores i.e. z1 and z2 which is -1.96 and 1.96**. Between these two points, we will have our 95% confidence interval. $\Large Z_1 < \frac{\bar X_{s} - \mu}{\frac {\sigma}{\sqrt n}} < Z_2$ (equation 1) - $\bar X_s$ represents mean of sample $S$ Now, if we compare left side: $Z_1 < \Large \frac{\bar X_{s} - \mu}{\frac {\sigma}{\sqrt n}}$ We will get: $\Large \bar X_{s} - Z_1 * (\frac{σ}{\sqrt n}) \ < \ μ$ (equation 2) Now, if we compare right side: $\Large \frac{\bar X_{s} - \mu}{\frac {\sigma}{\sqrt n}} > Z_2$ We will get: $\Large μ \ < \bar X_{s} + Z_1 * (\frac{σ}{\sqrt n})$ (equation 3) Now, from equation 1, 2 and 3 if we want the $\mu$ value only, $\Large \bar X_{s} - Z_1 * (\frac{σ}{\sqrt n}) \ < \ μ \ < \bar X_{s} + Z_1 * (\frac{σ}{\sqrt n})$ > **Q. So, what will be the range where original $μ$ is lying?** **Confidence Interval = $\bar X ± Z \left( \frac {σ}{\sqrt n}\right)$** OR **$[\bar {X} - Z * (\frac{σ}{\sqrt n}) \ ,\ \bar {X} + Z * (\frac{σ}{\sqrt n})]$** where: $\bar X$: sample mean $Z$: is the z-score corresponding to the desired confidence level. $σ$: population standard deviation $n$: sample size If you take **$(Z * \frac{σ}{\sqrt n})$**, it is referred as the the **margin of error** in the context of confidence intervals. Now let's try to solve one example ### **Example on confidence interval** (5 mins) ``` The mean height of a sample of 100 adults was found to be 65 inches, with a standard deviation of 2.5 inches. Compute 95% confidence interval ``` Solution: Given, Sample size "n" = 100 Sample mean "$\bar x$ = 65 SD = 2.5 First, let's calculate the standard error: Code: ```python= # sigma/sqrt(n) std_error = 2.5/np.sqrt(100) std_error ``` >Output: ``` 0.25 ``` Now the values of Z for 95% confidence will be, we have calculated above Code: ```python= # z1 will be z1 = norm.ppf(0.025) z1 ``` >Output: ``` -1.9599639845400545 ``` Code: ```python= # z2 will be z2 = norm.ppf(1 - 0.025) # we can also use norm.ppf(0.975) z2 ``` >Output: ``` 1.959963984540054 ``` Now, How to get the data poitns for z1 which is on left side and z2 which is on right side, $X_1 = μ + Z1 * σ$ and, $X_2 = μ + Z2 * σ$ Code: ```python= x1 = 65 + z1 * std_error x1 ``` >Output: ``` 64.51000900386498 ``` Code: ```python= x2 = 65 + z2 * std_error x2 ``` >Output: ``` 65.48999099613502 ``` So the range of 95% confidence interval --> [64.51, 65.48] ### Conclusion: We can claim that the population mean will lie between the value 64.51 and 65.48 with 95% confidence. There is directly one function available which will calculate interval by using just single formula - Using **norm.interval()** You have to pass three attributes : - **norm.interval(confidence, loc=0, scale=1)** - confidence: how much confidence you want - loc: pass the **mean** value here (by default it is 0) - scale: pass the **standard error** here (by default it is 1) Code: ```python= norm.interval(0.95, loc=65, scale=std_error) ``` >Output: ``` (64.51000900386498, 65.48999099613502) ``` Let solve one more example to reiterate the concept ## Example 2 on confidence interval: (3 mins) ``` The sample mean recovery time of 100 patients after taking a drug was seen to be 10.5 days with a standard deviation of 2 days Find the 95% confidence interval of the true mean. ``` **Approach 1**: Given, - sample size "n" = 100 - sample mean = 10.5 - standard deviation = 2 Approach will be same Code: ```python= std_error = 2/np.sqrt(100) std_error ``` >Output: ``` 0.2 ``` Code: ```python= z1 = norm.ppf(0.025) x1 = 10.5 + z1 * std_error x1 ``` >Output: ``` 10.108007203091988 ``` Code: ```python= z2 = norm.ppf(0.975) x2 = 10.5 + z2 * std_error x2 ``` >Output: ``` 10.89199279690801 ``` **Approach 2:** Code: ```python= norm.interval(0.95, loc = 10.5, scale=0.2) ``` >Output: ``` (10.10800720309199, 10.89199279690801) ``` So the range of 95% confidence interval --> [10.10, 10.89] Just for practice, let's solve one more example ***Instructor note:*** Solve example 3 if time permits otherwise ignore it. ## Example 3 on confidence interval: (5 mins) ``` The mean Youtube watch time of a sample of 100 students was found to be 3.5 hours, with a standard deviation of 1 hour. Construct a 90% confidence interval for the true watch time. ``` Solution: Given, - sample size = 100 - sample mean = 3.5 - standard deviation = 1 - confidence interval = 90% Now, let's try a different approach here. We can simply just define one function and calculate all the needed values inside that function So that it'll be easier to calculate interval just by recalling the function. Code: ```python= # define function calc_CI def calc_CI(mean, std, N, confidence): # let's calc std error std_err = std / np.sqrt(N) print("SE ",std_error) # calculate the remaining fractions beyond interval (we know 90% so fractions will be 5% each so 0.05) slice = (1 - (confidence/100))/2 print("Slice ",slice) # let's calculate z1 and z2 z1 = norm.ppf(slice) z2 = norm.ppf(1-slice) print("z1 z2", z1, z2) # calculate end points x1 = mean + (z1 * std_err) x2 = mean + (z2 * std_err) return x1, x2 ``` Code: ```python= calc_CI(3.5, 1, 100, 90) ``` >Output: ``` SE 0.2 Slice 0.04999999999999999 z1 z2 -1.6448536269514729 1.6448536269514722 (3.3355146373048528, 3.6644853626951472) ``` **Conclusion** So that was all about computing confidence intervals using CLT, now in general other statistics like median or any specific percentile you are not allowed to use CLT. For that, we have a technique called "Bootstrapping" Let's have a look at it. --- title: Quiz 3 description: duration: 60 card_type: quiz_card --- # Question From a sample of 80 endangered birds, the average wingspan was found to be 45 cm, with a population standard deviation of 10 cm. What is the correct confidence interval of the mean wingspan of the entire population with 90% confidence. # Choices - [x] [43.16, 46.83] - [ ] [40.21, 43.45] - [ ] [45.67, 48.92] - [ ] [46.95, 50.01] --- title: Quiz 3 explanation description: duration: 5400 card_type: cue_card --- ### Quiz 3 explanation Given, - sample size "n" = 80 - sample mean = 45 - standard deviation = 10 Code: ```python= std_error = 10/np.sqrt(80) std_error ``` >Output: ``` 1.118033988749895 ``` Code: ```python= z1 = norm.ppf(0.05) x1 = 45 + z1 * std_error x1 ``` >Output: ``` 43.16099773854971 ``` Code: ```python= z2 = norm.ppf(0.95) x2 = 45 + z2 * std_error x2 ``` >Output: ``` 46.83900226145029 ``` Confidence Interval will be -> [43.16, 46.83] **Approach 2:** Code: ```python= norm.interval(0.90, loc = 45, scale=std_error) ``` >Output: ``` (43.16099773854971, 46.83900226145029) ``` --- title: Quiz 4 description: duration: 60 card_type: quiz_card --- # Question In a software project, the team estimates bug resolution time at an average of 6 hours with a standard deviation of 2 hours. To estimate the mean resolution time with 99% confidence, the project manager samples 25 resolved bugs. What is the correct confidence interval? # Choices - [ ] [3.25, 6.45] - [x] [4.96, 7.03] - [ ] [5.55, 8.63] - [ ] [6.74, 9.42] --- title: Quiz 4 explanation description: duration: 5400 card_type: cue_card --- ### Quiz 4 explanation Given, - sample size "n" = 25 - sample mean = 6 - standard deviation = 2 Code: ```python= std_error = 2/np.sqrt(25) std_error ``` >Output: ``` 0.4 ``` Code: ```python= z1 = norm.ppf(0.005) x1 = 6 + z1 * std_error x1 ``` >Output: ``` 4.969668278580439 ``` Code: ```python= z2 = norm.ppf(0.995) x2 = 6 + z2 * std_error x2 ``` >Output: ``` 7.03033172141956 ``` Confidence Interval will be -> [4.96, 7.03] **Approach 2**: Code: ```python= norm.interval(0.99, loc = 6, scale=std_error) ``` >Output: ``` (4.96966827858044, 7.03033172141956) ``` --- title: Confidence interval using bootstrap description: duration: 5400 card_type: cue_card --- ### Confidence interval using Bootstrap (10-12 mins) - suppose you have very little data and you want to compute a confidence interval for some other statistics like median then the most common technique is bootstapping. Let's start with one example: ### Example: Salary Survey Imagine we want to analyse and learn about the data scientist salaries at Google. We have 2 surveys, <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/057/433/original/clt19.png?1700482313 height = 270 width = 700> We don't have a population mean here but we have 2 samples so we can calculate the sample means. Code: ```python= survey_1 = [35, 36, 33, 37, 34, 35] np.mean(survey_1) ``` >Output: ``` 35.0 ``` Code: ```python= survey_2 = [20, 37, 17, 50, 53, 33] np.mean(survey_2) ``` >Output: ``` 35.0 ``` > Q. Which of the two surveys is better for estimating the population parameter or which survey is more reliable? By observing the samples we found that values in survey 1 are much closer to the mean values so survey 1 will be more accurate for estimation > Now, can we simulate more and more sets of samples like the ones above? For this statisticians come up with something which has reasonable amount of accuracy. ### Sample With Replacement They said that take your survey and then create more samples from the same survey only using **replacement**. Bootstrapping is a statistical procedure that resamples a single dataset to create many simulated samples. Code: ```python= n = 6 bootstrapped_samples = np.random.choice(survey_1, size=n) bootstrapped_samples ``` >Output: ``` array([37, 35, 35, 34, 35, 33]) ``` Here we will get an array of length 6 where each element is one of the original data points from survey 1 which is randomly chosen. Every time we run this code, we will get a different array so means that the mean of this newly constructed array will also be different. Code: ```python= np.mean(bootstrapped_samples) ``` >Output: ``` 34.833333333333336 ``` Code: ```python= bootstrapped_samples = np.random.choice(survey_2, size=n) np.mean(bootstrapped_samples) ``` >Output: ``` 37.166666666666664 ``` Let's observe the difference between survey_1 and survey_2 by running the code several times. - We can observe that in survey 1, the mean value is always close to 35 But in survey 2, it sometimes comes to 35, sometimes 40, sometimes 39 so there is more variance in survey 2. So we will go with survey_1 as it has the higher confidence because the variance in survey_1 is less. **Let's draw a histogram of this survey** Code: ```python= bootstrapped_means_survey_1 = [] for reps in range(10000): bootstrapped_samples = np.random.choice(survey_1, size=n) bootstrapped_mean = np.mean(bootstrapped_samples) bootstrapped_means_survey_1.append(bootstrapped_mean) sns.histplot(bootstrapped_means_survey_1) ``` >Output: <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/057/435/original/clt20.png?1700482649 height = 400 width = 550> Code: ```python= bootstrapped_means_survey_2 = [] for reps in range(10000): bootstrapped_samples = np.random.choice(survey_2, size=n) bootstrapped_mean = np.mean(bootstrapped_samples) # Replace by any statistic (median, percentile) bootstrapped_means_survey_2.append(bootstrapped_mean) sns.histplot(bootstrapped_means_survey_2) ``` >Output: <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/057/436/original/clt21.png?1700482738 height = 400 width = 550> > Let's compare these two histograms, what can we observe? - **We can observe that in survey_2, the interval or range is somewhere between 20-50** **While in survey_1, it is between 33-36 which is very close to the actual mean** So, survey 1 is more accurate than survey 2 ### How to compute the condidence interval? We can calculate the percentile of **bootstrapped mean** - 2.5th percentile will give me lower bound (x1) - 97.5th percentile will give me upper bound (x2) Then, confidence inteval will be [x1, x2] Code: ```python= len(bootstrapped_means_survey_1) ``` >Output: ``` 10000 ``` Code: ```python= x1 = np.percentile(bootstrapped_means_survey_1, 2.5) x1 ``` >Output: ``` 34.0 ``` Code: ```python= x2 = np.percentile(bootstrapped_means_survey_1, 97.5) x2 ``` >Output: ``` 36.0 ``` The 95% of the numbers lies between 34 & 36 so **Confidence Interval: $(x1, x2)$** AS this process is random, this will be slight change in CI everytime Code: ```python= len(bootstrapped_means_survey_2) ``` >Output: ``` 10000 ``` Code: ```python= x1 = np.percentile(bootstrapped_means_survey_2, 2.5) x1 ``` >Output: ``` 24.0 ``` Code: ```python= x2 = np.percentile(bootstrapped_means_survey_2, 97.5) x2 ``` >Output: ``` 46.0 ``` here also CI will be **Confidence Interval = $(x1, x2)$** This is how we can calculare Confidence Interval using bootstrap. --- title: Conclusion description: duration: 5400 card_type: cue_card --- ### Conclusion With this, we are done with today's lecture. CLT and confidence interval are very important concepts of statistics and we hope you understand these topics very well. Keep revising the topics.

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.