---
title: Content
description:
duration: 5400
card_type: cue_card
---
### <font color='blue'>Content</font>
- **Central Limit Theorem**
- **Confidence Intervals**
- Using CLT
- Using Bootstrapping
- With replacement
---
title: Central Limit Theorem
description:
duration: 5400
card_type: cue_card
---
### <font color='blue'>Introduction</font> (2 mins)
Greeting everyone,
In our previous lecture, we explored the normal distribution, where we discussed how many real-world phenomena tend to follow this bell-shaped curve.
In today's lecture, we are going to cover very important topics of statistics i.e. Central Limit Theorem (CLT) and Confidence Interval (CI).
Let's get started
### <font color='blue'>Central Limit Theorem</font> (12-15 mins)
In the probability distributions-2 lecture, we have covered everything related to sample, sample statistics and sampling distribution.
The <font color='purple'>central limit theorem relies on the concept of a sampling distribution</font>, which is the probability distribution of a statistic for a large number of samples taken from a population.
> <font color='purple'>**Let's recall the sampling distribution**</font>
Draw random samples from a population, calculate means for each sample, and repeat.
The collection of these sample means forms a sampling distribution.
<br>
> <font color='purple'>Imagine a scenario</font>
```
Imagine you have a big jar filled with jellybeans.
```
Each jellybean represents a piece of data in your population.
The color and size of the jellybeans can be different, representing the diversity in your data.
Now, grab a handful of jellybeans from the jar and calculate the average size of those jellybeans.
Put those jellybeans back, shake the jar, and grab another handful, calculate the average again. Repeat this process many times.
According to the Central Limit Theorem:
1. **No matter how the jellybeans are distributed originally**, as you take more and more samples and calculate the average each time, the distribution of those averages will start to look like a bell curve.
2. **The more handfuls of jellybeans you take, the closer the distribution gets to a perfect bell curve**, even if the original distribution of jellybeans was not bell-shaped at all.
This is powerful because it allows us to make certain assumptions and predictions about the averages of large samples, even if they don't know much about the original population.
<br>
**So,**
The central limit states that <font color='blue'>"the mean of a random sample will resemble even closer to the population mean as the sample size increases and it will approximate a normal distribution regardless of the shape of the population distribution"</font>
- Means, if you take sufficiently large samples from a population, the samples’ means **will be normally distributed**, even if the population isn’t normally distributed.
<img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/057/395/original/clt1.png?1700474602 height = 300 width = 400>
Similarlly,
- If we have a <font color='purple'>sufficiently large number of independent and identically distributed (i.i.d.) random variables and sum them up, the distribution of the sum will tend to be approximately normal, regardless of the shape of the original distribution.</font>
This is called **CLT for sums of random variables**
<br>
In summary both versions essentially state that as you **sum or average a sufficiently large number of i.i.d. random variables**, the resulting distribution tends to approach normality.
The "30" rule of thumb is often mentioned as a guideline for the classical CLT, but the actual requirement may vary based on the characteristics of the population distribution.
Sample size (n) plays a vital role in the context of the Central Limit Theorem (CLT) and its impact can be summarized as follows:
### <font color='blue'>Sample Size and Normality:</font>
A **larger sample size leads** to a sampling distribution that closely resembles a **normal distribution**.
- The CLT <font color='purple'>tends to work well when the **sample size $(n)$ is sufficiently large**, typically considered as **$n≥30$**</font>. However, it is not a strict rule but a rough guideline.
For <font color='purple'>**moderately skewed distributions, even smaller sample sizes can sometimes be sufficient**</font>.
- For Small n $(n<30)$ the CLT can still be useful, especially if the population distribution is close to normal. However, <font color='purple'>for smaller sample sizes, the normality assumption becomes more critical</font>.
<img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/057/396/original/clt2.png?1700474761 height = 330 width = 600>
Here,
- We started by taking 2 samples **(n=2)** from the population, calculated their mean, and repeated this many times to form a distribution. <font color='purple'>This distribution tends to look more like a normal distribution.</font>
- When we **increased the sample size, the distribution resembled a normal distribution even more closely**.
<br>
### <font color='purple'>**Sample Size and Standard Deviations**</font>:
As you can see in the image above, sample size also affects the spread of the sampling distribution.
- A **smaller n (n = 2), will have a high standard deviation** because sample means are less precise estimates of the population mean, resulting in more spread.
- A **larger n ( n >= 30) will have a low standard deviation** since sample means become more precise estimates of the population mean, leading to less spread.
<br>
<font color='orange'>**Conclusion**</font>
The **sample size not only influences how closely the sampling distribution approximates a normal curve but also impacts the spread or precision of sample means**.
As the sample size increases, the CLT becomes more applicable, and the standard deviation decreases, making the estimates of population parameters more reliable.
If we summarize the CLT,
### <font color='purple'>**Conditions of the CLT will be**</font>
To apply the central limit theorem, four conditions must be met:
1. **Randomization**:
- Data should be randomly sampled, ensuring every population member has an equal chance of being included.
2. **Independence**:
- Each sample value should be independent, with one event's occurrence not affecting another.
- Commonly met in probability sampling methods, which independently select observations.
3. **Large Sample Condition**:
- A sample size of 30 or more is generally considered "sufficiently large."
- This threshold can vary slightly based on the population distribution's shape.
These conditions ensure the applicability of the central limit theorem.
<br>
---
title: Applications of CLT , Summarizing CLT and Example
description:
duration: 5400
card_type: cue_card
---
### <font color='blue'>**Application of CLT on real life dataset**</font> (12-15 mins)
Let's apply the central limit theorem to real distribution to see if the distribution of the sample means tends to follow normal distribution or not.
We'll take the height dataset on which we have been working from few lectures.
As we know it is already normally distributed, so let's take the sample means and see if they follow normal distribution.
Code:
```python=
!wget --no-check-certificate https://drive.google.com/uc?id=1Mrt008vkE4nVb1zE4f06_rtq70QPfkIo -O weight-height.csv
```
> Output:
```
weight-height.csv 100%[===================>] 418.09K --.-KB/s in 0.003s
2024-01-18 10:03:08 (134 MB/s) - ‘weight-height.csv’ saved [428120/428120]
```
Code:
```python=
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import norm
```
Code:
```python=
df_hw = pd.read_csv('weight-height.csv')
df_hw.head()
```
>Output:
<img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/057/404/original/clt5.png?1700476492 height = 200 width = 300>
We are going to work on height column so let's store it in a different dataframe.
Code:
```python=
df_height = df_hw["Height"]
sns.histplot(df_height)
```
>Output:
<img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/057/405/original/clt6.png?1700476596 height = 300 width = 400>
Code:
```python=
# mean of the entire population
mu = df_height.mean()
mu
```
>Output:
```
66.36755975482124
```
Code:
```python=
sigma = df_height.std()
sigma
```
>Output:
```
3.8475281207732293
```
We will now randomly select five samples and determine the average height of these samples
### <font color='purple'>Sample size = 5</font>
Code:
```python=
df_height.sample(5)
```
>Output:
```
5608 62.069334
2318 76.806344
5647 61.016914
1682 69.902756
882 68.890095
Name: Height, dtype: float64
```
Code:
```python=
np.mean(df_height.sample(5))
```
>Output:
```
67.8041732628697
```
<font color='orange'>**Observation**</font>
- We can notice that on running the above code, it is generating 5 different samples every time and the sample mean is also changing with that.
Let's repeat this process 10,000 times so we will get sample means of 10,000 unique samples (size = 5).
We will plot the distributions of these 10,000 sample means to see if they follow the normal distribution.
Code:
```python=
sample_5 = [np.mean(df_height.sample(5)) for i in range(10000) ]
sns.histplot(sample_5, kde=True)
```
>Output:
<img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/057/408/original/clt7.png?1700477436 height = 350 width = 500>
Code:
```python=
np.mean(sample_5)
```
>Output:
```
66.36291130707713
```
Code:
```python=
np.std(sample_5)
```
>Output:
```
1.7009715727280839
```
<font color='orange'>**Observation**</font>
- We can conclude that the distribution of those 10000 samples means is normally distributed and most of the values lies between 62 and 72.
- There might be some cases where the samples contain only short peoples that is why we can see some values between 60 and 62
- Similarly, there might be some cases where the samples contain only tall people that is why we can see some values between 70 and 72.
### <font color='purple'>Sample size = 20</font>
> <font color='purple'>**Q. What would happen If we increase the size of our sample?**</font>
We studied earlier in the lecture that as we increase the size of the sample, the spread of data will be less.
- This means, as we increase the size of the sample, the sample mean will come closer and closer to the population mean
Let's try this out.
- Let's increase the sample size to 20.
- We will again perform 10,00 iterations and plot the distributions of the sample means
Code:
```python=
sample_20 = [np.mean(df_height.sample(20)) for i in range(10000) ]
sns.histplot(sample_20, kde=True)
```
>Output:
<img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/057/409/original/clt8.png?1700477618 height = 350 width = 500>
Code:
```python=
np.mean(sample_20)
```
>Output:
```
66.36437874021179
```
Code:
```python=
np.std(sample_20)
```
>Output:
```
0.8745559612271192
```
<font color='orange'>**Observation**</font>
- We can clearly see that as <font color='purple'>**we increase the number of samples from 5 to 20, the sample means come closer to the actual mean and the standard deviation becomes less**</font>.
- Previously the majority of the values were between 62 and 72. Now the spread of the data has decreased and values lie between 64 and 69
So we found that by <font color='purple'>increasing the size of the sample, the variability or SD of the sample distributions decreases and the sample mean tends to be much closer to the population mean</font>.
### <font color='purple'>Comparison of statistics</font>.
Let's compare the statistics of population data and sample data to observe some patterns
Code:
```python=
# population mean
mu = df_height.mean()
# population SD
sigma = df_height.std()
# mean of sample distributions having sample size = 5
mu_5 = np.mean(sample_5)
# SD of sample distributions having sample size = 5
sigma_5 = np.std(sample_5)
# mean of sample distributions having sample size = 20
mu_20 = np.mean(sample_20)
# SD of sample distributions having sample size = 20
sigma_20 = np.std(sample_20)
```
Code:
```python=
print(mu, mu_5, mu_20)
print(sigma, sigma_5, sigma_20)
```
>Output:
```
66.36755975482124 66.36291130707713 66.36437874021179
3.8475281207732293 1.7009715727280839 0.8745559612271192
```
<font color='orange'>**Observation**</font>
Here,
**Population Statistics:**
- $μ$ = population mean
- $σ$ = population standard deviation
**Sample Statistics:**
- $μ_5$ = mean of sample means (from samples of size 5)
- $σ_5$ = standard deviation of the sample means (from samples of size 5)
- $μ_{20}$ = mean of sample means (from samples of size 20)
- $σ_{20}$ = standard deviation of the sample means (from samples of size 20)
<br>
1. We can clearly observe that,
- <font color='purple'>As we increase the sample size, the SD of sample means decreases</font>.
- The **SD of sampling distribution ($σ_{\bar x}$) is less than the population SD ($σ$).**
$\Large σ > σ_5 > σ_{20}$
<br>
This aligns with the CLT, which states that the standard deviation of the sampling distribution ($σ_{\bar x}$) is the standard deviation of the population ($σ$) divided by the square root of the sample size.
- $\Large σ_{\bar x} = \frac{σ}{\sqrt n}$
We have already studied it, it is known as **Standard Error**. It indicates that how far my sample mean is from the actual mean.
By looking into the means we can observe that,
2. the mean of the sampling distribution is equal to the mean of the population
- $\Large μ_{\bar x} = μ$
### <font color='blue'>**Summarizing CLT**</font> (3 mins)
> <font color='purple'>How can we mathematically represent it?</font>
We can describe the sampling distribution of the mean using this notation:
<font color='purple'>**$\Large {\bar X} \backsim \Large N(μ \ , \ \frac{σ}{\sqrt n})$**</font>
Where:
- $\bar X$ is the sampling distribution of the sample means
- $\backsim$ means “follows the distribution”
- $N$ is the normal distribution
- $µ$ is the mean of the population
- $σ$ is the standard deviation of the population
- $n$ is the sample size
Now, let's solve some examples
### <font color='purple'>Example 1:</font> (5 mins)
```
Systolic blood pressure of a group of people is known to have an average of 122 mmHg
and a standard deviation of 10 mmHg.
Calculate the probability that the average blood pressure of 16 people will be greater than 125 mmHg.
```
Given,
for the entire population
- $μ$ = 122
- $σ$ = 10
We need to calculate the probability that average BP of 16 people will be > 125
- Sample size n = 16
- and by CLT, the $\bar X$ = $μ$ = 122
- The standard deviation will be
<br>
Code:
```python=
# SE = σ/sqrt(n)
sigma = 10/np.sqrt(16)
sigma
```
>Output:
```
2.5
```
So, the probability of finding people having greater than 125 average BP will be
we know,
$P[X>125] = 1 - P[X < 125]$
> How can we find P[X < 125]?
by calculating "norm.cdf(zscore)"
Code:
```python=
# zscore = (X - mu)/σ
z_score = (125 - 122)/sigma
z_score
```
>Output:
```
1.2
```
Code:
```python=
# P[X>125]=1−P[X<125]
probability = 1 - norm.cdf(z_score)
probability
```
>Output:
```
0.11506967022170822
```
The probability that the average blood pressure of 16 people will be greater than 125 mmHg is 0.115
<img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/057/410/original/clt9.png?1700478231 height = 400 width = 400>
Let's solve some quizzes
---
title: Quiz 1
description:
duration: 60
card_type: quiz_card
---
# Question
Weekly toothpaste sales have a mean 1000 and std dev 200. Sample is taken of size 4.
What is the probability that the average weekly sales next month is more than 1110?
# Choices
- [ ] 0.29
- [x] 0.13
- [ ] 0.11
- [ ] 0.08
---
title: Quiz 1 explanation
description:
duration: 5400
card_type: cue_card
---
### Quiz 1 explanation
Given,
For population data
- $μ$ = 1000
- $σ$ = 200
So, for samples,
- $\bar X$ = $μ$ = 1000
- Sample SD will be
Code:
```python=
# SE = σ/sqrt(n)
sigma = 200/np.sqrt(4)
sigma
```
>Output:
```
100.0
```
So, the probability that the average weekly sales will be greater than 1110 will be
$P[X>1110] = 1 - P[X < 1110] $
> How can we find P[X < 1110]?
by calculating "norm.cdf(zscore)"
Code:
```python=
# zscore = (X - mu)/σ
z_score = (1110 - 1000)/sigma
z_score
```
>Output:
```
1.1
```
Code:
```python=
probability = 1 - norm.cdf(z_score)
probability
```
>Output:
```
0.13566606094638267
```
The probability that the average weekly sales next month is more than 1110 is 0.13.
<img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/057/411/original/clt10.png?1700479311 height = 400 width = 450>
Next quiz
---
title: Quiz 2
description:
duration: 60
card_type: quiz_card
---
# Question
In an e-commerce website, the average purchase amount per customer is $80 with a standard deviation of $15.
If we randomly select a sample of 50 customers,
what is the probability that the average purchase amount in the sample will be less than $75?
# Choices
- [ ] 0.36
- [ ] 0.18
- [ ] 0.01
- [x] 0.009
---
title: Quiz 2 explanation
description:
duration: 5400
card_type: cue_card
---
### Quiz 2 explanation
Given,
For population data
- $μ$ = 80
- $σ$ = 15
So, for samples with size 50
- $\bar X$ = $μ$ = 80
- Sample SD will be,
Code:
```python=
# SE = σ/sqrt(n)
sigma = 15/np.sqrt(50)
sigma
```
>Output:
```
2.1213203435596424
```
So, the probability that the average purchase amount will be less than 75$ will be
$P[X<75]$
> How can we find P[X < 75]?
by calculating "norm.cdf(zscore)"
Code:
```python=
# zscore = (X - mu)/σ
z_score = (75 - 80)/sigma
probability = norm.cdf(z_score)
probability
```
>Output:
```
0.009211062727049501
```
The probability that the average purchase amount in the sample will be less than $75 is 0.009
<img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/057/423/original/clt11.png?1700480332 height = 300 width = 500>
<img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/057/424/original/clt12.jpeg?1700480431 height = 400 width = 800>
Now, Let's jump into the next important concept in statistics, Confidence Interval.
---
title: Confidence Interval, Compute 95% Confidence interval, Examples
description:
duration: 5400
card_type: cue_card
---
### <font color='blue'>**Confidence Interval**</font> (5-7 mins)
In our probability distibutions-2 lectures, we took an example of turtles to discuss point estimates. Let's recall it.
<font color='purple'>**Example:**</font>
We wanted to <font color='purple'>estimate the mean weight of a certain species of turtle in Florida by taking a single random sample **($S$)** of 50</font> turtles and using the sample mean to estimate the true population mean.
<img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/057/425/original/clt13.png?1700480631 height = 200 width = 500>
But as we know there is one problem here,
**Problem:**
- The <font color='purple'>mean weight of turtles in the sample is not guaranteed to exactly match</font> the mean weight of turtles in the whole population.
**Solution:**
In order to capture this uncertainty, we will say that the population mean will lie in the <font color='purple'>**range of the sample mean (point estimates) $+/-$ some errors here and there**</font>.
So, whatever sample mean I will get from the sample, I will try to provide a range and the population mean will lie within that range.
This interval or range is called **Confidence Interval**.
<br>
In summary,
- A confidence interval <font color='purple'>**is the mean of your estimate plus and minus the variation in that estimate</font>**.
- This is the range of values you expect your estimate to fall within a certain level of confidence.
> <font color='purple'>**Q. What is confidence?**</font>
Confidence, in statistics, is another way to describe probability.
For example,
- If you construct a confidence interval with a 95% confidence level, <font color='purple'>you are confident that 95 out of 100 times the estimate will fall between the upper and lower values specified by the confidence interval</font>.
- **$x_1$ is the lower bound and $x_2$ is the upper bound**.
<br>
### <font color='purple'>**Q. How to calculate the confidence interval?**</font>
There are 2 ways to find confidence interval
1. **Using CLT**
- This is for mean values. <font color='purple'>If we have mean as a statistic then we will calculate CI using the central limit theorem (CLT)</font>
2. **Using Bootstrapping**
- If you have statistics other than mean like <font color='purple'>median then we are not allowed to use CLT. We can use bootstrap method to calculate confidence intervals in those scenarios</font>.
<br>
Let's calculate the confidence interval using CLT first,
- Using CLT, we will get these values and range between these values will be our confidence interval.
<img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/058/006/original/sample.png?1701171647 height = 200 width = 400>
Here $\bar X_{S}$ is the mean of sample ($S$).
<img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/057/427/original/clt15.png?1700480759 height = 400 width = 400>
## <font color='blue'>Compute a 95% Confidence Interval</font> (7 mins)
To get a 95% confidence interval, we <font color='purple'>need to find the area that covers 95% of the data</font>
- Let's say we have 2 points around mean one on the left and one on the right. These points will be upper bound and lower bound.
- We can calculate <font color='purple'>the data points for these z scores (z1 and z2) which will be the confidence interval</font>.
- We know that between z1 and z2 95% population lies so on the **left hand side of z1 there is 2.5% population and on the right side of z2 there is 2.5% population**
<img src=https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/057/743/original/Screenshot_2023-11-23_at_5.51.42_PM.png?1700742162 width=300>
So, to find the value below this percentage of data falls, we will use the PPF.
<br>
> <font color='purple'>**Q. How do we find z1 and z2**</font>
- **$z1 = norm.ppf(0.025)$ as we have 2.5% data till z1**
Similarly to calculate z2 will use
- **$z2 = norm.ppf(1-0.025)$ as we have 2.5% data remaining after z2**
<img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/057/429/original/clt17.png?1700480912 height = 250 width = 400>
Code:
```python=
# z1 will be
z1 = norm.ppf(0.025)
z1
```
>Output:
```
-1.9599639845400545
```
Code:
```python=
# z2 will be
z2 = norm.ppf(1 - 0.025) # we can also use norm.ppf(0.975)
z2
```
>Output:
```
1.959963984540054
```
> <font color='purple'>Q. What is the formula for the z score?</font>
$\Large Z = \frac{{X} - μ}{σ}$
So,
<font color='purple'>$\Large X = μ + (Z * σ)$</font>
Where
- $X$ is individual data point
- $\mu$ is population mean
- $\sigma$ is population standard deviation
Note:
From this, we will get data points associated with the corresponding z score, which is Z in this case.
<br>
**Here we are dealing with Samples so**,
- We will use $\bar X$ which is sample mean.
- In the case of sample we consider standard error so $σ = \frac {σ}{\sqrt n}$
The Z-score is a measure of how many standard deviations a data point (in this case, the sample mean) is from the population mean
- We need to find how many standard deviation away the sample mean is from population mean
Z score will be:
$\Large Z = \frac{\bar X - \mu}{\frac{\sigma}{\sqrt n}}$
Where,
- $\bar X$ = sample mean
- $\mu$ - population mean
- $\sigma/\sqrt n$ = standard error
<br>
In our sample $S$ that we are looking above the <font color='purple'> **, our population mean will lie between z scores i.e. z1 and z2 which is -1.96 and 1.96**</font>.
Between these two points, we will have our 95% confidence interval.
$\Large Z_1 < \frac{\bar X_{s} - \mu}{\frac {\sigma}{\sqrt n}} < Z_2$ (equation 1)
- $\bar X_s$ represents mean of sample $S$
Now, if we compare left side: $Z_1 < \Large \frac{\bar X_{s} - \mu}{\frac {\sigma}{\sqrt n}}$
We will get: $\Large \bar X_{s} - Z_1 * (\frac{σ}{\sqrt n}) \ < \ μ$ (equation 2)
Now, if we compare right side: $\Large \frac{\bar X_{s} - \mu}{\frac {\sigma}{\sqrt n}} > Z_2$
We will get: $\Large μ \ < \bar X_{s} + Z_1 * (\frac{σ}{\sqrt n})$ (equation 3)
<br>
Now, from equation 1, 2 and 3 if we want the $\mu$ value only,
<font color='blue'>$\Large \bar X_{s} - Z_1 * (\frac{σ}{\sqrt n}) \ < \ μ \ < \bar X_{s} + Z_1 * (\frac{σ}{\sqrt n})$</font>
<br>
> **<font color='purple'>Q. So, what will be the range where original $μ$ is lying?</font>**
**Confidence Interval = <font color='purple'>$\bar X ± Z \left( \frac {σ}{\sqrt n}\right)$</font>**
OR
**<font color='purple'>$[\bar {X} - Z * (\frac{σ}{\sqrt n}) \ ,\ \bar {X} + Z * (\frac{σ}{\sqrt n})]$</font>**
where:
$\bar X$: sample mean
$Z$: is the z-score corresponding to the desired confidence level.
$σ$: population standard deviation
$n$: sample size
<br>
If you take **$(Z * \frac{σ}{\sqrt n})$**, it is referred as the the **margin of error** in the context of confidence intervals.
Now let's try to solve one example
### <font color='purple'>**Example on confidence interval**</font> (5 mins)
```
The mean height of a sample of 100 adults was found to be 65 inches,
with a standard deviation of 2.5 inches.
Compute 95% confidence interval
```
Solution:
Given,
Sample size "n" = 100
Sample mean "$\bar x$ = 65
SD = 2.5
First, let's calculate the standard error:
Code:
```python=
# sigma/sqrt(n)
std_error = 2.5/np.sqrt(100)
std_error
```
>Output:
```
0.25
```
Now the values of Z for 95% confidence will be,
we have calculated above
Code:
```python=
# z1 will be
z1 = norm.ppf(0.025)
z1
```
>Output:
```
-1.9599639845400545
```
Code:
```python=
# z2 will be
z2 = norm.ppf(1 - 0.025) # we can also use norm.ppf(0.975)
z2
```
>Output:
```
1.959963984540054
```
Now, How to get the data poitns for z1 which is on left side and z2 which is on right side,
$X_1 = μ + Z1 * σ$ and,
$X_2 = μ + Z2 * σ$
Code:
```python=
x1 = 65 + z1 * std_error
x1
```
>Output:
```
64.51000900386498
```
Code:
```python=
x2 = 65 + z2 * std_error
x2
```
>Output:
```
65.48999099613502
```
So the range of 95% confidence interval --> [64.51, 65.48]
### <font color='orange'>Conclusion:</font>
We can claim that the population mean will lie between the value 64.51 and 65.48 with 95% confidence.
There is directly one function available which will calculate interval by using just single formula
- Using **norm.interval()**
You have to pass three attributes :
- **norm.interval(confidence, loc=0, scale=1)**
- confidence: how much confidence you want
- loc: pass the **mean** value here (by default it is 0)
- scale: pass the **standard error** here (by default it is 1)
Code:
```python=
norm.interval(0.95, loc=65, scale=std_error)
```
>Output:
```
(64.51000900386498, 65.48999099613502)
```
Let solve one more example to reiterate the concept
## <font color='purple'>Example 2 on confidence interval:</font> (3 mins)
```
The sample mean recovery time of 100 patients after taking a drug was seen to be 10.5 days with a standard deviation of 2 days
Find the 95% confidence interval of the true mean.
```
**Approach 1**:
Given,
- sample size "n" = 100
- sample mean = 10.5
- standard deviation = 2
Approach will be same
Code:
```python=
std_error = 2/np.sqrt(100)
std_error
```
>Output:
```
0.2
```
Code:
```python=
z1 = norm.ppf(0.025)
x1 = 10.5 + z1 * std_error
x1
```
>Output:
```
10.108007203091988
```
Code:
```python=
z2 = norm.ppf(0.975)
x2 = 10.5 + z2 * std_error
x2
```
>Output:
```
10.89199279690801
```
**Approach 2:**
Code:
```python=
norm.interval(0.95, loc = 10.5, scale=0.2)
```
>Output:
```
(10.10800720309199, 10.89199279690801)
```
So the range of 95% confidence interval --> [10.10, 10.89]
Just for practice, let's solve one more example
<font color='red'>***Instructor note:***</font>
Solve example 3 if time permits otherwise ignore it.
## <font color='purple'>Example 3 on confidence interval:</font> (5 mins)
```
The mean Youtube watch time of a sample of 100 students was found to be 3.5 hours,
with a standard deviation of 1 hour.
Construct a 90% confidence interval for the true watch time.
```
Solution:
Given,
- sample size = 100
- sample mean = 3.5
- standard deviation = 1
- confidence interval = 90%
Now, let's try a different approach here.
We can simply just define one function and calculate all the needed values inside that function
So that it'll be easier to calculate interval just by recalling the function.
Code:
```python=
# define function calc_CI
def calc_CI(mean, std, N, confidence):
# let's calc std error
std_err = std / np.sqrt(N)
print("SE ",std_error)
# calculate the remaining fractions beyond interval (we know 90% so fractions will be 5% each so 0.05)
slice = (1 - (confidence/100))/2
print("Slice ",slice)
# let's calculate z1 and z2
z1 = norm.ppf(slice)
z2 = norm.ppf(1-slice)
print("z1 z2", z1, z2)
# calculate end points
x1 = mean + (z1 * std_err)
x2 = mean + (z2 * std_err)
return x1, x2
```
Code:
```python=
calc_CI(3.5, 1, 100, 90)
```
>Output:
```
SE 0.2
Slice 0.04999999999999999
z1 z2 -1.6448536269514729 1.6448536269514722
(3.3355146373048528, 3.6644853626951472)
```
<font color='orange'>**Conclusion**</font>
So that was all about computing confidence intervals using CLT, now in general other statistics like median or any specific percentile you are not allowed to use CLT.
For that, we have a technique called "Bootstrapping"
Let's have a look at it.
---
title: Quiz 3
description:
duration: 60
card_type: quiz_card
---
# Question
From a sample of 80 endangered birds,
the average wingspan was found to be 45 cm, with a population standard deviation of 10 cm.
What is the correct confidence interval of the mean wingspan of the entire population with 90% confidence.
# Choices
- [x] [43.16, 46.83]
- [ ] [40.21, 43.45]
- [ ] [45.67, 48.92]
- [ ] [46.95, 50.01]
---
title: Quiz 3 explanation
description:
duration: 5400
card_type: cue_card
---
### Quiz 3 explanation
Given,
- sample size "n" = 80
- sample mean = 45
- standard deviation = 10
Code:
```python=
std_error = 10/np.sqrt(80)
std_error
```
>Output:
```
1.118033988749895
```
Code:
```python=
z1 = norm.ppf(0.05)
x1 = 45 + z1 * std_error
x1
```
>Output:
```
43.16099773854971
```
Code:
```python=
z2 = norm.ppf(0.95)
x2 = 45 + z2 * std_error
x2
```
>Output:
```
46.83900226145029
```
Confidence Interval will be -> [43.16, 46.83]
**Approach 2:**
Code:
```python=
norm.interval(0.90, loc = 45, scale=std_error)
```
>Output:
```
(43.16099773854971, 46.83900226145029)
```
---
title: Quiz 4
description:
duration: 60
card_type: quiz_card
---
# Question
In a software project, the team estimates bug resolution time at an average of 6 hours
with a standard deviation of 2 hours.
To estimate the mean resolution time with 99% confidence, the project manager samples 25 resolved bugs.
What is the correct confidence interval?
# Choices
- [ ] [3.25, 6.45]
- [x] [4.96, 7.03]
- [ ] [5.55, 8.63]
- [ ] [6.74, 9.42]
---
title: Quiz 4 explanation
description:
duration: 5400
card_type: cue_card
---
### Quiz 4 explanation
Given,
- sample size "n" = 25
- sample mean = 6
- standard deviation = 2
Code:
```python=
std_error = 2/np.sqrt(25)
std_error
```
>Output:
```
0.4
```
Code:
```python=
z1 = norm.ppf(0.005)
x1 = 6 + z1 * std_error
x1
```
>Output:
```
4.969668278580439
```
Code:
```python=
z2 = norm.ppf(0.995)
x2 = 6 + z2 * std_error
x2
```
>Output:
```
7.03033172141956
```
Confidence Interval will be -> [4.96, 7.03]
**Approach 2**:
Code:
```python=
norm.interval(0.99, loc = 6, scale=std_error)
```
>Output:
```
(4.96966827858044, 7.03033172141956)
```
---
title: Confidence interval using bootstrap
description:
duration: 5400
card_type: cue_card
---
### <font color='blue'>Confidence interval using Bootstrap</font> (10-12 mins)
- suppose you have very little data and you want to compute a confidence interval for some other statistics like median then the most common technique is bootstapping.
Let's start with one example:
### <font color='purple'>Example: Salary Survey</font>
Imagine we want to analyse and learn about the data scientist salaries at Google.
We have 2 surveys,
<img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/057/433/original/clt19.png?1700482313 height = 270 width = 700>
We don't have a population mean here but we have 2 samples so we can calculate the sample means.
Code:
```python=
survey_1 = [35, 36, 33, 37, 34, 35]
np.mean(survey_1)
```
>Output:
```
35.0
```
Code:
```python=
survey_2 = [20, 37, 17, 50, 53, 33]
np.mean(survey_2)
```
>Output:
```
35.0
```
> <font color='purple'>Q. Which of the two surveys is better for estimating the population parameter or which survey is more reliable?</font>
By observing the samples we found that values in survey 1 are much closer to the mean values so survey 1 will be more accurate for estimation
> Now, can we simulate more and more sets of samples like the ones above?
For this statisticians come up with something which has reasonable amount of accuracy.
### <font color='purple'>Sample With Replacement</font>
They said that take your survey and then create more samples from the same survey only using **replacement**.
Bootstrapping is a statistical procedure that resamples a single dataset to create many simulated samples.
Code:
```python=
n = 6
bootstrapped_samples = np.random.choice(survey_1, size=n)
bootstrapped_samples
```
>Output:
```
array([37, 35, 35, 34, 35, 33])
```
Here we will get an array of length 6 where each element is one of the original data points from survey 1 which is randomly chosen.
Every time we run this code, we will get a different array so means that the mean of this newly constructed array will also be different.
Code:
```python=
np.mean(bootstrapped_samples)
```
>Output:
```
34.833333333333336
```
Code:
```python=
bootstrapped_samples = np.random.choice(survey_2, size=n)
np.mean(bootstrapped_samples)
```
>Output:
```
37.166666666666664
```
Let's observe the difference between survey_1 and survey_2 by running the code several times.
- We can observe that in survey 1, the mean value is always close to 35
But in survey 2, it sometimes comes to 35, sometimes 40, sometimes 39 so there is more variance in survey 2.
So we will go with survey_1 as it has the higher confidence because the variance in survey_1 is less.
**Let's draw a histogram of this survey**
Code:
```python=
bootstrapped_means_survey_1 = []
for reps in range(10000):
bootstrapped_samples = np.random.choice(survey_1, size=n)
bootstrapped_mean = np.mean(bootstrapped_samples)
bootstrapped_means_survey_1.append(bootstrapped_mean)
sns.histplot(bootstrapped_means_survey_1)
```
>Output:
<img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/057/435/original/clt20.png?1700482649 height = 400 width = 550>
Code:
```python=
bootstrapped_means_survey_2 = []
for reps in range(10000):
bootstrapped_samples = np.random.choice(survey_2, size=n)
bootstrapped_mean = np.mean(bootstrapped_samples) # Replace by any statistic (median, percentile)
bootstrapped_means_survey_2.append(bootstrapped_mean)
sns.histplot(bootstrapped_means_survey_2)
```
>Output:
<img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/057/436/original/clt21.png?1700482738 height = 400 width = 550>
> Let's compare these two histograms, what can we observe?
- **We can observe that in survey_2, the interval or range is somewhere between 20-50**
**While in survey_1, it is between 33-36 which is very close to the actual mean**
So, <font color='purple'>survey 1 is more accurate than survey 2</font>
### <font color='purple'>How to compute the condidence interval?</font>
We can calculate the percentile of **bootstrapped mean**
- 2.5th percentile will give me lower bound (x1)
- 97.5th percentile will give me upper bound (x2)
Then, confidence inteval will be [x1, x2]
Code:
```python=
len(bootstrapped_means_survey_1)
```
>Output:
```
10000
```
Code:
```python=
x1 = np.percentile(bootstrapped_means_survey_1, 2.5)
x1
```
>Output:
```
34.0
```
Code:
```python=
x2 = np.percentile(bootstrapped_means_survey_1, 97.5)
x2
```
>Output:
```
36.0
```
The 95% of the numbers lies between 34 & 36 so
**Confidence Interval: $(x1, x2)$**
AS this process is random, this will be slight change in CI everytime
Code:
```python=
len(bootstrapped_means_survey_2)
```
>Output:
```
10000
```
Code:
```python=
x1 = np.percentile(bootstrapped_means_survey_2, 2.5)
x1
```
>Output:
```
24.0
```
Code:
```python=
x2 = np.percentile(bootstrapped_means_survey_2, 97.5)
x2
```
>Output:
```
46.0
```
here also CI will be
**Confidence Interval = $(x1, x2)$**
This is how we can calculare Confidence Interval using bootstrap.
---
title: Conclusion
description:
duration: 5400
card_type: cue_card
---
### <font color='blue'>Conclusion</font>
With this, we are done with today's lecture.
CLT and confidence interval are very important concepts of statistics and we hope you understand these topics very well.
Keep revising the topics.