Piyush Ranjan
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    --- title: Content description: duration: 5400 card_type: cue_card --- ### <font color='blue'>Content</font> - **Central Limit Theorem** - **Confidence Intervals** - Using CLT - Using Bootstrapping - With replacement --- title: Central Limit Theorem description: duration: 5400 card_type: cue_card --- ### <font color='blue'>Introduction</font> (2 mins) Greeting everyone, In our previous lecture, we explored the normal distribution, where we discussed how many real-world phenomena tend to follow this bell-shaped curve. In today's lecture, we are going to cover very important topics of statistics i.e. Central Limit Theorem (CLT) and Confidence Interval (CI). Let's get started ### <font color='blue'>Central Limit Theorem</font> (12-15 mins) In the probability distributions-2 lecture, we have covered everything related to sample, sample statistics and sampling distribution. The <font color='purple'>central limit theorem relies on the concept of a sampling distribution</font>, which is the probability distribution of a statistic for a large number of samples taken from a population. > <font color='purple'>**Let's recall the sampling distribution**</font> Draw random samples from a population, calculate means for each sample, and repeat. The collection of these sample means forms a sampling distribution. <br> > <font color='purple'>Imagine a scenario</font> ``` Imagine you have a big jar filled with jellybeans. ``` Each jellybean represents a piece of data in your population. The color and size of the jellybeans can be different, representing the diversity in your data. Now, grab a handful of jellybeans from the jar and calculate the average size of those jellybeans. Put those jellybeans back, shake the jar, and grab another handful, calculate the average again. Repeat this process many times. According to the Central Limit Theorem: 1. **No matter how the jellybeans are distributed originally**, as you take more and more samples and calculate the average each time, the distribution of those averages will start to look like a bell curve. 2. **The more handfuls of jellybeans you take, the closer the distribution gets to a perfect bell curve**, even if the original distribution of jellybeans was not bell-shaped at all. This is powerful because it allows us to make certain assumptions and predictions about the averages of large samples, even if they don't know much about the original population. <br> **So,** The central limit states that <font color='blue'>"the mean of a random sample will resemble even closer to the population mean as the sample size increases and it will approximate a normal distribution regardless of the shape of the population distribution"</font> - Means, if you take sufficiently large samples from a population, the samples’ means **will be normally distributed**, even if the population isn’t normally distributed. <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/057/395/original/clt1.png?1700474602 height = 300 width = 400> Similarlly, - If we have a <font color='purple'>sufficiently large number of independent and identically distributed (i.i.d.) random variables and sum them up, the distribution of the sum will tend to be approximately normal, regardless of the shape of the original distribution.</font> This is called **CLT for sums of random variables** <br> In summary both versions essentially state that as you **sum or average a sufficiently large number of i.i.d. random variables**, the resulting distribution tends to approach normality. The "30" rule of thumb is often mentioned as a guideline for the classical CLT, but the actual requirement may vary based on the characteristics of the population distribution. Sample size (n) plays a vital role in the context of the Central Limit Theorem (CLT) and its impact can be summarized as follows: ### <font color='blue'>Sample Size and Normality:</font> A **larger sample size leads** to a sampling distribution that closely resembles a **normal distribution**. - The CLT <font color='purple'>tends to work well when the **sample size $(n)$ is sufficiently large**, typically considered as **$n≥30$**</font>. However, it is not a strict rule but a rough guideline. For <font color='purple'>**moderately skewed distributions, even smaller sample sizes can sometimes be sufficient**</font>. - For Small n $(n<30)$ the CLT can still be useful, especially if the population distribution is close to normal. However, <font color='purple'>for smaller sample sizes, the normality assumption becomes more critical</font>. <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/057/396/original/clt2.png?1700474761 height = 330 width = 600> Here, - We started by taking 2 samples **(n=2)** from the population, calculated their mean, and repeated this many times to form a distribution. <font color='purple'>This distribution tends to look more like a normal distribution.</font> - When we **increased the sample size, the distribution resembled a normal distribution even more closely**. <br> ### <font color='purple'>**Sample Size and Standard Deviations**</font>: As you can see in the image above, sample size also affects the spread of the sampling distribution. - A **smaller n (n = 2), will have a high standard deviation** because sample means are less precise estimates of the population mean, resulting in more spread. - A **larger n ( n >= 30) will have a low standard deviation** since sample means become more precise estimates of the population mean, leading to less spread. <br> <font color='orange'>**Conclusion**</font> The **sample size not only influences how closely the sampling distribution approximates a normal curve but also impacts the spread or precision of sample means**. As the sample size increases, the CLT becomes more applicable, and the standard deviation decreases, making the estimates of population parameters more reliable. If we summarize the CLT, ### <font color='purple'>**Conditions of the CLT will be**</font> To apply the central limit theorem, four conditions must be met: 1. **Randomization**: - Data should be randomly sampled, ensuring every population member has an equal chance of being included. 2. **Independence**: - Each sample value should be independent, with one event's occurrence not affecting another. - Commonly met in probability sampling methods, which independently select observations. 3. **Large Sample Condition**: - A sample size of 30 or more is generally considered "sufficiently large." - This threshold can vary slightly based on the population distribution's shape. These conditions ensure the applicability of the central limit theorem. <br> --- title: Applications of CLT , Summarizing CLT and Example description: duration: 5400 card_type: cue_card --- ### <font color='blue'>**Application of CLT on real life dataset**</font> (12-15 mins) Let's apply the central limit theorem to real distribution to see if the distribution of the sample means tends to follow normal distribution or not. We'll take the height dataset on which we have been working from few lectures. As we know it is already normally distributed, so let's take the sample means and see if they follow normal distribution. Code: ```python= !wget --no-check-certificate https://drive.google.com/uc?id=1Mrt008vkE4nVb1zE4f06_rtq70QPfkIo -O weight-height.csv ``` > Output: ``` weight-height.csv 100%[===================>] 418.09K --.-KB/s in 0.003s 2024-01-18 10:03:08 (134 MB/s) - ‘weight-height.csv’ saved [428120/428120] ``` Code: ```python= import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt from scipy.stats import norm ``` Code: ```python= df_hw = pd.read_csv('weight-height.csv') df_hw.head() ``` >Output: <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/057/404/original/clt5.png?1700476492 height = 200 width = 300> We are going to work on height column so let's store it in a different dataframe. Code: ```python= df_height = df_hw["Height"] sns.histplot(df_height) ``` >Output: <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/057/405/original/clt6.png?1700476596 height = 300 width = 400> Code: ```python= # mean of the entire population mu = df_height.mean() mu ``` >Output: ``` 66.36755975482124 ``` Code: ```python= sigma = df_height.std() sigma ``` >Output: ``` 3.8475281207732293 ``` We will now randomly select five samples and determine the average height of these samples ### <font color='purple'>Sample size = 5</font> Code: ```python= df_height.sample(5) ``` >Output: ``` 5608 62.069334 2318 76.806344 5647 61.016914 1682 69.902756 882 68.890095 Name: Height, dtype: float64 ``` Code: ```python= np.mean(df_height.sample(5)) ``` >Output: ``` 67.8041732628697 ``` <font color='orange'>**Observation**</font> - We can notice that on running the above code, it is generating 5 different samples every time and the sample mean is also changing with that. Let's repeat this process 10,000 times so we will get sample means of 10,000 unique samples (size = 5). We will plot the distributions of these 10,000 sample means to see if they follow the normal distribution. Code: ```python= sample_5 = [np.mean(df_height.sample(5)) for i in range(10000) ] sns.histplot(sample_5, kde=True) ``` >Output: <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/057/408/original/clt7.png?1700477436 height = 350 width = 500> Code: ```python= np.mean(sample_5) ``` >Output: ``` 66.36291130707713 ``` Code: ```python= np.std(sample_5) ``` >Output: ``` 1.7009715727280839 ``` <font color='orange'>**Observation**</font> - We can conclude that the distribution of those 10000 samples means is normally distributed and most of the values lies between 62 and 72. - There might be some cases where the samples contain only short peoples that is why we can see some values between 60 and 62 - Similarly, there might be some cases where the samples contain only tall people that is why we can see some values between 70 and 72. ### <font color='purple'>Sample size = 20</font> > <font color='purple'>**Q. What would happen If we increase the size of our sample?**</font> We studied earlier in the lecture that as we increase the size of the sample, the spread of data will be less. - This means, as we increase the size of the sample, the sample mean will come closer and closer to the population mean Let's try this out. - Let's increase the sample size to 20. - We will again perform 10,00 iterations and plot the distributions of the sample means Code: ```python= sample_20 = [np.mean(df_height.sample(20)) for i in range(10000) ] sns.histplot(sample_20, kde=True) ``` >Output: <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/057/409/original/clt8.png?1700477618 height = 350 width = 500> Code: ```python= np.mean(sample_20) ``` >Output: ``` 66.36437874021179 ``` Code: ```python= np.std(sample_20) ``` >Output: ``` 0.8745559612271192 ``` <font color='orange'>**Observation**</font> - We can clearly see that as <font color='purple'>**we increase the number of samples from 5 to 20, the sample means come closer to the actual mean and the standard deviation becomes less**</font>. - Previously the majority of the values were between 62 and 72. Now the spread of the data has decreased and values lie between 64 and 69 So we found that by <font color='purple'>increasing the size of the sample, the variability or SD of the sample distributions decreases and the sample mean tends to be much closer to the population mean</font>. ### <font color='purple'>Comparison of statistics</font>. Let's compare the statistics of population data and sample data to observe some patterns Code: ```python= # population mean mu = df_height.mean() # population SD sigma = df_height.std() # mean of sample distributions having sample size = 5 mu_5 = np.mean(sample_5) # SD of sample distributions having sample size = 5 sigma_5 = np.std(sample_5) # mean of sample distributions having sample size = 20 mu_20 = np.mean(sample_20) # SD of sample distributions having sample size = 20 sigma_20 = np.std(sample_20) ``` Code: ```python= print(mu, mu_5, mu_20) print(sigma, sigma_5, sigma_20) ``` >Output: ``` 66.36755975482124 66.36291130707713 66.36437874021179 3.8475281207732293 1.7009715727280839 0.8745559612271192 ``` <font color='orange'>**Observation**</font> Here, **Population Statistics:** - $μ$ = population mean - $σ$ = population standard deviation **Sample Statistics:** - $μ_5$ = mean of sample means (from samples of size 5) - $σ_5$ = standard deviation of the sample means (from samples of size 5) - $μ_{20}$ = mean of sample means (from samples of size 20) - $σ_{20}$ = standard deviation of the sample means (from samples of size 20) <br> 1. We can clearly observe that, - <font color='purple'>As we increase the sample size, the SD of sample means decreases</font>. - The **SD of sampling distribution ($σ_{\bar x}$) is less than the population SD ($σ$).** $\Large σ > σ_5 > σ_{20}$ <br> This aligns with the CLT, which states that the standard deviation of the sampling distribution ($σ_{\bar x}$) is the standard deviation of the population ($σ$) divided by the square root of the sample size. - $\Large σ_{\bar x} = \frac{σ}{\sqrt n}$ We have already studied it, it is known as **Standard Error**. It indicates that how far my sample mean is from the actual mean. By looking into the means we can observe that, 2. the mean of the sampling distribution is equal to the mean of the population - $\Large μ_{\bar x} = μ$ ### <font color='blue'>**Summarizing CLT**</font> (3 mins) > <font color='purple'>How can we mathematically represent it?</font> We can describe the sampling distribution of the mean using this notation: <font color='purple'>**$\Large {\bar X} \backsim \Large N(μ \ , \ \frac{σ}{\sqrt n})$**</font> Where: - $\bar X$ is the sampling distribution of the sample means - $\backsim$ means “follows the distribution” - $N$ is the normal distribution - $µ$ is the mean of the population - $σ$ is the standard deviation of the population - $n$ is the sample size Now, let's solve some examples ### <font color='purple'>Example 1:</font> (5 mins) ``` Systolic blood pressure of a group of people is known to have an average of 122 mmHg and a standard deviation of 10 mmHg. Calculate the probability that the average blood pressure of 16 people will be greater than 125 mmHg. ``` Given, for the entire population - $μ$ = 122 - $σ$ = 10 We need to calculate the probability that average BP of 16 people will be > 125 - Sample size n = 16 - and by CLT, the $\bar X$ = $μ$ = 122 - The standard deviation will be <br> Code: ```python= # SE = σ/sqrt(n) sigma = 10/np.sqrt(16) sigma ``` >Output: ``` 2.5 ``` So, the probability of finding people having greater than 125 average BP will be we know, $P[X>125] = 1 - P[X < 125]$ > How can we find P[X < 125]? by calculating "norm.cdf(zscore)" Code: ```python= # zscore = (X - mu)/σ z_score = (125 - 122)/sigma z_score ``` >Output: ``` 1.2 ``` Code: ```python= # P[X>125]=1−P[X<125] probability = 1 - norm.cdf(z_score) probability ``` >Output: ``` 0.11506967022170822 ``` The probability that the average blood pressure of 16 people will be greater than 125 mmHg is 0.115 <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/057/410/original/clt9.png?1700478231 height = 400 width = 400> Let's solve some quizzes --- title: Quiz 1 description: duration: 60 card_type: quiz_card --- # Question Weekly toothpaste sales have a mean 1000 and std dev 200. Sample is taken of size 4. What is the probability that the average weekly sales next month is more than 1110? # Choices - [ ] 0.29 - [x] 0.13 - [ ] 0.11 - [ ] 0.08 --- title: Quiz 1 explanation description: duration: 5400 card_type: cue_card --- ### Quiz 1 explanation Given, For population data - $μ$ = 1000 - $σ$ = 200 So, for samples, - $\bar X$ = $μ$ = 1000 - Sample SD will be Code: ```python= # SE = σ/sqrt(n) sigma = 200/np.sqrt(4) sigma ``` >Output: ``` 100.0 ``` So, the probability that the average weekly sales will be greater than 1110 will be $P[X>1110] = 1 - P[X < 1110] $ > How can we find P[X < 1110]? by calculating "norm.cdf(zscore)" Code: ```python= # zscore = (X - mu)/σ z_score = (1110 - 1000)/sigma z_score ``` >Output: ``` 1.1 ``` Code: ```python= probability = 1 - norm.cdf(z_score) probability ``` >Output: ``` 0.13566606094638267 ``` The probability that the average weekly sales next month is more than 1110 is 0.13. <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/057/411/original/clt10.png?1700479311 height = 400 width = 450> Next quiz --- title: Quiz 2 description: duration: 60 card_type: quiz_card --- # Question In an e-commerce website, the average purchase amount per customer is $80 with a standard deviation of $15. If we randomly select a sample of 50 customers, what is the probability that the average purchase amount in the sample will be less than $75? # Choices - [ ] 0.36 - [ ] 0.18 - [ ] 0.01 - [x] 0.009 --- title: Quiz 2 explanation description: duration: 5400 card_type: cue_card --- ### Quiz 2 explanation Given, For population data - $μ$ = 80 - $σ$ = 15 So, for samples with size 50 - $\bar X$ = $μ$ = 80 - Sample SD will be, Code: ```python= # SE = σ/sqrt(n) sigma = 15/np.sqrt(50) sigma ``` >Output: ``` 2.1213203435596424 ``` So, the probability that the average purchase amount will be less than 75$ will be $P[X<75]$ > How can we find P[X < 75]? by calculating "norm.cdf(zscore)" Code: ```python= # zscore = (X - mu)/σ z_score = (75 - 80)/sigma probability = norm.cdf(z_score) probability ``` >Output: ``` 0.009211062727049501 ``` The probability that the average purchase amount in the sample will be less than $75 is 0.009 <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/057/423/original/clt11.png?1700480332 height = 300 width = 500> <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/057/424/original/clt12.jpeg?1700480431 height = 400 width = 800> Now, Let's jump into the next important concept in statistics, Confidence Interval. --- title: Confidence Interval, Compute 95% Confidence interval, Examples description: duration: 5400 card_type: cue_card --- ### <font color='blue'>**Confidence Interval**</font> (5-7 mins) In our probability distibutions-2 lectures, we took an example of turtles to discuss point estimates. Let's recall it. <font color='purple'>**Example:**</font> We wanted to <font color='purple'>estimate the mean weight of a certain species of turtle in Florida by taking a single random sample **($S$)** of 50</font> turtles and using the sample mean to estimate the true population mean. <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/057/425/original/clt13.png?1700480631 height = 200 width = 500> But as we know there is one problem here, **Problem:** - The <font color='purple'>mean weight of turtles in the sample is not guaranteed to exactly match</font> the mean weight of turtles in the whole population. **Solution:** In order to capture this uncertainty, we will say that the population mean will lie in the <font color='purple'>**range of the sample mean (point estimates) $+/-$ some errors here and there**</font>. So, whatever sample mean I will get from the sample, I will try to provide a range and the population mean will lie within that range. This interval or range is called **Confidence Interval**. <br> In summary, - A confidence interval <font color='purple'>**is the mean of your estimate plus and minus the variation in that estimate</font>**. - This is the range of values you expect your estimate to fall within a certain level of confidence. > <font color='purple'>**Q. What is confidence?**</font> Confidence, in statistics, is another way to describe probability. For example, - If you construct a confidence interval with a 95% confidence level, <font color='purple'>you are confident that 95 out of 100 times the estimate will fall between the upper and lower values specified by the confidence interval</font>. - **$x_1$ is the lower bound and $x_2$ is the upper bound**. <br> ### <font color='purple'>**Q. How to calculate the confidence interval?**</font> There are 2 ways to find confidence interval 1. **Using CLT** - This is for mean values. <font color='purple'>If we have mean as a statistic then we will calculate CI using the central limit theorem (CLT)</font> 2. **Using Bootstrapping** - If you have statistics other than mean like <font color='purple'>median then we are not allowed to use CLT. We can use bootstrap method to calculate confidence intervals in those scenarios</font>. <br> Let's calculate the confidence interval using CLT first, - Using CLT, we will get these values and range between these values will be our confidence interval. <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/058/006/original/sample.png?1701171647 height = 200 width = 400> Here $\bar X_{S}$ is the mean of sample ($S$). <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/057/427/original/clt15.png?1700480759 height = 400 width = 400> ## <font color='blue'>Compute a 95% Confidence Interval</font> (7 mins) To get a 95% confidence interval, we <font color='purple'>need to find the area that covers 95% of the data</font> - Let's say we have 2 points around mean one on the left and one on the right. These points will be upper bound and lower bound. - We can calculate <font color='purple'>the data points for these z scores (z1 and z2) which will be the confidence interval</font>. - We know that between z1 and z2 95% population lies so on the **left hand side of z1 there is 2.5% population and on the right side of z2 there is 2.5% population** <img src=https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/057/743/original/Screenshot_2023-11-23_at_5.51.42_PM.png?1700742162 width=300> So, to find the value below this percentage of data falls, we will use the PPF. <br> > <font color='purple'>**Q. How do we find z1 and z2**</font> - **$z1 = norm.ppf(0.025)$ as we have 2.5% data till z1** Similarly to calculate z2 will use - **$z2 = norm.ppf(1-0.025)$ as we have 2.5% data remaining after z2** <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/057/429/original/clt17.png?1700480912 height = 250 width = 400> Code: ```python= # z1 will be z1 = norm.ppf(0.025) z1 ``` >Output: ``` -1.9599639845400545 ``` Code: ```python= # z2 will be z2 = norm.ppf(1 - 0.025) # we can also use norm.ppf(0.975) z2 ``` >Output: ``` 1.959963984540054 ``` > <font color='purple'>Q. What is the formula for the z score?</font> $\Large Z = \frac{{X} - μ}{σ}$ So, <font color='purple'>$\Large X = μ + (Z * σ)$</font> Where - $X$ is individual data point - $\mu$ is population mean - $\sigma$ is population standard deviation Note: From this, we will get data points associated with the corresponding z score, which is Z in this case. <br> **Here we are dealing with Samples so**, - We will use $\bar X$ which is sample mean. - In the case of sample we consider standard error so $σ = \frac {σ}{\sqrt n}$ The Z-score is a measure of how many standard deviations a data point (in this case, the sample mean) is from the population mean - We need to find how many standard deviation away the sample mean is from population mean Z score will be: $\Large Z = \frac{\bar X - \mu}{\frac{\sigma}{\sqrt n}}$ Where, - $\bar X$ = sample mean - $\mu$ - population mean - $\sigma/\sqrt n$ = standard error <br> In our sample $S$ that we are looking above the <font color='purple'> **, our population mean will lie between z scores i.e. z1 and z2 which is -1.96 and 1.96**</font>. Between these two points, we will have our 95% confidence interval. $\Large Z_1 < \frac{\bar X_{s} - \mu}{\frac {\sigma}{\sqrt n}} < Z_2$ (equation 1) - $\bar X_s$ represents mean of sample $S$ Now, if we compare left side: $Z_1 < \Large \frac{\bar X_{s} - \mu}{\frac {\sigma}{\sqrt n}}$ We will get: $\Large \bar X_{s} - Z_1 * (\frac{σ}{\sqrt n}) \ < \ μ$ (equation 2) Now, if we compare right side: $\Large \frac{\bar X_{s} - \mu}{\frac {\sigma}{\sqrt n}} > Z_2$ We will get: $\Large μ \ < \bar X_{s} + Z_1 * (\frac{σ}{\sqrt n})$ (equation 3) <br> Now, from equation 1, 2 and 3 if we want the $\mu$ value only, <font color='blue'>$\Large \bar X_{s} - Z_1 * (\frac{σ}{\sqrt n}) \ < \ μ \ < \bar X_{s} + Z_1 * (\frac{σ}{\sqrt n})$</font> <br> > **<font color='purple'>Q. So, what will be the range where original $μ$ is lying?</font>** **Confidence Interval = <font color='purple'>$\bar X ± Z \left( \frac {σ}{\sqrt n}\right)$</font>** OR **<font color='purple'>$[\bar {X} - Z * (\frac{σ}{\sqrt n}) \ ,\ \bar {X} + Z * (\frac{σ}{\sqrt n})]$</font>** where: $\bar X$: sample mean $Z$: is the z-score corresponding to the desired confidence level. $σ$: population standard deviation $n$: sample size <br> If you take **$(Z * \frac{σ}{\sqrt n})$**, it is referred as the the **margin of error** in the context of confidence intervals. Now let's try to solve one example ### <font color='purple'>**Example on confidence interval**</font> (5 mins) ``` The mean height of a sample of 100 adults was found to be 65 inches, with a standard deviation of 2.5 inches. Compute 95% confidence interval ``` Solution: Given, Sample size "n" = 100 Sample mean "$\bar x$ = 65 SD = 2.5 First, let's calculate the standard error: Code: ```python= # sigma/sqrt(n) std_error = 2.5/np.sqrt(100) std_error ``` >Output: ``` 0.25 ``` Now the values of Z for 95% confidence will be, we have calculated above Code: ```python= # z1 will be z1 = norm.ppf(0.025) z1 ``` >Output: ``` -1.9599639845400545 ``` Code: ```python= # z2 will be z2 = norm.ppf(1 - 0.025) # we can also use norm.ppf(0.975) z2 ``` >Output: ``` 1.959963984540054 ``` Now, How to get the data poitns for z1 which is on left side and z2 which is on right side, $X_1 = μ + Z1 * σ$ and, $X_2 = μ + Z2 * σ$ Code: ```python= x1 = 65 + z1 * std_error x1 ``` >Output: ``` 64.51000900386498 ``` Code: ```python= x2 = 65 + z2 * std_error x2 ``` >Output: ``` 65.48999099613502 ``` So the range of 95% confidence interval --> [64.51, 65.48] ### <font color='orange'>Conclusion:</font> We can claim that the population mean will lie between the value 64.51 and 65.48 with 95% confidence. There is directly one function available which will calculate interval by using just single formula - Using **norm.interval()** You have to pass three attributes : - **norm.interval(confidence, loc=0, scale=1)** - confidence: how much confidence you want - loc: pass the **mean** value here (by default it is 0) - scale: pass the **standard error** here (by default it is 1) Code: ```python= norm.interval(0.95, loc=65, scale=std_error) ``` >Output: ``` (64.51000900386498, 65.48999099613502) ``` Let solve one more example to reiterate the concept ## <font color='purple'>Example 2 on confidence interval:</font> (3 mins) ``` The sample mean recovery time of 100 patients after taking a drug was seen to be 10.5 days with a standard deviation of 2 days Find the 95% confidence interval of the true mean. ``` **Approach 1**: Given, - sample size "n" = 100 - sample mean = 10.5 - standard deviation = 2 Approach will be same Code: ```python= std_error = 2/np.sqrt(100) std_error ``` >Output: ``` 0.2 ``` Code: ```python= z1 = norm.ppf(0.025) x1 = 10.5 + z1 * std_error x1 ``` >Output: ``` 10.108007203091988 ``` Code: ```python= z2 = norm.ppf(0.975) x2 = 10.5 + z2 * std_error x2 ``` >Output: ``` 10.89199279690801 ``` **Approach 2:** Code: ```python= norm.interval(0.95, loc = 10.5, scale=0.2) ``` >Output: ``` (10.10800720309199, 10.89199279690801) ``` So the range of 95% confidence interval --> [10.10, 10.89] Just for practice, let's solve one more example <font color='red'>***Instructor note:***</font> Solve example 3 if time permits otherwise ignore it. ## <font color='purple'>Example 3 on confidence interval:</font> (5 mins) ``` The mean Youtube watch time of a sample of 100 students was found to be 3.5 hours, with a standard deviation of 1 hour. Construct a 90% confidence interval for the true watch time. ``` Solution: Given, - sample size = 100 - sample mean = 3.5 - standard deviation = 1 - confidence interval = 90% Now, let's try a different approach here. We can simply just define one function and calculate all the needed values inside that function So that it'll be easier to calculate interval just by recalling the function. Code: ```python= # define function calc_CI def calc_CI(mean, std, N, confidence): # let's calc std error std_err = std / np.sqrt(N) print("SE ",std_error) # calculate the remaining fractions beyond interval (we know 90% so fractions will be 5% each so 0.05) slice = (1 - (confidence/100))/2 print("Slice ",slice) # let's calculate z1 and z2 z1 = norm.ppf(slice) z2 = norm.ppf(1-slice) print("z1 z2", z1, z2) # calculate end points x1 = mean + (z1 * std_err) x2 = mean + (z2 * std_err) return x1, x2 ``` Code: ```python= calc_CI(3.5, 1, 100, 90) ``` >Output: ``` SE 0.2 Slice 0.04999999999999999 z1 z2 -1.6448536269514729 1.6448536269514722 (3.3355146373048528, 3.6644853626951472) ``` <font color='orange'>**Conclusion**</font> So that was all about computing confidence intervals using CLT, now in general other statistics like median or any specific percentile you are not allowed to use CLT. For that, we have a technique called "Bootstrapping" Let's have a look at it. --- title: Quiz 3 description: duration: 60 card_type: quiz_card --- # Question From a sample of 80 endangered birds, the average wingspan was found to be 45 cm, with a population standard deviation of 10 cm. What is the correct confidence interval of the mean wingspan of the entire population with 90% confidence. # Choices - [x] [43.16, 46.83] - [ ] [40.21, 43.45] - [ ] [45.67, 48.92] - [ ] [46.95, 50.01] --- title: Quiz 3 explanation description: duration: 5400 card_type: cue_card --- ### Quiz 3 explanation Given, - sample size "n" = 80 - sample mean = 45 - standard deviation = 10 Code: ```python= std_error = 10/np.sqrt(80) std_error ``` >Output: ``` 1.118033988749895 ``` Code: ```python= z1 = norm.ppf(0.05) x1 = 45 + z1 * std_error x1 ``` >Output: ``` 43.16099773854971 ``` Code: ```python= z2 = norm.ppf(0.95) x2 = 45 + z2 * std_error x2 ``` >Output: ``` 46.83900226145029 ``` Confidence Interval will be -> [43.16, 46.83] **Approach 2:** Code: ```python= norm.interval(0.90, loc = 45, scale=std_error) ``` >Output: ``` (43.16099773854971, 46.83900226145029) ``` --- title: Quiz 4 description: duration: 60 card_type: quiz_card --- # Question In a software project, the team estimates bug resolution time at an average of 6 hours with a standard deviation of 2 hours. To estimate the mean resolution time with 99% confidence, the project manager samples 25 resolved bugs. What is the correct confidence interval? # Choices - [ ] [3.25, 6.45] - [x] [4.96, 7.03] - [ ] [5.55, 8.63] - [ ] [6.74, 9.42] --- title: Quiz 4 explanation description: duration: 5400 card_type: cue_card --- ### Quiz 4 explanation Given, - sample size "n" = 25 - sample mean = 6 - standard deviation = 2 Code: ```python= std_error = 2/np.sqrt(25) std_error ``` >Output: ``` 0.4 ``` Code: ```python= z1 = norm.ppf(0.005) x1 = 6 + z1 * std_error x1 ``` >Output: ``` 4.969668278580439 ``` Code: ```python= z2 = norm.ppf(0.995) x2 = 6 + z2 * std_error x2 ``` >Output: ``` 7.03033172141956 ``` Confidence Interval will be -> [4.96, 7.03] **Approach 2**: Code: ```python= norm.interval(0.99, loc = 6, scale=std_error) ``` >Output: ``` (4.96966827858044, 7.03033172141956) ``` --- title: Confidence interval using bootstrap description: duration: 5400 card_type: cue_card --- ### <font color='blue'>Confidence interval using Bootstrap</font> (10-12 mins) - suppose you have very little data and you want to compute a confidence interval for some other statistics like median then the most common technique is bootstapping. Let's start with one example: ### <font color='purple'>Example: Salary Survey</font> Imagine we want to analyse and learn about the data scientist salaries at Google. We have 2 surveys, <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/057/433/original/clt19.png?1700482313 height = 270 width = 700> We don't have a population mean here but we have 2 samples so we can calculate the sample means. Code: ```python= survey_1 = [35, 36, 33, 37, 34, 35] np.mean(survey_1) ``` >Output: ``` 35.0 ``` Code: ```python= survey_2 = [20, 37, 17, 50, 53, 33] np.mean(survey_2) ``` >Output: ``` 35.0 ``` > <font color='purple'>Q. Which of the two surveys is better for estimating the population parameter or which survey is more reliable?</font> By observing the samples we found that values in survey 1 are much closer to the mean values so survey 1 will be more accurate for estimation > Now, can we simulate more and more sets of samples like the ones above? For this statisticians come up with something which has reasonable amount of accuracy. ### <font color='purple'>Sample With Replacement</font> They said that take your survey and then create more samples from the same survey only using **replacement**. Bootstrapping is a statistical procedure that resamples a single dataset to create many simulated samples. Code: ```python= n = 6 bootstrapped_samples = np.random.choice(survey_1, size=n) bootstrapped_samples ``` >Output: ``` array([37, 35, 35, 34, 35, 33]) ``` Here we will get an array of length 6 where each element is one of the original data points from survey 1 which is randomly chosen. Every time we run this code, we will get a different array so means that the mean of this newly constructed array will also be different. Code: ```python= np.mean(bootstrapped_samples) ``` >Output: ``` 34.833333333333336 ``` Code: ```python= bootstrapped_samples = np.random.choice(survey_2, size=n) np.mean(bootstrapped_samples) ``` >Output: ``` 37.166666666666664 ``` Let's observe the difference between survey_1 and survey_2 by running the code several times. - We can observe that in survey 1, the mean value is always close to 35 But in survey 2, it sometimes comes to 35, sometimes 40, sometimes 39 so there is more variance in survey 2. So we will go with survey_1 as it has the higher confidence because the variance in survey_1 is less. **Let's draw a histogram of this survey** Code: ```python= bootstrapped_means_survey_1 = [] for reps in range(10000): bootstrapped_samples = np.random.choice(survey_1, size=n) bootstrapped_mean = np.mean(bootstrapped_samples) bootstrapped_means_survey_1.append(bootstrapped_mean) sns.histplot(bootstrapped_means_survey_1) ``` >Output: <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/057/435/original/clt20.png?1700482649 height = 400 width = 550> Code: ```python= bootstrapped_means_survey_2 = [] for reps in range(10000): bootstrapped_samples = np.random.choice(survey_2, size=n) bootstrapped_mean = np.mean(bootstrapped_samples) # Replace by any statistic (median, percentile) bootstrapped_means_survey_2.append(bootstrapped_mean) sns.histplot(bootstrapped_means_survey_2) ``` >Output: <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/057/436/original/clt21.png?1700482738 height = 400 width = 550> > Let's compare these two histograms, what can we observe? - **We can observe that in survey_2, the interval or range is somewhere between 20-50** **While in survey_1, it is between 33-36 which is very close to the actual mean** So, <font color='purple'>survey 1 is more accurate than survey 2</font> ### <font color='purple'>How to compute the condidence interval?</font> We can calculate the percentile of **bootstrapped mean** - 2.5th percentile will give me lower bound (x1) - 97.5th percentile will give me upper bound (x2) Then, confidence inteval will be [x1, x2] Code: ```python= len(bootstrapped_means_survey_1) ``` >Output: ``` 10000 ``` Code: ```python= x1 = np.percentile(bootstrapped_means_survey_1, 2.5) x1 ``` >Output: ``` 34.0 ``` Code: ```python= x2 = np.percentile(bootstrapped_means_survey_1, 97.5) x2 ``` >Output: ``` 36.0 ``` The 95% of the numbers lies between 34 & 36 so **Confidence Interval: $(x1, x2)$** AS this process is random, this will be slight change in CI everytime Code: ```python= len(bootstrapped_means_survey_2) ``` >Output: ``` 10000 ``` Code: ```python= x1 = np.percentile(bootstrapped_means_survey_2, 2.5) x1 ``` >Output: ``` 24.0 ``` Code: ```python= x2 = np.percentile(bootstrapped_means_survey_2, 97.5) x2 ``` >Output: ``` 46.0 ``` here also CI will be **Confidence Interval = $(x1, x2)$** This is how we can calculare Confidence Interval using bootstrap. --- title: Conclusion description: duration: 5400 card_type: cue_card --- ### <font color='blue'>Conclusion</font> With this, we are done with today's lecture. CLT and confidence interval are very important concepts of statistics and we hope you understand these topics very well. Keep revising the topics.

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully