Piyush Ranjan
    • Create new note
    • Create a note from template
      • Sharing URL Link copied
      • /edit
      • View mode
        • Edit mode
        • View mode
        • Book mode
        • Slide mode
        Edit mode View mode Book mode Slide mode
      • Customize slides
      • Note Permission
      • Read
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Write
        • Only me
        • Signed-in users
        • Everyone
        Only me Signed-in users Everyone
      • Engagement control Commenting, Suggest edit, Emoji Reply
    • Invite by email
      Invitee

      This note has no invitees

    • Publish Note

      Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

      Your note will be visible on your profile and discoverable by anyone.
      Your note is now live.
      This note is visible on your profile and discoverable online.
      Everyone on the web can find and read all notes of this public team.
      See published notes
      Unpublish note
      Please check the box to agree to the Community Guidelines.
      View profile
    • Commenting
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
      • Everyone
    • Suggest edit
      Permission
      Disabled Forbidden Owners Signed-in users Everyone
    • Enable
    • Permission
      • Forbidden
      • Owners
      • Signed-in users
    • Emoji Reply
    • Enable
    • Versions and GitHub Sync
    • Note settings
    • Note Insights New
    • Engagement control
    • Make a copy
    • Transfer ownership
    • Delete this note
    • Save as template
    • Insert from template
    • Import from
      • Dropbox
      • Google Drive
      • Gist
      • Clipboard
    • Export to
      • Dropbox
      • Google Drive
      • Gist
    • Download
      • Markdown
      • HTML
      • Raw HTML
Menu Note settings Note Insights Versions and GitHub Sync Sharing URL Create Help
Create Create new note Create a note from template
Menu
Options
Engagement control Make a copy Transfer ownership Delete this note
Import from
Dropbox Google Drive Gist Clipboard
Export to
Dropbox Google Drive Gist
Download
Markdown HTML Raw HTML
Back
Sharing URL Link copied
/edit
View mode
  • Edit mode
  • View mode
  • Book mode
  • Slide mode
Edit mode View mode Book mode Slide mode
Customize slides
Note Permission
Read
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Write
Only me
  • Only me
  • Signed-in users
  • Everyone
Only me Signed-in users Everyone
Engagement control Commenting, Suggest edit, Emoji Reply
  • Invite by email
    Invitee

    This note has no invitees

  • Publish Note

    Share your work with the world Congratulations! 🎉 Your note is out in the world Publish Note

    Your note will be visible on your profile and discoverable by anyone.
    Your note is now live.
    This note is visible on your profile and discoverable online.
    Everyone on the web can find and read all notes of this public team.
    See published notes
    Unpublish note
    Please check the box to agree to the Community Guidelines.
    View profile
    Engagement control
    Commenting
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    • Everyone
    Suggest edit
    Permission
    Disabled Forbidden Owners Signed-in users Everyone
    Enable
    Permission
    • Forbidden
    • Owners
    • Signed-in users
    Emoji Reply
    Enable
    Import from Dropbox Google Drive Gist Clipboard
       Owned this note    Owned this note      
    Published Linked with GitHub
    • Any changes
      Be notified of any changes
    • Mention me
      Be notified of mention me
    • Unsubscribe
    --- title: Content description: duration: 5400 card_type: cue_card --- ### Content - **Introduction** - **Descriptive Statistics** - Measures of Central Tendency - Mean - Median - Mode - Measures of Variability - Range - Variance - Standard Deviation - **Inferential Statistics** - **Weighted Average** - **Inter Quartile Range** - Quartile - Percentile - Box Plot - **IQR implementation on real life dataset** - **Random Variables** - Discrete RV - Continuous RV - **Distribution Functions** - Histogram - Probability Mass Function (PMF) - Probability Density Function (PDF) - Cumulative Distribution Function (CDF) --- title: Descriptive statistics and Inferential statistics description: duration: 5400 card_type: cue_card --- ### Introduction (2 mins) Greetings Everyone, Since this lecture, we have been only talking about probability. We did not talk about statistics at all. Probability and statistics go hand in hand. Statistics is a very broad subject in itself but we are going to look at it based on the context of Data science only. So let's start our lecture There are 2 types of Statistics: ### <font color='blue'>**1. Descriptive Statistics** (3 mins)</font> The word descriptive means "**DESCRIBE**" Descriptive statistics involve summarizing and presenting data in a meaningful way, providing a clear and concise overview of a dataset. **Example:** <font color='purple'>You are driving a car and you look at your dashboard.</font> - The speedometer shows the speed of your car at the moment is 65 km/hr. So, it is simply describing speed. - This Speedometer simply describes an event that a vehicle is moving at a certain speed so it is an example of descriptive statistics. ### <font color='blue'>**2. Inferential Statistics** (5 mins)</font> Inferential statistics, on the other hand, involve making predictions, inferences, or drawing conclusions about a larger population based on a sample of data. <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/056/088/original/odo.jpeg?1699354146 width = "600"> <font color='purple'>Continuing the example:</font> - The car's speedometer displays the current speed but **doesn't predict your arrival time** because it depends on various factors like distance and traffic. - What Google Maps will do here, is estimate arrival time based on data and assumptions, but it's only sometimes 100% accurate. - This prediction is an example of inferential statistics, as it draws conclusions from real-world scenarios - It is trying to **"infer"** something. It is concluding out of it. So, it is inferential statistics. These predictions are not random, there are a lot of hypotheses involved here. We are going to learn about it in detail in our next module which is entirely about Inferential statistics. <font color='orange'>**Conclusion:**</font> - Descriptive statistics <font color='purple'>summarizes data</font> - Inferential statistics <font color='purple'>draws conclusions based on the observations.</font> These two are the essential branches of statistics Let's explore descriptive statistics --- title: Measures of Central Tendency description: duration: 5400 card_type: cue_card --- ### <font color='blue'>**Measures of Central Tendency**</font> In statistics, we often use measures to understand and describe a set of data. Three common measures are: 1. <font color='purple'>**Mean**</font> - The Mean is the average of all data points 2. <font color='purple'>**Median**</font> - The Median is the middle value when the data is sorted. 3. <font color='purple'>**Mode**</font> - It is the observation with the highest frequency Let's go deeper with the help of an example now. > <font color='purple'>Q. If you are looking for a new job and you want to check how much companies are paying for that role, where will you check?</font> At Glassdoor or leves.fyi Based on this let's have an example #### <font color='purple'>**Example-1: Data Scientist's Salaries** (7-8 mins)</font> ``` Suppose you are looking for a data scientist job at FAANG. The sample of salaries is taken and recorded as [30L, 30L, 35L, 40L, 40L]. What will be the salary you would be expecting? ``` <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/056/089/original/sal.png?1699354428 height = "200" width = "600"> **Approach:** ### <font color='purple'>**Mean** (3 mins)</font> > <font color='purple'>What will be our thought process here, and on what basis we will do the negotiations?</font> Our initial approach would likely be to calculate the Mean, or average, of these salaries. This is how we can calculate it: - Mean will be <font color='purple'>(30 + 30 + 35 + 40 + 40)/5 = **35 lakhs**</font> - **$\Largeμ = \frac{∑X}{N}$** Where, - $μ$ = population mean - $∑X$ = sum of each value in the population - $N$ = number of values in the population So, the mean salary in this sample is 35 lakhs. We might negotiate our expected salary around this figure. Suppose, a **new candidate comes** in the context and **his salary is 3 crores**. - **New mean** will become = <font color='purple'>(30 + 30 + 35 + 40 + 40 + 300)/6 = **79 lakhs**</font> The mean salary <font color='orange'>dramatically increased</font> to 79 lakhs because of the new candidate's exceptionally high salary. > Does this mean that you should now negotiate for 79L? We can say that this **new candidate is an outlier** in the data which is affecting the mean value. <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/056/090/original/out.png?1699354715 height = "200" width = "500"> ### <font color='purple'>**Median** (5 mins)</font> Here comes the concept of **"Median"** to measure central tendency instead of measuring it using Mean. <br> Before the new candidate joined, the <font color='purple'>Median was 35L</font>. This means that 35L was the center value when the salaries were sorted in ascending order: - **Original Salaries:** [30L, 30L, 35L, 40L, 40L] - **Sorted:** [30L, 30L, 35L, 40L, 40L] - **Median** = $35L$ (the middle value) <br> After the new candidate with a significantly higher salary arrived (300L), the new Median became 37.5 lakhs: - **New Salaries**: [30L, 30L, 35L, 40L, 40L, 300L] - **Sorted**: [30L, 30L, 35L, 40L, 40L, 300L] - **New Median** = (35L + 40L) / 2 = 75L / 2 = $37.5L$. So, it would be suitable to negotiate at 37.5 lakhs. There is a <font color='purple'>**huge difference in the new mean and new median**.</font> <font color='orange'>**Conclusion:**</font> - The outliers dramatically affect the Mean but the Median remains more robust and closer to the typical value of the dataset. - Which concludes that **Median is more robust to outliers** Let's look into some quick examples to get better understanding of this. #### <font color='purple'>**Example-2**: (2 mins)</font> ``` What is the median of the following numbers: 10, 20, 30, 40, 50, 60, 70 ``` - The number of observations is odd and the numbers are sorted. - The median would be: $(\frac{n+1}{2})th$ observation's value = (7+1)/2 = 4th observation. Therefore, the median will be 40. #### <font color='purple'>**Example-3:** (2 mins)</font> ``` What is the median of these numbers: 10, 20, 30, 40, 50, 60, 70, 80 ``` - The number of observations is even and the numbers are sorted. - The median would be: $\frac{(\frac{n}{2})th \, observation \,+\, (\frac{n}{2}+1)th \, observation}{2}$ = (4th obs + 5th obs)/2 = (40+50)/2 = 45 There is one more term often used in statistics i.e. "MODE". Let's look into it ### <font color='purple'>**Mode**(2 mins)</font> It is the observation with the highest frequency - It is **most occurring data point** in the dataset. <font color='purple'>Suppose the data points are recorded as</font> - [90, 90, 90, 80, 90, 70, 95, 90] - The mode will be **90**. - Remember, sometimes if there are no data points that repeat, then we can say that there is no mode There can also be **more than one mode** in the dataset. <font color='purple'>Suppose the data points are recorded as</font> - [2, 2, 3, 3, 4] - We can call this **Bi-modal** with 2 and 3 as the modes <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/056/092/original/mod.png?1699354981 height = "150" width = "600"> Now, Let's solve some quizzes --- title: Quiz-1 description: duration: 60 card_type: quiz_card --- # Question ``` There are 4 people whose average age is 24. We know the age of three people: 20, 22, and 28. What is the median age of these 4 people? ``` # Choices - [ ] 22 - [x] 24 - [ ] 25 - [ ] 26 --- title: Quiz 1 Explanation description: duration: 5400 card_type: cue_card --- ### Explanation for Quiz 1 Let the age of fourth person be "x". (20 + 22 + 28 + x)/4 = 24 (70 + x)/4 = 24 70 + x = 96 x = 26 Age of the fourth person is 26 Sort the data in ascending order: Number sorted in ascending order: 20, 22, 26, 28 - Here N = Even. So, median will be $\frac{(\frac{n}{2})th \, observation \,+\, (\frac{n}{2}+1)th \, observation}{2}$ - (2nd obs + 3rd obs )/2 Median = (22 + 26)/2 = 24 **Conclusion** Median age of these 4 people is 24 years --- title: Quiz-2 description: duration: 60 card_type: quiz_card --- # Question ``` In a survey about favourite animal: 30 people said cat, 40 people said dog, 20 people said cow. What is the mode of favourite animals in this data? ``` # Choices - [ ] Cat - [x] Dog - [ ] Cow --- title: Quiz 2 Explanation description: duration: 5400 card_type: cue_card --- ### Explanation for Quiz 2 The dog is liked by the maximum number of people. So, the frequency of occurring the dog in the dataset is the highest. **Conclusion**: The Mode in this data is Dog --- title: Weighted Average description: duration: 5400 card_type: cue_card --- ### <font color='blue'>**Weighted Average: Reflecting Importance**</font> In many situations, <font color='purple'>all data points are not equally important</font> when calculating an average. In such cases, we use a concept called Weighted Average. - In Weighted Average, each data point is <font color='purple'>assigned a weight</font> that represents its **importance** or relevance. - We **multiply each data point by its corresponding weight**, **sum these products**, and **then divide by the total weight**. - It gives a more accurate representation of the "center" or "average" because it considers the significance of each data point. let's understand this with an example. #### <font color='purple'>**Example: Calculating GPA** ( 3-5 mins)</font> In real life, a common application of weighted average is calculating Grade Point Average (GPA) for students. Consider a student's course list for a semester: <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/056/093/original/stu2.png?1699355669 height=200 width = 350> - Each course has a specific credit value, which represents the weight of that course in the GPA calculation. For example, Math carries 3 credits, History carries 4 credits, and so on. - Each course is assigned a grade, representing the student's performance in that course. <br> <font color='purple'>To calculate the GPA:</font> 1. Calculate the weighted score for each course by multiplying the credit by the numerical grade. 2. Sum up all the weighted scores. 3. Divide the total weighted score by the total credits. <font color='purple'>Weighted Average will be:</font> - **For Math:** $3(CREDIT) * 5(GRADE) = 15$ - **For History:** $4 * 4 = 16$ - **For Chemistry:** $3 * 5 = 15$ - **For English:** $2 * 3 = 6$ $GPA = \frac{Total \ Weighted \ Score}{Total \ Credits}$ $= \frac{52}{17} = 3.05$ <font color='orange'>**Conclusion**:</font> So, the student's GPA for this semester is 3.05 It's a weighted average that considers both the grades and the credit hours, reflecting their academic performance more accurately. Let's solve some quizzes. --- title: Quiz-3 description: duration: 60 card_type: quiz_card --- # Question ``` A survey of number of pets in a town saw that - 30% people had 0 pets, 40% had 1 pet, 10% had 2 pets, 20% had 3 pets. What is the average number of pets? ``` # Choices - [ ] 1 - [x] 1.2 - [ ] 1.5 - [ ] 2.1 --- title: Quiz 3 explanation description: duration: 5400 card_type: cue_card --- ### Explanation for Quiz 3 Suppose there are 100 people. - 30 people have 0 pets. - 40 people have 1 pet. - 10 people have 2 pets. - 20 people have 3 pets. Average number of pets = $\frac{{(30 \cdot 0) + (40 \cdot 1) + (10 \cdot 2) + (20 \cdot 3)}}{{100}}$ = 1.2 This is a simple example of a "weighted average". Here we are not taking the average directly like adding 0 + 1 + 2.. Instead of this, we are multiplying by its weight means how many times it is occurring and then dividing it by the total weigth i.e. 100 --- title: Quiz-4 description: duration: 60 card_type: quiz_card --- # Question ``` The mean weight of 2 children in a family is 40 Kgs. If the weight of the mother is included, the mean becomes 45. What is the weight of the mother? ``` # Choices - [ ] 45 - [ ] 50 - [x] 55 - [ ] 60 - [ ] 65 --- title: Quiz 4 explanation description: duration: 5400 card_type: cue_card --- ### Explanation for Quiz 4 - Let the Weight of children = x and y in kgs The mean weight of 2 children is 40 kgs i.e $\frac{(x + y)}{2} = 40$ $x + y = 80$ - Let's assume the weight of the mother denoted by 'm' If we include the weight of the mother along with the weight of children then the new mean will be 45 kgs $\frac{(x + y + m)}{3} = 45$ $x + y + m = 135$ From this we can get, m = 55 The weight of the mother is 55 kg. --- title: Measures of variability description: duration: 5400 card_type: cue_card --- ### <font color='blue'>**Measures of Variability**</font> Three common measures of variability are 1. Range 2. Variance 3. Standard Deviation Let's discuss Range #### <font color='blue'>**Range** (3 mins)</font> Range is nothing but <font color='purple'>**Maximum value - Minimum value**</font> Suppose the Salaries of some employees in a company are : [30, 30, 35, 40, 40] - Here, the range of the salary will be **40-30 = 10** It describes the overall spread of the data that the difference between maximum and minimum values is 10. <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/056/096/original/rang.png?1699356360 height = "250" width = "500"> . > <font color='purple'>Q. What will happen if there is an "Outlier" in the data?</font> - Let the salaries be: [30, 30, 35, 40, 40, 300] <font color='purple'>New range will be **300 - 30 = 270**</font> - As you can see one outlier can destroy the range of the dataset. We can conclude that <font color='purple'>**Range of the data is also not robust to the outliers like the Mean**</font> To solve this issue, statisticians came up with the metric called "**Inter Quartile Range**". Let's look into it. #### <font color='blue'>**Inter Quartile Range** (10 mins)</font> IQR is the metric that provides a robust way to measure the spread of a dataset. - The IQR is the range between the first quartile (Q1) and the third quartile (Q3) of a dataset. - Means, <font color='purple'>**$IQR = Q3 - Q1$**</font> > <font color='purple'>What is Quartiles?</font> #### <font color='purple'>**Quartiles**</font> It is the value which <font color='purple'>divides the dataset into four equal parts.</font> There are three quartiles, **Q1, Q2, and Q3**. - <font color='purple'>Q1 represents the 25th percentile</font>, meaning that 25% of the data falls below this value. - <font color='purple'>Q2 is the median and represents the 50th percentile</font>, dividing the data into two equal halves. - <font color='purple'>Q3 represents the 75th percentile</font>, meaning that 75% of the data falls below this value. Suppose we have this data with us, > **What each values are representing here?** <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/056/098/original/gro.png?1699356596 height = "200" width = "500"> - 0.25 represents Q1 or 25th percentile: - It means that 0.25% of people are shorter than this lady - 0.5 represents Q2 or 50th percentile: - It means that 0.5% or half of the people in the dataset are shorter than this guy - 0.75 represents Q3 or 75th Percentile: - It means that 0.75% of people are shorter than this guy - Maximum or 1: - It means that 100% of people are shorter than this lady or she is the tallest person in the dataset. > <font color='purple'>Q.What is percentile?</font> #### <font color='purple'>**Percentile**</font> A value that tells us that some **"p%" observations are less than that value** - Let's say the value occurring at 50 Percentile is 68. We can say that 50% of the data is less than 68 One more example: > <font color='purple'>Suppose you scored 99 percentile in your 10th boards, what does this mean?</font> - It indicates that **99% of students scored less marks than you**. Now if you want the graphical representation of a dataset's summary statistics, including the median, quartiles, and potential outliers. Box plot comes into the picture #### <font color='purple'>**Box Plot**</font> It provides a visual way to understand the distribution and spread of data. **Box:** - The box itself represents the interquartile range (IQR), It is divided into two parts, the lower (bottom) quartile (Q1) and the upper (top) quartile (Q3). - The length of the box is determined by the range between Q1 and Q3. **Line (Median):** - Inside the box, a line or bar is drawn that represents the median, which is the middle value of the dataset when it's ordered. **Whiskers:** - Two lines, or "whiskers," extend from the box in both directions. - The Value of Lower whiskers is determined by Q1 - 1.5(IQR). It is the minimum value of the range - The Value of Upper whiskers is determined by Q3 + 1.5(IQR). It is the maximum value of the range **Outliers:** - Data points that fall outside the whiskers are considered outliers. **Example** We have a sorted data, the box plot of the data will look like this <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/056/099/original/so34.png?1699356783 height = "300" width = "550"> --- ***Interactive tool for Box Plot*** https://www.desmos.com/calculator/h9icuu58wn How to use it: Just enter the data in the list A on the left hand side, then run the Box plot command on the left hand side only. It'll generate a Box plot based on your entered data ### <font color='purple'>**Variance** (5-7 mins)</font> Now let's have a look into another measure of variability, variance: Variance, <font color='purple'>measures the spread or dispersion of the values</font> of a random variable around its mean. It quantifies how much **individual values deviate from the mean**. - A **higher variance** indicates that the <font color='purple'>values are more spread out from the mean</font>. - While a **lower variance** suggests that the <font color='purple'>values are closer to the mean</font>. We can plot the histogram to visualise the spread or distribution of the data Code: ```python= import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt ``` Code: ```python= !wget --no-check-certificate https://drive.google.com/uc?id=1Mrt008vkE4nVb1zE4f06_rtq70QPfkIo -O weight-height.csv ``` >Output: ``` --2024-01-18 09:28:53-- https://drive.google.com/uc?id=1Mrt008vkE4nVb1zE4f06_rtq70QPfkIo Resolving drive.google.com (drive.google.com)... 142.250.128.139, 142.250.128.102, 142.250.128.113, ... Connecting to drive.google.com (drive.google.com)|142.250.128.139|:443... connected. HTTP request sent, awaiting response... 303 See Other Location: https://drive.usercontent.google.com/download?id=1Mrt008vkE4nVb1zE4f06_rtq70QPfkIo [following] --2024-01-18 09:28:53-- https://drive.usercontent.google.com/download?id=1Mrt008vkE4nVb1zE4f06_rtq70QPfkIo Resolving drive.usercontent.google.com (drive.usercontent.google.com)... 173.194.193.132, 2607:f8b0:4001:c0f::84 Connecting to drive.usercontent.google.com (drive.usercontent.google.com)|173.194.193.132|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 428120 (418K) [application/octet-stream] Saving to: ‘weight-height.csv’ weight-height.csv 100%[===================>] 418.09K --.-KB/s in 0.003s 2024-01-18 09:28:53 (118 MB/s) - ‘weight-height.csv’ saved [428120/428120] ``` Code: ```python= df_hw = pd.read_csv("weight-height.csv") df_hw.head() ``` >Output: <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/056/108/original/hj8.png?1699361102 height = 200 width = 300> Code: ```python= df_hw.describe() ``` >Output: <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/056/109/original/go09.png?1699361185 height = 300 width = 280> We will going to work on the single column for now Code: ```python= df_height = df_hw["Height"] df_height.head() ``` >Output: ``` 0 73.847017 1 68.781904 2 74.110105 3 71.730978 4 69.881796 Name: Height, dtype: float64 ``` Code: ```python= sns.histplot(df_height) ``` >Output: <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/056/117/original/nsi.png?1699363504 height = 300 width = 400> #### <font color='purple'>**Example: The Height Guessing Game**</font> Let's turn learning about variance into a fun game. > <font color='purple'>Imagine you're tasked with guessing the height of a randomly chosen person from a dataset, and you have to place your bet</font>. Here's how we will calculate and understand variance through this game. <br> <font color='purple'>**Round 1: Your First Guess**</font> - You make your first guess, **predicting that the person's height is 68 inches**. As it turns out, the **actual height is 64 inches**. - Your <font color='purple'>error for this round is 4 inches</font>. <font color='purple'>**Round 2: A Different Guess**</font> - For the second round, you **guess 63 inches**, but it's revealed that the person's **actual height is 68 inches**. - In this case, <font color='purple'>your error is 5 inches</font>. <br> Now, here's the critical question: > <font color='purple'>Q. When playing this game, would you ever choose an extreme guess like 55 inches or 75 inches?</font> **No**, It's highly unlikely that someone's height would be that far from the average. <font color='purple'>**Minimizing Error:**</font> So, what's the best strategy to minimize your errors in this game? It's pretty clear that we want to <font color='purple'>choose a height that's as close as possible to the average height</font> of the people in the dataset. <br> <font color='purple'>**Understanding Variance:**</font> This notion of <font color='purple'>minimizing error by making guesses closer to the average</font> is at the core of what we call "**variance**" - Variance tells us how data points are spread out from the average. - The **smaller the variance, the closer data points are to the mean**, making our guesses more accurate. In this game, by minimizing the variance of your guesses, we increase your chances of being right. Let's explore another way to measure error <font color='purple'>**Defining Error:**</font> <font color='purple'>$Error = (Actual \ Height - Guessed \ Height)^2$</font> > <font color='purple'>Q. Now, to minimize this error, what's the best approach?</font> In our Height Guessing Game, we've seen that aiming for the mean (μ) height is the key. Means, Guessed height should be the mean value. - **$Error = (H_1 - μ)^2$** (guessing for 1 time) - It is also known as **Mean Squared Error** <br> <font color='purple'>Imagine you're playing the game 10 times, guessing the mean height each time:</font> Error1 = $(H1 - μ)^2$ Error2 = $(H2 - μ)^2$ Error3 = $(H3 - μ)^2$ ... Error10 = $(H10 - μ)^2$ To find the overall error, we can sum up these individual errors and then divide by the number of guesses, which gives us the variance: <font color='purple'>**Variance Calculation:**</font> Variance = (Error1 + Error2 + Error3 + ... + Error10) / 10 - <font color='purple'>**$Variance = \Large\frac{(H_1 - μ)^2 + (H_2 - μ)^2 + (H_3 - μ)^2 + ..... + (H_{10} - μ)}{10}^2$**</font> > <font color='purple'>Q. So, if the variance is low, what does that mean?</font> It implies that most of our guesses are incredibly accurate. In general, variance quantifies how spread out, the data values are from the average (mean) value. It assesses the average squared difference between data points and the mean. <br> The formula for calculating variance for n data points is: ### <font color='purple'>**Variance Calculation Formula:**</font> **$variance$ = $\sigma^2 = \frac{\displaystyle\sum_{i=1}^{n}(H_i - \mu)^2}{n}$** - σ2 is the population variance. - $H_i$ is the ith data point. - µ is the population mean. - n is the number of data points in the population. Now that we have a clear understanding of variance and how it measures the spread or dispersion of data points, Let's look into another essential concept closely related to variance. It's called "Standard Deviation" ### <font color='purple'>**Standard Deviation** ( 3- 5 mins)</font> Let's introduce an even more practical and commonly used statistic - the "Standard Deviation." While variance quantifies the dispersion of data, standard deviation is derived from variance and offers a more interpretable measure. - The <font color='purple'>standard deviation represents how much individual data points deviate from the mean or average value</font>. - It gives us a clear sense of the typical or expected amount of variation in our dataset. - In simple words, it represents that <font color='purple'>how far is our data point from the mean ( μ )</font> <br> <font color='purple'>**Standard Deviation Formula:**</font> The standard deviation, can be calculated by taking the square root of the variance: **$SD = \sqrt{variance}$** - $\Large σ = \sqrt{\frac{\displaystyle\sum_{i=1}^{n}(H_i - \mu)^2}{n}}$ <font color='purple'>**Interpretation:**</font> - A <font color='purple'>lower standard deviation signifies that data points tend to be close to the mean</font>, indicating **less variability**. - Conversely, a <font color='purple'>higher standard deviation indicates greater data dispersion</font>, suggesting **more variability** within the dataset. --- title: IQR Implementation on real life dataset description: duration: 5400 card_type: cue_card --- ### <font color='blue'>**IQR implementation on a real-life dataset** (10-12 mins)</font> Let's apply this to real world dataset Now let's see how we can calculate IQR and create a Box plot using python, by working on a real life dataset <br> #### <font color='purple'>**Problem Statement**:</font> When we talk about these two players: 1. Sehwag 2. Rahul Dravid We all know that Sehwag has **aggressive batting style** While Rahul Dravid **plays patiently**, with no risk and stands on the crease like a "Wall" - Let's analyse both of their matches and try to find some insights about their range of scores. - We will use IQR here to calculate the range of their scores accurately and will also try to find if they have any "Outlier" scores in their careers. <font color='purple'>We will conclude that out of these two batsman, who is the more consistent batsman?.</font> Let's start with Sehwag's matches Code: ```python= import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt ``` Code: ```python= !wget --no-check-certificate https://drive.google.com/uc?id=1JYyGv7QSb_GkVGan5rnHNrtOg2ewJB_f -O sehwag.csv ``` Code: ```python= sehwag = pd.read_csv("sehwag.csv") sehwag.head() ``` >Output: <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/056/101/original/op220.png?1699358676 height = "200" width = "600"> Code: ```python= sehwag["Runs"].describe() ``` >Output: <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/056/102/original/dfg34.png?1699358869 height = 200 width = 350> We want to find the range of his scores > <font color='purple'>Let's find Quartiles first on the "Runs" column.</font> So Q1, Q2 and Q3 will be Code: ```python= # 25th percentile or Q1 p_25 = np.percentile(sehwag["Runs"], 25) p_25 ``` >Output: ``` 8.0 ``` This value indicates that 25% of all the values present in the dataset for Sehwag's run is less than 8 We can also say, <font color='purple'>Out of all the matches that Shewag played, in 25% of those matches, he scored less than 8 runs.</font> Code: ```python= #50th percentile or Q2, also "Median" p_50 = np.percentile(sehwag["Runs"], 50) p_50 ``` >Output: ``` 23.0 ``` This indicates that in 50% of the matches, he scored less than 23 runs Code: ```python= #75th percentile or Q3 p_75 = np.percentile(sehwag["Runs"], 75) p_75 ``` >Output: ``` 46.0 ``` This indicates that in <font color='purple'>75% of the matches, he scored less than 46 runs</font> > <font color='purple'>So, IQR will be?</font> We know IQR = Q3 - Q1 Code: ```python= # Inter Quartile Range iqr_sehwag = p_75 - p_25 iqr_sehwag ``` >Output: ``` 38.0 ``` Code: ```python= normal_range = (sehwag["Runs"].max() - sehwag["Runs"].min()) normal_range ``` >Output: ``` 219 ``` We can observe the difference here, **IQR is 38** which means that middle 50% of the data lies in the range of 38. So more than 50% of the time, Sehwag scores in the range of 38 runs <br> - On the other hand, the **normal range is very high i.e. 219** which is certainly not a good range to consider. - We can say one thing that **there in an Outlier present in the data** means in some matches he has scored so many runs like more than 300 in a single match This is why the range is getting affected by the outlier Let's plot the box plot to visualise the spread of the data. Code: ```python= sns.boxplot(data=sehwag["Runs"], orient="h") ``` >Output: <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/056/103/original/boxd23.png?1699359296 height = 300 width = 400> We can see that Q1, Q2, and Q3 values lie within the box and we can also see whiskers on both the sides of box which we can say is the limit. We already saw how to calculate the lower whisker and upper whisker All the values outside the limit are considered "Outlier" Code: ```python= # upper limit = Q3 + 1.5 * IQR upper = 46 + 1.5*(iqr_sehwag) upper ``` >Output: ``` 103.0 ``` Here, we cannot have values on the left side of the lower whisker as the batsman cannot score less than 0 runs. So all the outliers will be present on the right side of the upper whisker Code: ```python= # all the values greater than upper is outlier outliers_sehwag = sehwag[sehwag["Runs"]>upper] len(outliers_sehwag) ``` >Output: ``` 14 ``` Code: ```python= 14/245 ``` >Output: ``` 0.05714285714285714 ``` <font color='orange'>**Conclusion**:</font> Here we can say that **5.7% values from the dataset are outliers**. This means we can say that <font color='purple'>5.7 or ~6% times Sehwag has scored more than the IQR which is 38 runs</font> Now let's have a same process into Dravid's stats <font color='blue'>**Cricket - Dravid** (10 mins)</font> Code: ```python= !wget --no-check-certificate https://drive.google.com/uc?id=1nrKmOYQNiTqFhMIoAwE00ULKhGhMEVMZ -O dravid.csv ``` Code: ```python= dravid = pd.read_csv("dravid.csv") dravid["Runs"].describe() ``` >Output: <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/056/105/original/op78.png?1699359727 height = 200 width = 300> Code: ```python= #25th percentile or Q1 per_25 = np.percentile(dravid["Runs"], 25) per_25 ``` >Output: ``` 10.0 ``` This indicates that in <font color='purple'>25% of the matches, he scored less than 10 runs</font> Code: ```python= #50th percentile or Q2 , also "Median" per_50 = np.percentile(dravid["Runs"], 50) per_50 ``` >Output: ``` 26.0 ``` This indicates that in <font color='purple'>50% of the matches, he scored less than 26 runs</font> Code: ```python= #75th percentile or Q3 per_75 = np.percentile(dravid["Runs"], 75) per_75 ``` >Output: ``` 54.0 ``` This indicates that in <font color='purple'>75% of the matches, he scored less than 54 runs</font> Code: ```python= # Inter Quartile Range iqr_dravid = per_75 - per_25 iqr_dravid ``` >Output: ``` 44.0 ``` Code: ```python= normal_range = (dravid["Runs"].max() - dravid["Runs"].min()) normal_range ``` >Output: ``` 153 ``` Code: ```python= sns.boxplot(data=dravid["Runs"], orient="h") ``` >Output: <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/056/106/original/fj65.png?1699360087 height=300 width = 400> Code: ```python= # upper limit = Q3 + 1.5 * IQR upper_dravid = per_75 + 1.5*(iqr_dravid) upper_dravid ``` >Output: ``` 120.0 ``` Code: ```python= # all the values greater than upper is outlier outliers_dravid = dravid[dravid["Runs"]>upper_dravid] len(outliers_dravid)/len(dravid) ``` >Output: ``` 0.009433962264150943 ``` Code: ```python= outliers_dravid['Runs'].shape ``` >Output: ``` (3,) ``` Code: ```python= dravid.shape ``` >Output: ``` (318, 14) ``` #### <font color='orange'>**Conclusion**</font> Here we can say that **0.9% values from the dataset are outliers**. This means we can say that 0.9% times Dravid has scored more than the IQR which is 44 runs <br> So we can say that in <font color='purple'>Sehwag case there is **6% outliers** and in Dravid's case there are only **0.9% outliers**</font> which shows that "<font color='orange'>**Dravid was more consistent than Sehwag**"</font> <br> Now let's calculate the standard deviation through which we can measure the amount of variation or dispersion in runs scored by Sachin and Dravid. > Code: ```python= std_dev_sehwag = np.std(sehwag["Runs"]) print("The amount of variations in runs scored by sehwag is:",std_dev_sehwag) ``` > Output: ``` The amount of variations in runs scored by sehwag is: 34.73830672594385 ``` > Code: ```python= std_dev_dravid = np.std(dravid["Runs"]) print("The amount of variations in runs scored by dravid is:",std_dev_dravid) ``` > Output: ``` The amount of variations in runs scored by dravid is: 29.635116182506632 ``` Lower standard deviation indicates less variability in the batsman's performance. > **For Sehwag:** - Standard Deviation of Runs: 34.74. > **For Dravid:** - Standard Deviation of Runs: 29.64 <br> <font color='orange'>**Conclusion:**</font> Dravid has a lower standard deviation compared to Sehwag. This suggests that **Dravid's run scores are more consistent or less variable than Sehwag's**. - In other words, Dravid tends to have a more stable performance in terms of runs compared to Sehwag, who shows more variability in his run scores. <br> So, from both **IQR method** and measuring **Standard Deviation** we can conclude that <font color='purple'>Dravid is more consitent then Sehwag</font> **Let's talk about a new concept** --- title: Random variables and Distribution functions description: duration: 5400 card_type: cue_card --- ### <font color='purple'>**Random Variable (RV)** (5 - 7 mins)</font>: Having explored measures of variability such as variance and standard deviation which particularly deal with random variables, let's try to understand what exactly a random variable is. Random variables are essential for modelling and understanding uncertainty in various scenarios, let's have a look. > <font color='purple'>**Can you think of situations in your daily life where uncertainty or randomness plays a significant role?**</font> (Learner Participation) - Weather forecasts - Stock market predictions - Even a coin toss has uncertainty Because of this randomness or uncertainty, the concept of probability seeps into our daily life. For instance, - When talking about weather forecasts, the reporter talks about the percentage chance (probability) of rain, storm, etc. - We've even quantified that on a coin toss, the probability of getting a heads is 0.5 Therefore, when we're looking at things from a mathematical mindset, we need a way to account for such events, that exhibit randomness. For this, we use something known as a <font color='purple'>**Random variable**</font> A random variable is a situation/event/experiment, for which we are not certain about the outcome. It is a way to assign numbers to the outcomes of such events. They can further be divided into 2 types: - <font color='orange'>Discrete RV</font> - <font color='orange'>Continuous RV</font> #### <font color='purple'>Examples of Discrete RV</font> Here, we can count the number of possible outcomes. <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/056/107/original/rv5.png?1699360560 height = "300" width = "600"> . > <font color='purple'>**Coin Toss**</font> Since we mentioned it earlier, let's consider a coin toss. <font color='purple'>**What are its possible outcomes?**</font> Heads and Tails. - There is no other possible other than this. Hence, we can represent its outcomes as a random variable, that can take values: $\left \{H, T \right \}$ > <font color='purple'>**Throw of a dice**</font> - Let's assign a random variable, "X," to represent the outcome of the die roll. - So, a throw of dice can be represented as: <font color='purple'>$X = \left \{ 1, 2, 3, 4, 5, 6 \right \}$</font>, depending on the outcome of the roll. - It can not have an outcome lesser than 1, or greater than 6 - Or even, any decimal value between 1 and 2 - Hence it is also discrete RV > <font color='purple'>**Consider the contingency table provided.**</font> - Here, we have 2 random variables - X: Number of sandwiches bought - Y: Number of drinks bought Note that they are both discrete RV, as they can only have following values: - $X = \left \{1, 2 \right \}$ - $Y = \left \{1, 2, 3 \right \}$ #### <font color='purple'>Examples of Continuous RV</font> Here, we cannot count the number of possible outcomes. They are infinite. > <font color='purple'>**Height of students in a class**</font> - Suppose the lowest student height in the class is: 4.5 feet - Suppose the highest student height is: 5.9 feet Now, we can have students that have height as - 4.511 feet - 4.92 feet - 5.8555 feet So, we have an infinite number of possible height values between 4.5 and 5.9 feet. We cannot count them Whereas, we could count the number of possibilities in a coin toss or dice throw. > <font color='purple'>**Other examples of Continuous RV can be:**</font> - Temperature of a room - Time taken to complete a task - Distance travelled ...etc Now, that we have the proper understanding of random variables, let's have a look into different distribution functions which are used to describe the probability distribution of random variables. Let's have a look into a dataset but first let's see what is distribution function is ### <font color='blue'>**Distribution Functions** (3 mins)</font> **Probability Density Function (PDF)**: - The PDF is a function that describes the probability density of a continuous random variable over its range. - The term "density" here is similar to how tightly data is packed around a specific point, like cars on a road. **Probability Mass Function (PMF)**: - The PMF is a function that describes the probability of a discrete random variable taking on a specific value. **Cumulative Distribution Function (CDF)**: - The CDF is a function that gives the probability that a random variable is less than or equal to a specified value. Let's implement this using a height dataset We will going to work on the height dataframe that we saw above for now Code: ```python= df_height = df_hw["Height"] df_height.head() ``` >Output: ``` 0 73.847017 1 68.781904 2 74.110105 3 71.730978 4 69.881796 Name: Height, dtype: float64 ``` Code: ```python= # minimum height min_height = df_height.min() min_height ``` >Output: ``` 54.2631333250971 ``` Code: ```python= # maximum height max_height = df_height.max() max_height ``` >Output: ``` 78.9987423463896 ``` Code: ```python= total = len(df_height) total ``` >Output: ``` 10000 ``` When we talk about probability, we try to construct the Distribution plots. Cumulative Distribution Function (CDF), Probability Mass Function (PMF), and Probability Density Function (PDF) are all related to random variables and are used to describe the probability distribution of random variables. First, let's see what is random variable To plot this type of distribution we generally use Histograms or Distribution plots #### <font color='blue'>**Histogram** (5 mins)</font> It is a graphical representation of a dataset's distribution, showing the frequency or probability of different values within the data. Code: ```python= sns.displot(df_height) ``` >Output: <img src =https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/056/110/original/jwp0.png?1699361448 height = 400 width = 400> > <font color='purple'>Q.What we can understand from this distribution?</font> - Each bar in the histogram represents one of the intervals or ranges, - The height of the bar indicates the frequency or number of data points falling within that interval. <font color='purple'>**Count**</font>: - It indicates the "**frequency**", which means in the particular bar or range of height, how many values are there. - We can say this like, <font color='purple'>around 500 people have their height in the range of 63 - 65 (that on bar)</font> This is what histograms or distribution plots tell about the data Now let's have a look into some distribution functions ### <font color='purple'>**Probability Mass Function (PMF)** (3 mins)</font> The PMF is a function that describes the probability of a discrete random variable taking on a specific value. It associates each possible value of the random variable with its probability of occurrence. <font color='purple'>**Example: Rolling a Fair Six-Sided Die**</font> - If you have a discrete random variable X representing the outcome of rolling a fair six-sided die Possible outcome is: 1, 2, 3, 4, 5, 6. This is discrete random variable - The PMF might look like <font color='purple'>$P(X = 1) = \frac{1}{6}$, $P(X = 2) = \frac{1}{6}$</font>, and so on. <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/056/111/original/pmf.png?1699362755 height = "300" width = "500"> ### <font color='purple'>**Probability Density Function (PDF)** (5 mins)</font> **PDF is used for continuous random variables**, as opposed to PMF, which is for discrete variables. If you want to find the <font color='purple'>probability of a specific value</font> which is continuous random variable within the given range then we will use PDF. - It doesn't provide the probability of a specific value but gives you the <font color='purple'>probability of the RV falling within a certain interval</font>. - For example, what are the chances that the next height you chose will fall between 62 and 65 We can visualize a PDF by using distribution plots like histograms or KDE (Kernel Density Estimation ) plots. Code: ```python= sns.kdeplot(df_height) ``` >Output: <img src=https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/056/112/original/cou.png?1699362935 height=300 width = 400> <font color='purple'>**Example:**</font> If we have a continuous random variable Y representing the height of people in a population, The <font color='purple'>PDF might represent the probability that a randomly chosen person has a height within a certain range, such as between 65 and 70</font>. - We will find out the area under that interval to find the probability <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/056/113/original/bo56.png?1699363039 height = 300 width =400> Next up we have: ### <font color='purple'>**Cumulative Distribution Function (CDF)** (5-7 mins)</font> The Cumulative Distribution Function (CDF) describes the <font color='purple'>probability that a random variable takes on a value less than or equal to a given value</font>. In the context of this dataset, in CDF, we talk about fractions of people who are less than the given height - Let's say you take <font color='purple'>60 inches, then what fraction of the people have less than or equal to this value?</font> This fraction is calculated using CDF - It gives you the cumulative probability up to a certain point. <font color='purple'>**Example:**</font> If you have a random variable Z representing the number of heads in three coin tosses, The CDF would tell you the <font color='purple'>probability that Z is less than or equal to a certain number, like P(Z ≤ 2)</font>. > <font color='purple'>**How to calculate CDF**?</font> The **CDF is calculated by accumulating the probabilities for each height value**. - As you move along the X-axis (height values) on the CDF graph, you're essentially adding up the probabilities - It shows how likely it is to find someone with a height less than or equal to that value. <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/056/114/original/cdf.png?1699363140 height = "200" width = "500"> - The CDF graph typically starts at <font color='purple'>0% on the Y-axis (probability) when height is at its minimum</font> (in our dataset) - It ends at <font color='purple'>100% when height is at its maximum</font>. - The curve starts at the left and gradually climbs towards the right. - The steepness of the curve at a particular point represents how quickly the probability is accumulating <font color='orange'>**Conclusion**</font> So, the PDF shows you the probability of a specific height, while the CDF shows you the probability of heights up to a certain value in your dataset. Let's plot the CDF graph for this dataset manually Code: ```python= # CDF: Cumulative distribution function # will take 100 values between the range of 50 and 80 inclusively using np.linpace x_values = np.linspace(50, 80, 100) # Will contain fraction of people shorter than x y_values = [] for x in x_values: # find out people shorter than x people_shorter_than_x = df_height[df_height <= x] # find out number of such people num_people_shorter_than_x = len(people_shorter_than_x) # How many fraction of people are shorter than x so dividing it by total value fraction_people_shorter_than_x = num_people_shorter_than_x / total # Appending into the y_values list y_values.append(fraction_people_shorter_than_x) # plotting the CDF plt.plot(x_values, y_values, c="b") ``` >Output: <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/056/115/original/bj2.png?1699363286 height = 300 width = 400> - This Curve is called "**Cumulative Distribution Function**". - It is a function which takes the x value and returns the y value - <font color='purple'>$f(x) = y$</font> It is the inverse of the percentile means - **Percentile will take input as 25 and give output as 63.5** means - 25% of the people are shorter than 63.5 - While **CDF will take input as 63.5 and give output as 0.25** means - if we want to find how many people are having height less than or equal to 63.5 i.e. 25% of people. <img src = https://d2beiqkhq929f0.cloudfront.net/public_assets/assets/000/056/116/original/lhfj.png?1699363348 height = 50 width = 600> ### <font color='orange'>**Conclusion :**</font> In summary, the relationships are as follows: - The <font color='purple'>PMF is used for discrete random variables</font>. - The <font color='purple'>PDF is used for continuous random variables</font>. - The <font color='purple'>CDF is used for both</font> discrete and continuous random variables to provide cumulative probabilities. These functions are essential tools in probability and statistics for describing and understanding the behaviour of random variables. Let's study about the statistical measures that are related to random variables. It helps to describe the characteristics of a random variable and provide insights into its behaviour: --- title: Conclusion description: duration: 5400 card_type: cue_card --- ### <font color='purple'>**Conclusion**</font> With this, we conclude today's lecture. Today, we understood about lots of different topics and laid the foundation for the next lectures. Please keep revising this class as it is the foundation class for your statistics journey and from the next class we will start learning about different probability distributions. See you in the next class

    Import from clipboard

    Paste your markdown or webpage here...

    Advanced permission required

    Your current role can only read. Ask the system administrator to acquire write and comment permission.

    This team is disabled

    Sorry, this team is disabled. You can't edit this note.

    This note is locked

    Sorry, only owner can edit this note.

    Reach the limit

    Sorry, you've reached the max length this note can be.
    Please reduce the content or divide it to more notes, thank you!

    Import from Gist

    Import from Snippet

    or

    Export to Snippet

    Are you sure?

    Do you really want to delete this note?
    All users will lose their connection.

    Create a note from template

    Create a note from template

    Oops...
    This template has been removed or transferred.
    Upgrade
    All
    • All
    • Team
    No template.

    Create a template

    Upgrade

    Delete template

    Do you really want to delete this template?
    Turn this template into a regular note and keep its content, versions, and comments.

    This page need refresh

    You have an incompatible client version.
    Refresh to update.
    New version available!
    See releases notes here
    Refresh to enjoy new features.
    Your user state has changed.
    Refresh to load new user state.

    Sign in

    Forgot password

    or

    By clicking below, you agree to our terms of service.

    Sign in via Facebook Sign in via Twitter Sign in via GitHub Sign in via Dropbox Sign in with Wallet
    Wallet ( )
    Connect another wallet

    New to HackMD? Sign up

    Help

    • English
    • 中文
    • Français
    • Deutsch
    • 日本語
    • Español
    • Català
    • Ελληνικά
    • Português
    • italiano
    • Türkçe
    • Русский
    • Nederlands
    • hrvatski jezik
    • język polski
    • Українська
    • हिन्दी
    • svenska
    • Esperanto
    • dansk

    Documents

    Help & Tutorial

    How to use Book mode

    Slide Example

    API Docs

    Edit in VSCode

    Install browser extension

    Contacts

    Feedback

    Discord

    Send us email

    Resources

    Releases

    Pricing

    Blog

    Policy

    Terms

    Privacy

    Cheatsheet

    Syntax Example Reference
    # Header Header 基本排版
    - Unordered List
    • Unordered List
    1. Ordered List
    1. Ordered List
    - [ ] Todo List
    • Todo List
    > Blockquote
    Blockquote
    **Bold font** Bold font
    *Italics font* Italics font
    ~~Strikethrough~~ Strikethrough
    19^th^ 19th
    H~2~O H2O
    ++Inserted text++ Inserted text
    ==Marked text== Marked text
    [link text](https:// "title") Link
    ![image alt](https:// "title") Image
    `Code` Code 在筆記中貼入程式碼
    ```javascript
    var i = 0;
    ```
    var i = 0;
    :smile: :smile: Emoji list
    {%youtube youtube_id %} Externals
    $L^aT_eX$ LaTeX
    :::info
    This is a alert area.
    :::

    This is a alert area.

    Versions and GitHub Sync
    Get Full History Access

    • Edit version name
    • Delete

    revision author avatar     named on  

    More Less

    Note content is identical to the latest version.
    Compare
      Choose a version
      No search result
      Version not found
    Sign in to link this note to GitHub
    Learn more
    This note is not linked with GitHub
     

    Feedback

    Submission failed, please try again

    Thanks for your support.

    On a scale of 0-10, how likely is it that you would recommend HackMD to your friends, family or business associates?

    Please give us some advice and help us improve HackMD.

     

    Thanks for your feedback

    Remove version name

    Do you want to remove this version name and description?

    Transfer ownership

    Transfer to
      Warning: is a public team. If you transfer note to this team, everyone on the web can find and read this note.

        Link with GitHub

        Please authorize HackMD on GitHub
        • Please sign in to GitHub and install the HackMD app on your GitHub repo.
        • HackMD links with GitHub through a GitHub App. You can choose which repo to install our App.
        Learn more  Sign in to GitHub

        Push the note to GitHub Push to GitHub Pull a file from GitHub

          Authorize again
         

        Choose which file to push to

        Select repo
        Refresh Authorize more repos
        Select branch
        Select file
        Select branch
        Choose version(s) to push
        • Save a new version and push
        • Choose from existing versions
        Include title and tags
        Available push count

        Pull from GitHub

         
        File from GitHub
        File from HackMD

        GitHub Link Settings

        File linked

        Linked by
        File path
        Last synced branch
        Available push count

        Danger Zone

        Unlink
        You will no longer receive notification when GitHub file changes after unlink.

        Syncing

        Push failed

        Push successfully