Probability & Statistics

# Probability ## Definition Probability is a mathematical concept that quantifies the likelihood of an event occurring. Probabilities are often expressed as fractions, decimals, or percentages. $$P(A) = \frac {n}{N}$$ $n$ is the number of favorable outcomes $N$ is the number of possible outcomes in sample space $S$ The **classical probability** assumes that all outcomes are equally likely. For example, the probability of getting heads in a fair coin toss is $\frac {1} {2}$. But **Conditional Probability** is the the probability of one event occurring given that another event has already occurred. Denoted $P(A | B)$, it is often expressed using the formula $P(A|B) = \frac {P(A \cap B)} {P(B)}$. provided $P(B)\neq 0$. $\cap$ is intersection - $P(A|B)$ is the conditional probability that $A$ occurs given that $B$ has occurred. - $P(A \cap B)$ is the probability that both events $A$ and $B$ occur. - $P(B)$ is the probability that event $B$ occurs. To illustrate, let's say you have a standard deck of 52 playing cards: - Event $A$ is drawing an Ace. - Event $B$ is drawing a red card. There are 4 Aces and 26 red cards in a standard deck. Two of the Aces are red. $P(A) = \frac {4} {52}$ $P(B) = \frac {26} {52}$ $P(A \cap B) = \frac {2} {52}$ -> $P(A|B)=\frac {\frac {2} {52}}{\frac {26} {52}}=\frac {2}{26}=\frac {1}{13}$ So if you know you've drawn a red card, there's a $\frac {1} {13}$ chance that it is an Ace. ## Union of Events The union of events $A$ and $B$, denoted as $A\cup B$, happens if either $A$, $B$, or both $A$ and $B$ happen. $$P(A\cup B) = P(A) + p(B) - P(A \cap B)$$ ![[2b9ac417d3f9501644f5bcc1cfdd8ebb.jpg]] The reason for subtracting $P(A \cap B)$ is to account for double-counting the outcomes that are common to both $A$ and $B$. For example, consider a standard deck of 52 playing cards. Let event $A$ be the drawing of a red card and let event $B$ be the drawing of a face card (Jack, Queen, King). The union of events $A$ and $B$ consists of drawing either a red card, a face card, or a red face card. In this case, the event $A\cup B$ includes 26 red cards and 12 face cards, but since 6 cards are both red and face cards, we have to account for that overlap: $P(A) = \frac {26} {52} = \frac 1 2$ $P(B) = \frac {12} {52} = \frac {3} {13}$ $P(A\cap B) = \frac {6} {52} = \frac {3} {26}$ $P(A\cup B) = \frac {1}{2} + \frac {3} {13} - \frac {3} {26} = \frac {16} {26}$ In the example above, $P(A\cup B)$ gives the probability of drawing either a red card, a face card, or a card that is both red and a face card from a standard deck of 52 cards. ## Independent Events two events $A$ and $B$ are said to be independent if the occurrence of one event does not affect the occurrence of the other. In other words, knowing whether or not one event has occurred does not give you any information about whether the other event will occur. $$P(A\cap B) = P(A) \times P(B)$$ ![[c839ee4bf3bb6c5efce796861aee5142.jpg]] The concepts of "independent events" and "intersection" are related but distinct in probability theory. The intersection of two events $A$ and $B$, denoted by $A\cap B$, refers to the set of outcomes that are common to both $A$ and $B$. For example, let $A$ be the event that you draw an Ace from a standard deck of 52 cards, and $B$ be the event that you draw a King from the same deck. If you draw one card and then put it back into the deck (i.e., drawing with replacement), then the events are independent. $P(A) = \frac {4} {52} = \frac {1} {13}$ $P(B) = \frac {4} {52} = \frac {1} {13}$ $P(A\cap B) = \frac {1} {13} \times \frac {1} {13} = \frac {1} {169}$ Since $P(A\cap B) = P(A) \times P(B)$, the events are independent when drawing with replacement. However, if you draw without replacement, the events become dependent. If you draw an Ace and do not put it back into the deck, then the probability of drawing a King changes: $P(A) = \frac {4} {52} = \frac {1} {13}$ $P(B|A) = \frac {4} {51}$ ## Tree Diagrams In a tree diagram, each branch represents a possible outcome of an event. The branches emerge from nodes, which signify the point at which a decision or random event takes place. The end points of the branches, sometimes called "leaves," represent the ultimate outcomes of a sequence of events. *** ### Tossing two coins Suppose you want to find the probabilities of getting different combinations of heads (H) and tails (T) when tossing two coins. A tree diagram for this experiment would look like this: ![[4a.3a9550f.jpg]] ## Total Probability The concept of total probability is a fundamental principle in probability theory and statistics that describes how to get the overall probability of an event $A$ occurring, given several mutually exclusive and exhaustive ways that $A$ can occur. The "mutually exclusive and exhaustive" part means that the different ways can't happen at the same time and cover all possible ways $A$ could happen. Suppose $A$ can happen in one of several ways, each associated with a different event $B_1$,$B_2$,…,$B_n$. The total probability of $A$ is the sum of the probabilities of $A$ happening in each of these ways. Mathematically, this can be expressed as: $$P(A) = P(A\cap B_1) + P(A\cap B_2) + ... + P(A\cap B_n)$$ previously we had -> $P(A|B_n) = \frac {P(A\cap B_n)}{P(B_n)}$ multiply both sides to $P(B_n)$ -> $P(B_n).P(A|B_n) = P(A\cap B_n)$ thus -> $P(A\cap B_n) = P(B_n).P(A|B_n)$ so we got: $$P(A) = P(B_1).P(A|B_1)+P(B_2).P(A|B_2)+...+P(B_n).P(A|B_n)$$ *** Let's say you have two bags of marbles. Bag 1 contains 3 red marbles and 2 green marbles. Bag 2 contains 1 red marble and 4 green marbles. 1. The probability of picking Bag 1 is $P(B_1)=0.5$ and the probability of picking Bag 2 is $P(B_2)=0.5$. 2. The probability of drawing a red marble from Bag 1 is $P(R|B_1)=\frac 3 5$ and from Bag 2 is $P(R|B_2)=\frac 1 5$. The total probability of drawing a red marble would be: $P(R) = P(B_1).P(R|B_1)+P(B_2).P(R|B_2) = 0.5 \times \frac 3 5 + 0.5 \times \frac 1 5 = \frac {3} {10} + \frac {1} {10} = \frac {4} {10} = \frac 2 5$So, the total probability of drawing a red marble from either of the bags is 2552. ## Permutation Count The arrangement of a set of objects in a particular order The number of possible permutations depends on the number of elements in the set and how many of those elements you are choosing to arrange. The formula to calculate the number of permutations of $n$ objects taken $r$ at a time (without repetition) is given by the formula: $$nP_r = \frac {n!} {(n-r)!}$$ $n$ is the total number of objects $r$ is the number of objects you are selecting from the total *** If you have 5 books and you want to know how many ways they can be arranged on a shelf, the number of permutations would be $5!=5×4×3×2×1=120$. If you want to select 3 books out of 5 and find out the number of unique sequences (or arrangements) you can create, you would calculate: $5P_3 = \frac {5!}{(5-3)!} = \frac {5 \times 4 \times 3 \times 2}{2 \times 1} = 60$ ## Combination The term "combination" refers to the selection of items from a larger set, where the order of selection does not matter. $$ \begin{bmatrix} n \\ k \end{bmatrix} = \frac {n!} {k! \times (n-k)!} $$ For example, the number of ways to choose 2 items out of 5 is: $$ \begin{bmatrix} 5 \\ 2 \end{bmatrix} = \frac {5!} {2! \times (5-2)!} = \frac {5 \times 4 \times 3 \times 2} {2 \times 3 \times 2} = \frac {20} {2} = 10 $$ Combinations are often used when you are looking at the number of ways certain events can occur where the order of occurrence is not important. ## Expected Value The expected value of a random variable is a measure of the "central tendency" of the distribution of that variable. In simple terms, it provides a weighted average of all possible values a random variable can take, where each value is weighted according to its probability of occurrence. For a discrete random variable $X$ with possible outcomes $x_1, x_2, …, x_n$ occurring with probabilities $p(x_1), p(x_2) ,…, p(x_n)$, the expected value $E[X]$ is given by: $$E[X] = \sum^{n}_{i=1} x_i.p(x_i)$$ For a continuous random variable $X$ with probability density function $f(x)$, the expected value $E[X]$ is: $$E[X] = \int^{\infty}_{-\infty} x.f(x)dx$$ ## Uniform Distribution The uniform distribution is a type of probability distribution in which all outcomes are equally likely. The uniform distribution can be either discrete or continuous. ### Discrete Uniform Distribution In a discrete uniform distribution, each of the $n$ values in the sample space has an equal probability of $\frac 1 n$. The expected value $E[X]$ for a discrete uniform distribution in the interval $[a,b]$ is: $$E[X]=\frac {a+b}{2}$$ The variance Var(X) is: $$Var(X)=\frac {(b-a)(b-a+2)}{12}$$ ### Continuous Uniform Distribution In the continuous case, the distribution is described by two parameters, $a$ and $b$, which define the interval over which the variable can take values. The probability density function(PDF) $f(x)$ for a continuous uniform distribution over the interval $[a, b]$ is given by: $$ f(x) = \begin{cases} \frac{1}{b-a} & \text{if } a \leq x \leq b, \\ 0 & \text{otherwise}. \end{cases} $$ The expected value $E[X]$ for a continuous uniform distribution in the interval $[a,b]$ is: $$E[X]=\frac {a+b}{2}$$ The variance Var(X) is: $$Var(X)=\frac {(b-a)^2}{12}$$ ## Normal Distribution ![[standard-normal-distribution-example.webp]] The normal distribution, also known as the Gaussian distribution, is a probability distribution that describes how the values of a variable are distributed. It is a symmetric distribution where most of the observations cluster around the central peak and the probabilities for values further away from the mean taper off equally in both directions. The probability density function (PDF) of a normal distribution with mean $\mu$ and standard deviation $\sigma$ is given by: $$f(x)=\frac {1}{\sigma \sqrt{2\pi}}e^{-\frac 1 2 (\frac {x-\mu}{\sigma})^2}$$ $\mu$ is the mean(also its median and mode) $\sigma$ is the standard deviation, a measure of the amount of variation or dispersion in the set of values. $\sigma^2$ is the variance, the square of the standard deviation. ### Standard Normal Distribution A standard normal distribution is a normal distribution with a mean of 0 and a standard deviation of 1. For a random variable $X$ with normal distribution $N(\mu,\sigma^2)$, the standardized variable $Z$ is defined as: $$Z = \frac {X-\mu}{\sigma}$$ ## Binomial Distribution ![[binomial.gif]] Binomial Distribution describes the number of successes in a fixed number of independent Bernoulli trials, each with the same probability of success. A Bernoulli trial is a random experiment with exactly two possible outcomes: "success" and "failure". For example, tossing a coin is a Bernoulli trial where the coin landing heads up could be considered a "success" and tails a "failure" (or vice versa, depending on how you define success). The probability mass function(PMF) of the binomial distribution is given by: $$P(X = k) = \begin{bmatrix} n \\ k \end{bmatrix} \times p^k \times (1-p)^{(n-k)} $$ $P(X=k)$ is the probability of observing $k$ successes in $n$ trials $n$ denoted as the number of trials $p$ denoted as the probability of success for each trial $\begin{bmatrix}n \\ k \end{bmatrix}$ is the binomial coefficient, which represents the number of ways to choose $k$ successes out of $n$ trials and is defined by: $$ \begin{bmatrix}n \\ k \end{bmatrix} = \frac {n!} {k! \times (n-k)!} $$ **note**: PMF is used for discrete random variables, whereas PDF is used for continuous random variables. The mean ($\mu$) and variance ($\sigma^2$) of a binomial distribution are given by: $$\mu = n \times p$$ $$\sigma^2 = n \times p \times (1-p)$$ *** Let's consider the example of flipping a coin 5 times and counting the number of heads that appear. In this situation, each flip of the coin is a Bernoulli trial with two possible outcomes: heads ("success") or tails ("failure"). Let's assume the coin is fair, so the probability of getting heads (success) is $p=0.5$, and the probability of getting tails (failure) is $1−p=0.5$. We flip a fair coin 5 times ($n=5$). What is the probability of getting exactly 3 heads? $P(X = k) = \begin{bmatrix} n \\ k \end{bmatrix}\times p^k \times (1-p)^{(n-k)}$ -> $P(X = 3) = \begin{bmatrix}5 \\ 3 \end{bmatrix} \times 0.5^3 \times (1-0.5)^{5-3} = \frac {5!} {3! \times (5-3)!} \times 0.5^3 \times 0.5^2 = 10 \times 0.125 \times 0.25 = 0.3125$ So, the probability of getting exactly 3 heads when flipping a fair coin 5 times is 0.3125, or 31.25%. ## Geometric Distribution ![[geometric-distribution-probability-statistics.jpg.webp]] The geometric distribution is a probability distribution that describes the number of Bernoulli trials needed for a success to occur. In other words, it models the number of trials that you must conduct until you get your first "success," assuming that each trial is independent and has the same probability $p$ of success. For example, if you're flipping a fair coin and you're interested in knowing how many flips it will take until you get your first "heads," that's a situation modeled by a geometric distribution with $p=0.5$. The probability mass function (PMF) of the geometric distribution is given by: $$P(X = k) = (1-p)^{(k-1)} \times p$$ The mean ($\mu$) and variance ($\sigma^2$) of a geometrically-distributed random variable are given by: $$\mu = \frac 1 p$$ $$sigma^2 = \frac {1-p} {p^2}$$ so for the example above we got: $P(X=k)=(1-0.5)^{k-1}\times 0.5$ Let's calculate it for 1, 2, 3: $P(X=1)=(1-0.5)^{1-1}\times 0.5 = 0.5$ $P(X=2)=(1-0.5)^{2-1}\times 0.5 = 0.25$ $P(X=3)=(1-0.5)^{3-1}\times 0.5 = 0.125$ So: - There is a 50% chance it will take exactly 1 flip to get the first "heads." - There is a 25% chance it will take exactly 2 flips to get the first "heads." - There is a 12.5% chance it will take exactly 3 flips to get the first "heads." Using the formula for the mean of a geometric distribution, $\mu = \frac 1 p$, the expected number of trials (flips) it will take to get the first "heads" is $\frac 1 {0.5} = 2$. So, on average, you would expect it to take 2 flips to get your first "heads" when flipping a fair coin. ## Negative Binomial Distribution ![[1.gif]] The Negative Binomial Distribution is an extension of the geometric distribution that describes the number of Bernoulli trials required for achieving $r$ successes. In the geometric distribution, you're interested in the number of trials needed for the first success. In the negative binomial distribution, you're interested in the number of trials needed to achieve $r$ successes, where $r$ is a fixed positive integer. The probability mass function (PMF) for the negative binomial distribution is given by: $$ P(X=k) = \begin{bmatrix}k-1 \\ r-1\end{bmatrix} \times p^r \times (1-p)^{(k-r)} $$ $r$ is the number of successes we need And same as binomial distribution: $$ \begin{bmatrix}n \\ k \end{bmatrix} = \frac {n!} {k! \times (n-k)!} $$ The mean ($\mu$) and variance ($\sigma^2$) of a negative binomial distribution are given by: $$\mu = \frac r p$$ $$\sigma^2 = \frac {r(1-p)} {p^2}$$ *** Suppose you are flipping a fair coin and are interested in finding out how many flips it will take until you get 3 heads. What is the probability that it will take exactly 5 flips? $P(X=k) = \begin{bmatrix}k-1 \\ r-1\end{bmatrix}\times p^r \times (1-p)^{(k-r)}$ $P(X=5) = \begin{bmatrix}5-1 \\ 3-1\end{bmatrix}\times 0.5^3 \times (1-0.5)^{(5-3)} = \begin{bmatrix}4 \\ 2\end{bmatrix} \times 0.5^3 \times 0.5^2 = 6 \times 0.125 \times 0.25 = 0.1875$ ## Sum up ### Binomial What is the probability of getting exactly $k$ successes in $n$ trials? ### Geometric What is the probability that the first success occurs on the $k$-th trial? ### Negative Binomial What is the probability that the $r$-th success occurs on the $k$-th trial? # Statistics ## Data Types ### Qualitative Data(Categorical Data) 1. **Nominal Data**: Data that can be divided into categories but cannot be ordered or measured. - **Examples**: Male/Female, Yes/No, Types of fruits 2. **Ordinal Data**: Data that can be ordered but the intervals between the data points are not meaningful. - **Examples**: Customer satisfaction ratings (Very Dissatisfied, Dissatisfied, Neutral, Satisfied, Very Satisfied) ### Quantitative Data(Numerical Data) 1. **Discrete Data**: Numeric values that are countable and have a finite number of values. - **Examples**: Number of pets owned, Number of siblings 2. **Continuous Data**: Numeric values that are measurable and can take an infinite number of values within a specific range. - **Examples**: Body weight, Temperature, Time taken to complete a task ### Special Types 1. **Interval Data**: Data that is both continuous and can be ordered, but the zero point is arbitrary, meaning it doesn't indicate the absence of the attribute being measured. This means that ratios between numbers are not meaningful. Temperature in Celsius or Fahrenheit is an example. - **Examples**: IQ scores, SAT scores 2. **Ratio Data**: Similar to interval data but with a true zero point, which means that ratios between numbers are meaningful. - **Examples**: Height in meters, Income in dollars ### Time-Series and Cross-Sectional Data 1. **Time-Series Data**: Data collected at different points in time. This is often used in finance for stock prices or in economics for data like GDP over time. 2. **Cross-Sectional Data**: Data collected at a single point in time. For example, a survey of the income of various individuals taken in the year 2021. ### Mixed Types 1. **Multilevel(or Hierarchical) Data**: Data that are nested within different groups. For example, students nested within classrooms nested within schools. 2. **Panel Data**: A combination of cross-sectional and time-series data, often used in econometrics. It involves tracking the same individuals or units over multiple time periods. ## Sampling Sampling refers to the process of selecting a subset of individuals from a larger population or data set to estimate characteristics of the whole population. The idea is to analyze a small group that accurately represents the larger group to make inferences about the larger population. ### Random Sampling Each member of the population has an equal chance of being selected. ### Stratified Sampling Stratified sampling is a method of sampling that involves dividing the population into different subgroups or "strata" based on certain characteristics, and then taking a sample from each stratum. The characteristics used to create the strata could be anything relevant to the study, such as age, gender, income level, educational attainment, geographic location, and so on. Stratified sampling is particularly useful when: 1. The population is not homogeneous and contains several distinct categories or strata. 2. The researcher wants to ensure that specific subgroups are adequately represented in the sample. 3. The objective of the research involves comparing different subgroups within the population. ### Cluster Sampling The population is divided into clusters, often based on geographical location. A random sample of clusters is chosen, and then either all individuals in selected clusters are included, or a random sample of individuals within each selected cluster is taken. ### Systematic Sampling A list is made of all members of the population, and every $n$th individual is selected from the list. ### Convenience Sampling Samples are chosen in a way that is convenient for the researcher, often leading to biased results. ### Judgment Sampling The researcher selects samples based on their expertise, also known as "expert sampling." ### Quota Sampling The researcher aims to gather samples from specific subgroups until a predetermined quota for each subgroup is reached. ### Snowball Sampling Used in social sciences for hard-to-reach populations. Existing study subjects recruit future subjects among their acquaintances. ## Descriptive statistics Descriptive statistics is the branch of statistics that focuses on collecting, summarizing, and interpreting data in a way that allows for immediate and straightforward conclusions about a dataset. Unlike inferential statistics, which aims to make predictions or inferences about a population based on a sample, descriptive statistics seeks only to describe the main aspects of the data at hand. Descriptive statistics provide a way to understand the "shape" of the dataset and to identify patterns, trends, or anomalies that might exist. This type of statistical analysis can be performed on either a single variable or on multiple variables. ### Common Descriptive Statistics Measures 1. **Measures of Central Tendency**: These statistics give you a central point around which the set of values seems to cluster. - **Mean**: The average value. - **Median**: The middle value when the data are sorted. - **Mode**: The value that occurs most frequently. 2. **Measures of Dispersion**: These statistics describe how spread out the values are. - **Range**: The difference between the maximum and minimum values. - **Variance**: The average of the squared differences from the mean. - **Standard Deviation**: The square root of the variance. - **Interquartile Range (IQR)**: The range within which the middle 50% of values fall. 3. **Measures of Shape**: These statistics describe the shape of the data distribution. - **Skewness**: Indicates the direction and extent of skew(departure from horizontal symmetry) in the data. - **Kurtosis**: Indicates how the peak and tails of the distribution differ from the normal distribution. 4. **Measures of Position**: These statistics identify where individual data points stand in relation to others in the data set. - **Percentiles**: The value below which a given percentage of data points fall. - **Quartiles**: Divide the data into four equal parts. - **Z-scores**: Indicate how many standard deviations an individual data point is from the mean. 5. **Counts and Frequencies**: Simply tallying the number of occurrences of each unique value or range of values can be informative. 6. **Cross-tabulation**: Often used for categorical data to show the frequency with which certain events occur together. 7. **Correlation Coefficients**: These describe the strength and direction of a relationship between two variables. ## Inference Statistics Inferential statistics is a branch of statistics that allows you to make predictions ("inferences") about a population based on a sample of data from that population. Unlike descriptive statistics, which aims to summarize the data you have, inferential statistics helps you draw conclusions beyond your data. Essentially, it provides ways to generalize findings from a sample to a larger population. Here are some of the key concepts and methods commonly used in inferential statistics: ### Hypothesis Testing You often start with a null hypothesis (e.g., "there is no effect of the drug") and an alternative hypothesis (e.g., "the drug has an effect"). Based on your sample data, you use statistical tests (like t-tests, chi-square tests, etc.) to determine the probability that your null hypothesis is true or false. ### Confidence Intervals Instead of just providing a single estimate based on the sample, a confidence interval gives an estimated range of values which is likely to include the population parameter. For instance, a 95% confidence interval implies that if the experiment were repeated many times, the confidence interval would contain the true population mean 95% of the time. ### Regression Analysis Regression techniques allow you to examine the relationship between two or more variables. Simple linear regression involves predicting a quantitative outcome variable (dependent variable) based on one predictor variable (independent variable). Multiple regression uses two or more predictor variables. ### ANOVA (Analysis of Variance) ANOVA allows you to compare the means of three or more groups to see if they are statistically different. For example, you might use ANOVA to see if diet affects weight loss in different age groups. ### Bayesian Methods In contrast to traditional ("frequentist") methods, Bayesian methods provide a more direct way to represent uncertainty by using probability distributions for both observed data and unknown parameters. ### Correlation Analysis This measures the strength and direction of the relationship between two variables. It doesn't imply causation but can suggest whether an association exists. ### Types of Data and Tests Different types of data (nominal, ordinal, interval, ratio) require different tests. Nominal data might use a chi-square test, ordinal data might use a Mann-Whitney U test, and interval or ratio data might use a t-test or ANOVA. ### Sampling and Experimental Design How you collect your sample and design your experiment has a significant impact on the validity of your inferences. Random sampling and random assignment are best practices for reducing bias. ### Significance Level (Alpha) This is the probability of rejecting the null hypothesis when it is actually true. Common choices for alpha are 0.05, 0.01, and 0.001. ### P-value The P-value tells you how extreme your data are. Smaller P-values suggest that the null hypothesis is less likely to be true. ## Measures of Centrality Measures of centrality describe the center of a data distribution. Measures of centrality aim to provide a summary of a dataset with a single value that represents the 'middle' or 'center' of the data. Here are the primary measures of centrality in statistics: ### Mean (Arithmetic Average) It is the sum of all data points divided by the number of data points. It provides a measure of the central location of the data. $$\bar{x} = \frac {\sum x_i}{n}$$ ### Median The median is the value that separates the higher half from the lower half of a data sample. If there's an odd number of observations, the median is the middle number; if there's an even number of observations, it's the average of the two middle numbers. - median for an odd number of observations: $$median = x_{\frac n 2}$$ - median for an even number of observations: $$median = \frac {x_{\frac n 2} + x_{\frac n 2 + 1}}{2}$$ ### Mode The mode is the value that appears most frequently in a data set. A data set may have one mode, more than one mode, or no mode at all. ### Geometric Mean This is the nth root of the product of n numbers. It's especially useful for data that describe growth rates or multiplicative processes. $$Gmean = \sqrt[n]{x_1 \times x_2 \times ... \times x_n}$$ ### Harmonic Mean It's the reciprocal of the arithmetic mean of the reciprocals of a set of observations. It is particularly useful when dealing with rates. $$Hmean = \frac {n} {\frac 1 {x_1} + \frac 1 {x_2} + ... + \frac 1 {x_n}}$$ ### Trimmed Mean (or Truncated Mean) This involves removing the smallest and largest values (a certain percentage from each end) and then calculating the mean of the remaining data. It can help in reducing the impact of outliers. ### Weighted Mean If different data points have different importance or weights, the weighted mean takes this into account. It's the sum of the product of each data point and its weight divided by the sum of the weights. $$\bar{x}_w = \frac {w_i x_i}{\sum w_i}$$ ### Midrange It's the average of the maximum and minimum values of a dataset. $$midrange = \frac {max(x's) + min(x's)} 2$$ ## Measures of Variability Measures of variability (or dispersion) describe the extent to which data points in a dataset differ from the central tendency and from each other. These measures provide insights into the spread or distribution of data. Here are the primary measures of variability in statistics: ### Range The range is the simplest measure of variability and is calculated as the difference between the maximum and minimum values in a dataset. $$range = max(x's) - min(x's)$$ ### Variance This measure calculates the average of the squared differences from the mean. - For a population, the formula is: $$\sigma_2=\frac{\sum(x_i−\mu)^2} N$$ $\sigma_2$ is the population variance $x_i$ is each data point $\mu$ is the population mean $N$ is the number of data points - For a sample, the formula is: $$\sigma_2=\frac{\sum(x_i−\mu)^2} {n-1}$$ ### Standard Deviation The standard deviation is the square root of the variance. It's the most common measure of dispersion and indicates the extent to which the data points deviate from the mean. - For a population: $$\sigma=\sqrt{\sigma^2}$$ - For a sample: $$s = \sqrt{s^2}$$ ### Mean Absolute Deviation (MAD) It represents the average of the absolute differences between each data point and the mean. $$MAD = \frac {\sum |x_i-\bar x|}{n}$$ Where $x_i$ is each data point $\bar x$ is the sample mean $n$ is the number of data points ### Interquartile Range (IQR) IQR is the range between the first quartile (25th percentile) and the third quartile (75th percentile). It measures the statistical spread of the middle 50% of a dataset and is resistant to outliers. ### Coefficient of Variation (CV) This is the ratio of the standard deviation to the mean, often multiplied by 100%. It's a normalized measure of dispersion, which allows for comparing variability across datasets with different units or scales. - For a population: $$CV = \frac \sigma \mu \times 100\%$$ - For a sample: $$CV = \frac s {\bar x} \times 100\%$$ ### Range of a Sample For datasets with a large number of observations, the range can be represented by splitting the data into smaller, equal parts (like deciles, quintiles, etc.), and then analyzing the range of each part. ### Skewness and Kurtosis While these are not direct measures of variability, they describe the shape of the distribution. Skewness measures the asymmetry, and kurtosis measures the "tailedness" of the distribution. ## Quartile in Data A quartile in data analysis and statistics is a type of quantile that divides a dataset into four defined intervals(so there are three quartiles). Quartiles are used to calculate the spread and center of a dataset, and they can be particularly helpful in identifying outliers or understanding the distribution of data points. The three quartiles are defined as follows: ### First Quartile(Q1) Also known as the **lower quartile**. This is the median of the first half of a dataset. 25% of data points fall below this value. ### Second Quartile(Q2) simply known as the **median**. This divides the dataset into two halves. 50% of data points fall below this value. ### Third Quartile(Q3) also known as the **upper quartile**. This is the median of the second half of a dataset. 75% of data points fall below this value. ### Interquartile Range(IQR) It is a measure of variability that calculated as Q3−Q1. It describes the range of the middle 50% of values in a dataset, providing a measure of where the "bulk" of the values lie. ### Calculating Quartiles 1. Order the data in non-decreasing order. 2. Find the median. This is your second quartile (Q2). 3. Find the median of the first half of your dataset (not including Q2 if your number of data points is odd). This is Q1. 4. Find the median of the second half of your dataset (again, not including Q2 if your number of data points is odd). This is Q3. ## Information Gain and Entropy In the realm of decision trees, the goal is often to find splits in the data that result in the most homogeneous or "pure" subsets. Information Gain is a measure used to quantify the reduction in this uncertainty or disorder. ### Entropy This is a measure of the randomness or disorder of a set. For a binary classification problem, the entropy $H(S)$ of a set $S$ with positive cases $p_+$ and negative cases $p_-$ given by: $$H(S) = -p_+ log_2(p_+) - p_- log_2(p_-)$$ It is highest when data is evenly split between classes and is lowest (0) when all data belongs to a single class. ### Information Gain Given a dataset $S$ and a feature $F$, the Information Gain due to splitting on feature $F$ is: $$IG(S,F) = H(S) - \sum^v \frac {S_v}{S} H(S_v)$$ $v$ represents all possible values of feature $F$ $S_v$ is the subset of $S$ for which feature $F$ has value $v$ $H(S_v)$ is the entropy of this subset The idea is to compute the Information Gain for each feature and choose the feature for splitting that provides the maximum gain, i.e., the feature that most effectively reduces the uncertainty or disorder in the target variable. ## Point Estimation Point estimation involves using sample data to calculate a single value (known as a point estimate) which is to serve as a "best guess" or "best estimate" of an unknown population parameter (like the population mean or population proportion). For example the sample mean $\bar x$ is a point estimate of the population mean $\mu$. ## Interval Estimation Interval estimation involves using sample data to calculate an interval (or range of values) which is likely to contain the population parameter of interest. This interval is called a confidence interval. For example for a population mean $\mu$ with known variance $\sigma^2$, the confidence interval is given by: $$\bar x \pm z (\frac \sigma {\sqrt{n}})$$ $z$ is the z-score corresponding to the desired confidence level $n$ is the sample size ## Margin of Error The margin of error (often abbreviated as MOE) is a statistic expressing the amount of random sampling error in a survey's results. It represents the range within which the true population parameter is expected to fall with a certain level of confidence. In other words, the margin of error gives an interval around a survey result and suggests that the real value in the population likely falls within that interval. ### Difference between MOE and CI - **MOE**: If you state a sample proportion as 50% with an MOE of 5%, you're saying the true proportion in the population could be as low as 45% or as high as 55%. If you found that 60% of people like apples with an MOE of 5%, then the real percentage in the entire population could be anywhere from 55% to 65%. - **CI**: If you provide a 95% confidence interval for a proportion as $[45\%, 55\%]$, you're saying you're 95% confident the true population proportion lies within that interval. Using the MOE from above, you'd say you're pretty sure (e.g., 95% confident) that the real percentage of people who like apples is between 55% and 65%. ## Central Limit Theorem ![[CentralLimitTheoremCLT-687bdb7ec28f44539d5eabc54070058c.jpg]] Central Limit Theorem describes the distribution of sample means from a population with any shape of probability distribution. It states that, under certain conditions, the distribution of the sample means will approximate a normal distribution (also known as a Gaussian distribution) as the sample size increases, regardless of the shape of the original population distribution. In other words, as you take larger and larger random samples from a population and calculate their means, the distribution of these sample means will become more and more bell-shaped and approach a normal distribution. Key points about the Central Limit Theorem: 1. **Random Sampling**: The samples must be drawn randomly from the population. Each sample should be independent of the others. 2. **Sample Size**: As the sample size increases, the sample means will tend to follow a normal distribution more closely. In practice, a sample size of 30 or more is often considered sufficient for the CLT to apply. 3. **Population Distribution**: The population from which the samples are drawn can have any shape of probability distribution. It does not need to be normally distributed. 4. **Sample Mean**: The distribution being approximated by the CLT is the distribution of sample means, not the distribution of individual data points. 5. **Independence**: The observations within each sample should be independent of each other. 6. **Finite Variance**: The population should have a finite variance. This means that extreme outliers or heavy tails in the population distribution can impact the applicability of the CLT. The practical significance of the Central Limit Theorem is that it allows statisticians and researchers to make inferences about a population based on a sample. When you have a sufficiently large sample, you can use the properties of the normal distribution to calculate confidence intervals, conduct hypothesis tests, and make statistical predictions, even when you don't know the exact shape of the population distribution. ## Data Distribution Pattern Finding the distribution pattern of data involves analyzing the data to determine the shape and characteristics of its probability distribution. ### Histogram Create a histogram of your data. A histogram is a graphical representation that divides the data into bins or intervals and shows the frequency or count of data points in each bin. By examining the shape of the histogram, you can get an initial sense of the data distribution. Common shapes include normal (bell-shaped), skewed (positively or negatively), bimodal (two distinct peaks), and uniform (flat). ### Probability Density Plot A probability density plot, often generated using kernel density estimation (KDE), provides a smoothed estimate of the underlying data distribution. It can help you visualize the shape more clearly than a histogram. ### Quantile-Quantile (Q-Q) Plot ![[300px-Normal_normal_qq.svg.png]] A Q-Q plot is a graphical tool for assessing how well your data matches a theoretical distribution, such as the normal distribution. If your data closely follows a theoretical distribution, the points in the Q-Q plot will form a straight line. Deviations from a straight line can indicate departures from the assumed distribution. ### Summary Statistics Calculate summary statistics such as the mean, median, mode, standard deviation, and skewness of your data. These statistics can provide insights into the central tendency, spread, and symmetry of the distribution. ### Domain Knowledge Consider your knowledge of the data-generating process or the context of the data. Sometimes, domain knowledge can help you make an educated guess about the likely distribution. For example, if you're dealing with ages of people, you might expect a distribution that is right-skewed because ages cannot be negative. ### Statistical Tests There are statistical tests designed to assess the goodness-of-fit between your data and specific distribution models. For example, you can use the chi-square goodness-of-fit test to test if your data follows a particular distribution. ### Data Visualization Use various data visualization techniques, such as box plots, violin plots, or cumulative distribution plots, to explore the data's distribution visually from different angles. ### Machine Learning Techniques If you have a large dataset, you can also use machine learning algorithms, such as clustering or density estimation methods, to identify underlying patterns in the data. ### Empirical Cumulative Distribution Function (eCDF) Plotting the empirical cumulative distribution function, which is a step function that represents the cumulative distribution of your data, can give you a visual sense of how your data is distributed. ### Simulation If you have knowledge of the underlying process that generates the data, you can simulate data from that process and compare it to your actual data to assess the fit. ## Z-score Z-scores are used to standardize and compare data points from different distributions, allowing for meaningful comparisons between them. When you have two different distributions and you want to compare individual data points or statistics (such as means or percentiles) from these distributions, z-scores provide a way to make those comparisons on a common scale. $$Z = \frac {x-\mu}{\sigma}$$ - $Z$ is the z-score - $x$ is the individual data point you want to standardize - $\mu$ is the mean (average) of the distribution - $\sigma$ is the standard deviation of the distribution *** Let's consider an example of comparing the exam scores of two different classes, Class A and Class B. You want to determine if Class A performed significantly better or worse than Class B and quantify the difference using z-scores. **Class A Scores:** - Mean ($\mu_A$): 75 - Standard Deviation ($\sigma_A$): 10 - Number of Students ($n_A$): 50 - A specific student's score in Class A ($x_A$): 85 **Class B Scores:** - Mean ($\mu_B$): 80 - Standard Deviation ($\sigma_B$): 12 - Number of Students ($n_B$): 60 - A specific student's score in Class B ($x_B$): 78 $Z_A = \frac {x_A - \mu_A}{\sigma_A} = \frac {85-75}{10} = 1.0$ This means that the student's score in Class A is 1 standard deviation above the mean of Class A. $Z_B = \frac {x_B - \mu_B}{\sigma_B} = \frac {78-80}{12} = -0.167$ This means that the student's score in Class B is approximately 0.167 standard deviations below the mean of Class B. In this example, the student performed relatively better in Class A compared to their peers, as their score was further above the mean of Class A than below the mean of Class B. ### Benefits 1. **Standardization**: Z-scores standardize data by transforming it into units of standard deviation from the mean. This transformation makes it easier to compare data points that come from distributions with different scales or units. 2. **Comparison of Data Points**: If you want to compare individual data points from two distributions, converting them to z-scores allows you to assess how far each data point is from its respective distribution's mean in terms of standard deviations. This comparison can help identify outliers or extreme values relative to their respective distributions. 3. **Comparison of Means**: If you want to compare the means of two distributions, calculating the z-scores for the means allows you to determine whether the means are significantly different from one another. You can use z-tests or other hypothesis tests to make this comparison. 4. **Quantile Comparisons**: Z-scores also facilitate comparisons of quantiles (e.g., percentiles or quartiles) between distributions. You can determine, for example, how a particular value in one distribution compares to the distribution of values in the other using z-scores. 5. **Statistical Testing**: When conducting statistical tests or hypothesis testing involving data from two distributions (e.g., comparing two groups in an experiment), z-scores can be used to assess the statistical significance of observed differences. Z-tests, t-tests, and other parametric tests often use z-scores for this purpose. 6. **Effect Size**: Z-scores can be used to calculate effect sizes, which measure the magnitude of differences or associations between variables. Effect size measures are often used to provide practical significance and context for statistical results. 7. **Normality Assumption**: In some statistical analyses, it is assumed that the data follows a normal distribution. By converting data points to z-scores, you can assess whether they deviate significantly from this normality assumption. 8. **Standard Practice**: Using z-scores is a standard and widely recognized method in statistics. It provides a clear and interpretable way to compare data points or statistics from different distributions, making it a valuable tool in data analysis. ## T-test A t-test is a statistical hypothesis test used to determine whether there is a significant difference between the means of two groups. There are several types of t-tests, but the most commonly used ones are: 1. **Independent Samples t-test**: This type of t-test is used when you have two independent groups, and you want to compare the means of a continuous variable between these two groups. For example, you might use an independent samples t-test to compare the average test scores of two different groups of students. 2. **Paired Samples t-test**: This t-test is used when you have paired or matched observations, and you want to compare the means of the differences between these pairs. For example, you might use a paired samples t-test to compare the before-and-after scores of the same group of individuals who received a treatment. The t-test calculates a test statistic (t-value) based on the sample data and a specified significance level (alpha). The t-value is then compared to a critical value from the t-distribution or used to calculate a p-value. If the calculated t-value is greater than the critical value or the p-value is less than alpha, it suggests that there is a statistically significant difference between the means of the two groups. Commonly used significance levels are 0.05, 0.01, or 0.10. - If the calculated t-value is greater than the critical t-value, you reject the null hypothesis. - If the calculated t-value is less than the critical t-value, you fail to reject the null hypothesis. $$t = \frac {\mu_a - \mu_b}{\sqrt{\frac{\sigma^2_a}{s_a} + \frac {\sigma^2_b}{s_b}}}$$ - $\mu_a$ is the mean of a - $\sigma_a$ is the standard deviation of a - $s_a$ is the sample size of a - so on... $$t = \frac {\mu d}{\frac {\sigma d}{\sqrt s}}$$ - $\mu d$ mean of differences - $\sigma d$ is the standard deviation of differences - $s$ is the sample size To calculate critical t-value we need Degrees of Freedom(df) and alpha. Degrees of Freedom(df) for an independent samples: $$df = s_a + s_b - 2$$ Degrees of Freedom(df) for a paired samples: $$df = n - 1$$ - $n$ is the number of paires For many common cases, you can find critical t-values in t-tables provided in statistics textbooks or online resources. These tables list critical t-values for various degrees of freedom and significance levels. When using a t-table, you simply look up the intersection of your chosen significance level (alpha) and the degrees of freedom (df) to find the critical t-value. *** **Scenario:** Imagine you are a researcher investigating whether a new weight-loss medication is effective compared to a placebo. You have two groups of participants: one group took the weight-loss medication for 12 weeks, and the other group took a placebo. At the end of the 12 weeks, you measured the weight loss (in pounds) for each participant in both groups. **Hypotheses:** - Null Hypothesis (H0): The mean weight loss in the medication group is equal to the mean weight loss in the placebo group. - Alternative Hypothesis (Ha): The mean weight loss in the medication group is different from the mean weight loss in the placebo group. **Data:** Medication Group: - Mean weight loss = 10 pounds - Standard deviation = 3 pounds - Sample size = 30 Placebo Group: - Mean weight loss = 6 pounds - Standard deviation = 2.5 pounds - Sample size = 30 **Conduct the t-test:** $t = \frac {10-6} {\sqrt {\frac {3^2}{30}+\frac {2.5^2}{30}}}=4.38$ $df = 30 + 30 - 2 = 58$ Choose a significance level (alpha), typically 0.05 Critical t-value is approximately 2.00 Reject the null hypothesis because t-value is greater than critical t-value This suggests that there is a statistically significant difference in weight loss between the medication group and the placebo group, indicating that the medication is likely effective in promoting weight loss. *** **Scenario:** Imagine you are a researcher studying the effectiveness of a new study method on improving students' test scores. You have a group of 20 students, and you want to determine whether there is a statistically significant difference between their test scores before and after they received a special tutoring program. **Hypotheses:** - Null Hypothesis (H0): There is no significant difference in the mean test scores before and after the tutoring program. - Alternative Hypothesis (Ha): There is a significant difference in the mean test scores before and after the tutoring program. **Data:** You collect test scores for each student before and after the tutoring program. Here's a simplified dataset of the changes in test scores (result = after - before): -1, 3, 2, 0, 4, 5, 1, -2, 2, 3, 0, 1, 6, 2, 3, 1, 0, 2, 4, 5 **Conducting the paired samples t-test:** $\mu = \frac {27} {20} = 1.35$ $\sigma = 2.12$ $t = \frac {1.35}{\frac {2.12}{\sqrt {20}}}$ $df = 20-1=19$ Choose a significance level (alpha), typically 0.05 Critical t-value is approximately ±2.093 In both cases, since your t-statistic does not exceed the critical t-value, you would not reject the null hypothesis. This suggests that, based on your sample data and the chosen significance level, you do not have enough evidence to conclude that the tutoring program had a statistically significant impact on the test scores. ### p-value The p-value represents the probability of observing a t-statistic as extreme as, or more extreme than, the one calculated from your sample data, assuming that the null hypothesis is true. If you have statistical software or a calculator that can directly compute the p-value, you can use that. You would input the t-statistic, degrees of freedom (df), and specify that it's a two-tailed test. If you want to find the p-value manually using a t-table, you would follow these steps: 1. Determine the degrees of freedom (df) 2. Determine the direction of the test (two-tailed, left-tailed, or right-tailed) 3. Look up the critical t-value for the specified alpha level (0.05 for a 5% significance level) and degrees of freedom ### Tailed-Test In hypothesis testing, the terms "two-tailed," "left-tailed," and "right-tailed" refer to different ways of specifying the direction of a statistical test and the associated alternative hypothesis. **Two-Tailed Test:** - In a two-tailed test, you are interested in whether there is a significant difference in either direction from the null hypothesis. - The alternative hypothesis typically states that there is a significant difference, but it doesn't specify whether the difference is greater or smaller than what is expected under the null hypothesis. Example: Testing if a new drug has an effect on blood pressure. The null hypothesis might be that the drug has no effect, and the alternative hypothesis is that the drug has an effect in either direction (increasing or decreasing blood pressure). **Left-Tailed Test:** - In a left-tailed test, you are interested in whether there is a significant difference in one specific direction, typically a decrease or a smaller value than what is expected under the null hypothesis. - The alternative hypothesis specifically states that the parameter of interest is less than what is stated in the null hypothesis. Example: Testing if a new manufacturing process produces parts that are smaller than the current standard. The null hypothesis might be that the new process produces parts of the same size, and the alternative hypothesis is that the new process produces smaller parts. **Right-Tailed Test:** - In a right-tailed test, you are interested in whether there is a significant difference in one specific direction, typically an increase or a larger value than what is expected under the null hypothesis. - The alternative hypothesis specifically states that the parameter of interest is greater than what is stated in the null hypothesis. Example: Testing if a new fertilizer increases the yield of a crop. The null hypothesis might be that the new fertilizer has no effect, and the alternative hypothesis is that the new fertilizer increases crop yield. ### Use T-test instead of Z-test when 1. **Sample Size is Small:** If your sample size is small (typically n < 30) and the population standard deviation (σ) is unknown or estimated from the sample, a t-test is more appropriate. In such cases, you should use a t-distribution to account for the added uncertainty due to the smaller sample size. 2. **Population Standard Deviation is Unknown:** When you don't know the population standard deviation and have a small sample size, a t-test is preferred because it uses the sample standard deviation (s) to estimate the population standard deviation, adjusting for the uncertainty in this estimate. 3. **Comparing Sample Means:** If you are comparing means between two groups (e.g., two different treatments) or conducting paired samples tests (e.g., before-and-after measurements), a t-test is typically used. 4. **Non-Normal Data:** While t-tests assume normality of the data, they can still be robust when the assumption of normality is approximately met, especially with larger sample sizes. If your data is clearly non-normal, you might consider non-parametric tests as an alternative to t-tests. ### Replacements of T-test 1. **Non-Parametric Tests:** Non-parametric tests are distribution-free tests that do not rely on the assumptions of normality or constant variance. Examples of non-parametric tests include the Mann-Whitney U test (an alternative to the independent samples t-test), the Wilcoxon signed-rank test (an alternative to the paired samples t-test), and the Kruskal-Wallis test (an alternative to one-way ANOVA). These tests are particularly useful when dealing with ordinal or non-normally distributed data. 2. **Bootstrapping:** Bootstrapping is a resampling technique that can be used to estimate the distribution of a statistic (e.g., mean or difference in means) without assuming a specific population distribution. It is especially helpful when you have a small sample size and want to calculate confidence intervals or conduct hypothesis tests. 3. **Transformations:** In some cases, applying data transformations (e.g., log transformation) to your data can help make it more normally distributed. After transformation, you can then perform a t-test or ANOVA on the transformed data. 4. **Robust Tests:** Robust statistical tests are designed to be less sensitive to violations of assumptions. For example, robust versions of t-tests, such as Welch's t-test, can be used when variances between groups are unequal or when data deviates from normality. 5. **Bayesian Methods:** Bayesian statistical methods provide an alternative framework for hypothesis testing and parameter estimation. Bayesian methods can be more robust to non-normality and allow for the incorporation of prior information into your analysis. 6. **Exact Tests:** Exact tests, such as Fisher's exact test or exact permutation tests, can be used when dealing with categorical data or small sample sizes. These tests provide exact p-values without relying on distributional assumptions. 7. **Machine Learning Methods:** Depending on your research question, you may consider machine learning approaches that do not assume normality, such as decision trees, random forests, or support vector machines. These methods can be used for classification or prediction tasks. ## Tests for Categorical Data ### Chi-Square Test of Independence The chi-square test is used to determine whether there is a significant association between two categorical variables. It assesses whether the observed frequency counts in a contingency table are significantly different from what would be expected if the variables were independent. - Chi-Square Test for Independence (2x2 table): Used for 2x2 contingency tables. - Chi-Square Test of Independence (larger tables): Used for larger contingency tables. ### Fisher's Exact Test Fisher's exact test is an alternative to the chi-square test, often used when dealing with small sample sizes or when the assumptions of the chi-square test are not met. It is particularly useful for 2x2 tables. ### McNemar's Test McNemar's test is used to analyze paired categorical data, typically in a 2x2 table, to assess whether there is a significant difference between two related groups. ### Cochran's Q Test Cochran's Q test is used to compare the proportions of a categorical outcome measured in three or more related groups. It is an extension of McNemar's test for more than two groups. ### Contingency Table Analysis When you have multiple categorical variables and want to assess their associations simultaneously, you can perform a contingency table analysis. This may include methods like log-linear modeling, which extends chi-square analysis to larger tables, or correspondence analysis for visualizing relationships in multi-way contingency tables. ### G-test (Likelihood Ratio Test) The G-test is another test for assessing the association between categorical variables. It is based on the likelihood ratio and can be used for contingency tables with larger dimensions. ### Logistic Regression While logistic regression is often used for binary outcomes, it can also be extended to handle categorical outcomes with more than two categories. It's a powerful tool for modeling the relationship between one or more predictor variables and a categorical outcome variable. ### Multinomial Logistic Regression This is an extension of logistic regression used when the outcome variable has more than two unordered categories. ### Ordinal Logistic Regression When the outcome variable is ordinal (e.g., ordered categories like "low," "medium," "high"), ordinal logistic regression is used to model the relationship with one or more predictor variables. ## Covariance Covariance measures the degree to which two random variables change together. In other words, it quantifies how two variables tend to move in relation to each other. Mathematically, the covariance between two random variables $X$ and $Y$ is represented as $Cov(X, Y)$. A positive covariance indicates that as one variable increases, the other tends to increase as well, and vice versa. A negative covariance suggests that as one variable increases, the other tends to decrease. However, the magnitude of the covariance is not standardized, making it difficult to interpret the strength of the relationship. It's highly dependent on the scales of the variables being compared. $$Cov(X, Y) = \frac 1 n \sum^n_{i=1}(x_i-\bar x)(y_i-\bar y)$$ - $n$ is the number of data points or observations - $x_i$ and $y_i$ are individual data points of $X$ and $Y$, respectively - $\bar x$ and $\bar y$ are the means (average) of $X$ and $Y$, respectively ## correlation and covariance Correlation is a standardized measure of the linear relationship between two variables. It provides a value between -1 and 1, where -1 indicates a perfect negative linear relationship, 1 indicates a perfect positive linear relationship, and 0 indicates no linear relationship. The most common measure of correlation is the **Pearson correlation coefficient** ($r$), which is calculated as the covariance between two variables divided by the product of their standard deviations: $$r = \frac {Cov(X, Y)}{\sigma_X . \sigma_Y}$$ Pearson's correlation coefficient not only quantifies the direction of the relationship (positive or negative) but also provides information about its strength. A value close to -1 or 1 indicates a strong linear relationship, while a value close to 0 suggests a weak or no linear relationship. ### Correlation doesn't imply causation The phrase "correlation does not imply causation" is a fundamental principle in statistics and scientific research. It means that just because two variables are correlated (i.e., they have a statistical relationship) does not mean that one variable causes the other. There are several reasons for this: 1. **Third Variables (Confounding Factors):** Correlation may exist because both variables are independently influenced by a third variable, which is sometimes called a confounding factor. This third variable can create the appearance of a relationship between the two variables of interest. For example, there is a strong positive correlation between the number of ice cream sales and the number of drownings during the summer months. However, the underlying cause is not that eating ice cream causes drownings or vice versa; it's because both are influenced by warm weather. 2. **Reverse Causation:** Correlation alone cannot determine the direction of causation. It's possible that the relationship is reversed, meaning that the second variable causes changes in the first variable. For example, there may be a positive correlation between the amount of exercise people do and their weight. However, it's not exercise causing weight; it's more likely that people who weigh more tend to exercise less. 3. **Coincidence:** Sometimes, a correlation may occur by pure chance. This is especially true when dealing with small sample sizes or when analyzing a large number of variables. Finding a correlation doesn't mean it has any practical or meaningful significance. 4. **Nonlinear Relationships:** Correlation measures linear relationships. It may not capture complex, nonlinear relationships between variables. In cases where the relationship is nonlinear, correlation can be misleading. 5. **Spurious Correlations:** Some correlations may be coincidental and unrelated to any meaningful cause-and-effect relationship. Identifying causation requires a deeper understanding of the underlying mechanisms, experimentation, and rigorous research design. ## Casual Inference Causal inference in statistics is the process of determining cause-and-effect relationships between variables. Establishing causation goes beyond observing correlations or associations; it involves making conclusions about whether changes in one variable actually lead to changes in another. Here are some key methods and considerations for conducting causal inference in statistics: 1. **Randomized Controlled Trials (RCTs):** - RCTs are often considered the gold standard for establishing causation. In an RCT, subjects are randomly assigned to different groups, one of which receives the treatment (the independent variable) while the other serves as a control group. The randomization helps ensure that both known and unknown confounding variables are equally distributed between groups. - By comparing the outcomes of the treatment group with those of the control group, researchers can infer causation if a significant difference is observed. 2. **Quasi-Experimental Designs:** - In situations where it is not ethical or feasible to conduct RCTs, quasi-experimental designs can be used. These designs involve carefully selecting or matching subjects to create groups that resemble randomized groups as closely as possible. - Common quasi-experimental designs include before-and-after studies, difference-in-differences analysis, and propensity score matching. 3. **Covariate Adjustment:** - In observational studies where randomization is not possible, covariate adjustment is crucial. It involves controlling for potential confounding variables by including them as covariates in statistical models. - Techniques such as multiple regression, propensity score matching, and propensity score weighting can help account for covariates. 4. **Instrumental Variables (IV):** - IV analysis is a method used to address endogeneity (confounding) in observational data. It relies on the idea of finding a variable (the instrument) that affects the treatment variable but is unrelated to the outcome variable except through its impact on the treatment. - IV methods are commonly used in econometrics and require careful selection of valid instruments. 5. **Natural Experiments:** - Natural experiments occur when an external event or factor mimics randomization. Researchers can take advantage of these situations to study causal relationships. - An example might be studying the impact of a policy change in one region while a neighboring region remains unchanged. 6. **Longitudinal Data Analysis:** - Analyzing data collected over time can help establish causation. By tracking changes in variables and their timing, researchers can draw stronger conclusions about causality. - Techniques such as fixed effects models and growth curve modeling can be applied to longitudinal data. 7. **Counterfactuals and Potential Outcomes:** - The counterfactual framework involves comparing what actually happened (the observed outcome) with what would have happened in a hypothetical scenario (the unobserved or counterfactual outcome) had the treatment not been applied. - This framework underlies many causal inference methods, including causal diagrams and causal mediation analysis. 8. **Causal Diagrams (Directed Acyclic Graphs, DAGs):** - DAGs are graphical representations of causal relationships between variables. They help identify potential causal paths and confounding variables. - DAGs can guide the selection of covariates to include in statistical models and clarify the assumptions underlying causal inferences. 9. **Sensitivity Analysis:** - Sensitivity analysis involves testing the robustness of causal inferences to different assumptions. It helps assess how sensitive the results are to potential sources of bias or unmeasured confounding. 10. **Expert Knowledge and Domain Expertise:** - Subject-matter expertise is crucial in designing studies, selecting appropriate methods, and interpreting results in the context of the specific field of study. ## Means ### Arithmetic Mean(Average) The arithmetic mean is widely used and is appropriate for most situations where you want to find a typical or central value. It is commonly used for: - Describing the central tendency of data. - Calculating grades or test scores. - Finding the average income or temperature. - Assessing the average speed or distance in travel. $$\mu = \frac {\sum^n_{i=1}x_i}{n}$$ ### Geometric Mean The geometric mean is particularly useful when dealing with values that are products of each other or when you want to understand the compounded growth or rate of change. Common use cases include: - Calculating investment returns over multiple periods. - Evaluating compound interest rates. - Analyzing growth rates, such as population growth or bacterial growth. - Assessing performance metrics like the geometric mean of returns in finance. $$Gmean = \sqrt[n]{x_1 \cdot x_2 \cdot ... \cdot x_n}$$ ### Harmonic Mean The harmonic mean is primarily used in situations where you need to find an average that gives less weight to extreme values and emphasizes values that are small. Common use cases include: - **Speed and Rate Calculations:** The harmonic mean is used in physics and engineering to calculate average speeds or rates when dealing with distances or times. For example, it is used to find the average speed of a journey when the speed varies along the route. - **Networks and Time Calculations:** In computer science and network theory, the harmonic mean is used to calculate average data transfer rates or network speeds. It helps determine the efficiency of data transmission. - **Financial Metrics:** In finance, the harmonic mean is employed to calculate metrics such as the harmonic mean of price-to-earnings ratios for a set of stocks. It is used to find an average valuation that accounts for extreme values. - **Environmental and Energy Calculations:** When analyzing environmental data, such as air quality indices or energy efficiency ratings, the harmonic mean can be used to calculate an overall index that considers different components. - **Chemistry and Music:** In chemistry, the harmonic mean is used to calculate average reaction rates. In music, it's employed to find the harmonic mean of frequencies for musical intervals. $$H=\frac n {\frac 1 {x_1}+\frac 1 {x_2}+...+\frac 1 {x_n}}$$ ## Outliers Outliers are observations that differ significantly from the other values in a dataset. Removing outliers is important in many statistical analyses because they can skew results and violate the assumptions of many statistical tests. Here's a step-by-step guide on how to detect and remove outliers: 1. **Visualization**: - **Boxplots**: Outliers are typically observations outside the whiskers of the boxplot. - **Histograms and Scatter Plots**: A visual examination can also show data points that stand out from the rest. - **Z-Score Plot**: For standardized data, points with z-scores greater than 3 or less than -3 are considered potential outliers. 2. **Statistical Methods**: - **Z-Scores**: Calculate the z-score for each data point. Data points with a z-score absolute value greater than a certain threshold (commonly 2 or 3) are considered outliers. - **IQR (Interquartile Range)**: For a dataset, calculate Q1 (25th percentile) and Q3 (75th percentile). The IQR is Q3-Q1. Any data point below Q1-1.5IQR or above Q3+1.5IQR can be considered an outlier. - **MAD (Median Absolute Deviation)**: This is a robust method to detect outliers when data is skewed. 3. **Domain Knowledge**: Sometimes what may appear as an outlier might be a genuine observation that is of interest, especially in fields like fraud detection. Consult with domain experts before deciding on outlier removal. 4. **Consider the Data Source**: If the data collection method has a chance of producing extreme values due to errors (e.g., sensor malfunctioning, human error), it can further justify the removal of outliers. 5. **Automated Methods**: - **DBSCAN**: It's a clustering algorithm that can cluster core points and can leave out points that don't belong to any cluster, treating them as outliers. - **Isolation Forest**: This is an ensemble method specifically for anomaly detection (outliers can be considered anomalies). ### Non-parametric Methods "Non-parametric" in statistics refers to methods that do not make strong assumptions about the form or parameters of a population distribution. Non-parametric methods are often contrasted with parametric methods, which assume a specific distribution for the data, such as the normal distribution. Non-parametric methods are used to detect outliers when data doesn't follow a normal distribution, or when you don't want to make any assumptions about the distribution of the data. Here's a guide on how to detect and remove non-parametric outliers: 1. **Visualization**: - **Boxplots**: Even when not assuming normality, outliers can often be observed as points lying outside the whiskers of the boxplot. - **Scatter Plots and Time-series plots**: Visual examination can also show data points that stand out from the others. 2. **Rank-based Methods**: - **Modified Z-Score based on MAD (Median Absolute Deviation)**: This method is particularly robust against outliers. Here's how to compute it: 1. Compute the median of the data, let's call it `median`. 2. Compute the MAD: $$MAD=median(∣X_i−median∣)$$ 3. Compute the modified z-score for each data point: $$mZ=0.6745 \times \frac {X_i−median}{MAD}$$ 4. Data points with a modified z-score greater than 3.5 (or another threshold) are potential outliers. 3. **Depth-based Methods**: - Methods such as the "Tukey depth" or "simplicial depth" determine the centrality of a point by examining how central it is with respect to various subsets of the data. Points with very low depth can be considered outliers. 4. **Non-parametric Tests**: - The Grubbs' Test, although assuming normality for its computation, can be used on ranked data, making it non-parametric in nature. It tests the hypothesis that there are no outliers in the dataset. 5. **Density-Based Methods**: - **DBSCAN**: As previously mentioned, it's a clustering algorithm that can help identify clusters of data points and treat points that don't belong to any cluster as outliers. 6. **Removing the Outliers**: Once you've identified the outliers using non-parametric methods, you can remove them. If using Python, you can easily filter them out using pandas or numpy. ### Multivarients Outliers Multivariate outliers are observations with atypical combinations of values across multiple variables or dimensions. While univariate outliers are easily identifiable on one-dimensional scales, multivariate outliers require a bit more finesse due to the interactions between multiple variables. Here's how to detect and remove multivariate outliers: 1. **Visualization**: - **Scatterplot Matrices**: For datasets with a manageable number of variables, scatterplot matrices can be used to visualize pairwise relationships and potentially spot multivariate outliers. - **3D Plots**: Useful for datasets with three variables. 2. **Statistical Methods**: - **Mahalanobis Distance**: This is a measure of distance from a point to a distribution, taking into account the covariance structure. A large Mahalanobis distance indicates that the data point might be an outlier. Typically, a chi-squared distribution is used to interpret this distance and determine if a point is an outlier. 3. **Machine Learning Techniques**: - **PCA (Principal Component Analysis)**: Outliers can sometimes be detected more easily in the reduced dimension space of principal components. - **Clustering (like DBSCAN or K-means)**: Observations that don't fit well into any cluster, or are part of very small and sparse clusters, might be considered outliers. - **Isolation Forest**: This algorithm is specifically designed for anomaly detection. It isolates anomalies based on the idea that outliers are few and different, and thus they can be isolated quickly. ## Error measurement Error measurement formulas are crucial in statistics, machine learning, and various scientific domains to quantify the accuracy of predictions or estimates. ### Mean Squared Error(MSE) - **When to Use**: Regression problems. - **Pros**: Gives a quadratic penalty to errors. This means larger errors are penalized more heavily than smaller ones. RMSE is in the same unit as the response variable, which can be useful for interpretation. - **Cons**: Sensitive to outliers since it squares the residuals. Might not be the best choice if you have a lot of outliers and don't want them to dominate the error metric. $$MSE = \frac {1}{n}\sum^n_{i=1}(y_i-\hat y_i)^2$$ - $y_i$ is actual value - $\hat y_i$ is predicted value - $n$ is number of data points ### Mean Absolute Error(MAE) - **When to Use**: Regression problems. - **Pros**: Gives a linear penalty to errors. Useful when you want to penalize all errors equally regardless of their magnitude. - **Cons**: Might not be sensitive to large errors if they are relatively rare. $$MAE = \frac {1}{n}\sum^n_{i=1}|y_i-\hat y_i|$$ ## A/B Testing A/B testing, also known as split testing, is a method of comparing two versions of a webpage or app against each other to determine which one performs better in terms of a specific metric (e.g., conversion rate). The goal is to identify changes that increase the likelihood of a user taking a desired action, such as signing up for a newsletter, purchasing a product, or clicking a button. In the context of statistics, A/B testing is essentially a controlled experiment with two groups: - **A (Control Group)**: Users exposed to the current version. - **B (Treatment Group)**: Users exposed to the new version. Here's a step-by-step approach to conducting an A/B test: 1. **Define the Objective**: Clearly state the goal of your test. It might be increasing the click-through rate, increasing sales, reducing bounce rate, etc. 2. **Random Assignment**: Users should be randomly assigned to either group A or B to ensure that there are no underlying biases in your test groups. 3. **Size Determination**: Before starting the test, you need to decide how many data points (users, sessions, etc.) you need. This often depends on expected effect size, desired power of the test, and acceptable significance level. Tools like online sample size calculators can help. 4. **Run the Test**: Expose your users to version A or B based on their group assignment and collect data on the metric of interest. 5. **Statistical Analysis**: After collecting the data: - **Choose the Right Test**: For proportions (like conversion rates), a chi-squared test or two-proportion z-test is commonly used. For continuous data (like revenue), you might use a t-test. - **Test the Null Hypothesis**: Usually, the null hypothesis states that there is no difference between groups A and B in terms of the metric of interest. - **Check Assumptions**: Ensure the assumptions of your statistical test are met (e.g., normality, equal variances for a t-test). 6. **Interpret the Results**: Look at the p-value from your test. If it's below a pre-defined threshold (e.g., 0.05), you might reject the null hypothesis and conclude that there's a statistically significant difference between A and B. 7. **Consider Other Factors**: - **Multiple Comparisons**: If you're running multiple A/B tests at once, consider the issue of multiple comparisons, which can increase the chance of finding a false positive. - **Practical Significance**: Even if a result is statistically significant, it might not be practically significant. A tiny improvement might not justify the costs or risks of implementing a change. 8. **Make a Decision**: Based on your results and their practical implications, decide whether to implement the changes from version B or to keep version A. 9. **Iterate**: A/B testing is an ongoing process. You can continually refine and test new hypotheses to optimize performance.