StatQuest - HackMD

--- title: 'StatQuest' disqus: hackmd --- StatQuest === ![downloads](https://img.shields.io/github/downloads/atom/atom/total.svg) ![build](https://img.shields.io/appveyor/ci/:user/:repo.svg) ![chat](https://img.shields.io/discord/:serverId.svg) ## Table of Contents [TOC] # Machine learning Machine learning is using data to do prediction and classification. Use test data to evaluate the model, use training data to train a model. Fitting the Training data well but making poor predictions, is called the Bias-Variance Tradeoff ## Cross-Validation allows us to compare different machine learning methods and get a sense of how well they will work in practice. When given data, we should do: 1. Estimate the parameters for the machine learning methods. training the algorithm using the data. 2. Evaluate how well the machine learning methods work. (testing the algorithm) Cross-validation divides data into blocks, then use one block as the test data, and other blocks to train the data. In practice, it's common to divide data into 10 blocks, this is called Ten-Fold Cross-Validation. Leave one out(LOO): use one data to test, all other data to train. Cross-validation can be used to get the best hyperparameter. ## Confusion matrix The row of confusion matrix corresponds to what the machine learning methods predictions($\hat{Y_i}$), the column corresponds to the truth ($Y_i$). | | $Y_1$ | $Y_2$ | ... | $Y_n$ | | ----------- | ---------------- | ---------------- | --- | ----- | | $\hat{Y}_1$ | (True Positive) | (False Positive) | ... | | | $\hat{Y}_2$ | (False Negative) | (True Negative) | ... | | | ... | ... | ... | ... | ... | | $\hat{Y}_n$ | | | ... | | numbers along the diagonal tell us how many times the samples were correctly classified. We can build a confusion matrix for each method and compare it. ## Sensitivity and Specificity $Sensitivity = \frac{\text{True Positives}}{\text{True Positives + False Negatives}}$ $Specificity \frac{\text{True Negatives}}{\text{True Negatives + False Positives}}$ $$Precision = \frac{\text{True Positives}}{True Positives + False Positives} $$ The proportion of positive results that were correctly classified. When the Confusion matrix has more than 2 dimensions, compute sensitivity and specificity for each variable with all others collapsed. ## Bias and Variance The inability of a machine learning method to capture the true relationship is called bias. For example, linear regression cannot capture the non-linear relationship. The Overfitted model can fit the training data perfectly, so the bias will be zero. We can compare the regression using the sum of squares. The difference in fits between datasets is called variance. The Overfitted model cannot fit test data well, so the variance will be high. Ideally, the best model as low bias and low variance. ## ROC and AUC ### ROC ROC evaluates the effectiveness of the model with a specific parameter setting. \begin{aligned} Y-axis&: \text{True Positive Rate} = \text{Power} = \text{ Sensitivity } &= \frac{\text{True Positives}}{\text{True Positives + False Negatives}}\\ X-axis&: \text{False Positive Rate} = \alpha = \text{1 - Specificity} &= \frac{\text{False Positives}}{\text{False Positives + True Negatives}} \end{aligned} ### AUC ROC can compare the difference threshold of a model. AUC, the area under the ROC plot, can compare different models, a high AUC model is better. ## Naive Bayes classifier \begin{aligned} \hat y &= arg \max_y P(Y = y|X = x) = argmax_y P(y = y) P(X = x|Y = y) / \underbrace{ P(x = x)}_{\text{It will not affect, so ignore}} \\ &= arg \max_y P(Y = y|X = x) = argmax_y P(y = y) P(X = x|Y = y) \end{aligned} ## Hiden Markov Model In an probability setting, there are two groups of variables, 1. observations ($O$) - the variables we observed. - The probability from one hidden state to another is called transition probability. 2. hidden states ($H$) - the variables we cannot see. - The probability that the observations are emitted from the hidden states are called emission probability. ### How do we get these probabilities? For transition probabilities: 1. List the situations one by one, calculate the frequency from one state to another. ### How to predict the next hidden state according to transition probabilities? assume our hidden variable $H$ we have 2 hidden states, a and b. The probability of next state is a: \begin{aligned} Pr(H_{next} = a) &= Pr(a \rightarrow a) \times Pr(H_{current} = a) + Pr(b \rightarrow a) \times Pr(H_{current} = b)\\ Pr(H_{next} = b) &= Pr(a \rightarrow b) \times Pr(H_{current} = a) + Pr(b \rightarrow b) \times Pr(H_{current} = b) \end{aligned} ### How to predict hidden state according to observation? **Prior proabability $P(H)$: The initial probability of each hidden state before we see observations.** **Posterior probability$P(H|O = \text{a value})$: The probability of each hidden state after we include the probability of observations** ### How to predict a sequential hidden states according to the sequential observations we know. Maximum likelihood! Pick the model that make the observations the most likely to happen. Viterbie algorithm(dynamic programming) # Regularization ## Ridge Regression Ridge regression introduces a small amount of bias into regression, so it doesn't fit the training data perfectly, but the variance is better than a perfect fitting line. \begin{aligned} \text{lest squares} &: \text{minimize the sum of the squared residual}\\ \text{Ridge Regression} &: \text{minimize the sum of the squared residual} + \lambda \times \text{the slope}^2 \end{aligned} $\text{the slope}$ adds a penalty to the traditional Least Squares method. $\lambda \in [0, \infty]$ determines how severe the penalty is. When $\lambda = 0$, $slope$ is same as least squares, when $\lambda = \infty$, $slope$ is zero. Use cross validation to determine $\lambda$. When the $slope$ of a line is steep, the prediction for $Y$ is very sensitive to relatively small changes in $X$. - When we have categorical variables, $$\text{Ridge Regression} : \text{minimize the sum of the squared residual} + \lambda \times \text{difference in means}^2 $$ - When we have multiple variables, $$\text{Ridge Regression} : \text{minimize the sum of the squared residual} + \lambda \times (\text{slope}_1+\text{slope}_2+ ...)^2 $$ - When we are in logistic regression, $$\text{Ridge Regression} : \text{minimize the sum of the squared likelihood} + \lambda \times \text{slope}^2 $$ **When the sample sizes are relatively small, then Ridge Regression can improve predictions made from new data by making the predictions less sensitive to the training data.** ## Lasso Regression \begin{aligned} \text{Ridge Regression} &: \text{minimize the sum of the squared residual} + \lambda \times \text{the slope}^2\\ \text{Lasso Regression} &: \text{minimize the sum of the squared residual} + \lambda \times |\text{the slope}| \end{aligned} Differences between lasso and ridge: 1. **Ridge regression** can only shrink the slope **asymptotically** close to $0$, while lasso regression can shrink the slope ** to 0**. So Lasso can exclude useless variables from equations. When having useless variables, the lasso can exclude them. When all the variables are useful, the ridge can do better. ## Elastic-Net Regression When have millions of variables, we use elastic-net regression! \begin{aligned} \text{Ridge} &: \text{minimize the SS(residual)} + \lambda \sum_{i = 1}^N coefficient_i^2\\ \text{Lasso} &: \text{minimize the SS(residual)} + \lambda \sum_{i = 1}^N |coefficient_i|\\ \text{Elastic-Net} &: \text{minimize the SS(residual)} + \lambda_1 \sum_{i = 1}^N coefficient_i^2 + \lambda_2 \sum_{i = 1}^N |coefficient_i|\\ \end{aligned} Lasso is good when there are correlations between parameters because lasso will eliminate one of the correlated variables, while lasso and ridge will shrink all correlated variables together. ## PCA Tips eigenvalue: The sum of squares of the distances between the points projected onto the line and the origin. 1. Scale your data scale your data by dividing standard deviation:$$\text{scaled data} = \frac{data}{\text{Standard deviation}}$$ 2. Center your data! Center each dimension of your data! 3. How many PC? Technically, there is a PC for each variable in the dataset. If there are fewer samples than variables, then the number of samples puts an upper bound on the number of PCs with eigenvalues greater than 0. ## Linear Discriminant analysis focuses on maximizing the separability among known categories. LDA creates new axes and projects the data onto this new axis in a way to maximize the separation of the two categories. How LDA creates new axises: $\max \frac{\mu_1 - \mu_2}{s_1^2 + s_2^2}$ 1. Maximize the distance between means. 2. Minimize the variation("scatters", $s^2$) with in each category When have categorical variables(more than 3 categories), the distance is not between each mean, but between each mean to the center point of all data(mean point of all data). **So PCA just looks for the components with the largest variance, while LDA tries to separate them as well.** The similarity between LDA and PCA: - Both rank the new axes in order of importance. - PC1 accounts for the most variation in the data. - LD1 accounts for the most variation between the categories. tries to maximize the variation between known categories. ![](https://i.imgur.com/woWkTnU.png =400x) ## Hierarchical clustering in the heatmap Given a heatmap, we are going to reorder rows and columns to cluster similar variables. 1. We starting at treating each row as a cluster. For each cluster, figure out which cluster is most similar to this cluster by Euclidian distance. 2. Merge the most similar cluster pairs into a large cluster. 3. Repeat until we have a single cluster containing all rows left. Dendrogram indicates both the similarity and the order that the clusters were formed. There is a different way of comparing the distance between one cluster with other clusters. 1. The average of each cluster(Centroid) 2. The closest point in each cluster(Single-Linkage) 3. The furthest point in each cluster(Complete-Linkage) ## K-means cluster ### Steps: 1. Select the number of clusters $K$ you want to identify in your data. 2. Randomly select K data points as the initial clusters' representation. 3. Measure the distance between each point and the three initial clusters. 4. Assign each point to its nearest cluster. 5. calculate the mean for each cluster as the cluster's representation. 6. Repeat till no cluster changes. 7. Try different initial clusters' representations. ### How to decide K? 1. try different K and compare the total variation.plot # of clusters and reduction in variation, you will get elbow plot. ![](https://i.imgur.com/3F8AOkG.png =400x) ## Classical Multidimensional Scaling(MDS) aka Principal Coordinate Analysis(PCoA) Similar to PCA, except instead of converting correlations into a 2-D graph, it converts distances among the samples into a 2-D graph. 1. calculate the distance between each pair of data points.(log fold chan distance, Manhattan distance($E_1$), Hamming Distance, ..) 2. reduce them to a 2-D graph. If use Euclidian distance, PCoA = PCA, so minimizing the linear distances = maximizing the linear correlations. We can use the average of the absolute values of log fold change. ### Difference between K-means and Hierarchical clustering: K-means clustering specifically tries to put the data into the number of clusters you tell it to. Hierarchical clustering just tells you, pairwise, what two things are most similar. # Statistics Foundation ## Plots ### Histogram put sample values into small bins. ### Boxplots Box represents the middle 50% of the data, so it is [25%, 75%] of the data. The line in the middle of the box is the median value. ## Distributions ### Binomial distribution describe the distribution which has two possible outcomes. PDF: $$Pr(x|n, p) = \frac{n!}{x!(n-x)!}p^x (1-p)^{n-x}$$ ### Normal Distribution has two parameters, average measurements and the standard deviation of average. Height and width depend on the standard deviation. 95\% samples fall into $[-2\times sd, 2\times sd]$ . PDF: $$Pr(x|\mu, \sigma) = \frac{1}{\sqrt{2 \pi \sigma ^2}} e^{\frac{-(x- \mu)^2}{2 \sigma^2}} $$ where $\mu$ is mean, $\sigma$ is the standard deviation ### Exponential distribution It models the time between events. PDF: $$Pr(x|\lambda) = \lambda e ^{\lambda x}$$ ## Regressions ### Linear regression (General Linear Models) Fitting a line to data, aka least squared, aka linear regression. #### 1. Steps: 1. Use Least-squares to fit a line to the data 2. Calculate $R^2$ 3. Calculate a p-value for $R^2$ #### 2. concepts - **Residual**: The distance from a line to a data point $x_i - f(x_i)$. - **Sum of squared residuals ** : $((ax_1+b)-y_1)^2+((ax_2+b)-y_2)^2+...$ - **Sum of squares around the mean** $SS(mean) = \sum (x_i - \bar{x})^2$ - **Variation around the mean** $Var(mean) = \frac{\sum (x_i - \bar{x})^2}{n} = \frac{SS(mean)}{n}$ - **Sum of squares around the fit** $SS(fit) = \sum (x_i - f(x_i))^2$ - **Variation around the fit** $Var(fit) = \frac{\sum (x_i - f(x_i))^2}{n} = \frac{SS(fit)}{n}$ - In general, $Variance(something) = \frac{\text{Sums of squares}}{\text{The number of those things}} = \text{Average sum of squares}$ - $R^2 = \frac{Var(mean) - Var(fit)}{Var(mean)}$ #### 3. Detailed steps 1. Linear regression starts at calculating the sum of squared residuals for $y = \bar{x}$, then rotate the line to calculate least squares. To do so, get the partial derivative of a and b to get the best fit a and b. 2. After get a line, calculate the $R^2 = \frac{Var(mean) - Var(fit)}{Var(mean)} = \frac{SS(mean) - SS(fit)}{SS(mean)}$. 3. Calculate the p-value for $R^2$ from F. \begin{aligned} F &= \frac{\text{The variation in mouse size explained by weight}}{\text{The variation in mouse size not explained by weight}} \\ &= \frac{SS(mean) - SS(fit)/(p_{fit} - p_{mean})}{SS(fit)/(n - p_{fit})}\\ &= R^2 \times \frac{n-p_{fit}}{p_{fit} - p_{mean}} \end{aligned} > Here $\frac{n-p_{fit}}{p_{fit} - p_{mean}}$ is called **degrees of freedom**. > The numerator $SS(mean) - SS(fit)/(p_{fit} - p_{mean})$ tells us the variance explained by the extra parameters. $p_{fit}$ is the # of parameters in fitted model, $p_{mean}$ is the # of parameters of mean, which has only one parameter. > The denominator $SS(fit)/(n - p_{fit})$ is the variance not explained by the model. Why $n - p_{fit}$? The more parameter you have, the more data you need to estimate them. > After getting the F, which will be a large number if the line fits well. Then generate random data points and get an F distribution. But now we have F-distributions on computers. #### 4. $F$ for multiple regression \begin{aligned} F &= \frac{SS(simple) - SS(multiple)/(p_{multiple} - p_{simple})}{SS(multiple)/(n - p_{multiple})} \end{aligned} ### t-tests and ANOVA #### t-test (https://i.imgur.com/sKpEdXD.png =400x) #### ANOVA (https://i.imgur.com/Btvi6bj.png =400x) ### Logistic regression Logistic regression is a specific type of Generalized Linear Model(GLM), which is a generalization of the concepts and abilities of regular linear models(linear regression). Y axis values is between [0,1], to make the y-axis values in $[-\infty, \infty]$, the y axis in logistic regression is transformed from the $probability$ to the $log(odds) = log(\frac{p}{1-p})$, so just like in linear regresion, it can go from $[-\infty, \infty]$ and the fitted line becomes straight line . #### Coefficient ##### For continious variables and intercept The coefficients of logistic regression are from the log odds plot, so if intercept ($b$) of $y = ax + b$ is $2$, it means when $x = 0$, the $log(\text{odds y}_{x = 0}) = 2$, z value is the estimated intercept divided by the standard error. In other words, it is the number of standard deviations the estimated intercept is away from 0 on a standard normal curve(**Wald test!**), compare it with $2 sd$!. The same thing can apply to continuous variables. ##### For binary variable The steps are similar with apply t test on linear regression, design matrix and calculate $R^2$ 1. Fit a line to represent the $log(\text{odds y}_{x = 0})$ for $x = 0$. We then calculate the $log(\text{odds y}_{x = 1})$ for $x = 1$. 2. write the equation $y = log(\text{odds y}_{x = 0}) \times B_1 + \frac{log(\text{odds y}_{x = 1})}{log(\text{odds y}_{x = 0})} \times B_2$, where $B1$ and $B2$ are the two columns in the design matrix. **So after the transforamtion: $log(\text{odds y})$,we can apply anything in linear regression for logistic regression!** Such as t-test, $R^2$, ANOVA #### Fitting a line using maximum likelihood In this step, we are going to fit a line for $\text{log odds}$ transformed data. As the $\text{log odds}$ push all data points to infinity, we the residual of the points to the fitted line are also infinity, so we cannot use least squares. ##### STEPS: 1. Select a squiggle in original which can give you a candidate log-odds line in the log-odds graph. 2. Using the $x$ value of each data point to get the $\text{log odds}$ value for each data point. In other words, project all data onto the $\text{log odds}$ line, so it will give each data a fitted $\text{log odds}$ value. The reason for projection is some data will get infinity log odds value when the Pr is 1. 3. Transform the $\text{log odds}$ value to $probabilities$ using sigmoid function $\sigma = \frac{e^{\text{log odds}}}{1+e^{\text{log odds}}}$ which is the reverse of $p = log(\frac{p}{1-p}) = \text{log odds}$. These $probabilities$ are the fitted value of logistic regression. Then do prediction using these $probabilities$. if $Pr(x_i) >= 0.5$, $y_i = 1$ else, $y_i = 0$. 4. Calculate the likelihood for each point. If the true label of a point is 1, $L(x_i) = Pr(x_i)$, if the true label of a point is 0, $L(x_i) = 1 - Pr(x_i)$ 5. Iterate the steps for other lines to get maximum likelihood. so we can calculate the ML. 6. Calculate the $likelihood$ or $\text{log likelood}$ of data using the $probabilities$ from last step: \begin{aligned} \text{likelood of data} &= \prod_{i:y_i = 1}^{n} Pr(x_i) \times \prod_{i:y_j = 0}^{n} (1-Pr(x_j))\\ \text{log likelood of data} &= \sum_{i:y_i = 1}^{n} \text{log Pr}(x_i) \times \sum_{i:y_j = 0}^{n} log(1- Pr(x_j)) \end{aligned} #### R-squared and p-value in logistic regression ##### SS(fit) Because we cannot use SS(fit) and SS(mean) in logistic regression, which is all infinities, we use the maximum $\text{log likelood}$ (LL(fit)) from the last step as the substitute of SS(fit). ##### SS(mean) For SS(mean), we calculate the overall log-odds, which is $\frac{\text{# of data point whose y = 1}}{\text{# of data point whose y = 0}}$ as the overall log odds of y, and transform it back using the sigmoid function to get the mean probability, which can be regarded as the horizontal line of mean in probability graph, then we can use this value as the probability to calculate the LL(mean), here all Prs are equal to the mean. Another way is use the Pr(y = 1) to get the number directly. $Pr(y = 1) = \frac{\text{# of y = 1}}{\text{total number}}$ **$R^2 = \frac{LL(\text{overall probability}) - LL(fit)}{LL(\text{overall probability}) - LL(\text{saturated model})}$** In logirstic regression, the $LL(\text{saturated model}) = 0$ because saturated model fit the data perfectly with $Pr(x_i) = 1$, so the likelihood is 1 and log likelihood is 0. In other generalized linear model, the $LL(\text{saturated model})$ is not always zero. ##### p-value Notice that $2 \times (LL(fit) - LL(\text{overall probability}))$ is a $\chi ^2$value with degree of freedom = the difference in the $ of parameters in the two models. Here the overall probability model has only a single parameter, so the df is 1. ## Concepts ### Saturated model and deviance statistics #### Definitions - **Null model** has the least parameter (single parameter), By calculating the likelihood of the $\textbf{Null Model}$, we have a sense of the worst-case scenario. - **Saturated model** appears in the general linear model. It maxes out the number of parameters we can estimate. The likelihood of the $\textbf{Saturated Model}$ is as high as it can be. In logistic model, $LL(\textbf{Saturated Model}) = 0$ - In general linear model, - $$R^2 = \frac{LL(\text{Null Model}) - LL(\text{Proposed Model})}{LL(\text{Null Model}) - LL(\textbf{Saturated Model})} $$ ##### p-value calculation No.1 > - Residual Deviance: $$\text{Residual Deviance} = 2 \times (LL(\textbf{Saturated Model}) - LL(\text{Proposed Model})) =\text{A }\chi^2 \text{ Value} $$ >The 2 here makes the difference in these log-likelihoods have a $\chi^2$ distuibution with degrees of greedom equal to the difference in the number of parameters >$$DF_{S-P} = \text{# of parameters}_{\text{Saturated Model}} - \text{# of parameters}_{\text{Proposed Model}} $$ >To get the p-value comparing Saturated Model with Proposed Model, we compare the Residual Deviance with the $\chi^2$ distuibution with $DF_{S-P}$. >**To make sure the $\chi ^2$ test to work correctly, the Proposed Model have to be a simpler version of the Saturated Model(nested)** > - Null Deviance: >$$\text{Null Deviance} = 2 \times (LL(\textbf{Saturated Model}) - LL(\text{Null Model})) =\text{A }\chi^2 \text{ Value} $$ >$$DF_{S-N} = \text{# of parameters}_{\text{Saturated Model}} - \text{# of parameters}_{\text{Null Model}} $$ >To get the p-value comparing Saturated Model with Null Model, we compare the Null Deviance with the $\chi^2$ distuibution with $DF_{S-N}$. >- p-value To get the p-value comparing Proposed Model with Null Model, we do >$$\text{Null Deviance - Residual Deviance}$$ $$DF_{P-N} = \text{# of parameters}_{\text{Proposed Model}} - \text{# of parameters}_{\text{Null Model}} $$ >the p-value with the $\chi^2$ distuibution with $DF_{P-N}$. ##### p-value calculation No.2 > $$2 \times (LL(\text{Proposed Model}) - LL(\text{Null Model})) =\text{A }\chi^2 \text{ Value} $$ >$$DF_{P-N} = \text{# of parameters}_{\text{Proposed Model}} - \text{# of parameters}_{\text{Null Model}} $$ >the p-value with the $\chi^2$ distuibution with $DF_{P-N}$. #### Things Ideally, we want the likelihood of the data given our Proposed model to be larger than the Null Model and close to the Saturated Model by using likelihoods to calculate $R^2$ and its p-value for the **proposed model** - In linear regression, $$R^2 = \frac{SS(\text{Null Model}) - SS(\text{Proposed Model})}{SS(Null Model)}$$ where the $SS(\text{Null Model})$ determines the boundary for a bad fit. When the model fits the data perfectly, the residuals are all 0 and $SS(\text{Null Model}) = 0$, so $R^2 = 1$. In this formulation, $SS(\text{Null Model})$ provides a upper bound, $SS(\text{Proposed Model})$ will not larger than $SS(\text{Null Model})$. - In logistic regression, $$R^2 = \frac{LL(\text{Null Model}) - LL(\text{Proposed Model})}{LL(\text{Null Model}) - LL(\textbf{Saturated Model})} $$ as $LL(\textbf{Null Model})$ is the lower bound, we have to come up with an upper bound in denominator to make $R^2$ is in $[0,1]$, so the $LL(\textbf{Saturated Model})$ comes in. It is the best model can be. ### Deviance Residual Deviance residuals represent the square root of the contribution that each data point has to the overall Residual Deviance. $$\text{Residual Deviance} = 2 \times (LL(\text{Saturated Model}) - LL(\text{Proposed Model})) $$ $$\text{Deviacnce residual of }x_i = \sqrt{2\times(\textbf{Residual Deviance} - LL(x_i))} $$ ### Design matrix #### design matrix for t-test and ANOVA ![](https://i.imgur.com/wQ7Abt3.png =400x) #### Design matrix for linear regression ![](https://i.imgur.com/MXezxDo.png =400x) #### Design matrix for three variables. Compare whether the complex model is better than the simple model ![](https://i.imgur.com/VfpjT3z.png =400x) ### Log Logs are just isolated exponents. Fold change should be plotted on the log axes because it starts from 0, and non-changed, FC = 1, appears on 0. #### Geometric mean the mean of the logs. It is good for log-based data, and less sensitive to outliers ### Degree of freedom Definition: Number of independent pieces of information needed to make your calculation. If all information are independent, $DF = N -1$ #### 1 degree of freedom: - When we have only two situations. - Coin toss to get what you get has 1 degree of freedom because if you get heads, you automatically know you haven't get tails. so one independent piece of information required = 1 degree of freedom. - 1000 coin toss to get how many tails and heads you get also has 1 degree of freedom because when you know how many heads you get, you will automatically know how many tails you have. #### 2 degree of freedom: - When there are 3 possible outcomes. If we just know one outcome didn't happen, it cannot tell us the true outcome, we have to know two outcomes that didn't happen. - For example, we have a traffic light, knowing the color is not green cannot tell us the true color, we have to also know the color is not yellow to get that the true color is red. #### Why $n-1$ in SD, but $n$ in $\bar{x}$? When you calculate mean, you don't need to know anything, knowing one value cannot tell you another value, so you should divide by $n$. When calculating SD, you have to know the mean, which means if you know first $n-1$ data values, you can get the $n^{th}$ data value from the mean, so the DF is $n-1$ because we just need to know n-1 values to get the SD. #### DF in test statistics When calculating the statistic, such as t, r, $\chi ^ 2$, we have to calculate other statistics such as mean or SD beforehand. 1. one sample t-test: need to know the mean of the sample, so $DF = n-1$. 2. two-sample t-test: need to know the mean of two samples, so $DF = (n_1 + n_2)-2$. ### Population parameters population distribution means a distribution fits all events. parameters used for determining how a distribution fits the population data are called population parameters. for example, sd and mean in Gaussian are two population parameters of Gaussian distribution. We always estimate the population parameters using relatively small sample sizes. The reason to estimate population parameters is to ensure that the results drawn from our experiment are reproducible. The population parameters will apply to any dataset, which means all future data are from the population distribution. ### Estimate mean, variance and standard deviation #### If we have all possible measurements in hand, calculate them! This is called parameter 1. **Calculate** population mean $\mu = \frac{\text{Sum of the measurements}}{\text{The number of all possible the measurements}} = \text{Average Measurement}$ 2. **Calculate** population variance: $\frac{\sum (x - \mu)^2}{n}$. 3. **Calculate** standard deviation: $\sqrt{\frac{\sum (x - \mu)^2}{n}} = \sqrt{\text{Population Variance}}$ SD is in the original unit we measured, we can draw it on the graph. #### However we don't have all measurements, so estimate them! This is called statistic 1. **Estimated** population mean, aka sample mean$\bar{x} = \frac{\text{Sum of the measurements}}{\text{The number of the measurements}} = \text{Average Measurement}$ 2. **Estimate** population variance: $\frac{\sum (x - \bar{x})^2}{\bf{n-1}}$ 3. **Estimate** standard deviation:$\sqrt{\frac{\sum (x - \bar{x})^2}{\bf{n-1}}}$ ##### why n-1 instead of n? > Here n-1 compensates the differences between $\mu$ and $\bar{x}$, other wise we will consistently underestimate the variance around the $\mu$, because the differences between the data and the sample mean tend to be smaller than the data and the population mean. $\frac{\sum (x - \bar{x})^2}{n-1} < \frac{\sum (x - \mu)^2}{n-1}$ ##### Why dividing by **n** underestimates the variance? > Notice the equation of calculating population variance: $\frac{\sum (x - \mu)^2}{n}$. Instead of $\mu$, let's use unknown variable $v$ as the mean, and get the derivitive: \begin{aligned} \frac{dVar}{dv} &=\frac{d}{dv}\frac{\sum(x-v)^2}{n}\\ &= \frac{\sum2(x-v)\times (-1)}{n}\\ &= \frac{-2}{n} \sum (x - v)\\ &= \frac{-2}{n} ((\sum x) - (n \times v)) \end{aligned} so, if the derivitive is 0, \begin{aligned} \frac{dVar}{dv} &= 0\\ \frac{-2}{n} ((\sum x) - (n \times v)) &= 0\\ \sum x &= n \times v\\ v &= \frac{\sum x}{n} = \bar{x} \end{aligned} It means when $v = \bar{x}$, the variance is minimal, so when we use $v = \bar{x}$ to estimate the population variance, which should be $v = \mu$, we will always get a slightly larger variance because minimum variance appears when $v = \bar{x}$. So we substract n by 1. ##### Why substract by 1? technically the correction is multiplying by $n−1$. (See Bessel's correction - Wikipedia) One way to prove this uses the identity that variance is also given by $\sigma^2=\frac{1}{2}E[(x_i−x_j)^2]$, where $x_i$ and $x_j$ are two independent samples. But if we pick two samples at random, $\frac{n}{n^2}=\frac{1}{n}$ of the samples we will have i=j (pick same sample twice), which provides an estimate $E[(x_i−x_j)^2] = 0$ . The other $1 - \frac{1}{n}$ of the pairs we get different samples, and so the expected value, in that case, is $\sigma ^2$. \begin{aligned} E[s^2_n] &= \frac{1}{n} \times 0+(1−\frac{1}{n}) \times \sigma^2 \\ &= \frac{n-1}{n} \sigma^2 \\ E[\sqrt{\frac{n}{n-1}s_n}] &= \sigma \end{aligned} ##### alternative proof: The expected discrepancy between the biased estimator and the true variance is \begin{align} \operatorname{E} \left[ \sigma^2 - s_n^2 \right] &= \operatorname{E}\left[ \frac{1}{n} \sum_{i=1}^n(x_i - \mu)^2 - \frac{1}{n}\sum_{i=1}^n (x_i - \overline{x})^2 \right] \\ &= \operatorname{E}\left[ \frac{1}{n} \sum_{i=1}^n\left((x_i^2 - 2 x_i \mu + \mu^2) - (x_i^2 - 2 x_i \overline{x} + \overline{x}^2)\right) \right] \\ &= \operatorname{E}\left[ \frac{1}{n} \sum_{i=1}^n\left(\mu^2 - \overline{x}^2 + 2 x_i (\overline{x}-\mu) \right) \right] \\ &= \operatorname{E}\left[ \mu^2 - \overline{x}^2 + \frac{1}{n} \sum_{i=1}^n 2 x_i (\overline{x} - \mu) \right] \\ &= \operatorname{E}\left[ \mu^2 - \overline{x}^2 + 2(\overline{x} - \mu) \overline{x} \right] \\ &= \operatorname{E}\left[ \mu^2 - 2 \overline{x} \mu + \overline{x}^2 \right] \\ &= \operatorname{E}\left[ (\overline{x} - \mu)^2 \right] \\ &= \operatorname{Var} (\overline{x}) \\ &= \frac{\sigma^2}{n} \end{align} So, the expected value of the biased estimator will be $$ \operatorname{E} \left[ s^2_n \right] = \sigma^2 - \frac{\sigma^2}{n} = \frac{n-1}{n} \sigma^2 $$ So, an unbiased estimator should be given by $$ s^2 = \frac{n}{n-1} s_n^2 $$ ###### Intuition In the biased estimator, by using the sample mean instead of the true mean, you are underestimating each $x_i - \mu$ by $x_i - \mu$. We know that the variance of a sum is the sum of the variances (for uncorrelated variables). So, to find the discrepancy between the biased estimator and the true variance, we just need to find the expected value of $x_i - \mu$. This is just the variance of the sample mean, which is $\frac{\sigma^2}{n}$. So, we expect that the biased estimator underestimates $\sigma$ by $\frac{\sigma^2}{n}$, and so the biased estimator = $(1 - \frac{1}{n}) \times \text{the unbiased estimator}$ = $(\frac{n-1}{n}) \times \text{the unbiased estimator}$. ### Expected value The expected value represents the average of what we would expect if an event happens a bunch of times. $E(x) = \sum XP(X=x)$, where x is each outcome, $P(X=x)$ is the probability of observing the outcome x. ### Statistical Model 1. Model refers to a relationship, is a way to explore the relationship between variables. The model can also be an equation that we fit the data, is an approximation of the data 2. We use statistics to determine how useful and reliable a model is. ### Sampling a distribution - what does "take a sample from a distribution" mean? it means we pick a random number based on its distribution. - Why would we want to take a sample from a distribution? We can plug samples into a statistical test to compare the model expectation with reality. So we can know the generality of a model. ### p-value a number between 0 - 1, quantify how confident we are for our result using hypothesis testing. p-value threshold is 0.05 means only 5% of those experiments would result in the wrong decision (false positive). the p-value tells us how confident we are to reject $H_0$, but it doesn't tell us how different they are. Commonly, the more data we have, the smaller p-value we will get. #### How to calculate p value. There is a one-sided and two-sided p-value. the p-value is determined by adding up probabilities. it is composed of three parts: 1. The probability of random chance would result in the observation. (observing $H_0$)(HHT)([$-\infty,x$]) 2. The probability of observing something else that is equally rare.( [$(2mean -x),\infty$] symmetric part of the distribution) 3. The probability of observing something rarer or more extreme. (TTT, HHH) Why add 2 and 3? If other things are equally rare as $H_0$ happened a lot will make things to be less special. When continuous variables, use statistical distribution. For one side test, there is no equally rare event. only [$-\infty,x$] **For exploratory data, you should have a crazy p-value to make it real** ### Variation - Residual: The distance from a line to a data point $x_i - f(x_i)$. - Sum of squared residuals : $((ax_1+b)-y_1)^2+((ax_2+b)-y_2)^2+...$ - Sum of squares around the mean $SS(mean) = \sum (x_i - \bar{x})^2$ - Variation around the mean $Var(mean) = \frac{\sum (x_i - \bar{x})^2}{n} = \frac{SS(mean)}{n}$ - Sum of squares around the mean $SS(mean) = \sum (x_i - \bar{x})^2$ - Variation around the mean $Var(mean) = \frac{\sum (x_i - \bar{x})^2}{n} = \frac{SS(mean)}{n}$ - Sum of squares around the fit $SS(fit) = \sum (x_i - f(x_i))^2$ - Variation around the mean $Var(fit) = \frac{\sum (x_i - f(x_i))^2}{n} = \frac{SS(fit)}{n}$ ### Covariance 1. Calculate the variance for two variables using $\frac{\sum (x - \bar{x})^2}{\bf{n-1}}$ 2. Covariance answers "Do the measurements, taken as pairs, tell us something that the individual measurements do not?" 3. The main idea behind Covariance is that it can classify three types of relationships. 1. Positive covariance represents relationships with a positive slope. 2. Negative covariance represents relationships with a negative slope. 3. 0 covariance represents 0 relationships because the slope is 0. 4. covariance is the computational step stone of correlation. 5. covariance = $\frac{\sum(x - \bar{x})(y - \bar{y})}{n-1}$ 6. Covariance value just tells use the sign of slope, not the value of the slope, or how far the data is away from the line. 7. If no relationship, covariance is 0 If x, y both less or larger than $\bar{x}$,$\bar{y}$, the covariance is positive. On the flipped side, the covariance is negative. ![](https://i.imgur.com/qdlQXJe.png =400x) The covariance of one variable with itself is the variance of this variable. When we double all the data, we get covariance 4 times bigger than the original. so covariance is sensitive to the scale of the data, hard to interpret. ### Correlation ($R$) The trend from data that can be used to make predictions and inferences, aka educated guesses, is called relationship, correlation is used to represent the strength of the relationship. Correlation doesn't depend on the scale. $R=1$ when a straight line goes through all the data, regardless of how much data we have. Always straight line between two points. The more data we have, the more confident we have for the correlation. For correlation, a p-value tells us the probability that randomly drawn dots will result in a similarly strong relationship, or stronger. \begin{aligned} \text{Pearson's Correlation} &= \frac{Covariance(X,Y)}{\sqrt{Variance(X)}\sqrt{Variance(Y)}}\\ Covariance &= \frac{\sum(x - \bar{x})(y - \bar{y})}{n-1}\\ Variance &= \frac{\sum (x - \bar{x})^2}{\bf{n-1}} \end{aligned} Correlations are hard to compare with each other. use $R^2$! ### $R^2$ can be compared, $R^2 = 1$ is twice better than $R^2 = 0.5$. STEPS: 1. calculate variantion around mean $\sum (x - \bar{x})^2$ 2. fit a model to the data 3. $R^2 = \frac{Var(mean)-Var(model)}{Var(mean)}$ If $R^2=0.8$, it means there are 80% less variation around the model than the mean or say the model accounts for 80% of the variation. $R^2$ is just the squared R. ### Central Limit Theorem Given a distribution (which has a sample mean, Cauchy distribution doesn't have one), the sample mean of the sampling distribution follows a normal distribution. Using the central limit theorem, we can use t-test ($H_0$: The means of two samples are equal), ANOVA ($H_0$: The means of three or more samples are equal) ### Standard error Commonly, standard error refers to the standard error of sample means, which is the standard deviation of the means of sample means. If we want to get standard error for a single sample, it is $\frac{\text{Standard Deviation}}{\sqrt{n}}$, where $n$ is the sample size. We have other kinds of standard errors, like the standard error of standard deviation, which is the standard deviation of the standard deviations of samples. If we calculate medians of a bunch of samples, we can also get the standard deviation of sample medians, called the standard error of sample medians. Three common error bars: - standard deviations: * These tell you how **your data** is distributed around the mean. * Big SD tell you some of the data points were pretty far from the mean. * In most cases, you want to use SD as the bar because it tells you about **your data** - Standard errors * it tells you how the mean is distributed. - Confidence Intervals * These are related to standard errors ### Standard deviation vs. Standard error Standard deviation: Given a sample (a set of measurements), the mean of the sample is the sample mean, the standard deviation is the standard deviation of this sample mean. Standard error: Given a bunch of samples, each sample has a sample mean and standard deviation, the standard deviation of **all sample means** is called the standard error. ### Technical and Biological Replicates **Technical replicates**: Every single experiment is performed on the same sample, they are just repetition of the same experiment on the same person: - It gives us an accurate measurement of this sample. - It tells us how accurately we are measuring gene expression by how different each technical replicate is. It is used when we want to say how awesome a new method is. - We can get multiple samples from a single individual, they are considered as technical replicates while can tell us more variances. - Improves the specificity **Biological replicates**: each measurement comes from a different sample that comes from a different individual. They use different biological sources of samples. - improve generality. ### Sample size and effective sample size If there are correlation between samples, we can get effective sample size = $$\frac{\text{The number of samples}}{1+(\text{the number of samples}-1) \times \text{the correlation}}$$ The two correlated samples are counted as an effective sample size rather than 2 in the experiment. ### Confidence interval The confidence interval is an interval that covers the means. 95% confidence interval is the interval that covers 95% of the means in bootstrapping. It is useful because it is a statistical test performed visually. Because the interval covers 95% of the means, we know that anything outside of it occurs less than 5% of the time, which means the p-value of anything outside of the confidence interval is < 0.05. In bootstraps, 95% CI means the interval contains 95% of the bootstrap means. If the 95% interval of the two samples is not overlapped, we know that the p-value is less than 0.05 for sure. ### Which t-test should use? paired or unpaired? #### paired t-test paired t-test when you have paired measurements for each subject. #### unpaired t-test unpaired t-test when you have two groups of measurements from two different groups. use unpaired t-test that does bit assume equal variation in both groups is more conservative. ### One or two-tailed p-values. #### Two-tailed p-value: it tests whether the new treatment is better, worse, or not significantly different. #### One tailed p-value: It tests whether the new treatment is better or not significantly better. so the p-value of the one-tailed p-value is always smaller #### p-hacking Decide what test and what p-value we want to use before we experiment, otherwise it will be p-hacking. ### The binomial distribution and test Binormial distribution: $$pr(x|n,p) = (\frac{n!}{x!(n-x)!}) p^x (1 - p)^{n-x}$$ Binomial test: All samples are independent. use the binomial distribution to calculate the p-value. $H_0$: The two groups have no difference. If no preference, the model fits poor, we will reject the idea that both situations have the same probability. p-value: the probability that rare thing and more rare things happen. ### Quantiles and percentiles 1. The median is a quantile because it splits the data into equal-sized groups. This is called the 0.5 quantiles or 50% quantile. 2. quantiles are lines that divide data into equally sized groups. 3. Percentile is quantiles that divide the data into 100 equally sized groups. 4. 7% quantile = $7^{th}$ percentile. ### Quantile-Quantile plot (QQ plot) test whether data is normally distributed. 1. Get each data point it own quantile. 2. Get a named distribution curve (any normal curve will do) and get the values of the same quantiles as your data points' quantile. Then plot the paired values(same quantile of the two distribution) on a plot. The normal distribution is the most commonly used one. ### Quantile Normalization Assume we have each gene's expression value, different samples have different expression values for each gene. Use quantile normalization to compare them. 1. sort the expression values of each sample. 2. iteratively calculate the mean of the $i^{th}$ highest value in all samples and use the mean as the quantile normalized value for the $i^{th}$ highest value for all samples. By doing this, the genes that have the same ranking will have the same expression value. So we can compare different samples. ### Probability vs. Likelihood probabilities are the areas under the **fixed distribution**: pr(data|**fixed distribution**) Likelihoods are the y-axis values for **fixed data points** with **distributions that can be moved.**: L(distribution|**fixed data**) ### Maximum likelihood The goal of maximum likelihood is to find the optimal way to fit a distribution to the data. The reason you want to fit a distribution to your data is it can be easier to work with and it is also more general - it applies to every experiment of the same type. Normally distributed: 1. we expect most of the measurements to be close to the mean. 2. we expect the measurements to be relatively symmetrical around the mean. Likelihood plot for mean: x - location of the center of the distribution y - the likelihood of observing the data Likelihood plot for standard deviation: same as the mean. ![](https://i.imgur.com/brKDekA.png =350x) When calculating more than one samples, $$L(\lambda | x_1, x_2,..., x_n) = L(\lambda | x_1) \times L(\lambda | x_2) \times ... \times L(\lambda | x_n) $$ The likelihood is used when trying to find the optimal value for the mean of standard deviation for a distribution given a bunch of observed measurements. $L(X|Y) = Pr(Y|X)$ #### Calculate maximum likelihood: 1. write likelihood estimation for all samples: \begin{aligned} L(parameters|x_1, x_2, ...,x_n)&=\prod_{i = 1}^n L(parameters|x_i)\\ logL(parameters|x_1, x_2, ...,x_n)&=\sum_{i = 1}^n logL(parameters|x_i)\\ &=\sum_{i = 1}^n logPr(x_i|parameters) \end{aligned} 2. Take the derivitive of $logL(parameters|x_1, x_2, ...,x_n)$. 3. Assign the derivive as 0 and solve it. #### Maximum likelihood for the exponential distribution ![](https://i.imgur.com/Jt8phl5.png =350x) ![](i.imgur.com/1XkHat9.png =250x) #### Maximum likelihood for the binomial distribution ![](https://i.imgur.com/X2WEE1t.png =350x) #### Maximum likelihood for normal distribution Solve the partial derivative for $\mu$ and $\sigma$ and get the answer. \begin{aligned} \mu &= \frac{(x_1 + ...+ x_n)}{n}\\ \sigma &= \sqrt{\frac{(x_1- \mu)^2 + ...+(x_n- \mu)^2 }{\textbf{n}}} \end{aligned} ### odds and log(odds) The odds are the ratio of something happening to something not happening: \begin{aligned} probability &= \frac{\text{something happening }}{\text{everything that could happen}}\\ odds &= \frac{\text{something happening }}{\text{something not happening}}\\ odds & = \frac{Pr(\text{something happening})}{Pr(\text{something not happening})} \end{aligned} $\text{The odds} \in[0,\infty]$, where 1 is the diverge point, so log odds is clear, because $log(odds) \in[-\infty,\infty]$, and 0 is the diverge point. <span style="display:block;text-align:center">![](https://i.imgur.com/4x1igjt.png =350x) log-odds are normally distributed given a system with two variables. like yes/no, T/F ### logit function The log of the ratio of the probabilities is called the logit function and forms the basis for logistic regression. ### Odds ratio and log(Odds ratio) The odds ratio compared two odds. $$\text{odds ratio} =\frac{\text{odds of something}}{\text{odds of another thing}} $$ <span style="display:block;text-align:center">![](https://i.imgur.com/VuYzdhx.png =305x) When we have two variables, we can write a table indicating how many samples in each grid, this can indicate the relationship between two variables. We can use three methods to test the significance! | | Yes | No | | ----- | --- | --- | | True | a | b | | False | c | d | #### 1. Fisher's exact test and enrichment analysis(p-value) when having an expected distribution and a sample, we use the sample to calculate the p-value w.r.t the expected distribution to see whether the sample is from the distribution. #### 2. Chi-squared test(p value) compares the observed values(sample) to expected values (expected distribution). $H_0$ is no relationship betwen the two. row probability compared with column probability. first, calculate the probability by row, then, use this probability to get the expected value for the column, then calculate the p-value #### 3. The Wald test(p-value and confidence interval) takes the advantage that the log odds ratio is normally distributed. To fit a normal distribution using the row probability, then use column probability to calculate how many standard deviations the column probabilities are as confidence interval and p-value. estimated standard deviation: $$sd = \sqrt{\frac{1}{a} + \frac{1}{b} + \frac{1}{c} + \frac{1}{d}}$$ User story --- ```gherkin= Feature: Guess the word # The first example has two steps Scenario: Maker starts a game When the Maker starts a game Then the Maker waits for a Breaker to join # The second example has three steps Scenario: Breaker joins a game Given the Maker has started a game with the word "silky" When the Breaker joins the Maker's game Then the Breaker must guess a word with 5 characters ``` > I choose a lazy person to do a hard job. Because a lazy person will find an easy way to do it. [name=Bill Gates] ```gherkin= Feature: Shopping Cart As a Shopper I want to put items in my shopping cart Because I want to manage items before I check out Scenario: User adds an item to cart Given I'm a logged-in User When I go to the Item page And I click "Add item to cart" Then the number of items in my cart should go up And my subtotal should increment And the warehouse inventory should decrement ``` > Read more about Gherkin here: https://docs.cucumber.io/gherkin/reference/ User flows --- ```sequence Alice->Bob: Hello Bob, how are you? Note right of Bob: Bob thinks Bob-->Alice: I am good thanks! Note left of Alice: Alice responds Alice->Bob: Where have you been? ``` > Read more about sequence-diagrams here: http://bramp.github.io/js-sequence-diagrams/ Project Timeline --- ```mermaid Gantt title A Gantt Diagram section Section A task :a1, 2014-01-01, 30d Another task :after a1 , 20d section Another Task in sec :2014-01-12 , 12d anther task: 24d ``` > Read more about mermaid here: http://mermaid-js.github.io/mermaid/ ## Appendix and FAQ :::info **Find this document incomplete?** Leave a comment! ::: ###### tags: `Statistics` `Documentation`