Statistical concepts, calculations, interpretations

###### tags: `statistics` `bias` `hypothesis` `R` `exp` `odds ratio` `relative risk` `confidence interval` `beta` `standard error` `glm(family=binomial(link="logit"))` `logistic regression` `pchisq()` `scale` `mean` `sd` `p-value` `alpha` # Statistical concepts, calculations, interpretations ### Bias in statistics **Bias** is the tendency of a statistic to overestimate or underestimate a parameter. For a point estimator, **statistical bias** is defined as the difference between the parameter to be estimated and the mathematical expectation of the estimator. ### Null hypothesis and alternative hypothesis **Null hypothesis (often symbolized H0 and read as “H-naught”)** This is the idea that there is no relationship in the population and that the relationship in the sample reflects only sampling error. Informally, the null hypothesis is that the sample relationship “occurred by chance.” **Alternative hypothesis (often symbolized as H1)** This is the idea that there is a relationship in the population and that the relationship in the sample reflects this relationship in the population. When running a hypothesis test/experiment, the null hypothesis says that there is no difference or no change between the two tests. The alternate hypothesis is the opposite of the null hypothesis and states that there is a difference between the two tests. The goal of the experiment is usually to disprove the null hypothesis, and to prove/test the alternate hypothesis. If you are trying to prove that a new drug lowers cholesterol, the null hypothesis states that there is no difference in cholesterol between the group with the drug and the group without, while the alternate hypothesis states that the new drug does have an effect on cholesterol levels. ### p-value and alpha **p-value** The P stands for probability. **The p value is the probability of obtaining a test result that is due to chance under the assumption of no effect or no difference (i.e., the null hypothesis is true)**. In the example where we are trying to test whether a new drug lowers cholesterol, the p-value is the probability that the null hypothesis, which states that there is no difference in the cholesterol between the treatment group and the control group, is true. If the test result has a p-value 0.04 then there is 4% probability that there is no difference between the two groups. **Alpha** So what do p-values really tell us? p-values tell us whether an observation is as a result of a change that was made or is a result of random occurrences. In order to accept a test result we want the p-value to be low. How low you ask? Well, that depends on what standard you want to set/follow. In most fields, acceptable p-values should be under 0.05 while in other fields a p-value of under 0.01 is required. This cut-off number is known in statistics as the alpha, and results from experiments with p-values below this threshold are considered to be statistically significant. When the significance threshold is set to p < 0.05, then the chance of being wrong (i.e., due to chance) is less than 0.05 (i.e., 5%). ### Type I error, Type II error, alpha, beta, power **Type I error** is that you wrongly reject a null hypothesis when it is actually true. Type I error is not affected by sample size because it is set in advance; however, this error increases with the number of tests or end points. **Alpha** is the maximum probability of Type I error set in advance. **Type II error** is that you wrongly accept a null hypothesis when it is actually false. **Beta** is the probability of Type II error. Beta depends on sample sizes and alpha. Beta gets smaller as the sample size gets larger. Beta gets smaller as the number of tests or end points increases **Statistical power** is the probability of avoiding a Type II error. | | Decision | (probability) | | -------- | -------- | -------- | | Truth | Accept H0 | Reject H0 | | H0 is true | Correct decision P (1- alpha) | Type I error P (alpha) | | H0 is false | Type II error (beta) | Correct decision P (1- beta; power) | ### Mean, Median, Mode, Range, Q1, Q3, IQR **Mean** is usually the average. **Median** is usually the middle value in the ordered list of numbers. If the number of data points is odd, the median is the middle data point. If the number of data points is even, the median is the average of the middle two data points. **Mode** is the number that is repeated more often than any other. **Range** is the difference betwee the largest number and smallest number. **First quartile (Q1)** is the median of the data points to the left of the median in the ordered list. **Third quartile (Q3)** is the median of the data points to the right of the median in the ordered list. **Interquartile range (IQR)** is the amount of spread in the middle 50% of a dataset. It is the distance between the first quartile (Q1) and the third quartile (Q3). `IQR= Q3- Q1` ### Positive skewness, negative skewness, symmetrical distribution **Positively skewed distribution (right-skewed)** is characterized by many outliers in the upper region, or right tail. A positively skewed distribution is said to be skewed right because of its relatively long upper (right) tail. The mean is larger than the median. ![Positive skewness](https://i.imgur.com/olk2jJb.png) **Negatively skewed distribution (left-skewed)** has a disproportionately large amount of outliers that fall within its lower (left) tail. A negatively skewed distribution is said to be skewed left because of its long lower tail. The mean is smaller than the median. ![Negative skewness](https://i.imgur.com/Ws61QcU.png) **Symmetrical distribution** In a perfectly symmetrical distribution, the mean and the median are the same. This example has one mode (unimodal), and the mode is the same as the mean and median. ![](https://i.imgur.com/p5utDsj.jpg) ### * [Question: How can I convert -log10 (p-value) to p-value?](https://www.biostars.org/p/358046/) ### How to interpret output from generalized linear model in R? #### **Variable standardisation**. If you end up getting a very big effect size from a predictor on an outcome in a logistic or linear regression, it could be either that the predictor is highly correlated with the outcome, or you will need to standardise the predictor. Pack years of smoking, for instance, is derived from ever smoked. When you predict PYOS with ever smoked, you will have a very large effect size. Standardise or rescale data using `scale(x,center=TRUE, scale=TRUE)` This default calculates the mean and standard deviation (SD) of the entire vector, "scale" each element by those values by subtracting the mean and dividing by the SD, and generates z-scores (standardized scores). This scaling is useful to reduce the range of a widely distributed variable for making plots. Standardization does not change your underlying distribution. It only changes the units of measurement. Also, in regression there is not assumption regarding distribution of your independent variables. The requirement is only that the residuals of your model be normally distributed. * [Understanding `scale` in R](https://stackoverflow.com/questions/20256028/understanding-scale-in-r) * [R Tutorial Series: Centering Variables and Generating Z-Scores with the Scale() Function](https://www.r-bloggers.com/r-tutorial-series-centering-variables-and-generating-z-scores-with-the-scale-function/) * [standardising non normally distributed predictors for regression](https://stats.stackexchange.com/questions/223432/standardising-non-normally-distributed-predictors-for-regression) ```r! set.seed(1) x <- runif(7) # Manually scaling/standardising x (x - mean(x)) / sd(x) # Scale/standardise x using scale() scale(x) ``` ```r! # Compare the range before and after variable scaling/standardisation > range(ukb.phenos.covar.nonwhite.related.ID2.removed$merged_pack_years_20161, na.rm = T) [1] -0.035 254.950 > range(ukb.phenos.covar.nonwhite.related.ID2.removed$merged.pack.years.20161.z.scores, na.rm = T) [1] -0.5245432 15.8371049 ``` --- #### **Deviance Residuals:** This is simply a non-parametric description of the distribution of the outcome variable. #### **null deviance** and **residual deviance** To evaluate the overall performance of the model, look at the null deviance and residual deviance. **Null deviance** shows how well the outcome variable is predicted by a model with nothing but an intercept (grand mean). This is essentially a chi square value on 13916 degrees of freedom. Adding 19 predictors (caffeine.per.day ~ rs4410790) in the model decreases the deviance by 3464 (18557-15093) points on 19 (13916-13897) degrees of freedom. The **residual deviance** is a chi square value, 15093. A chi square of 15093 on 13897 df yields a p value of 1.456817e-12 `pchisq(15093, df=13897, lower.tail=FALSE)`. The null hypothesis is rejected. Notice that the degrees of freedom associated with the null deviance and residual deviance differs by 19, the number of covariates/predictors added in the model. #### **X observations deleted due to missingness** It shows up here because you had 347366 observations for which either the outcome, predictors, or both are missing. * [Interpreting Residual and Null Deviance in GLM R](https://github.com/bnamatherdhala/mint/wiki/Interpreting-Residual-and-Null-Deviance-in-GLM-R) * [R: calculate p-value given Chi Squared and Degrees of Freedom](https://stats.stackexchange.com/questions/250819/r-calculate-p-value-given-chi-squared-and-degrees-of-freedom) * [Interpretation of R's output for binomial regression](https://stats.stackexchange.com/questions/86351/interpretation-of-rs-output-for-binomial-regression) ```r! Call: glm(formula = formula, family = binomial(link = "logit"), data = data) Deviance Residuals: Min 1Q Median 3Q Max -2.7414 -0.8439 -0.5069 0.9681 2.7273 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 6.900e+00 2.524e-01 27.341 < 2e-16 *** caffeine.per.day -4.823e-05 1.379e-04 -0.350 0.72665 merged_pack_years_20161 1.051e-02 1.403e-03 7.489 6.95e-14 *** complete_alcohol_unitsweekly 6.130e-03 1.213e-03 5.054 4.33e-07 *** age -1.248e-01 3.067e-03 -40.678 < 2e-16 *** PC1 1.654e-02 1.285e-02 1.287 0.19803 PC2 1.596e-02 1.348e-02 1.184 0.23644 PC3 2.455e-02 1.276e-02 1.923 0.05447 . PC4 -2.103e-02 8.043e-03 -2.615 0.00892 ** PC5 3.837e-03 3.594e-03 1.068 0.28573 PC6 1.129e-02 1.179e-02 0.958 0.33789 PC7 1.737e-02 9.253e-03 1.877 0.06054 . PC8 1.420e-02 9.098e-03 1.560 0.11869 TDI 1.257e-01 7.195e-03 17.468 < 2e-16 *** overall_health_rating -6.002e-02 3.000e-02 -2.000 0.04546 * factor(inferred.sex)1 -2.246e-01 4.304e-02 -5.220 1.79e-07 *** quali.edu.6138.recoded.z.score 6.331e-01 2.359e-02 26.835 < 2e-16 *** factor(X20116_recodeFinal)2 4.098e-01 9.434e-02 4.344 1.40e-05 *** rs2472297 -5.597e-02 3.176e-02 -1.762 0.07801 . rs4410790 -1.725e-02 2.932e-02 -0.588 0.55627 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 18557 on 13916 degrees of freedom Residual deviance: 15093 on 13897 degrees of freedom (347366 observations deleted due to missingness) AIC: 15133 Number of Fisher Scoring iterations: 4 ``` ```r! # Calculate number of observations used in the modelling numb.obs.raw.data <- nrow(data) numb.obs.deleted <- length(logi.summ.outc.CI.expo.CCPD.predictor.covar.2SNPs[["na.action"]]) numb.obs.analysed <- numb.obs.raw.data-numb.obs.deleted ``` --- #### Standardise, center, normalise * [What does “normalization” mean and how to verify that a sample or a distribution is normalized?](https://stats.stackexchange.com/questions/70553/what-does-normalization-mean-and-how-to-verify-that-a-sample-or-a-distribution/70555#70555) #### Estimate R^2^ (proportion of trait variance explained by a predictor) for continuous and binary traits * [How does the correlation coefficient differ from regression slope?](https://stats.stackexchange.com/questions/32464/how-does-the-correlation-coefficient-differ-from-regression-slope) * [How to calculate the amount of phenotype variance explained by a few SNPs?](https://www.researchgate.net/post/How_to_calculate_the_amount_of_phenotype_variance_explained_by_a_few_SNPs) * [How to calculate pseudo- R2 from R's logistic regression?](https://stats.stackexchange.com/questions/8511/how-to-calculate-pseudo-r2-from-rs-logistic-regression) * [How different is Beta computation using Covariance and Linear Regression?](https://math.stackexchange.com/questions/38644/how-different-is-beta-computation-using-covariance-and-linear-regression) --- #### What is odds ratio? * [Explaining Odds Ratios](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2938757/) An odds ratio (OR) is a measure of association between an exposure and an outcome. The OR represents the odds that an outcome will occur given a particular exposure, compared to the odds of the outcome occurring in the absence of that exposure. Odds ratios are most commonly used in case-control studies, however they can also be used in cross-sectional and cohort study designs as well (with some modifications and/or assumptions). #### Calculate odds ratios and 95% confidence interval from regression coefficient and standard error * [Calculate odds ratio confidence intervals from plink output?](https://stats.stackexchange.com/questions/66760/calculate-odds-ratio-confidence-intervals-from-plink-output) ```r! #----------------------------------------------------------------------------- # Calculate odds ratio and 95% confidence intervals from GSMR beta and se #----------------------------------------------------------------------------- ## Calculate 95% confidence interval for the beta as beta+/-1.96*se. gsmr_result$bxy_lbound <- with(gsmr_result,bxy-1.96*se) gsmr_result$bxy_ubound <- with(gsmr_result,bxy+1.96*se) ## Exponentiate beta to get odds ratios gsmr_result$OR <-with(gsmr_result,exp(bxy)) ## Convert 95% CI of the beta to 95%CI of the odds ratio. gsmr_result$or_lbound <- with(gsmr_result,exp(bxy_lbound)) gsmr_result$or_ubound <-with(gsmr_result,exp(bxy_ubound)) ``` #### Interpret odds ratios (OR), relative risk (RR) * [Explaining Odds Ratios](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2938757/) * Suppose an exposure trait has been standardised (i.e. mean +/- SDs) and has standard deviation (SD) of 2.181. Interpret the OR in terms of **standard deviation changes**. The interpretation for OR~exposure-->outcome~=0.85 is that 1 standard deviation increase of having the exposure is assoicated with 29.8% decrease of having the outcome/disease. ```r! > SD=2.181 > OR=0.85 > OR^SD [1] 0.7015565 > 1-OR^SD/1 [1] 0.2984435 ``` * If the trait is not standardised but measured in its natural unit (e.g. cups of coffee consumed per day), the interpretation for OR~exposure-->outcome~=0.85 is that increased consumption of coffee by one cup is associated with 15% decrease `(1-0.85)/1` of having the outcome/disease. * OR is usually used to test association in case-control studies. The outcome should be binary (i.e. case 1 or control 0). OR is odds of having exposure in cases divided by odds of having exposure in controls.Suppose your exposure is smoking and outcome is lung cancer (binary): * when OR is < 1 and 95% CI do not include a number at or greater than 1 (e.g 0.87-0.93), smoking reduces lung cancer * when OR > 1 and 95% CI do not include a number smaller than 1 (e.g. 1.1 - 1.3), smoking increases lung cancer * The interpretation of OR~(BMI→T2D)~ = 3.29 is that people whose BMI are 1 SD (SD = 3.98 for BMI in European men corresponding to ~12 kg of weight for men of 175 cm stature; see Supplementary Table 6 for the SD of the risk factors) above the population mean will have 3.29 times increase in risk to T2D compared with the population prevalence (~8% in the US) [Causal associations between risk factors and common diseases inferred from GWAS summary data](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5768719/) * when OR=1, there is no association between smoking and lung cancer. * when OR is < 1 and 95% CI include a number at or greater than 1 (e.g 0.87-1.13), the result is similar to having OR=1 * Interpretation of the OR depends on how you calculate it. Usually, OR=odds of event in cases/odds of event in controls. The odds are the `P/(1-P)`, where P= probability of event in a group (be it case or control). From the formula, you can deduce that if OR>1, the odds of the event is higher in the cases. Of course, you should also calculate the confidence interval (CI) of the OR to determine if your result in statistically significant. If the CI contains 1, then the OR, although >1, is not statistically significant. Statistical significance and what determines it is a different topic and I won't go into it now.[How to interpret odds ratios that are smaller than 1?](https://www.researchgate.net/post/How_to_interpret_odds_ratios_that_are_smaller_than_1) * If OR=1, then the odds of the event in the cases=odds of the event in controls. So, that event makes no difference, it's not a risk factor. * If the OR<1, then the odds of the event in cases is lower than the odds of the event in controls. So, **if OR=0.4**, the easiest thing to do is reverse it. 1/0.4=2.5. Now you have odds of event in controls/odds of event in cases=2.5. So you would interpret: **The cases have a 2.5 lower odds of the event than the controls**, and the result is/is not statistically significant (depending on the CI and its inclusion of the value 1). You can also interpret the result as the odds of the event is 60% (=(1-0.4)*100) lower in cases than in the controls, and the result is/is not statistically significant (and insert CI). * relative risk (RR) is used in cohort studies and RCTs. RR means probability of having disease in exposed divided by probability of having disease in non exposed. You need to know the difference between OR and RR and also difference between probability and odds. However, OR and RR are used inter changeably in most situations. #### Run binary logistic regression of a binary outcome (y) on predictors (x) in R and interpret the coefficient estimate * [Generalized Linear Models (GLMs) in R, Part 4: Options, Link Functions, and Interpretation](https://www.theanalysisfactor.com/generalized-linear-models-glm-r-part4/) ```r! # Set up a formula for a logistic model ## pred.conti: continuous predictors ## pred.binar: binary predictors ## equation <- "y ~ x1 + x2 + x3" glm(formula= equation,data=data,family=binomial(link = "logit")) ```