Refresher Course for Statistics and Epidemiology

###### tags: `datathon2020` --- slideOptions: transition: slide --- # Refresher Course for Statistics and Epidemiology Ryo Uchimido MD, MPH, Megu Baden MD, PhD, Toru Shirakawa MD Workshop 01, Tokyo Datathon 2020 --- > This refresher course provides fundamental but important ideas of biostatistics and clinical epidemiology. We will begin by discussing some of the core concepts in statistics such as statistical distribution, statistical tests, p-value, and confidence interval. Next we will go through fundamental concepts of clinical epidemiology such as measures of association, biases, DAG (Directed acyclic graph) and effect modification. Then we will review linear regression models and logistic regression models with and without interaction terms. We will emphasize not only statistical evaluation of regression models but interpretation with epidemiology perspective. This course will utilize R software, so fundamental skills and knowledge of R is required. Of course, we’re happy to help if you are having trouble in the course. --- ## Table of Contents [ToC] --- ## 1. Review of Clinical Study Design Clinical study is an experiment or investigation to answer clinical questions. Typical clinical question is ... - Interventional Study - **Randomized Controlled Study (RCT)** - Non-randomized Controlled Study - Obersevational Study - Descriptive Study (without control group) - **Analytical Observational Study** (with control group) - Cross-sectional Study - Longitudinal Study - Case-control Study - Cohort Study ### Reference - Clinical Epidemiology: The Essentials, Grant S. Fletcher, 2020. --- ## 2. Levels of Evidence: a Practial Perspective --- ### 2-1. Effectiveness of New Drug or Treatment - **RCT** - Accepted hierarchy of evidence - Systematic Review / Meta-analysisof RCT > RCT > Observational Study - **However, quality of design matters** - Tools for check your RCT design - CONSORT: http://www.consort-statement.org/ - checklist: https://www.equator-network.org/wp-content/uploads/2013/09/CONSORT-2010-Checklist-MS-Word.doc - flow diagram: https://www.equator-network.org/wp-content/uploads/2013/09/CONSORT-2010-Flow-Diagram-MS-Word.doc --- ### 2-2. Questions for which RCT is difficult or expensive - **Observatinal Study** - Prospective Cohort Study > Longitudinal Study > Case-control Study - Tools for qualitying your desing of observational study - STROBE Statement: https://www.strobe-statement.org/index.php?id=strobe-home - checklists: https://www.strobe-statement.org/index.php?id=available-checklists - Example of high quality observational study - Nurse's Health Study: https://www.nurseshealthstudy.org/ - Health Professional Follow-up Study: https://sites.sph.harvard.edu/hpfs/ ### Reference - https://www.cebm.net/wp-content/uploads/2014/06/CEBM-Levels-of-Evidence-2.1.pdf --- ## Notation - Random variables - $X$ : general random variable - $X_i$ : random variable for individual $i$ - $Y$ : outcome - dichotomous outcome - $Y=1$ if dead - $Y=0$ if alive - continous outcome: - systolic blood pressure - $Y=128$ (mmHg) - $A$ : treatment assingment (sometimes denoted by $T$ or $W$) - $A=0$ control group - $A=1$ intervention group - can be extended to multiple treatments ($A=0,1,2,\ldots$) - $L$ : confounder(s) - age (continuous), sex (dichotomous), BMI (continous) - type of insurance (categorical) - $U$ : unobsersved confounder(s) - annual income --- - Counterfactual - $Y^{0}, Y^{1}$ - potential outcomes under 'what if' assumption - never be observed at the same time (fundamental problem of causal inference) - Probability - $\mathbb{P}(X)$ is a probability of a random variable $X$. - $\mathbb{E}(X)$ is an expected value of a random variable $X$. - $\mathrm{Var}(X) := \mathbb{E}\left[(X-\mathbb{E}[X])^2\right]$: variance of $X$ - Estimators - $\hat\mu(X) := N^{-1}\sum_iX_i$: sample mean - $\hat\pi(a|\ell)$: assingment probability (or propensity score)  --- ## 3. New Drug and Placebo: a Motivational Example - Administration of a new drug improve the outcome for a patint $i$ ? - $Y^{1}_i - Y^{0}_i > 0$ ? - Problem: both of $Y^{a=1}_i$ and $Y^{a=0}_i$ cannot be observed in a real world. - Estimate the effect from a **population**. --- - Patients who received new drug have, **on average**, better outcome than those not? - $\mathbb{E}[Y^{a=1}-Y^{a=0}] > 0$ ? - $\mathbb{E}[Y|A=1] - \mathbb{E}[Y|A=0] > 0$ ? |$L\ (\text{age})$|$A$|$Y^0$|$Y^1$|$Y$| |:--:|:--:|:--:|:--:|:--:| |40|0|1|(0)|1| |40|0|1|(0)|1| |60|0|1|(0)|1| |60|0|1|(1)|1| |60|0|1|(1)|1| |60|0|0|(1)|0| |40|1|(1)|0|0| |40|1|(1)|0|0| |40|1|(0)|0|0| |40|1|(0)|0|0| |40|1|(1)|0|0| |60|1|(1)|1|1| |60|1|(0)|1|1| - Average treatment effect (ATE) - $\text{ATE} = \mathbb{E}[Y|A=1] - \mathbb{E}[Y|A=0]$ - Problem: assingment for valid inference. --- ### R code ``` y <- c(0,1,1,0,0,1,0,1,1,0) a <- c(0,1,0,0,1,1,0,0,1,0) df <- data.frame(Y=y, A=a) mean(df[df$A==1,]$Y) - mean(df[df$A==0,]$Y) ``` #### Output [1] 0.4166667 --- ## 4. Statistics for Randomized Controlled Trial （（RCTとはなにかの説明）） --- ### 4-1. Random assignment - $\mathbb{P}(A=1) = 1/2$. - $A \perp Y^a$. - "Table 1" in RCT paper confirms $A\perp L$, which imples $A\perp U$, hence $Y^a\perp A$. --- ### 4-2. Estimation - continuous outcome - dichotomous outcome --- ### $t$-test and $\chi^2$-test - continuous outcome - Assumption: - $\mathbb{E}[Y|A=a] = \theta_a + \epsilon_a$, and $\epsilon_a \sim \mathcal{N}(0, \sigma_a)$ for $a=0, 1$. - The sample mean $\mu_{a}(Y) = \frac{1}{N_a}\sum_{i:A_i=a}Y_i$ is an unbiased and consistent estimator of $\theta_a$ for $a=0, 1$. - $\lim_{N_a\to\infty}\mu_{a}(Y) = \theta_a$ (consistency) - $\mathbb{E}_{\text{data}}[\mu_{a}(Y)] = \theta_a$ (unbiasedness) - Null hypothesis $H_0$: $\theta_1 = \theta_0$. - $T = \mu_1(Y) - \mu_0(Y)$ follows $t$-distribution. - dichotomous outcome --- ## 5. Measures of Treatment Effect - measure of effect on continuous outcome - difference - $\mathbb{E}[Y^{a=1} - Y^{a=0}]$ - $\mathbb{E}[Y|A=1] -\mathbb{E}[Y|A=0]$ --- - measure of effect on dichotomous outcome - Note that $\mathbb{E}[Y] = \mathbb{P}(Y=1)$, and $\mathbb{E}[Y|A=a] = \mathbb{P}(Y=1|A=a)$. - risk difference (RD) - causal RD: $\mathbb{P}[Y^{1}=1] - \mathbb{P}[Y^{0}=1]$ - crude RD: $\mathbb{P}[Y=1|A=1] - \mathbb{P}[Y=1|A=0]$. - risk ratio (RR) - causal RR: $\mathbb{P}[Y^{1}=1] / \mathbb{P}[Y^{0}=1]$ - crude RR: $\mathbb{P}[Y=1|A=1] / \mathbb{P}[Y=1|A=0]$. - odds ratio (OR) - causal OR: $O[Y^1=1] / O[Y^0=1]$. - crude OR: $O[Y=1|A=1] / O[Y=1|A=0]$. - where $O(Y=1|A=a)$ denotes the odds of $Y=1$ given $A=1$ which is defined as ${\mathbb{P}[Y=1|A=a]}/\left({1-\mathbb{P}[Y=1|A=a]}\right)$. --- - outcome distribution - normal pdf / cdf p-value - binomial pmf/ cdf - average distribution - central limit theorem - $\bar{X} = \frac{1}{n}\sum_{i=1}^nX_i$, where $X_i$ follow i.i.d distributions. - $\lim_{n\to\infty} \sqrt{n}\bar{X} = \mathcal{N}(\mathbb{E}(X),V(X))$ - p-value --- ## 6. Hypothesis Testing - p-value - p-value is the probability that random chance generated the observed data or something else that is equal or rarer. you need 2 steps such as calculating the probability that random chance generate the observed data, then assessing the probability of something else that is equal or rare. If you have a pdf of the observed data distribution, you can easily calculate p-value. --- - When $X \sim \mathcal{N}(\mu, \sigma)$, a p-value for $X = a$ is the probability hat random chance generate $X$ such that $|X-\mu| \ge |a-\mu|$. - ![](https://i.imgur.com/iwc8vdn.jpg) - P-value is a probability that the experiment would yeiled the observed result or more extreme ones under the null hypothesis $H_0$: $\mathbb{P}[\text{data}|H_0]$. - confidence interval --- ## 7. From RCT to Observational research --- ### Confounding Confounding factor is a randome variable ... --- ### DAG and confounding adjustment - Directed Acyclic Graph (DAG) - DAG is a graphical representation of structural causal model - Nodes = randome variables - Edge from $X\to Y$ denotes... - Example - Methods for cnfounding adjustment - Regression - Conditional treatment effect - $\mathbb{E}[Y|A=1, L=\ell] - \mathbb{E}[Y|A=0, L=\ell]$ - Linear regression: $\mathbb{E}[Y|A=a, L=\ell] = a\theta + g(\ell)$ - Standardization - Marginal treatment effect - $\displaystyle{\int_\mathcal{L}\mathbb{E}[Y|A=1, L] - \mathbb{E}[Y|A=0, L] \mathbb{P}(L)}$ - $\displaystyle{\hat{\mu}^\text{IPWE}_{a,n} = \mu_a(Y/\hat\pi(A|L)})$ - $\hat\pi$ is an estimate of $\pi(a|\ell) = \mathbb{P}(A=a|L=\ell)$ by some propensity models (logistic regression, random forest, etc.) --- ### References - Robins JM, Hernán MA, Brumback B. Marginal structural models and causal inference in epidemiology. Epidemiology. 2000 Sep;11:550-60. --- ## 8. Linear Regression - Generalized linear regression - linear regression - $\mathbb{P}[Y^a=1] = \psi_0 + \psi_1a$ - $\text{causal RD} = \psi_1$ - $\mathbb{P}[Y=1|A=a] = \psi'_0 + \psi'_1a$ - $\text{crude RD} = \exp(\psi'_1)$ - log linear model - $\mathrm{log}(\mathbb{P}[Y^a=1]) = \theta_0 + \theta_1a$ - $\text{causal RR} = \exp(\theta_1)$ - $\mathrm{log}(\mathbb{P}[Y=1|A=a]) = \theta'_0 + \theta'_1a$ - $\text{crude RR} = \exp(\theta'_1)$ - logistic regression link function - $\mathrm{logit}(p) = \log(p/(1-p))$ - $\mathrm{logit}(\mathbb{P}[Y^a=1]) = \beta_0 + \beta_1a$ - $\text{causal OR} = \exp(\beta_1)$ - $\mathrm{logit}(\mathbb{P}[Y=1|A=a]) = \beta'_0 + \beta'_1a$ - $\text{crude OR} = \exp(\beta'_1)$ - statistical model - estimating model function - これはどういう意味でしょうか？ - Linear regression R-square and Least square - minizing the least square error - $L(y, \hat y) = \sqrt{(y-\hat y)^2}$ - $\hat\beta = \mathrm{arg\ min}_\beta L(y, \beta X)$ - logistic regression maximum likelihood estimation - likelihood is a probability of observed variable $O$ given a specified model (with a parameter) $\mathcal{M}(\theta)$ of the true distribution: $L(O,\mathcal{M}(\theta))=\mathbb{P}(O|\mathcal{M}(\theta))$. - $L(O|\mathcal{M}(\theta)) = \prod_iL(O_i|\mathcal{M}(\theta))$ - $\log L(O|\mathcal{M}(\theta)) = \sum_i\log L(O_i|\mathcal{M}(\theta))$ - maximum likelihood estimation (MSE) of a parameter is choosing the estimate of the parameter by maximizing the likelihood - $\hat\theta = \mathrm{arg\ max}_\theta\mathbb{P}(O|\mathcal{M}(\theta))$ - MLE is known to be consistent (and unbiased?) - $\lim_{n\to\infty}\hat\theta = \theta$ - effect modification - $\mathbb{E}[Y^a|X=x_1] \neq \mathbb{E}[Y^a|X=x_0]$ - interpretation of beta coefficient of regression models with out without interaction term that reflect possible effect modification - $\mathbb{E}[Y^a|X=x] = \psi_0 + \psi_1 a$ - $\mathbb{E}[Y^1|X=x]-\mathbb{E}[Y^0|X=x] = \psi_1$ - $\mathbb{E}[Y^a|X=x] = \psi_0 + \psi_1 a + \psi_2 a * x$ - $\mathbb{E}[Y^1|X=x]-\mathbb{E}[Y^0|X=x] = \psi_1 + \psi_2 x$ --- ## 9. Machine Learning - Estimators for assignment probability $\pi$ in the previous sections can be replaced by more complex models such as machine learning models. - benefits of machine learning. - missing values - large dataset