# Paper summary: Treatment for Mild Chronic Hypertension during Pregnancy This note is written for Airiel during the final phases of my PhD thesis writing. As shown below, when you do not want to write your thesis, writing anything else is far more enticing. In any case, this note covers the statistical analysis section of [Treatment for Mild Chronic Hypertension during Pregnancy](https://www.nejm.org/doi/full/10.1056/NEJMoa2201295) by Tita et al. The goal is to develop a broad understanding of the statistical methods applied in this work. Definitions are highlighted in blue. The information in this note is largely gathered during an hour-long train ride reading sources on [a tutorial on standard notions in medical statistics](https://pubmed.ncbi.nlm.nih.gov/34070675/), [alpha spending functions](https://online.stat.psu.edu/stat509/lesson/9/9.6), [O'Brien-Flemming boundaries](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4024106/), [multiplicity issues](https://www.quantics.co.uk/blog/multiplicity-what-is-it-and-why-is-it-important/), [sample size determination](https://www.cebm.net/wp-content/uploads/2014/12/Was-the-study-big-enough-Cafe-Rules.pdf), [analysis approaches](https://www.clinfo.eu/itt-vs-pp/) etc. ## A quick refresh on hypothesis testing The key idea is that we can never be fully sure about a piece of statistical result. Therefore, we work with probabilities instead. We need four things: a hypothesis, a statistic (some numerical values), a probability distribution, and a cut-off value $\alpha$ (often called the level of significance). The hypothesis usually states something conservative, e.g. the drug has no effect. For this reason we also call it the _null_ hypothesis. Then, we calculate the statistic of choice based on the collected data. Then, we check the probability ($p$-value) of the statistic occuring assuming the null hypothesis is true. If the $p$-value is very small, it means it very unlikely for us to arrive at our results if the drug has no effect. We take that as evidence that the drug indeed has effect and we say that we _reject_ the null hypothesis, i.e. the result is significant. Otherwise, we say that we fail to reject the null hypothesis. How small of a $p$-value is small enough? The level of significance $\alpha$ is a cut-off that is more or less randomly decided and agreed upon by everyone and is usually taken as 5%. Different types of hypothesis testing is simply achieved by choosing different test statistics and its probability distribution. In practice, these can be done easily with some software or code. A subtle but salient point is that rejecting the null is not the same as _proving_ outright that the drug has effect. ## A walkthrough > The data and safety monitoring board approved a final sample size of 2404 (1202 per group), which was reduced from the originally planned enrollment of 4700 patients, as sufficient to detect a relative reduction of 33% in the incidence of the composite primary-outcome events. In these calculations, we assumed a baseline incidence of primary-outcome events of 16% in the control group, 10% nonadherence to the trial regimen or crossover, and 5% loss to follow-up, with 85% power and a two-sided alpha level of 0.05. :::info __Primary outcomes__: They are the main effects we wish to study and are mentioned in the Methods section of the paper. Unsurprisingly, the trials are specifically designed to allow us to properly measure the primary outcomes. The crucial point is that any other supporting findings (commonly called secondary outcomes) from the trial must be scrutinized and doubted as the trial is not specifically designed for it. :::: :::info __Relative reduction__: Relative reduction (in risk, I assume) considers the ratio of two values: the risk (probability of the primary outcome in the control group and the risk of the primary outcomes in the exposed group. E.g. if the control group has 20% of primary outcome _prevalence_ and the exposed group has 10% of primary outcome _incidence_, then the relative reduction is 100%. Importantly, we do not know either risk probabilities, just the ratio. So a drop from 5% to 2.5% yields a relative reduction of 100% as well. A responsible study should report both relative and absolute values. :::: To detect some degree of relative reduction, we need enough people (a large enough sample size) so that the results are unlikely to arise due to statistical fluctuations (chance). Intuitively, to detect a smaller difference, we need a larger sample size --- a small difference is more likely to happen out of chance, so we need more tests to be sure. A general rule of thumb states that [to detect a difference half the size, we need to quadruple the sample size](https://www.cebm.net/wp-content/uploads/2014/12/Was-the-study-big-enough-Cafe-Rules.pdf). The baseline incidence is commonly called _prevalance_ as well. The statistical power of 85% means there is a 85% chance the hypothesis test correctly rejects the null. Another way to express this is that there is a 15% chance that the null is true but we mistakenly reject it. The alpha of 5% is a convention, the fact that it's two-sided is a detail. > A blinded reassessment of the sample size that was performed after 800 patients had completed the trial revealed that the incidence of the primary outcome was at least 30%. Thus, we determined that the enrollment of 2404 patients would suffice to detect relative effect sizes of 25% or more. This sample size would provide more than 80% power to detect a relative difference of 35% or more in the incidence of small-for-gestational age birth weight, assuming a baseline incidence as low as 10%. They later confirm with their results that the difference between the incidence and prevalence of the primary outcomes is large enough such that their sample size of 2404 is sufficient. > The primary analyses were performed in the intention-to-treat population. When the primary composite or birth-weight outcomes were undertermined (e.g., withdrawal from thetrial before delivery), multiple imputation methods with five replicates were used. :::info __Intention-to-treat (ITT) population__: Basically everyone in the trial is accounted for in the results, despite some common hurdles, e.g. some patients fail to follow up, quit the study, dies halfway etc. The key idea is that excluding these "faulty data" changes the outcome itself. For example, your drug unfortunately kills your patient, then by excluding these dead (ex)patients from the analysis skews the results. This is in contrast with per-protocol (PP) analysis, where we only consider the subjects who fully adhere the procedures of the trial start to finish. IIT analysis unsurprisingly tends to underestimate the strength of a result, which errs on the side of type 1. ::: This mean the results are analysed for subjects who may fail to follow up or fail to adhere the procedures in other ways. In those case, they do their best to estimate (impute) the missing data using the collected data. E.g. if we are collecting some patient information and one person left the weight blank, a simple solution is to replace that blank with the average weight computed from the overall dataset, i.e. if we don't know, we guess the average. Imputation is a complex area in itself. > Details regarding these analyses are provided in the Supplementary Appendix.17. Multivariable log-binomial models were applied to each replicated set, and assessments of treatment effect were pooled. Adjusted risk ratios, 95% confidence intervals,and tests of statistical significance were calculated. Complete-case analyses were also conducted among all the patients with available data regarding the primary outcome and small-for-gestational-age birth weight; risk ratios and 95% confidence intervals were calculated. We also determined the number of patients who would need to be treated to prevent one primary-outcome event and the 95% confidence interval. :::info __Complete-case analysis__: Analysis using only patient data with no missing value, not to be confused with per-protocol (PP) analysis. ::: Remember that for hypothesis testing we need to compute a statistic? In this case, the statistic comes from the Multivariable log-binomial models. In essence, we look at the data and we make a guess that the data follows some formula (a model). The log-binomial model is a standard choice for applications involving binary variables (yes/no). The model has several parameters, just like how $y=mx+c$ straight line can take on different values of $m$ and $c$ for different slopes and intercepts respectively. From the data collected, we make our best guess of the parameters so that the resulting model matches the data. This is called fitting the model. The result is that the fitted model serves as an approximation of the data, just like how a straight line $y=mx+c$ might be used to represent a bunch of data points (that roughly show the trend of a straight line). In this case, the parameters of the log-binomial model represent the relative adjusted risk ratios! Long story short, we infer the adjusted risk ratios from the data based on the assumption the data follow the log-binomial model. More on adjusted risk/odds ratios [here](https://www.statology.org/adjusted-odds-ratio/). The straight line example highlights an obvious caveat: if the data points exhibit a wavy trend let's say, then the straight line model is inappropriate! Similarly, this study assumes that the data follows a series of statistical assumption such that the log-binomial model is suitable. This is standard practice, but does not mean it is guaranteed to be safe. Recall the saying "all models are wrong, but some are useful". :::info __Adjusted risk ratio__: Let's say we want to investigate the effects of variable $A$ on output $Y$. We can measure take a bunch of $(A,Y)$ samples and fit it to a model and if their relationship (quantified by model parameter) is significant. However, in real life there's actually variables $B,C,D$ that affects $Y$ too. If we take them all into account, it means we should fit a (multivariate) model to $Y$ against $A,B,C,D$ and each factor $A,B,C,D$ will have their own parameters. Consider the parameter for $A$ in the simpler model and the more complex model, the parameter value from the two models will be different. Specifically, the parameter in the multivariate model will be calculated with the other factors $B,C,D$ in mind. Therefore, we say that the parameter value is now _adjusted_. In this case, the parameter represents the _adjusted_ risk ratio. Simply put, the _adjusted_ parameter is generally a more comprehensive way of analysis as it involves more relevant factors. ::: Now, we can compute these fitted parameters to the parameters that would have resulted from a null hypothesis. If they are sufficiently larger/smaller surpassing some $\alpha$, then we say that the result is significant, i.e. the different in the risk is significant, supporting that treatment does make a difference. 95% confidence interval simply comes from $1-\alpha=1-0.05=0.95$. Intuitively, there's 5% chance that the result happened by luck, so I'm only 95% confident in my findings. > We replicated the primary-outcome analyses using logistic regression to estimate odds ratios according to the prespecified statistical plan. In addition, we conducted per-protocol analyses (in which crossovers were included in the group as treated) and survival analyses to account for the time that patients had been enrolled in the trial; both analyses included patients who had been lost to follow-up. Just some details on the statistical approach, which tells us what they do when patients stop following up, or dies etc. > We performed one planned interim analysis of the primary outcome using a Lan–DeMets alpha spending function that approximated O’Brien–Fleming boundaries. The alpha level for the final primary analysis was therefore 0.0492; :::info __Interim analysis__: Analyzing the result and drawing conclusions based on current collected data before the study is completed. Naturally, we should apply common sense e.g. if we test pregnant ladies of ages 20-30 first then 30-40 at a later time. Doing an interim analysis right after collecting data on the 20-30 year olds will give you unreliable results as you have missed out the rest of the population. A safe bet is to truly randomize the tests across time so at any interim the sample is a reasonable estimate of the population. ::: Interim analysis is useful because if the partial results are enough to make or break the claim, we can stop and everyone go home early. However, the earlier we stop the test, the less data we have to learn from, and the higher chance the conclusion may be a statistical fluke. Therefore, the earlier we stop the test, the more stringent we must make our hypothesis testing! --- the smaller the $\alpha$ must be! It is not at all obvious how to adjust your hypothesis testing based on the time at the iterim. O'Brien-Fleming offers a mathematical formula to adjust the alpha value and the authors of the paper use the Lan-DeMets spending function to approximate this value, which turns out to be 0.0492 (surely, we already know this must be lower than the preset 0.05). Basically, the true O'Brien-Fleming equation very hard to calculate, so we use an estimate of it, a very common strategy in mathematical computations. Again, this is standard practice in early stopping of experiments. One can in principle do multiple interim testings, and the details of how to choose the interim timings and how to adjust the hypothesis test to accomodate for the early stopping is an interesting subject. > the safety outcome was evaluated at a 0.05 significance level. There was no prespecified plan to adjust for multiple testing. Results for secondary outcomes are reported with 95% confidence intervals without adjustment for multiplicity and thus should not be used to infer definitive effects. The secondary outcomes of the study is there for consideration but may not be taken too seriously. This is because of multiplicity --- we can take the data, and from it compute many quantities. Some of those measurements are intended to find out similar or same things. So if you measure for a specific effect 10 different ways, it is more likely that at least one will give you a significant result. In this case, you may end up rejecting the null erroneously. This highlights the importance of careful experimental design and how data can be manipulated to say pretty much whatever you want it to say.