## Types of Missingness in RWD
[](https://sentinelinitiative.org/sites/default/files/documents/smdi_r_pharma2023_vF.pdf)
---
[toc]
---
### Glossary of Acronyms / Nodes
- **C1**: Covariate 1, an observed confounder (e.g., age, sex, comorbidity).
- **COI**: Covariate of Interest, a key variable needed for adjustment (e.g., disease severity).
- **COI_obs**: Observed portion of the Covariate of Interest the part of COI that is actually recorded.
- **E**: Exposure, the treatment or intervention (e.g., drug or therapy).
- **Y**: Outcome, the endpoint of interest (e.g., survival, response to treatment).
- **M**: Missingness indicator, represents whether the COI is missing (M = 1 if COI is missing, M = 0 if observed).
- **U**: Unmeasured variable, a factor that affects both exposure or outcome and the missingness of COI, but is not captured in the dataset (e.g., socioeconomic status, patient preferences, or provider-level factors).
> **No arrow from E to Y** in any panel.
---
### (a) Missing Completely at Random (MCAR)
```mermaid
flowchart LR
C1 --> E
C1 --> Y
COI --> E
COI --> Y
COI --> COI_obs
M --> COI_obs
```
**Explanation**: Missingness (M) occurs by pure chance, unrelated to any measured or unmeasured variable.
Complete case analysis yields unbiased estimates, though with some efficiency loss.
#### Walkthrough
| Connection | Interpretation |
| ------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **C1 → E** | Patients’ fully observed baseline covariates (age, sex, etc.) can influence which treatment they receive. |
| **C1 → Y** | Those same covariates also affect the outcome (e.g., older patients have higher mortality). |
| **COI → E** | Even though COI (disease severity) may be incompletely recorded, its true value can still sway treatment choice (e.g., more severe patients get the new therapy). |
| **COI → Y** | Severity logically drives prognosis. |
| **COI → COI\_obs** | We observe a value only if it is measured; thus the true COI deterministically generates its recorded counterpart. |
| **M → COI\_obs** | If M = 1 (missing), COI\_obs must be NA; if M = 0, COI\_obs gets the measured value. The arrow clarifies that COI\_obs is created *after* the missingness decision. |
**What is *not* connected—and why**
*No variable points into **M***.
That expresses the essence of MCAR: the probability of missingness is independent of C1, COI, E, Y, or any unmeasured factors. In practice this means any record can be missing purely by chance (e.g., random database outage).
**Implications for analysis**
* **Bias-free complete-case analysis**: because missingness is unrelated to study variables, simply deleting rows with missing COI does not distort causal estimates—it only reduces precision.
* **Efficiency loss**: discarding data still lowers statistical power; imputation can recover efficiency but is not required for unbiasedness.
* **Contrast with other panels**: panels (b)–(d) introduce arrows into **M**, making missingness systematic (MAR or MNAR) and therefore potentially biasing naïve analyses.
---
### (b) Missing at Random (MAR)
```mermaid
flowchart LR
C1 --> E
C1 --> Y
C1 --> M
COI --> E
COI --> Y
Y --> M
COI --> COI_obs
M --> COI_obs
```
#### Walkthrough
| Connection | Interpretation |
| ------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **C1 → E** | Observed baseline covariates (e.g., age, sex) influence treatment choice. |
| **C1 → Y** | Those covariates also influence outcome (prognosis). |
| **COI → E** | The true severity still steers treatment selection. |
| **COI → Y** | Severity affects outcome. |
| **C1 → M** | Certain patient characteristics (older age, frailty) make a COI measurement more or less likely to be missing (e.g., clinicians skip a lab in frail patients). |
| **Y → M** | Early events (e.g., rapid death or hospitalization) can pre-empt measurement, so future outcome status that is already “set” influences whether COI ever gets measured. |
| **COI → COI\_obs** | Observed value is produced from the true value once measured. |
| **M → COI\_obs** | If M = 1, COI\_obs is forced to missing; if M = 0, COI\_obs holds the measured value. |
**Key MAR logic**
*All arrows into **M** originate from variables that are **observed** (C1 and Y).
COI itself does **not** point to M; nor do unmeasured variables.*
That is the MAR assumption: once you condition on variables we have in the data set, missingness no longer depends on unobserved information.
**Implications for analysis**
* **Correctable bias**: because the missingness mechanism is fully explained by observed variables, we can remove bias using multiple imputation, IPW, or full-information likelihood that conditions on C1 and Y.
* **Why E is not connected to M**: treatment is assigned before missingness is realized, so in many EHR contexts the *exposure itself* does not drive whether a baseline covariate is recorded. (If it did, an E → M arrow would appear and MAR would still hold because E is observed; here the data suggest only C1 and Y matter.)
* **Contrast to MCAR**: MCAR had *no* arrows into M; here we have systematic missingness but only via observed nodes, making it MAR.
* **Contrast to MNAR panels**: panels (c) and (d) add arrows from an unmeasured factor **U** or from COI itself into M, violating the MAR assumption.
---
### (c ) Missing Not at Random (MNAR) via **U**
```mermaid
flowchart LR
C1 --> E
C1 --> Y
COI --> E
COI --> Y
U --> E
U --> Y
U --> M
COI --> COI_obs
M --> COI_obs
```
#### Walkthrough
| Connection | Interpretation |
| ------------------ | -------------------------------------------------------------------------------------------------------------------------------------------- |
| **C1 → E** | Baseline confounders (age, sex) affect treatment choice. |
| **C1 → Y** | Same confounders influence outcome. |
| **COI → E** | True severity still helps determine the therapy a clinician selects. |
| **COI → Y** | Severity affects prognosis. |
| **U → E** | An unmeasured factor (e.g., socioeconomic status, cognitive decline, provider preference) shapes treatment choice beyond what is observed. |
| **U → Y** | That same factor also impacts outcome (e.g., low SES worsens survival). |
| **U → M** | The unmeasured factor alters the chance that severity gets recorded—poor patients might miss follow-up labs, or certain clinics omit a test. |
| **COI → COI\_obs** | The observed value is generated from the true severity when measured. |
| **M → COI\_obs** | If M = 1, COI\_obs is forced missing; if M = 0, it stores the measured value. |
**How this structure differs from (a) and (b)**
* Unlike MCAR (panel a), **M is no longer random**—it is systematically related to data.
* Unlike MAR (panel b), the driver of missingness is **unobserved** (U), so conditioning on observed variables cannot make M independent of the data‐generating process.
* No arrows run from C1 or Y into M; only U drives the missingness.
**Statistical implications**
* **Residual confounding**: U also affects E and Y, so naively adjusting for C1 and COI\_obs cannot block all backdoor paths; estimates remain biased.
* **MNAR violation**: Because U is unmeasured, the MAR assumption is broken; multiple imputation or IPW based only on observed covariates will not recover the truth.
* **Analytic options**:
* Sensitivity analyses (delta-adjusted imputation, pattern-mixture, selection models) to quantify how strong U must be to explain away results.
* Use external data or a validation subset where U is observed.
* Look for an instrumental variable that affects E but not Y or M except through E.
---
### (d) Missing Not at Random (MNAR) via **COI**
```mermaid
flowchart LR
C1 --> E
C1 --> Y
COI --> E
COI --> Y
COI --> M
COI --> COI_obs
M --> COI_obs
```
#### Walkthrough
| Connection | Interpretation |
| ------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **C1 → E** | Baseline covariates (e.g., age, sex, comorbidities) influence which therapy a patient receives. |
| **C1 → Y** | Those same covariates affect clinical outcome. |
| **COI → E** | True disease severity steers treatment choice (e.g., more advanced cancer → aggressive regimen). |
| **COI → Y** | Severity directly impacts prognosis. |
| **COI → M** | **Hallmark of this diagram:** the unobserved value itself governs its chance of being missing—for instance, severely ill patients may skip a functional-status test, or mild cases might not have a biomarker ordered. |
| **COI → COI\_obs** | If measured, the observed value is a noisy copy of the true severity. |
| **M → COI\_obs** | When M = 1, COI\_obs is forced to missing; when M = 0, it records the measurement. |
**How this structure differs from other panels**
* **Self-informative missingness:** Unlike panel (b) (MAR), missingness hinges on an **unobserved** driver—here the very value we are trying to measure.
* **No unmeasured confounder U:** In contrast to panel (c), an explicit hidden factor is not needed; the bias arises because COI both informs outcome/treatment **and** governs its own absence.
* **Not random:** Unlike MCAR (panel a), M is far from random; it is maximally informative about the missing value.
**Statistical implications**
* **Bias if ignored:** Complete-case analysis or MAR-based multiple imputation will be biased—records with extreme COI values are undersampled.
* **Model-based solutions:**
* *Selection models* (e.g., Heckman type) that jointly model outcome and missingness given a parametric link between COI and M.
* *Pattern-mixture approaches* stratifying on M and specifying prior distributions for the unobserved tail.
* **Sensitivity analysis necessity:** Because the mechanism is unidentifiable from observed data alone, analysts report how conclusions shift under plausible MNAR assumptions (e.g., tipping-point, delta-adjustment).
* **External information helps:** Validation sub-studies or external registries where COI is fully observed can anchor the MNAR model.
## References
https://sentinelinitiative.org/sites/default/files/documents/smdi_r_pharma2023_vF.pdf
https://academic.oup.com/jamiaopen/article/7/1/ooae008/7595634?login=false