# GLACIAL Paper Notes
By Arya Anantula
## Introduction and Context
[GLACIAL Paper](https://arxiv.org/pdf/2210.07416.pdf)
The Granger framework is used to find causal relationships in data usually under these conditions:
- Time-varying signals: any quantity or set of data points that changes over time. Their values are dependent on the specific point in time at which they are observed.
- Densely sampled time series: data collected at very frequent intervals over a period of time.
bag
:bangbang: However, in population health applications, longitudinal studies (involving information about an individual or group gathered over a period of time) track multiple individuals and collect data at sparse intervals for a limited number of times. They usually track many variables of an individual. Also, longitudinal studies often suffer from missing data. Due to all of this, Granger causality methods aren't well suited to handle these issues.
**GC (Granger Causality) Framework:** a statistical method used to determine if one time series can predict another
:thumbsdown: GC frameworks are currently only suited for densely and uniformly sampled (measurements are taken at regular, equal intervals of time) time series data. Longitudinal studies often have sparsely and irregularly sampled time series data. GC frameworks also only work with one system at a time, so they don't work with multiple systems (individuals in this case).
:thumbsdown: Another problem is that the relationship between variables can be non-linear, which can be difficult to detect. Also, the original form of GC struggles to tell the difference between direct and indirect causes. In theory, if you had unlimited past data, GC could better differentiate between direct and indirect causes. This is because with more data, it's easier to see the patterns and the steps between causes and effects. However, in real-life studies, there is only a limited amount of data available. As a result, it's common to mistakenly identify indirect causes as direct ones (false positive = think there is a direct cause when there isn't).
- Direct Cause: when one factor directly leads to an effect, with no intermediate steps or factors in between.
- Indirect Cause: involves one or more intermediate steps or factors. The initial factor sets off a chain of events that eventually leads to the outcome.
- Ex) Consider a scenario where a new advertising campaign leads to increased website traffic, and this increase in traffic then leads to higher product sales (Advertising Campaign → Increased Website Traffic → Higher Sales). The advertising campaign is the direct cause of increased website traffic and the increased website traffic is the direct cause of higher sales. The advertising campaign is an indirect cause of higher sales.
The authors of this paper believe that there is a lack of methods for uncovering causal relationships in longitudinal studies consisting of several individuals with sparse observations.
## GLACIAL
GLACIAL stands for a “Granger and LeArning-based CausalIty Analysis for Longitudinal studies.”
- It combines GC with ML approaches to find causality between multiple variables in a longitudinal study
- Treat each individual as an independent sample, but governed by shared causal mechanisms
- Even though each person's data is treated separately, GLACIAL assumes there's a common underlying rule or pattern that affects everyone's data. This rule is about how different factors cause changes over time. Some people's data are kept aside (not used in the training phase). After the model is trained with the rest of the data, it's tested on these hold-out individuals to see if it can accurately predict their outcomes. By using this train-test method, GLACIAL can check if the patterns it learned can actually predict what happens in the hold-out individuals' data (determine causal relationships).
- Utilizes a neural network that is trained using input feature drop-out to learn non-linear relationships
- This allows for GLACIAL to efficiently test for causal relationships where there are a large number of variables, and handle the case when there are irregular sampling and missing data.
Background Info:
- Pairwise GC is efficient but can mistakenly indicate a causal relationship when there isn't one (false positives), especially if it doesn't consider other variables in the system that might be influencing the results. Multivariate GC considers multiple variables at once and are more accurate but also more computationally expensive. Recently, neural networks have been more effective than traditional linear GC methods but there is a scaling issue with them.
- When some data is missing in studies, researchers have different ways to deal with it. They can try to guess (impute) the missing values, remove all data sets with any missing values, or use more refined methods that try to keep as much data as possible without getting misleading results. However, in longitudinal studies, where the same people are observed over time, dealing with missing data is trickier, and there aren’t well-established methods for this yet.
## Method
The longitudinal time series data is 3-dimensional (several individuals each with multiple variables that are tracked at timesteps).
**General GC Hypothesis:**
- Let's say we have 2 time series data variables X and Y. The history of a variable (like X) up to a certain time point t includes all its past values.
- Check if knowing the history of Y helps in predicting X better than just knowing the history of X alone.
- If the probability of X being in a certain state, given its history AND the history of Y, is the same as the probability of X being in that state given only its history, then Y is not causing X.
Instead of probabilities, we look at MSE (Mean Square Error) using predictions. The GC test compares the MSE when predicting X with and without including Y's history. If including Y's history results in a significantly lower MSE, it suggests Y is causing X (MSE(X_t, Pred without Y) - MSE(X_t, Pred with Y) > 0).
:brain: Neural Networks are used in GLACIAL to approximate the MSE optimal predictors for GC. Neural Network F is trained to predict a variable X without considering the influence of another variable Y, while Neural Network G is trained to predict X considering all available past data including Y. The difference in MSE when using F and G is calculated. If the MSE is lower when Y's history is included in the prediction (using G), it suggests that Y has a causal influence on X.
**Training and Testing:**
The neural network is trained on a training set and then tested on a test set (hold-out test individuals). This process helps determine if the findings are consistent across different data samples. A single RNN is used in place of all predictors.
- An RNN (Recurrent Neural Network) is effective for time-series data because it can remember past information (like previous values of variables) and use this to make future predictions. What makes RNNs special is their ability to remember previous information – they're particularly good at handling data that changes over time. This 'memory' feature makes them ideal for analyzing sequences of data where the order and history matter.
- While the RNN in GLACIAL is trained on data from all individuals to learn broad patterns and relationships, the analysis and predictions are done on an individual basis. This allows GLACIAL to leverage the strengths of both aggregated learning (for pattern recognition across the population) and individual-specific analysis
Missing values can be imputed from the RNN model that was used.
Input feature dropout: a multi-task RNN is used to handle multiple outputs, meaning it can make predictions about several variables at once (more efficient than training a network for every variable pair). The RNN is trained to approximate two types of expectations (mathematical predictions based on past data): one where a specific variable Y is considered (E[Xt|Ωt]), and one where it's not (E[Xt|Ωt \ Yt]). Here, Xt is all predicted variables. When variable Y is 'masked out' or ignored in the input data, the RNN focuses on the latter type of expectation. When the full data (including Y) is used, it focuses on the former. During training, each mini-batch is augmented by dropping out individual variables from the input features.
1. Split the data into multiple training and testing sets based on individuals to perform cross-validation
2. For each split, train the RNN on the training set as such: predict the next timestep's value based on the previous timesteps (takes the variables' values at timestep 1 to predict the values at timestep 2, then it takes the actual values from timestep 1 and 2 to predict the values at timestep 3, and so on). After each prediction, the RNN calculates the error (such as L2) between its prediction and the actual next timestep's values. Using a method like backpropagation through time (BPTT), the RNN updates its internal weights to minimize the prediction error. The RNN typically goes through the training data many times (each pass is called an epoch).
3. In the testing phase, use the trained RNN model on the test set to predict the variables at each timestep for the hold-out individuals. For instance, you use data from timesteps 1 to predict timestep 2, then use timesteps 1 and 2 to predict timestep 3, and so on, up until the last timestep. The model is used to make predictions one step ahead at a time, utilizing the actual observed data up to that point for the test individuals.
- During this phase, you calculate the MSE for the predictions. If you're testing for Granger causality, you would calculate the MSE for predictions with all variables included and then compare it with the MSE for predictions with the potential causal variable omitted (the input feature dropout during testing). Separate model for each variable missing will have to be trained.
## Causal Graph
**T-test:** a statistical test used to compare two groups and determine if they are different from each other in a statistically significant way. In this context, it is used to determine whether the difference in prediction errors (ΔMSE) between two variables is significant or just due to random chance.
- Outputs the p-value and t-statistic. The t-statistic is calculated as the difference between the sample mean and the null hypothesis mean, divided by the standard error of the mean. A larger absolute value of the t-statistic suggests a greater deviation from the null hypothesis, implying stronger evidence for a significant difference between groups.
**p-value:** a number between 0 and 1 that tells you how likely it is to observe your data (or something more extreme) if the null hypothesis is true. Null Hypothesis: This is a default assumption that there is no effect or no difference. In this case, the null hypothesis would be that one variable does not cause changes in another.
- A small p-value (typically less than 0.05) means that the data observed is very unlikely under the null hypothesis. This suggests that the observed data (like a significant ΔMSE) is not just a coincidence.
- A large p-value means that the observed data is more likely under the null hypothesis, suggesting that any observed differences could just be due to random chance.
For each variable pair, conduct a T-test on the ΔMSE calculations. If the p-value generated is less than the threshold, then you add the directed edge in the graph (indicating Granger causality) with weight of the t-statistic.
:thumbsdown: One problem is that with limited history, GC may detect edges between variables in both directions when there shouldn't be. Also, GC will detect edges for indirect causes in both directions. These are false positives.
**Post-processing steps:**
1. Orient bidirectional edge: If E(X|Y) < E(Y|X) remove Y → X, else remove X → Y. The direction with the bigger effect is regarded as the causal direction (use t-statistic).
2. Remove indirect edge: If the effect of X → Y is smaller than that of any edge on an alternative path, then X is likely an indirect cause of Y, so remove X → Y.
:thumbsdown: The problem with the post-processing steps is that we still have to find a proper DAG. A proper DAG is one that matches the DAG that the data was generated from. A good model will take in that data and reconstruct the DAG properly with the same edge connections. We need to find quantitative measures of evaluating and computing the accuracy of the causal graphs.
:question: We are looking at reinforcement learning approaches to learn the best pruning rules to create accurate DAGs. We need to come up with a large enough test suite of DAGs that provides statistical variation to produce a statistical confident t-test or p-test.
## Useful Links
- Granger Causality Video - https://youtu.be/b8hzDzGWyGM?si=Un2jWnSernkYmf3r
- Recurrent Neural Networks Intro Video - https://youtu.be/AsNTP8Kwu80?si=oT9mudB5gyLzEGTv