GLACIAL: Granger Causality

# GLACIAL: Granger Causality >A more in-depth hackmd. ## Key Points ### Background: >### Granger Causality * Widely used to detect causal relationships in time series data. * Works well with densely sampled data from single systems but is not well-suited for longitudinal studies with multiple individuals and outliers. ## Challenges in Longitudinal Studies * Sparse observations: longitudinal studies track individual data over few timepoints, making it difficult to deduce accurate causal relationships. * Nonlinear dynamic: rleationships between variables are often nonlinear, and traditional Granger Causality approaches may not be as effective in identifying the relationships. * Missing Data: real-world longitudinal data often has missing values, which can alter causal analysis. * Direct vs. Indirect Causes: GC methods traditionally do not distinguish well between direct and indrect causal effects :::spoiler More Info Here ## GLACIAL Approach ### Objective > Enhance causal accuracy in longitudinal studies by combining GC with machine learning techniques. GLACIAL was developed to handle complexities like individual changes and missing data. Traditional GC struggles in longitudinal analysis due to sparse sampling, outliers, missing data, etc. ## Methods * Multi-Task Neural Network * Uses a neural netowrk to handle nonlinear dynamics and large numebrs of variables. * Input Feature Dropout * Manages missing values and tests causal relationships by simulating the effect of missing data. * Train-Test Setup * Treats each individual's trajectory as an independent sample and uses average prediction on hold-out individuals to infer causal relationships. ## Methodological Aspects ### Granger Causality with MSE Test * Adapts GC for longitudinal data by comparing mean squared error (MSE) between predictors. * Uses average prediction accuracy across multiple hold-out test individuals for robust causality. ### Handling Limited Observation History * Traditional GC assumes infinite observation history, while GLACIAL adapts this for limited, irregular data. ### Statistical Testing for Robust Causality * Employs statistical tests on ∆MSE to validate causal relationships. * Uses repeated cross-validation splits for robustness. ## Experimental Findings ### Synthetic Data Analysis * Outperforms basse lines like PCMCI+, SVAR_GFCI, and DYNOTEARS. * Shows resilience against high measurement noise and missing data. ### Real-World Application (ADNI Data) * Successfully infers complex causal relationships in Alzheimer's Disease biomarkers. * Identifies meaningful links, sucha s between hippocampal atrophy and cognitive decline. ## Potential Applications ### Healthcare * Identifies causal relationships in patient health records. * Uncovers links between health biomarkers and outcomes in longitudinal studies. ### Social and Behavioral Science * Analyzes effects of interventions over time in populations. * Investigates causal relationships between socioeconomic factors and health. ::: :::success ## Advantages * Handling Missing Values: GLACIAL's use of dropout addresses missing data issues properly. * Scalability: Can manage a large number of variables and nonlinear relationships better than traditional GC methods. * Direct vs. Indirect Causes: Includes post-processing steps to clarify the directionality of causal relationships to resolve ambiguities. ::: :::danger ## Challenges ### Computation * Computationally intense method due to neural network model. * There could be more efficient means of scalability. ### Handling Missing Data * Though GLACIAL is robust to missing data, different missing data mechanisms may require enhancements. * Future work could develop strategies to combat different missing data mechanisms. ### Intepreting Nonlinear Relationships * GLACIAL detects nonlinear causal relationships, but interpreting these can be challenging. * Developing a form of interpretability can be possible in future work. ::: :::warning ## Running GLACIAL - Breaking Down the Repository ### Directories: #### data - contains PPMI curated datasets as well as the GLACIAL-readable data. > #### data_organizer.py - takes the PPMI curated data and converts it into a formatted spreadsheet that the GLACIAL codebase can read and run its model on. > #### data1.csv - contains the data (organized into columns) that the GLACIAL code runs on. #### genData - groups folder containing manually curated causal graphs using the data1.csv file and the gun_sim_data.py code (simulates data). > #### 4nodes_v1... - contains JSON version of causal graph with 4 nodes. > #### 4nodes_v1(image) - contains timestamps and residual representation along with the causal graph composed of 4 nodes. > #### 7nodes_v1 - contains JSON version of causal graph with 7 nodes. > #### 7nodes_v1(image) - contains timestamps and residual representation along with the causal graph composed of 7 nodes. #### graphs - contains the JSON files with the manual causal graph representations. gun_sim_data.py - needs these graphs to generate data based off them. > #### 4nodes_v1.json - contains json data for the 4 nodes causal graph. > #### 7nodes_v1.json - contains json data for the 7 node causal graph. > #### path_atts.json - basically the key for GLACIAL. GLACIAL references this file to understand how to govern the synthetic data to simulate causal relationships. ::: :::info ## GLACIAL Experiments #### 1: Running the GLACIAL Model without a manually specified causal graph > #### My first attempt at running the model was on the raw data in data/data1.csv. This data was not fitted towards any sort of causal graph, just the raw ppmi data formatted into a csv. > #### Upon running this i noticed that overfitting of the model began to happen very often, around 200-250 iterations. and led to inaccuracies. I discovered GLACIAL was equipped with this to stop validation error from increasing. #### 2: Running the GLACIAL Model with a manually fitted/created causal graph > #### To create a manually derived causal graph from the ppmi data , I used the gen_sim_data.py file. I noticed it had many different variables we could change to accustom the causal graph with our needs. I created a list depicting each one > seed - Random seed for reproducibility of the causal graph > noise_std - Standard Deviation of noise added to the data > nb_subj - Number of subjects for which to generate data > lag - Time lag between causality > graph - Path to the JSON file specifying the causal graph > timepoints - Number of timepoints for each subjects' data > kind - Type of function to use for data generation (usually sigmoid) > gap - Time interval between timepoints > outdir - Ouput directory (genData) > #### The script i used to create a 7 node causal graph is shown below: > python gen_sim_data.py -g graphs/7nodes_v1.json -k sigmoid -o genData > #### Here is the graph i got ![causal graph](https://hackmd.io/_uploads/HyJH7Ws1yg.png) > #### Upon running GLACIAL with the simulated data, I found out overfitting of the model happened less often, around 500-700 iterations. I learned that having simulated data for the model to be based upon gave it more accuracy #### GLACIAL Model: glacial.py is the actual file needed to run the GLACIAL model. The input is a JSON config file containg the features list, and a csv file with the simulated data. It is important to note that the features in the JSON config, and the columns in the csv have to have the exact same name. Furthermore when formatting the csv, columns must always be in this order: AGE, RID(unique identifier), Year(timesteps) These are the arguments for the GLACIAL script –seed/-s Seed for random number generator (default=999) –lr/-l Learning Rate used in Adam optimizer (default=3e-4) –steps Epochs for training (default=2000) –reps The number of time cross-validation is repeated (default=4) –config/-f (REQUIRED arg) The path to the CSV file containing the data –outpath/-o (REQUIRED arg) The output directory to store all the results of running GLACIAL on the provided data Below is the script i used to run the GLACIAL code: > #### python glacial.py --config conf/test0.json --lr 1e-5 --steps 500 --reps 4 --outpath /out :::danger It is worth noting that I could not generate any sort of DAG graph, because I do not have access to NVIDIA GPUs which are need for the model to run/ I was able to get the overfitting data by running it locally on my CPU. :::