GLACIAL experiments and performance analysis

# GLACIAL experiments and performance analysis By Arya Anantula ## GLACIAL experiments ### 1/12 runs Today, Aaron gave me data from PPMI in the .npy format (binary format storing numpy array). There were 315 individuals, 5 timepoints, and 11 variables. Then, I converted that into a csv file. The csv file had the proper ordering of columns (all variables, then individual's id, then timepoint). The numpy array had dimensions num individuals x num timepoints x num variables. The following is the code I used to convert the numpy array into the csv: ```python import numpy import pandas as pd data = numpy.load('PPMI_test_data.npy') numPatients, numTimepoints, numVariables = data.shape reshaped_data = [] for patient_id in range(numPatients): for timepoint in range(numTimepoints): row = [] for variable in range(numVariables): value = data[patient_id, timepoint, variable] row.append(value) row.append(patient_id) row.append(timepoint) reshaped_data.append(row) cols = ["Var"+str(var) for var in range(numVariables)] cols.extend(["RID", "Year"]) print(data.shape) df = pd.DataFrame(reshaped_data, columns=cols) df.to_csv('output.csv', index=False) ``` I then ran GLACIAL on the data in the csv. The arguments were the following: - seed = 999 (seed for random number generation) - learning rate = 3e-4 (used by Adam optimizer) - steps = 2000 (number of epochs) - reps = 4 (how many times cross-validation is repeated) This was the causal graph output: ![image_720](https://hackmd.io/_uploads/rywnrTxtp.png) This was the training and validation loss: ![image_480](https://hackmd.io/_uploads/r1j6BTeKa.png) GLACIAL is implemented with early stopping. If the validation error (which is checked every 50th iteration/epoch) has been on average increasing, then the algorithm early stops to avoid overfitting. I noticed that with this data, early stopping happened very early (on the 200-250th iteration).   I then ran GLACIAL with early stopping removed. This was the result: ![image_720](https://hackmd.io/_uploads/HkgW3LpgYa.png) This was the training and validation error: ![image_480](https://hackmd.io/_uploads/BJ538aeFa.png) As you can see from the above image, the model began to overfit as training error reduced and validation error increased. ### 1/10 runs I used gen_sim_data.py to generate a dataset that corresponds to a manually created causal graph. We specify the variable connections/edges of the causal graph in a JSON file which the python program uses to create data. I used the following arguments to generate the data: 1. random number generator seed = 0 2. noise_std = 0.02 3. nb_subj = 300 (300 individuals/subjects) 4. lag = 1 5. graph = 7 node causal graph (7 features/variables) 6. number of timepoints = 6 7. kind = sigmoid (don't think this really mattered) 8. gap = 1 9. outdir = genData ("genData" is just a directory I created to store generated data) This was the input causal graph (manually created) with 7 nodes. Data was generated based off of this graph. ![abc](https://hackmd.io/_uploads/S13CaBNt6.png) I ran GLACIAL on the generated data. It took about 15 minutes to run. Early stopping would happen around 1500 - 2000 iterations (epochs). Keep in mind that validation error is checked every 50th iteration for early stopping. These were the results that I got: ![image](https://hackmd.io/_uploads/HkPACS4tp.png) Both training and validation losses decreased. There would be early stopping as soon as it was detected that validation loss is increasing. The x axis is the number of times validation error was checked (happens every 50 iterations). ![image](https://hackmd.io/_uploads/BkuwJ84F6.png =x475) This is the first causal graph that GLACIAL generated. It is called the S1 graph. The blue arrow/edge is the causal direction between variables. A pink edge means it is bidirectional. The weights are the t-statistic. This is before any edge pruning. ![image](https://hackmd.io/_uploads/rknM-84Yp.png =x475) This is the causal graph after the first step of post-processing. This is called the S2 graph. This step orients bidirectional edges. It chooses the edge direction with the higher weight and removes the other direction. That is why the pink edge got changed to a particular direction. ![image](https://hackmd.io/_uploads/ryio-84Ya.png =x475) This is the causal graph after the second and last step of post-processing. This is called the S3 graph. The last step is to remove indirect edges. For an edge X→Y, remove X→Y if either: 1. There exists edge U→V on an alternative path from X to Y such that weight of X→Y < weight of U→V 2. There exists edge U→V on an alternative path from Y to X such that weight of X→Y < weight of U→V Intuitively, if the effect of X→Y is smaller than that of any edge on an alternative path, then X is likely an indirect cause of Y. Nothing has changed from the last graph as no indirect edges have been discovered and removed. | Model | TP | Precision | Recall | F1 | | -------- | -------- | -------- | -------- | -------- | | S1 | 6 | 0.857 | 0.857 | 0.857 | | S2 | 7 | 1.0 | 1.0 | 1.0 | | S3 | 7 | 1.0 | 1.0 | 1.0 | If the true causal graph is provided in the JSON config file, we can compare the accuracy of the GLACIAL generated graphs by comparing with the true graph. - TP (true positive) = the number of edges (including direction) correctly predicted. - Precision = TP / Total # of predicted edges. Of all the predicted edges, how many were correctly identified. - Recall = TP / Total # of true edges. Of all the actual/true edges, how many were correctly identified. - F1 = harmonic mean of precision and recall. *Bidirectional edges don't count as 2 different edges. These metrics will be extremely useful in validating the performance of GLACIAL by comparing GLACIAL's results against the truth. ## Running GLACIAL The following guide to running the GLACIAL code is for the original author's code: https://github.com/mnhng/GLACIAL I have forked their repository to design and run experiments: https://github.com/Arya333/GLACIAL The authors have created a program (**gen_sim_data.py**) to generate data based off of a manually created causal graph. It reads a JSON file (in the graphs folder) with the manual causal graph representation. The output is a csv file containing the data. Run it like so: gen_sim_data.py -g graphs\4nodes_v1.json -k sigmoid -o genData (this is an example) There are several arguments to change how the data is generated: 1. **--seed/-s** Seed for random number generator (default=0) 2. **--noise_std/-n** Standard deviation of the noise that is added to the time-series data (default=0.02) 3. **--nb_subj** The number of individuals/subjects to be in the data (default=2) 4. **--lag/-l** Time delay between the effect of a cause and its observable impact in a time-series data sequence (default=1) 5. **--graph/-g** (REQUIRED arg) The path to the JSON causal graph used to generate the data 6. **--timepoints/-t** The number of timepoints (default=6) 7. **--kind/-k** (REQUIRED arg) Choose between sigmoid or brownian for data gen of variables without parents <-- Don't exactly know what this is for because it doesn't seem like this arg is used in the code currently 8. **--gap** The gap between timepoints (default=1) 9. **--outdir/-o** (REQUIRED arg) The output directory to save the following: a csv file with the generated data; a JSON file that contains the link to the csv, list of features, and a ground truth causal graph; a png of the first 2 individuals data and the causal graph   **glacial.py** has the code that actually runs GLACIAL. The input is a JSON config file which contains the path to the csv data file and also has the list of features. The features list and the feature name columns in the csv have to have the exact same names. When formatting the csv data, the columns should be in this order: all the features, then the RID (unique identifier per individual/subject), then Year (timestep) Arguments for this file: 1. **--seed/-s** Seed for random number generator (default=999) 2. **--lr/-l** Learning Rate used in Adam optimizer (default=3e-4) 3. **--steps** Epochs for training (default=2000) 4. **--reps** The number of time cross-validation is repeated (default=4) 5. **--config/-f** (REQUIRED arg) The path to the CSV file containing the data 6. **--outpath/-o** (REQUIRED arg) The output directory to store all the results of running GLACIAL on the provided data When I initially tested and ran the code, I would get an error when there were a few number of individuals (<10) in the dataset. Then, I tried 10 individuals in the dataset. There weren't any error but the results weren't good as there were no edges in the output causal graph. I got good results when I tested with 300 individuals on the 7 node/variable data generated dataset. I assume GLACIAL needs enough sample data to discover causal relationships and generate meaningful results. ## Validating GLACIAL We aim to quantitatively evaluate the accuracy and reliability of the causal graphs produced by GLACIAL. We are looking for ways to measure the fidelity of causal graphs being produced by GLACIAL. Currently, we say that a good causal graph / DAG (directed acyclic graph) generated by GLACIAL is one that matches the DAG that the input data was generated from. We generate data from a sample DAG, so the data will represent the causal relationships of the sample DAG. A good model will take in that data and reconstruct the DAG properly with the same edge connections. However, testing with only a few sample DAGs isn't enough. To be confident in GLACIAL's reliability, we need to test it across a wide variety of different DAGs. This will ensure that GLACIAL works well in many diverse scenarios. We would have to create a large set of DAGs (test suite) that are different from each other in various ways (can vary in number of nodes, structure of connections, or the way data is generated from them). This will challenge the GLACIAL model with a range of different situations that it might encounter with real-world data. We can use statistical tests like t-tests or p-tests to determine if the results we get from reconstructing a DAG using GLACIAL are actually indicative of its performance. A statistically confident result means that we can be more certain that GLACIAL's performance is consistently good, rather than being good just by accident or random chance. **Model fidelity** refers to the accuracy and reliability of the model in capturing the underlying relationships in the data it's trained on. High-fidelity models closely mimic the true data generation process, making accurate predictions even in complex scenarios. Low-fidelity models may not capture all the complexities of the data, leading to less accurate or less reliable predictions. However, these models are usually simpler, require less data, and are computationally less expensive. **Generalization** is the ability of a model to perform well on new, unseen data. A model that generalizes well can make accurate predictions or decisions across a variety of scenarios, not just the ones it was trained on. The goal is to find a balance between fidelity and generalization. This means creating a model that has enough fidelity to accurately represent the underlying process, but not so much that it becomes overly complex (overfit) and fails to generalize. Certain techniques (like cross-validation) can help us achieve this. ## Finding the optimal epoch size (OLD) The following hyperparameter tuning was done on our GLACIAL code. However, after receiving the original authors' code, we switched to their codebase. This is old and not relevant to our current experiments. I have just included it as part of documentation. Originally, we used 1000 epochs for training, but this increased the runtime of training a lot, leading to 20+ minute runs of the program. This wouldn't be feasible when doing multiple tests or inputting a larger dataset. Therefore, I used the elbow method to find the best epoch size to use. I plotted epoch size against average minimum validation loss when using that epoch size. There would be a point in the graph after which the validation loss wouldn't decrease by much (diminishing returns). This would be the elbow of the graph, and we would use this as the epoch size going forward. ![epochs](https://hackmd.io/_uploads/Hyw48PVda.png) From the above image, 100 epochs is the elbow, so it is the epoch size we will use. Using more than 100 epochs doesn't result in much decrease of validation loss. These are the exact numbers we got: - Epoch sizes: [5, 10, 50, 100, 250, 500, 750, 1000] - Validation loss: [27.8, 19.7, 2.5, 0.8, 0.53, .48, 0.47, 0.47]