# Evaluating GLACIAL & other causal inferencing methods By Arya Anantula ## Purpose After running the GLACIAL model on curated PPMI data, I was able to generate causal graphs. However, it is important to ensure that the causal graphs generated by the model are reliable and generalizable to new, unseen data. - **Quantitative confidence in DAGs:** This means establishing how trustworthy and accurate the causal relationships depicted by the DAGs are. Simply seeing these relationships appear consistently during training is not enough to trust their validity. - **Repeatability on Training Sample DAGs is Not Sufficient:** Although the model may produce consistent causal graphs during training, this doesn't guarantee that these graphs are correct or will be the same with different sets of data. It’s similar to a model memorizing answers rather than truly understanding the questions; it performs well in a familiar setting but may fail in unfamiliar ones. - **Cross-Validation Helps:** Cross-validation is a technique where you split your data into multiple parts, train your model on some of these parts, and test it on others. This helps in checking how well your model performs on unseen data, but this alone isn't enough for establishing the reliability of the causal inferences. - **Need Multiple Separate Independent Metrics to Bound Generalizability of the Model:** The use of various independent metrics can measure how well the model’s causal inferences hold up across different situations and datasets. These metrics could involve different statistical tests or validation techniques that aren’t just based on the model’s performance on held-out data, but also on how well the causality holds under various conditions or assumptions. The main goal is to ensure that the causal relationships that the model identifies are not just results of specific data or overfitting, but that they truly reflect underlying mechanisms that would be observable in general populations or in different scenarios. This is crucial for making our research and findings robust and reliable for practical applications. ## Metrics used in the GLACIAL paper In the [GLACIAL paper](https://arxiv.org/pdf/2210.07416), the authors used precision, recall, and F1 scores as evaluation metrics for the model when running experiments. They assumed that there was a ground-truth (directed) graph that describes causal relationships. They compared the generated causal graph to the ground-truth causal graph. - TP (true positive) = the number of edges (including same direction) correctly predicted. - Precision = TP / Total # of predicted edges. Of all the predicted edges, how many were correctly identified. - Recall = TP / Total # of true edges. Of all the actual/true edges, how many were correctly identified. - F1 = harmonic mean of precision and recall. - F1 = (2 x Precision x Recall) / (Precision + Recall) - Harmonic mean emphasizes the impact of smaller values more than larger values, so if either precision or recall is low, F1 will be lower than a standard average of precision and recall - More accurately gauge the effectiveness of the model in terms of both capturing all relevant instances (high recall) and maintaining a low count of false alarms (high precision) ## Researching the Generalizability and Fidelity of Causal Inference Models - **Generalizability:** This refers to the ability of a model to perform well on new, unseen data that was not used during the training of the model. It's about how well the model can apply what it has learned to different scenarios outside the training set. - **Model Fidelity:** This means the accuracy with which a model represents the real-world processes or relationships it is supposed to simulate. High-fidelity models closely mimic the actual dynamics they represent. In this case, it refers to how accurately the causal inference model captures the true causal relationships. A good model is one that can reliably predict or infer causality in a variety of different conditions. Just like how F1 & AUC are metrics used to evaluate models in classification or clustering, similar quantitative metrics are needed for evaluating the fidelity and generalizability of causal inference models. **Multiple Codes Consensus:** By running similar experiments through different implementations (or "codes") that aim to perform the same task, and then taking a consensus (i.e., the majority or average result) from these various runs, you can more confidently validate the findings. This approach reduces the risk that the results are dependent on a specific algorithm or implementation detail. **[XGBoost](https://www.nvidia.com/en-us/glossary/xgboost/)** stands for "eXtreme Gradient Boosting," which is a machine learning technique that builds and improves models incrementally through an ensemble of simpler models (usually decision trees). It's known for its high performance and ability to generalize well from training data to unseen data. ### Relevant Research Papers [LLMs and Causal Inferencing](https://ar5iv.labs.arxiv.org/html/2403.09606) - A comprehensive survey on LLMs discusses the application of causal inference to enhance model robustness, fairness, and explainability. [Robust agents learn causal world models](https://ar5iv.labs.arxiv.org/html/2402.10877) - This paper discusses how models that can adapt to changes in data distributions or environments (through interventions) show improved generalizability and fidelity. [Evaluation Methods and Measures for Causal Learning Algorithms](https://ar5iv.labs.arxiv.org/html/2202.02896) - This paper presents a comprehensive framework for evaluating causal effect estimation, introducing several metrics that could help determine the fidelity and generalizability of causal models. The metrics discussed are instrumental for benchmarking causal models against ground-truth data. - "Due to the lack-of ground-truth data, one of the biggest challenges in current causal learning research is algorithm evaluations. This largely impedes the cross-pollination of AI and causal inference, and hinders the two fields to benefit from the advances of the other. To bridge from conventional causal inference (i.e., based on statistical methods) to causal learning with big data (i.e., the intersection of causal inference and machine learning), in this survey, we review commonly-used datasets, evaluation methods, and measures for causal learning using an evaluation pipeline similar to conventional machine learning." - "**Causal structure learning** refers to the task of identifying the causal relations for a given set of variables 𝑉={X1,...,Xn}. The goal is to generate a causal graph G={V,E} that represents the causal relations over the set of variables in V. E represents the set of directed edges between the variables in V." - **Evaluation Metrics for causal structure learning:** - **Structural Hamming Distance (SHD):** Given two causal graphs, one being the ground-truth partially DAG and the other being a predicted partially DAG, SHD is defined as the number of edits (adding, removing or reversing an edge) that have to be made to the learned graph for it to become the ground-truth graph. SHD = A + D + I. A represents the number of edges that are added, D represents the number of edges that were deleted, and I represents wrongly oriented edges. - **Frobenius Norm** measures the difference between two adjacency matrices of two causal graphs. The Frobenius norm provides a single value that quantifies the overall difference between the two graphs. A smaller Frobenius norm indicates that the graphs are more similar, meaning their causal structures are more closely aligned. Conversely, a larger Frobenius norm suggests greater discrepancy between the graphs. - **Structural Intervention Distance (SID):** For causal structure learning based methods, it is important to understand the causal interpretations of a graph since it helps predict the result of interventions. Given a true DAG and an estimated DAG, SID aims to infer the number of falsely inferred intervention distributions. Interventions are deliberate changes made to a variable within a graph to observe how these changes affect other variables. SID quantifies the difference between two causal graphs by considering the potential outcomes of interventions. It counts the number of instances where an intervention on a variable in the estimated graph leads to a different set of variables being affected compared to an intervention on the same variable in the true graph. A lower SID score means the estimated graph more accurately represents the true causal relationships, and a higher score indicates greater divergence. Unlike other metrics that might focus on the presence or absence of specific edges, SID considers the broader implications of these inaccuracies on the effectiveness of interventions, which are often the ultimate goal of causal analysis. - **Precision, Recall, F1, False Positive Rate (FPR), True Positive Rate (TPR), MSE, Area under ROC curve (AUC)** → These metrics are based on the intuition that directional adjacency relations can be treated as a binary classification problem. - **MSE:** the sum of square of difference between the predicted and the ground-truth causal graphs divided by the total number of nodes - **Causal Inference Tools:** - **[TETRAD](https://www.cmu.edu/dietrich/philosophy/tetrad/)** is a drag and drop suite to perform causal structure learning. It can take datasets containing both continuous and categorical variables, including time series data. It supports algorithms to search for structural relations. It also supports measuring unmeasured confounders; simulating data from a statistical model; predicting the effects on other variables of interventions or perturbations on one or more variables; and computing the probability distribution of any variable conditional on specified values of any other set of variables. It is developed in Java. [An evaluation framework for comparing causal inference models](https://ar5iv.labs.arxiv.org/html/2209.00115) - provides an extensive framework for evaluating causal inference models. It discusses performance profiles and statistical methods like the Friedman test for comparing multiple models, which can be crucial for assessing the fidelity and generalizability of causal inference models. This approach helps determine the effectiveness of different causal inference models under various simulation scenarios, which can be particularly useful in understanding how well these models perform across different datasets and conditions. [C-XGBoost: A tree boosting model for causal effect estimation](https://arxiv.org/pdf/2404.00751) - introduces a model that integrates XGBoost with causal inference to estimate Average Treatment Effect and Conditional Average Treatment Effect from observational data. This model leverages the strengths of tree-based models for handling tabular data and includes new regularization techniques to minimize overfitting and bias. It aims to improve the prediction of potential outcomes by using a novel loss function designed specifically for causal effect estimation.