# PCS Papers Summaries ## Veridical Data Science [(Yu and Kumbier, 2020)](https://www.pnas.org/doi/epdf/10.1073/pnas.1901326117) #### What does PCS do for science? PCS provides guidelines for properly evaluating how well a model does. PCS defines what properties a "good" model should satisfy. Some of these properties are quantifiable (e.g. measurable loss function) and others can be a preliminary list (e.g. here are some constraints you should think about) #### What do researchers gain by using PCS? Researchers that use PCS have a sound framework for evaluating models/methods/processes for their specific application. ## Stable Discovery of Interpretably Subgroups... [(Dwivedi et al, 2020)](https://arxiv.org/pdf/2008.10109.pdf) #### Setting: Causal Inference with heterogeneous treatment effects #### Goal: Find the best CATE (function measuring effect sizes) estimator for analysis on a chosen dataset Predictability - predictive reality check based on calibration Stability - run their method multiple times using different perturbations of the data to check whether the same cell cover was found. - For each cell they define a stability score and rank cells according to their stability scores. P & S - run their method on a new (but similar) dataset to show that conclusions obtained by their method on the VIGOR study can generalize to the APPROVe study #### What the Setting Gains By Using PCS: - Predictability gives trust in the selected CATE estimator because it passes an out-of-sample prediction accuracy - Stability helps their method know that the CATE estimator they select is robust to data perturbations. #### What are the Unique Challenges of PCS In This Setting: - CATE is a function and is challenging to estimate - Fundamental problem of missing data in causal inference -> cannot directly solve with conentional supervised learning techniques - No test for CATE models because the missing potential outcomes means there's no plug-in estimate for any risk function ## Veridical Causal Inference Using Propensity Score Methods... [(Ross et. al., 2021)](https://pubmed.ncbi.nlm.nih.gov/34040495/) Note: This paper is not written very clearly. They mention PCS once in the intro and then never again. Connections to PCS must be made by the reader. #### Setting: Causal Inference with propensity score methods for various outcomes (e.g. binary, count, etc.) for medical insurance claims data #### Goal: Want a proper method for answering scientific inquires using medical claims datasets that is also transparent and reproducible. Predictability - predict a propsentiy score Stability - stability of weights away from extreme values (trimming) for inverse probability of treatment weighting. #### What the Setting Gains By Using PCS: - this paper mentions PCS but does not actually explain how they use it or do they offer anything novel in my opinion #### What are the Unique Challenges of PCS In This Setting: - healthcare datasets have selection bias, heterogeneity, missing values, duplicate records, and misclassification of diseases. ## Next Waves in Veridical Network Embedding [(Ward et al., 2021)](https://onlinelibrary.wiley.com/doi/epdf/10.1002/sam.11486) #### Setting: Evaluating network embedding algorithms #### Goal: Study network embedding algorithms systematically and point to new directions for future research. Predictability - construct a metric to evaluate how well a model represents relationships in the original data. Computability - approximations or use of simpler low dimensional representations of large networks (e.g. limiting the number of vertices in the network) Stability - how stable are the results when the data or model is pertrubed? - examples - choice of representation space - choice of embedding dimension - preserving features of the network - data changes: removing small number of edges from the network #### What the Setting Gains By Using PCS: - gives their method a template and validity - gives them a way to comprehensively evaluate embedding algorithms that are very different. #### What are the Unique Challenges of PCS In This Setting: - There are many network embedding methods and are often dependent on downstream tasks - For predictability, we cannot do "cross validation" or sampling the data like we do for other supervised ML problems. (i.e. how do you sample from a network?) ## A New Method to Compare the Interpretability of Rule-Based Algorithms [(Margot et al., 2021)](https://www.mdpi.com/2673-2688/2/4/37/htm) #### Setting: evaluating interpretable, rule-based algorithms #### Goal: definition of a score that is a weighted sum of Predictability, Stability, and Simplicity. Predictability - accuracy of predictive algorithms Simplicity - sum of the lengths of the rules derived from the model Stability - Dice-Sorensen index for comparing two rule sets generated by an algorithm using two independent samples #### What the Setting Gains By Using PCS: - able to create a well defined scoring system for evaluating interpretable, rule-based algorithms - gives a structured framework for quantitatively evaluating interpretability #### What are the Unique Challenges of PCS In This Setting: - generating a stable set of rules is challenging - they need to consider how concise / simple a produced rule is