(1) Machine Learning Overview
---
- Machine learning is a set of methods for performing tasks without relying on explicit instructions
- On the right we see a series of handwritten digits (part of the MNIST dataset). Classifying large numbers of handwritten digits is a tremendously arduous task for a human to complete, but it is not nearly trivial enough to yield a simple computational solution. If one were to try to write down a set of instructions to enable a computer to classify handwritten digits what would that person write? How do we account for variations in handwritten digits like in the tail of a 2, the hat of a 1, etc. What do we do if the digits are written at different angles or are different sizes? etc. Machine learning helps us resolve precisely these kinds of tasks.
- Most often machine learning algorithms perform a constrained optimization of some objective function in order to achieve some desired result i.e. we might try minimize the number of incorrect digit labellings in the above problem, or we might try to minimize the cross entropy of some true label probabilities w/r/t some predicted ones. In other scenarios we might have more complex objective functions like in the case where we try to maximize the expected value of a series of 'moves in a game' where the value of a move is determined by the proportion of observed scenarios in which that move led to a win/loss.
----
(2)
---
But machine learning systems don't always succeed -- can anyone suggest a reason for why one of these systems might fail?
----
(3) Machine Learning Shortcomings
---
Right, so broadly many of these issues can be broken down into two types of failures: failures due to uncertainty and failures due to incompleteness.
- uncertainty is quantifiable. It represents the extent to which a model is unsure about a particular conclusion. Conversely, incompleteness is not quantifiable -- it results from an inability to incorporate key aspects of the problem into the optimization problem.
- while uncertainty arrises from often-resolvable issues like a lack of data, noisy data, or poor model selection, incompleteness may arise from a desire for some outcome that is difficult to express mathematically (like general safety, fairness, etc.), from an objective that is a proxy or is not precisely equivalent to the desired solution, or from a situation in which there is are multiple objectives with non-general trade-offs.
- How would you all characterize this example (example is of a model which is reasonably confident an image is a panda, unless noise is added at which point the model is almost certain it's a gibbon)? Is this an example of incompleteness or uncertainty?
- So what's going on here? Originally the model is reasonably confident that the provided image is a panda, which would be the correct label. We add a bit of noise to the image (which doesn't alter the image from a human perspective) and now the model is almost certain that the image is a monkey. So is this an issue of incompleteness or uncertainty?
- This is an example of model incompleteness. There is some aspect of this problem that we have failed to incorporate into our optimization function i.e. we would like the model not just to give the label that humans would provide, but to do labelling as humans would so as to prevent issues like the panda + noise one. But this isn't an easy thing to optimize for. We could have humans spend a bunch of time labelling a lot of data, but even still we'd be optimizing for correct labellings of images, not for a model that makes identifications as humans do (by recognizing shapes and attributes).
----
(4) Addressing Incompleteness
---
- So to solve this problem we'd need systems that are optimized for poorly defined criteria which may be unquantifiable or infeasible to test. (What does it really mean to make identifications as a human would).
- Instead of trying to optimize for any particular criterion indivdually we'll optimize for interpretibility -- if a system can "explain" its reasoning then we can verify whether it is sound with respect to our additional criteria. I.e. if a system can tell us why it thought that image was a gibbon then we can alter it to make it more human-like etc.
----
(5) Model Interpretability
---
- We'll define interpretability as “the ability to explain or to present in understandable terms to humans." Even if a model gave us the relative importance of 1 million features, it wouldn't be truly interpreatble unless it distilled these features down into the 10 most important or some number that was easily graspable by humans.
- If we could do this well, we could overcome perhaps the greatest issue in machine learning - the fact that it yields results without reasoning. In the future, interpretible methods will allow us to develop models that are safe and fair, models that tell us about correlations they've unearthed or possible avenues for research and discovery. They could unearth causal mechanisms, reduce technical debt and even dynamically update their optimization constraints in light of their interpretations. And They could allow us to build models faster, more efficiently, and with far greater accuracy / generalizability than is possible today.
- Interpretability will be a big leap forward in ML and the field is only just emerging.
----
(6) Interpretability Overview
---
There are a few key dimensions of interpretability:
- Intrinsic or Post Hoc?
- Some models are rendered intrinsically interpretable due to their simple structure. For example, a short decision tree or a sparse linear model may be interpretible as is. But many of these models lack complexity and may therefore be undesirable in many scenarios.
- Post Hoc interpretibility refers to interpretation methods that can be applies to models after training. For example my originaly suggestion for explanation on ADDER was to perform permutation feature importance, which is a common measure of feature importance given by measuring the effect of random permutations of sample features.
- Model-Specific or Model-Agnostic?
- Interpretable methods can be particular to model (e.g. the weights of a linear regression model have a specific interpretation as a line)
- Conversely, we can design interpretive methods that are general and can be applied after model training. Most of these methods analyze the relationship between feature inputs and labels. By definition, these methods do not use any model internals like weights or structural info
- Local or Global?
- Does the interpretable method provide interpretation of the entire model's behavior i.e. it chooses panda whenever (blank), or does it provide interpretation of an individual prediction i.e. the model chose the label panda in this scenario because (blank).
- Clearly it is necessary to choose which of these properties are preferable based on the task / model but I'll make a few broad generalizations
- Most models are not intrinsically interpretable so most interpretability methods must be post hoc.
- If you are able to get similar results using a model-agnostic method, use it. Why? Because it's modular (if the model changes the interpretable method doesn't need to). Because it allows for comparison (we can compare two models to see which meets some external criteria better). Because you may not always have access to model specific elements like weights or it may be infeasible to gain access to them and analyze them.
(7) Evaluating Interpretability
---
- We can evaluate these kinds of interpretable methods through application level, human level, and functional level evaluations.
- Applications Level: Create an experiment in which experts test the technology in application and measure outcomes with and without interpretations. Alternatively experiments can attempt to measure the difference between human interpretations and system interpretations.
- Human Level: Create an experiment in which non-experts carry out a simpler task that maintains the essence of the target application. This can be good when we want to test more general notions of the quality of the explanation.
- Functional Level: Experiments carried out algorithmically based upon predetermined feature values
- Issues with these evaluation approaches:
- Do not provided a standardized way to evaluate these methods.
----
(8) Model Explanation
---
- There are many methods of rendering a model interpretable. I only want to focus on one subset of interpretibility methods, called explanation. In explanation we algorithmically generate relations between a sample's feature values and its label in a human understandable way.
- Here I present a mathematical framework for explanation that, to my knowledge, has yet to be enumerated in the literature in this general way.
- I define a model-agnostic explainer as $\Phi(\Epsilon, \delta, f, X)$ where $\Epsilon$ is some noise function subject to some set of constraints, $\delta$ is a metric, $f$ is the original model we are trying to explain, and $X$ is a set of data. We then solve the constrained optimization: $\Phi^* = \argmin_\Epsilon \delta(f(\Epsilon(X)), f(X))$.
- We are therefore, trying to find the noise function that minimizes the distance between a sample perturbed by the noise and the original sample where the distance metric and the perturbation constraints are predefined. We then use the perturbation as a proxy for feature importance.
- Now I'm going to make the claim that there is a good choice of distance function and a bad choice. Most of the methods I will describe going forward will opt for the bad choice, but I want to quickly touch on the good choice and why it makes so much sense.
----
(9) An Ideal Metric
---
- The best choice for a distance function is mutual information. The mutual information between two random variables X and Y is in one way a measure of the amount of information obtained about one random variable by observing the other. Entropy is the amount of bits required to encode a random variable i.e. items drawn from some distribution. So mutual information is just the entropy of a random variable minus the conditional entropy of that random variable given another. Concretely this says "how much more information would we need to know to know the distribution of X if we know the distribution of Y". It is well defined mathematically as: $I(X,Y) = KL(P(X,Y),Q(X,Y)) = H(Y) - H(Y | X)$. In the case of explanation, we'd like to select the perturbation function $\Epsilon$ that maximizes the mutual information between the features and labels. This makes a ton of sense -- we want the features to tell us as much about the labels as possible subject to perturbation constraints.
- Observe that we can either use mutual information to find the most relevant features (i.e. we can minimize mutual information) or we can use it to find the least relevant features (i.e. we can maximize mutual information). The choice is generally irrelevant, but there are some cases for doing one over the other.
----
(10) Properties of Explanations
---
- Now explanations have many important properties that we might want to think about:
- Expressive Power: The language of the explanations that the method generates i.e. are they if/then rules? decisions tress, etc.?
- Translucency: The extent to which the method relies on data of the model like its parameters (more transluency == more info to generate explanations, less translucency == increased portability)
- Portability: The extent to which the explainer can be used generally (i.e. is it blackbox or specific)
- Algorithmic Complexity: How difficult is it to generate explanations -- how much time does it take?
- Individual Explanations also have a series of key properties:
- Accuracy: How well does an explanation predict unseen data?
- Fidelity: Related to accuracy -- how generalizable are the explanations?
- Consistency: How much do explanations differ between 2 models trained on the same task and that produce similar predictions
- Stability: How similar are the explanations for similar instances
- Comprehensibility: How well do humans understand the explanations
- Certainty: Does the explanation factor in the models certainty? <<-- maybe remove?
- Degree of Importance: Is it clear which features are most important?
- Representativeness: how many instances does the explanation cover
(discuss why novelty was dropped)
----
(11) Blackbox Explainers
---
- In my work I focused solely on blackbox explainers -- explainers that can interpret arbitrary models in which the parameters etc. may not be available. This portaiblity was key because models will continue to change and we don't want to have to swap out the explainer every time.
- I also focused on explainers which provide at least some local explanations -- we want to be able to tell clients why a model made a decision for a particular sample, not just give generalizations about trends demonstrated by the model.
- So I now want to discuss the explainers that I built (I'll also mention a couple that I didn't build, but which nevertheless could be useful)
----
(12) LIME
---
- LIME or Local Interpretable Model-Agnostic Explanations is a method of generating explanations for individual samples
- LIME essentially does the following:
- For a particular sample $x$ with $d$ features we uniformly draw an integer $k$ in the interval [0,d]. We then sample the features of $x$ uniformly without replacement $k$ times. This yields a new, perturbed sample with some of the features of $x$ and others replaced by zeros. We do this perturbed sample generation a N times, so we acquire a dataset of elements in the vicinity of $x$.
- We then apply our model to each of the N samples to generate N corresponding labels and we weight them by their proximity to the sample being explained.
- This allows us to train a sparse linear regressor on the data. The weight parameters of this model can then be used as a proxy for feature importance and can explain the contribution of each feature to the predicted label.
- Advantages: LIME has become an industry standard in explanation and there exists a relatively robust package that simplifies the LIME dev process. Also LIME makes intuitive sense and uses simple models to generate explanations
- Disadvantages: LIME will be inherently unfaithful for models that are highly non-linear in the locality of the prediction. Furthermore LIME will miss feature corelations that are not evident from a single sample / label pair
----
(13) ANCHORS
---
- If one were to use LIME to explain a model on a number of samples one might find that the same feature has one effect on the resulting label in one case and another effect in another case. Take for example the instance presented here. We see the contributions of particular features to the label for a given sample. However if you look closely, you'll notice that feature 2 contributes to the anomaly label in this sample, but to the nominal label in the other. Why might this be the case?
- These explanations are designed to be locally accurate, but applying an explanation from one sample to another would lead to false results. We really want an answer to the question "When does feature 2 contribute to an anomaly?"
ANCHORs seeks to find "anchors" which are if-then rules that sufficently "anchor" predictions such that changes to other features of the sample won't impact the label. For example, we might find that whenever feature 2 is smaller than 0.5 and feature 1 is larger than 1.0, then the sample is an anomaly.
- ANCHORS essentially does the following:
- Initialize an anchor A with an empty rule (i.e. one that applies to the sample as well as all of the perturbed versions of the sample)
- In every iteration, extend the anchor A by adding an additional rule so that the anchor will eventually be a long chain of rules that must all hold for a sample to be "anchored"
- But how do we choose which rule to add? We would like to add the rule with the highest estimated precision i.e. where the fewest number of perturbed versions of the sample are eliminated by this rule (precision is defined as the expectation over the samples of the indicator that the model classifies the samples with the same label).
- It would be infeasible to compute the true precision, so instead we endeavour to find the minimal set of calls to f that allow us to estimate which of the candidate rules has the highest true precision
- This problem can be formulated as a pure-exploration multi-armed bandit problem which we can solve using a greedy algorithm or using beam search.
- We proceed until some constraint is met.
- Advantages: Provides intuition about the global behavior. In a study carried out by Rebeiro, it was found that the vast majority of users prefered the explanations given by ANCHORS to those of LIME and furthermore, 24/26 users thought they would be more precise with ANCHORs and the 2 users who thought they would be more precise with LIME were actually more precise with ANCHORS (not clear what %% were actually more precise with ANCHORs)
- Disadvantages: Computation time may be even worse than LIME on select problems depending upon hyperparams. It is also theoretically possible for anchors to conflict or for anchors to be so specific as to be unuseable.
----
(14) Other Methods:
---
There are other popular methods that you all might be interested in like SHAP or kernel SHAP which yses Shapley values to quantify feature importance, or like DeepLIFT which is designed specifically for neural nets. however, I think the best explainer is L2X.
(15) L2X
---
- L2X improves upon LIME and ANCHORS by globally learning a local explainer i.e. by taking into account the distribution of inputs as well as the local behavior of the model.
- In L2X we train a network to learn a perturbation function which maximizes the mutual information of the perturbed samples with respect to the correct labelings subject to the constraint that the perturbation will consist in choosing $k$ features to retain and zeroing out the rest ($k$ is chosen by the user but can also be optimized)
- In order to find this function, it would be necessary to compute an expectation over ${d}\choose{k}$ possibilities which would be intractible. Instead we utilize the concrete relaxation of the gumbel softmax trick to sample from the distribution.
----