In summary, this paper applies a form of mediation analysis to the study of LLM performance on some Natural Logic queries. The idea is to understand to what degree an LLM appears to be influenced by the "right" aspects of a problem vs surface-level artifacts that should not influence its predictions. The mediation analysis is performed so as to understand what percentage of the total influence of the prompt on the response is explained by the "desired" inferential pathway.
## A toy dataset
In order to illustrate the methodology of this paper, let's consider the simplest possible imaginary dataset one can apply this to:
* There are two contexts, for simplicity, let's call this $C=0$ and $C=1$.
* The first context implies $\downarrow$ monotonicity, therefore $\mathbb{P}[M=0\vert C=0]=1$ while the second context implies $\uparrow$ monotonicity, thus $\mathbb{P}[M=1\vert C=1]=1$.
* There are also only two word pairs, let's say $W=0$ and $W=1$.
* The two word pairs have the opposite relation, $W=0$ has the $\sqsubseteq$ and $W=1$ implies the opposite. Thus $\mathbb{P}[R=1\vert W=1]=1$ and $\mathbb{P}[R=0\vert W=0]=1$.
* The two contexts and two word pairs are sampled uniformly and combined independently, thus creating 4 possible inputs which have equal probability.
* Finally, let's say that the LM makes perfect predictions for any C,W combination, thus $\mathbb{P}[Y=1\vert C=0, W=0] = \mathbb{P}[Y=1\vert C=1, W=1] = 1$ and $\mathbb{P}[Y=0\vert C=1, W=0] = \mathbb{P}[Y=0\vert C=0, W=1] = 1$.
This dataset fits the idealised causal diagram of the task on Figure 2 (which is a subset of the one on Figure 3).
The whole dataset consists can be summarised in a table
| C | M | W | R | Y |
|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 1 |
| 1 | 1 | 0 | 0 | 0 |
| 0 | 0 | 1 | 1 | 1 |
| 1 | 1 | 0 | 0 | 0 |
## TCE(C on Y)
Now let's see What we get if we calculate TCE(C on Y) as described in the paper.
Without applying any intervention, the expected value of $Y$ is $1/2$. This we can see by observing that each row has the same probability in the above table, so the average of the Y column is $1/2$.
Now, let's consider the intervention. As per Table 5, for each datapoint in the non-intervention dataset, we sample an intervened datapoint such that C and M are different, but W and R are the same. So if our non-intervention datapoint was $0,0,0,0$ then we sample $1, 1, 0, 0$ as the intervention.
So the intervention dataset will look as follows:
| C | M | W | R | Y |
|---|---|---|---|---|
| 1 | 1 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 1 |
| 1 | 1 | 1 | 1 | 0 |
| 0 | 0 | 0 | 0 | 1 |
You'll notice this is exactly the same as before.
Therefore the average of $Y$ is the same under both non-intervention and intervention. Therefore the calculated TCE(C on Y) would actually actually $0$.
And this problem doesn't go away if we use the potential outcomes framework and a better definition of average treatment effect.
$$
ATE(C \text{ on } Y) = (Y^{int+} - Y^{int-})
$$
What's the issue here? The problem is that $Y$ is a binary label, and the intervention
randomizes the context which has a non-monotonic influence on the label. Sometimes, the causal effect is in the positive direction (label switches from 0 to 1), and sometimes it's in the negative direction (switches from 1 to 0). If we therefore simply substract the label without intervention from the label with intervention, these two directions of causal inference will average out.
Solution: one should potentially calculate the following:
$$
\mathbb{E}[\vert Y^{int+} - Y^{int-} \vert] = \mathbb{P}[Y^{int+} \neq Y^{int-}]
$$
as the total causal effect of the intervention. In fact the paper that the authors based this work on, (Stolfo et al, 2022) used a similar metric to quantify causal influence.
One thing that jumps out is that at least on one of the Figures, the Total Causal Effect is higher than the Direct Causal Effect of the surface features:

If these quantities were properly defined, this should never happen.
In summary, I think there's a lot of room for improvement here. The overall causal diagram is interesting, but proper mediation analysis on top of it would result in much more meaningful quantities.