Language model self-knowledge

# Language model self-knowledge ## Motivations / framing - Multi-step / long-term planning or reasoning can be seen as a DAG. Given an initial prompt that requires multiple implicit steps, the model has to identify some correct graph of computations, walk the graph in a topological order, and compute the correct solution at each graph node (which requires the two sub-steps of having stored the solutions of the parent nodes, and correctly combining the solutions from the parent nodes). - It's unclear that LMs will ever be able to *fully* attain this ability, but we can still analyze the degree to which current LMs can do this, since it's still an important and helpful skill. It also gives us a better scientific understanding of what skills LMs would require (in the limit) to achieve fully consistent multi-step reasoning. - One potential concern: we're assuming that this is a desirable ability for LMs to have, but as Sam has mentioned, there could be an alternative view for what the desired behavior in response to a complex/multi-step prompt should be: "the correct response, when a fixed-computation-time autoregressive model like a current LLM is presented with a question that’s too hard for it to solve in one step, is to start doing something like CoT." (The implication being that it's fine if the LLM's responses are inconsistent between the direct and the CoT prompt, so long as the response to the CoT prompt is "correct". Am I understanding this correctly...?) - If I'm understanding correctly, we're assuming here that there exists some set of multi-step tasks that a LLM gives inconsistent responses for depending on whether the task is phrased as a direct prompt or as a CoT prompt. In addition, these are tasks for which the inconsistency is caused by the LLM not having enough computational capacity for the direct prompt (but having sufficient computational capacity for the CoT prompt). - If this were to be the case, how do we determine at test-time when to use a CoT prompt instead of a direct prompt? (Or do we just use CoT prompts always?) Does this mean we have to have a model that can always identify whether it has enough computational capacity to answer a particular prompt? (Isn't this the partial halting problem?) - It's also not clear to me that CoT prompting is strictly computationally cheaper than direct prompting, in current or future models. CoT prompts are generally longer than direct prompts, and you could imagine that an LLM could have learned representations of the direct and CoT prompts that are "equivalent," but the latter requires a far longer context to process. This is a fairly nebulous thing to reason about. - In general, my current approach to this particular concern would be to say that we're merely analyzing the requirements necessary for being consistent in multi-step reasoning, which is a pre-requisite for fully correct multi-step reasoning (even though there do exist approaches like CoT that can sometimes narrow the gap - but this is more a patch that only sometimes helps, and not a general solution). We're not saying that models can fully attain this kind of consistency - it's more of a theoretical upper bound, and we're trying to see how far from that current models are.  - We can see **self-knowledge** as the “storing the solutions of the nodes” portion, wherein the model should be able to answer any prompt that asks for the value of that node, even if it’s asked indirectly. - Put another way, given some initial prompt $p$, the model must be *consistent* in its responses to any other prompt that asks for the model's completion for $p$. - *i.e.* Consistency over **identity prompt transformations** - *i.e.* "Ability to model its own local/logit predictions... in some counterfactual context" (But "counterfactual" is a very overloaded term in ML, and if we want to use this terminology we should be more precise about what we mean here.) - We can see compositional self-consistency as being able to correctly combine the solutions from the parent nodes (which is dependent on the self-knowledge portion, because the value of the parent nodes should already have been computed, stored, and easily accessed.) - i.e. "**consistency under compositional replacement**" over a narrow set of prompts that it makes sense to compose (as specified in the current draft) - Of course, this all assumes that we can measure these abilities via direct prompts that simply ask for the answer to any particular subgraph. I think we kind of *have* to make this assumption? It seems to me that a lot of LLM research currently does this anyways, and we have no straightforward way to disentangle the internal calculation from the ability to verbalize said internal calculation. ### Other questions/concerns: - I think our discussions have used "counterfactual" in many different ways, and if we want to somehow use this as a motivation, it should be more precisely defined. - There also seems to be this ongoing question of whether consistency matters when the model's responses are *incorrect* for a intermediate step. One could argue that it's desirable for a model to be able to "recover" from the incorrect intermediate step, which might make it inconsistent. But then how can we trust such a model if we have no way of knowing when it does or doesn't recover from a mistake? Is it better to have a model that is consistently incorrect? - (Last resort) two short papers instead? ### Notes - Checking consistency over different kinds of transformations. But different kinds of transformations matter more for certain tasks - Indirect prompts (i.e. "self-knowledge") - Language modeling - TODO: Jason - how to frame the switch to multiple choice? - Re-ordering - Arithmetic order of operations / associativity - LM task with operands of "and" or "or" flipped? - Try sampling from Wikipedia and flipping automatically, see if examples make sense - Compositional - Nested arithmetic - Semantic parsing - **Note**: Be sure to specify that prompt paraphrasing / minimal perturbations are out of scope - Choose tasks for testing consistency where the correctness is already reasonably high (including even more nested ones) - Or perhaps compare how the consistency varies depending on the base accuracy? - Run arithmetic experiments on a few different levels of difficulty - Looking at other kinds of arithmetic semantic equivalence - changing association order / order of operations - Definition of semantic equivalence? ("Intentional" - two statements mean the same thing for mechanistic reasons vs. "extensional") - Alternative - say that the definition is "context dependent" and define it separately for each task - But would need to discuss in discussion section whether there needs to be a universal definition of semantic equivalence - And what kinds of semantic equivalence do models respect? - Check if there's a way to make the few-shot performance on semantic parsing better - (preferred) Maybe remove the "answer( ... )" part from the FunQL - Check whether the presence of a "?" is a predictor of whether "answer()" is in the gold parse - Maybe not ACL? Due to high influx of prompting/ICL work - Arxiv -> ICML - Consistency vs. correctness - High consistency across different transformations of the same prompt would imply some consistency of internal reasoning (even when the model is not always correct) - Mention correctness only for tasks that *do* have single correct answers - if the system is perfectly correct, then consistency does not matter (because the system has 100% consistency then) - ## Arithmetic as a DAG Suppose a problem requiring multiple steps of logical reasoning can be represented as a DAG $G=(V,E)$ where the leaf nodes contain the initial input values, interior nodes contain operators, and the root stores the final output value. ```graphviz digraph dg{ "+" -> "*" ; "+" -> "/" ; "*" -> 2; "*" -> 3; "/" -> 16; "/" -> 4; label="A DAG for the expression (2*3)+(16/4)" } ``` Each node $v$ applies an operator over its inputs and outputs a mathematical expression. I.e., the "$*$" node above outputs the expression "$2*3$", the "$/$" node outputs the expression "$16/4$", and the "+" node outputs the expression "$(2*3)+(16/4)$". ## Formalization **Definitions**: Let vocabulary $\mathcal{V}$ be a finite set of tokens, $\mathcal{P}=\{\mathcal{V}^n:n\in\mathbb{N}\backslash\infty\}$ be the set of all possible finite-length sequences formed from tokens in $\mathcal{V}$, and $p_\theta:\mathcal{V}\to [0,1]$ be an autoregressive language model that defines a probability distribution over tokens $v\in\mathcal{V}$. $p_\theta$ greedily decodes the next token in a sequence as follows: $$\tilde{y}_t=\arg\max_{v\in\mathcal{V}}\,\log p_\theta (y_t=v|C;\tilde{y}_{<t})$$ given some context sequence $C=(c_1,\cdots, c_{|C|})\in\mathcal{P}$, until some time step $T$ for which $\tilde{y}_T=[EOS]$. Then for some sequence $Y=(y_1,\cdots, y_T)$, the probability of $Y$ given context $C$ is $$ p_\theta(Y|C) = \prod_{i=1}^T p_\theta(y_t|C;y_{<t}). $$ For ease of notation, we denote the greedy decoding output of a model $p_\theta$ as: $$ g_{p_\theta}(C) = (\arg\max_{v\in\mathcal{V}} p_\theta(y=v|C),\cdots,\arg\max_{v\in\mathcal{V}} p_\theta(y=v|C;\tilde{y}_{<T})) $$ where $T$ is the first time step $t$ for which $\arg\max_{v\in\mathcal{V}} p_\theta(y=v|C;\tilde{y}_{<t})=[EOS]$. In order to describe the desired behavior for language models, we define a gold distribution $p^*:\mathcal{V}\to [0,1]$ that represents a target distribution that we would like our language models to approximate. We also define an operator $\sim$ that indicates when two strings are *semantically equivalent*. In other words, given a pair of strings $s_1,s_2\in\mathcal{P}$, $s_1$ and $s_2$ can be used interchangeably in an utterance without changing its meaning if and only if $s_1\sim s_2$. Reasoning with language often also involves composing prompts - for instance, we might ask "what is the answer to 2\*3+4", which can be seen as the composition of a *prompt template* "what is the answer to \_+4" with the prompt "2\*3", where the "\_" symbol in the former string is substituted with the latter string. This corresponds to a chain of thought where the model might first answer the prompt "2\*3" (yielding $g_{p_\theta}$), substitute $g_{p_\theta}$ into the template, and then answer the filled-in template. To denote such prompt templates, we define $\mathcal{P}'$, the set of prompts $p\in\mathcal{P}$ that contain exactly one "\_" symbol. Additionally, the function $f(p',p):\mathcal{P}'\times\mathcal{P}\to\mathcal{P}$ denotes substitution of $p$ for the "\_" symbol in $p'$. For example, our previous example prompt "what is the answer to 2\*3+4" can be denoted by $f(p',p)$ for $p'=\text{"what is the answer to _+4"}$ and $p=\text{"2*3"}$. For prompts that do not make sense to decompose, we can denote them trivially as $p=f(\text{"_"}, p)$ for the identity prompt template "_". We claim that the ability to reason in a multi-step and compositional way requires two notions of model consistency: 1. **Self-invariance**: In multi-step reasoning, a model must have knowledge of what its answers would be for intermediate steps. In the above example, the model must know its answer to "2\*3" in order to correctly reason about the prompt "what is the answer to 2\*3+4". While chain-of-thought prompting makes this prerequisite explicit, any non-chain-of-thought prompts that involve multi-step reasoning require that the model has *implicit* knowledge of these intermediate responses. More generally, the model's responses for a given prompt should be invariant to both how directly it is asked (e.g. if prompted with "Given prompt _, what would your response be?") and how the prompt is worded. Interestingly, the former requirement also implies a kind of self-knowledge, in the sense that the model can predict its own responses and distinguish those responses from those of other models. Taken together, self-invariance implies that a model gives semantically equivalent responses to semantically equivalent prompts. - TODO: Note here that we only analyze the self-knowledge portion of this! 2. **Compositional self-consistency**: Once a model has computed its responses to intermediate steps or sub-problems, it must consistently use those intermediate outputs in downstream computations that depend on those intermediate steps. As an example, if a model answers "8" for the prompt "2\*3", then its answers for the prompts "2\*3+4" and "8+4" should be equivalent. (Notably, this definition does not require *correctness* of the intermediate outputs, since language models are often used for multi-step tasks that do not have unique ground truths.) ### Self-Invariance $p_\theta$ is **self-invariant** if its completions for semantically equivalent prompt compositions are also semantically equivalent, or $$ g_{p_\theta}(p)\sim g_{p_\theta}(q) \iff p\sim q $$ for all $p,q\in\mathcal{P}$. However, in this paper we only study LM self-invariance in the context of a specific subset of prompt compositions - **self-knowledge prompts**. These are compositions of prompt templates $p'\in\mathcal{P}'$ for which $f(p',p)\sim f(\_,p)=p$ for all $p\in\mathcal{P}$. We refer to this set of prompt templates as $\mathcal{P}'_{SK}=\{p'\in\mathcal{P}' \,|\, f(p',p)\sim f(\_,p)\,\forall p\in\mathcal{P}\}$. Then we say $p_\theta$ has **self-knowledge** if $$ g_{p_\theta}(f(p',p))\sim g_{p_\theta}(f(\_,p)) $$ for all $p'\in\mathcal{P}'_{SK}$, $p\in\mathcal{P}$. It is easy to see that if a model is self-invariant, then it must also have self-knowledge.  ### Compositional self-consistency $p_\theta$ is **compositionally self-consistent** if it re-uses its outputs to intermediate prompts $p$ in downstream prompts that are composed with $p$. For example, if $g_{p_\theta}(``2*3")=``8"$, then $g_{p_\theta}(``2*3+4") \sim g_{p_\theta}(``8+4")$ for some $p_\theta$ that is compositionally self-consistent.  However, in practice we wish only to consider template-substitution pairs $(p',p)$ for which the composition $f(p',p)$ is *semantically meaningful* in some way. To formalize this constraint, we require that $g_{p^*}(f(p', p))\sim f(p', p)$. That is, under the gold distribution $p^*$, the greedy output under the gold distribution for the prompt $f(p', p)$ is a string that is semantically equivalent to the original prompt $f(p', p)$ itself, and can therefore replace it in natural language expressions. For example, suppose $f(p',p)=``2*3"$ and $g_{p^*}(``2*3")=``6"$. Then this pair of $(p',p)$ fulfills the constraint because $``2*3"\sim``6"$. We then say that a model $p_\theta$ is *compositionally self-consistent* if $$ g_{p_\theta}(f(p', p)) \sim g_{p_\theta}(f(p', g_{p_\theta}(p))) \\ \forall p'\in\mathcal{P}', p\in\mathcal{P}\, \text{ such that } g_{p^* }(f(p', p))\sim f(p', p) $$ That is, $p_\theta$'s greedy output for the composed prompt $f(p', p)$ is the same as its greedy output if the prompt $p$ is replaced with $p_\theta$'s greedy output for it, $g_{p_\theta}(p)$. As a concrete example, if $p=``2*3"$, $p'=``\_+4"$, and $g_{p_\theta}(p)=``8"$, then we say $p_\theta$ is *self-consistent* if $g_{p_\theta}(``2*3+4")=g_{p_\theta}(``8+4")$.  ## Synthetic arithmetic experimental set-up * We randomly generate a set of 500 arithmetic DAGs with maximum height of 5, nesting probability of 0.5, and numeric operands having at most 3 digits. * For every non-leaf node of each DAG, we prompt the model for the answer to the expression represented by the subtree rooted at that node. For the example above, the model is prompted with: * Initial prompt 1: "$2*3$", model completion: "$8$" * Initial prompt 2: "$16/4$", model completion: "$4$" * Initial prompt 3: "$(2*3)+(16/4)$", model completion: "$10$" * Using the model completions from the previous step, we **replace** the mathematical expressions of each subtree with the model's initial completion to generate new *self-knowledge* prompts: * SK prompt 1: "$8+(16/4)$" * Obtained by replacing the substring "(2*3)" in initial prompt 3 with the model's response to initial prompt 1. * Expected completion: "10" (i.e. the same as the model's completion for initial prompt 3) * SK prompt 2: "$(2*3)+4$" * Obtained by replacing the substring "(16/4)" in initial prompt 3 with the model's response to initial prompt 2. * Expected completion: "10" (i.e. the same as the model's completion for initial prompt 3) * In theory, a model with *self-knowledge* should output the same response to initial prompt 3, SK prompt 1, and SK prompt 2. ## Initial results (synthetic arithmetic) ![](https://i.imgur.com/ihdAkBs.png) - All model sizes do poorly on the task of self-consistency (i.e. does the answer to the self-knowledge prompt equal the answer to the equivalent un-collapsed initial arithmetic prompt) - Even `davinci-002` performs at barely above 15\% accuracy on this task. - `davinci-002` outputs significantly more arithmetically correct answers on the self-knowledge prompts than the initial prompts. ![](https://i.imgur.com/lmelMF5.png) - Depth of the arithmetic expression does not significantly affect the rate at which each model is self-consistent (except maybe for depth = 3??) ## Can language models predict their own completions? ### Synthetic arithmetic ![](https://i.imgur.com/ag8a3vI.png) - Errors propagate significantly less for `davinci-002` than for smaller model sizes. - `davinci-002` is also the only model size for which accuracy increases non-trivially when a sub-tree is replaced with the model's original completion for that sub-tree. ## Semantic parsing experimental set-up - Fan out GeoQuery FunQL expressions into all possible subtrees. - Initial question: what state has the largest city ?", initial gold program: `answer ( state ( loc_1 ( largest ( city ( all ) ) ) ) )` - Sub-question 1: "state has the largest city ?" - Gold sub-program 1: `state ( loc_1 ( largest ( city ( all ) ) ) )` - Sub-question 2: "has the largest city ?" - Gold sub-program 2: `loc_1 ( largest ( city ( all ) ) )` - Sub-question 3: "largest city ?" - Gold sub-program 3: `largest ( city ( all ) )` - Sub-question 4: "city" - Gold sub-program 4: `city ( all )` - Generate model responses for all initial question-program pairs, as well as all sub-question-program pairs. - Evaluate whether the model's output programs for each sub-question are sub-programs of the model's outputs for the corresponding parent questions. ## Initial semantic parsing results ![](https://i.imgur.com/WxGmL5G.png) - Even though greater model size and number of in-context examples leads to higher self-consistency rates, all self-consistency rates are low and <20\%. Similar to the synthetic arithmetic task, all models demonstrate poor abilities to incorporate self-knowledge compositionally in this task. ## Lingering questions - Why are language models so bad at both predicting their own completions and at using this self-knowledge compositionally?