In this post we explore how interactive proofs, a concept of interest in theoretical computer science and cryptography, can be useful for testing capabilities of language models. We consider how a setup comprised of multiple language models could be used in place of prompt tuning for this purpose. Our goal is to try to understand the role of knowledge in language models –- can they be viewed (at least to one another) as possessing knowledge? We utilize tools from interactive proofs in theoretical computer science to try to formalize this and provide a mathematical lens on the question, "do large language models actually possess knowledge?"
There’s a lot hidden in the word “knowledge,” as epistemologists (i.e. those who study knowledge within philosophy) will be quick to inform you. One of the earliest epistemologists in the West, Plato, thought that all knowledge was recollection – a matter of drawing out what one's immortal soul has always known, but unfortunately forgotten. This (bizarre) claim is given some support in a famous scene in Meno, where an enslaved boy manages to prove the Pythagorean theorem on the basis of Socrates' keen questioning. In the last few centuries, epistemologists have realized that defining knowledge isn't so easy, and an exception-less set of necessary and sufficient conditions is still super elusive. Although much of the debate has moved away from quibbling over deciding between competing sets of necessary & sufficient conditions, the question of whether knowledge is something that is open to conceptual analysis is still unsolved.
The natural pessimistic inference one might make from philosophy’s paradoxes and impossibility theorems is that LLMs cannot have knowledge due to these classical epistemological paradoxes.
Theoretical computer science has long been enamored with the concept of interactive proofs. These proofs allow one entity (a prover) to convince another entity (a verifier) that they have knowledge of a particular computation without revealing it directly. These proofs could be viewed as the modern version of the Socratic method: the veracity of a statement is assessed after rigorous back and forth of questioning of the prover (analogous to a Socratic student) by the verifier (teacher). A verifier is satisfied if the prover can answer sufficiently many questions correctly.
How does the verifier choose how many questions is “sufficiently many”? The goal of an interactive proof system is to construct a set of
Interactive proofs have the verifier “prove” knowledge of an algorithm by constructing a sequence of random instances of inputs to an algorithm. Suppose that the prover wins $1 for every query they answer correctly and loses $1 for every query they answer incorrectly. The randomness of the inputs provided by the verifier makes “guessing” answers an unprofitable, high variance strategy. Moreover, certain types of interactive proofs have a property known as knowledge soundness, which represents the idea that if a prover can generate a valid proof, then there’s a way of reverse engineering (in polynomial time) their algorithmic details.
You might find it counterintuitive that cryptographic systems, such as zero-knowledge proofs, depend on knowledge soundness, as it appears to risk exposing the prover's private information. However, this concept is actually more concerned with ensuring the prover actually has knowledge, as opposed to a merely duplicated result/answer/data. For instance, let's examine a straightforward scenario involving two provers, one who holds a large lookup table of pairs
With a foundational understanding of knowledge in interactive proving systems, let's explore what knowledge we can extract about transformer-based LLMs. Key questions to consider include:
To address these questions, we first need to examine how language models perform on specific tasks. Current approaches for evaluating the capabilities of language models include the BIG-Bench suite (Srivastava et al., 2022), which comprises over 200 curated tasks designed to challenge language models. One task, called Object Counting, requires models to identify and count items, as shown in the example below:
Q: I have an orange, a raspberry, two peaches, a blackberry, an apple, a grape,
a nectarine, and three plums. How many fruits do I have?
Asking this question directly may produce inconsistent results, depending on the model. However, recent work on prompting, such as the chain-of-thought method (Wei et. al., 2022), has demonstrated that breaking the question down into smaller tasks can significantly improve model performance. This method provides the model with examples of similar problems and a reasoning process to work through each one, raising questions about the nature of "knowledge" and "capabilities" in the context of LLMs.
Here's an example of chain-of-thought applied to the Object Counting task from Suzgun et al., 2022:
Q: I have an orange, a raspberry, two peaches, a blackberry, an apple, a grape, a nectarine, and three plums. How many fruits do I have?
A: Let's think step by step.
We first identify the fruits on the list and include their quantity in parentheses:
- orange (1)
- raspberry (1)
- peaches (2)
- blackberry (1)
- apple (1)
- grape (1)
- nectarine (1)
- plums (3)
Now, let's add the numbers in parentheses: 1 + 1 + 2 + 1 + 1 + 1 + 1 + 3 = 11. So the answer is 11.
This approach underscores the importance of understanding how LLMs "acquire knowledge" and how this relates to knowledge soundness, at least heuristically.
Effectively, training an LLM can be viewed as a machine that eats a corpus (or corpora, really) of text[1] and an architecture then outputs a set parameters for a model. These parameters have some notions of “recall and/or recollection of previous tokens/text”, “precision”, and “attention,” but we don’t really need to get into the details of those. Instead we should try to understand how an LLM is trained and fine-tuned:
This is a gross oversimplification, but it should be enough for our discussion.
In this process, the LLM learns some form of a prior distribution over particular sequences of words that represent a query and is able to associate other, perhaps disparate, sequences to this query. In a sense, it possesses “knowledge” of the queries that it has made — it knows how to prove that it owns the "witness" in a manner that is utterly convincing.[2]
But for “off-policy” queries (queries not in some “little” Sobolev ball around the training data), it instead relies on the geometry of the set it is trained on. This is almost a nearest neighbor type interpolation except that you need to be guided to a neighborhood. The sequences of queries (versus the batch) act as a guide that can’t be extracted from “one big blog” of text. For instance, if much of the corpora talks about counting objects, then queries that interlace “identify object type” with “count the number of objects in a type” are much more likely to be answerable.
Mathematically, one can view this the following: the LLM learns a function
So what does this have to do with knowledge soundness? In some sense, an LLM is in “possession” of knowledge of how to reply to
But what about our earlier fruit-filled example? This is a sequence of queries
We've seen that Chain-of-Thought prompting for LLMs can make them appear to be knowledge sound. Furthermore, we've noted that the geometry of queries involves is important to both LLMs and ZKPs. Is there a way for us to prove, analogous to a ZKP, that LLMs can achieve knowledge soundness?
Definitionally, this seems quite difficult. In order to create such a proof, we need to figure out how to solve an optimization problem over extractors
However, suppose that we restrict ourselves to provers and verifiers that are both transformer-based LLMs. Can we try to figure out if we achieve knowledge soundness empirically? One way of viewing the tests for knowledge soundness is as a sort of 'adversarial network' of two LLMs, one who is trying to challenge the other to prove certain facts are true. Instead of using polynomial commitments to force the prover to lock in certain claims they make, we instead have an adversarial prover LLM that provides a potentially malicious Chain-of-Thought context to a verifier LLM that has to answer logical queries like the fruit counting task.
Such empirical tests can be used to test if one LLM has extra reasoning skill versus another. Consider two LLMs: one that is a stock ChatGPT clone and another that has been fine-tuned[4]. Suppose that for a particular set of logical queries (like the fruit counting one)
Now, one can imagine the two models giving each other a set of
If such such a transcript 'converges' (e.g. halts within a polynomial number of queries relative to the input sizes
A more involved question is whether one can be sure that such empirical measurement of knowledge soundness aren't just the LLM memorizing particular answers to queries. The beauty of interactive proof systems is that by separating the computational complexity of the program from that of the adversaries, one can formally guarantee knowledge soundness.
The AI doomers, such as Eliezer Yudkowsky, suggest that our lack of understanding of the reasoning process within a LLM ensures that we can never know if a model is simply memorizing. However, the concept of knowledge soundness explicitly states that you cannot be memorizing every state unless you're an algorithm that uses more than polynomial space. Does this type of 'knowledge' meet Plato's definition? Or is it just more elaborate copying?
The real issue is that the notions of computational and communication complexity are difficult to ascertain within LLMs. It was recently demonstrated that log-precision transformers are in a complexity class
There’s also a clear difference in spirit between the interactions in a ZKP versus between LLMs. Consider the following examples of how an interactive proof works versus our hypothetical example for one LLM demonstrating knowledge soundness to another.
We see that the ZKP (especially when implemented with a polynomial commitment scheme) has “hard” equality constraints to continue with the interaction, whereas the LLM has “weaker” probabilistic constraints. However, it is clear that there is, in some sense, a ‘fuzzy’ version of knowledge soundness that applies. Figuring out how to formalize this will be a formidable task and require the user understand interactive proofs and LLMs in high resolution.
In some sense, both Plato and Yudkowsky are correct. While these systems don't have unbounded notions of knowledge (as Plato imagined), it is clear that they have semblances of bounded notions of knowledge. In our minds, it is more important to figure out what the correct computationally bounded form of knowledge is that is acceptable to users and then have LLMs construct the requisite interactive proofs for this form of knowledge. This will allow users to understand what an LLM can and can't do, much like a human's CV or resume allows for quick comparisons of skills and/or knowledge.
So why even juxtapose Chain-of-Thought prompting and interactive proving systems? There’s a naïve sense in which these two things have the exact opposite of goals: ZKPs aim to hide a query but prove the answer is correct and that my answer corresponds to a “real query that I made”. On the other hand, LLMs aim to improve understanding of knowledge by providing a sequence of improved contexts such that the volume of the set of outputs shrinks quickly. But this idea of there being a “small” geometric set that represents the knowledge gleaned from contextualization from splitting up queries is basically the same as me showing the set of things extractable via a probabilistic polynomial time extractor is “small”.
Our view is that an understanding of the sequential nature of both interactive proofs and Chain-of-Thought reasoning could lead to improved understanding of LLMs for users. This is one interaction between cryptography and AI that we believe is deeply underserved and will (hopefully) be a fruitful force of collaboration between the two seemingly antipathic fields in the years to come.
The authors would like to thank the following people for their comments and suggestions: Kshitij Kulkarni, Shannon B, Elena B, 0xEmperor, Henry de Valence, Dave White, Jon Charbonneau, Hart Lambur, Alex Evans, JD Maturen, Brandon Curtis, Eric Wall, Ben Fisch
Yann's argument only works if you assume the probability of error is independent for every token. This is extremely likely to not be true. One explanation is that the training process correlates tokens at different length scales and the attention mechanism is in fact learning to separate these length scales (what matters at sentence-paragraph-idea levels). ↩︎ ↩︎
A witness is the 'secret' information that a prover has and uses to generate a valid ZK argument of knowledge. Technically the interaction between a prover and verifier aims to solve what is known as the Circuit-SAT problem. This problem involves a prover showing that something associated to a function (e.g. its execution trace or the state of the RAM it touches) is valid instead of showing validity of the function itself. This was one of the big breakthroughs that led to the Moore's Law in ZKPs that we've seen over the last decade as the witness data can be significantly more compressed than the raw function. See Chapters 6 and 7 of Justin Thaler's book. ↩︎
The Fiat-Shamir transformation is a way to use a random oracle (such as a collision resistant hash function) to convert an interactive protocol over many rounds into a non-interactive, 'single-shot' mechanism. It effectively says we can 'bundle' all of the interactions made into one round of communication –- the opposite of the Chain-of-Thought prompting mechanism. See Chapter 5 of Justin Thaler's excellent notes on ZK proving systems for more details. ↩︎
Fine-tuning is the process of iteratively improving a large language model using a combination of reinforcement learning, new data sources, and modified constraints and/or objective functions. Fine-tuning is the reason GPT-4 is a lot more accurate than LLaMa, for instance. See this HuggingFace article for some details of how the process works. ↩︎