Do language models possess knowledge (soundness)?

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Authors: Tarun Chitra and Henry Prior (Gauntlet)

In this post we explore how interactive proofs, a concept of interest in theoretical computer science and cryptography, can be useful for testing capabilities of language models. We consider how a setup comprised of multiple language models could be used in place of prompt tuning for this purpose. Our goal is to try to understand the role of knowledge in language models –- can they be viewed (at least to one another) as possessing knowledge? We utilize tools from interactive proofs in theoretical computer science to try to formalize this and provide a mathematical lens on the question, "do large language models actually possess knowledge?"

Epistemology before Math

There’s a lot hidden in the word “knowledge,” as epistemologists (i.e. those who study knowledge within philosophy) will be quick to inform you. One of the earliest epistemologists in the West, Plato, thought that all knowledge was recollection – a matter of drawing out what one's immortal soul has always known, but unfortunately forgotten. This (bizarre) claim is given some support in a famous scene in Meno, where an enslaved boy manages to prove the Pythagorean theorem on the basis of Socrates' keen questioning. In the last few centuries, epistemologists have realized that defining knowledge isn't so easy, and an exception-less set of necessary and sufficient conditions is still super elusive. Although much of the debate has moved away from quibbling over deciding between competing sets of necessary & sufficient conditions, the question of whether knowledge is something that is open to conceptual analysis is still unsolved.

The natural pessimistic inference one might make from philosophy’s paradoxes and impossibility theorems is that LLMs cannot have knowledge due to these classical epistemological paradoxes.

IP = PSPACE = … = Knowledge?

Theoretical computer science has long been enamored with the concept of interactive proofs. These proofs allow one entity (a prover) to convince another entity (a verifier) that they have knowledge of a particular computation without revealing it directly. These proofs could be viewed as the modern version of the Socratic method: the veracity of a statement is assessed after rigorous back and forth of questioning of the prover (analogous to a Socratic student) by the verifier (teacher). A verifier is satisfied if the prover can answer sufficiently many questions correctly.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

How does the verifier choose how many questions is “sufficiently many”? The goal of an interactive proof system is to construct a set of

n

challenges or queries such that probability that a verifier doesn’t know how to compute the answer

n

challenges decays exponentially in

n

. Knowing how to compute an answer given arbitrary input data is in some sense “knowledge.” For instance, being able to reliably (e.g. over many random instances) compute a 3-coloring of a graph is in a sense knowledge of an algorithm for doing so. Having the ability to reliably and verifiably (by outside observers) execute the algorithm on any input for a sufficiently hard problem is “knowledge” in that one cannot simply be guessing or copying answers.

Interactive proofs have the verifier “prove” knowledge of an algorithm by constructing a sequence of random instances of inputs to an algorithm. Suppose that the prover wins $1 for every query they answer correctly and loses $1 for every query they answer incorrectly. The randomness of the inputs provided by the verifier makes “guessing” answers an unprofitable, high variance strategy. Moreover, certain types of interactive proofs have a property known as knowledge soundness, which represents the idea that if a prover can generate a valid proof, then there’s a way of reverse engineering (in polynomial time) their algorithmic details.

You might find it counterintuitive that cryptographic systems, such as zero-knowledge proofs, depend on knowledge soundness, as it appears to risk exposing the prover's private information. However, this concept is actually more concerned with ensuring the prover actually has knowledge, as opposed to a merely duplicated result/answer/data. For instance, let's examine a straightforward scenario involving two provers, one who holds a large lookup table of pairs

(x, f (x))

that were precomputed by someone else. This prover stores the pairs in a hash table using a collision resistant hash function

H

. The other prover actually has the polynomial space/time algorithm for

f

and computes

f (x)

. The former prover would not satisfy knowledge soundness because one would need to invert the hash function

H

, which cannot happen in polynomial time. On the other hand, the other prover does have knowledge of

f

and it is actually possible to extract it in polynomial time. Note that the type of “knowledge” used here is strictly dependent on computational complexity classes and not the more general epistemological definition.

Knowledge and Chain-of-Tho(ugh)ts

With a foundational understanding of knowledge in interactive proving systems, let's explore what knowledge we can extract about transformer-based LLMs. Key questions to consider include:

Can an LLM achieve knowledge soundness?
Can we make both the prover and verifiers LLMs interacting with one another?
Does this mean the LLMs “have knowledge”?

To address these questions, we first need to examine how language models perform on specific tasks. Current approaches for evaluating the capabilities of language models include the BIG-Bench suite (Srivastava et al., 2022), which comprises over 200 curated tasks designed to challenge language models. One task, called Object Counting, requires models to identify and count items, as shown in the example below:

Q: I have an orange, a raspberry, two peaches, a blackberry, an apple, a grape, 
a nectarine, and three plums. How many fruits do I have?

Asking this question directly may produce inconsistent results, depending on the model. However, recent work on prompting, such as the chain-of-thought method (Wei et. al., 2022), has demonstrated that breaking the question down into smaller tasks can significantly improve model performance. This method provides the model with examples of similar problems and a reasoning process to work through each one, raising questions about the nature of "knowledge" and "capabilities" in the context of LLMs.

Here's an example of chain-of-thought applied to the Object Counting task from Suzgun et al., 2022:

Identify the fruits and their counts
Add up the counts

Q: I have an orange, a raspberry, two peaches, a blackberry, an apple, a grape, a nectarine, and three plums. How many fruits do I have?
A: Let's think step by step.
We first identify the fruits on the list and include their quantity in parentheses:
- orange (1)
- raspberry (1)
- peaches (2)
- blackberry (1)
- apple (1)
- grape (1)
- nectarine (1)
- plums (3)
Now, let's add the numbers in parentheses: 1 + 1 + 2 + 1 + 1 + 1 + 1 + 3 = 11. So the answer is 11.

This approach underscores the importance of understanding how LLMs "acquire knowledge" and how this relates to knowledge soundness, at least heuristically.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

Deus ex Transformer

Effectively, training an LLM can be viewed as a machine that eats a corpus (or corpora, really) of text^[1] and an architecture then outputs a set parameters for a model. These parameters have some notions of “recall and/or recollection of previous tokens/text”, “precision”, and “attention,” but we don’t really need to get into the details of those. Instead we should try to understand how an LLM is trained and fine-tuned:

Randomness is used to initialize the parameters
Text is fed through and the ability to predict on particular snippets is checked (think of it as torturing a computer to do madlibs for you: “jack be quick, jack be nimble, jack ___ ____ ___ ___ ___”, predict all of the blanks)
The model is tuned via human feedback or automated query generation

This is a gross oversimplification, but it should be enough for our discussion.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

In this process, the LLM learns some form of a prior distribution over particular sequences of words that represent a query and is able to associate other, perhaps disparate, sequences to this query. In a sense, it possesses “knowledge” of the queries that it has made — it knows how to prove that it owns the "witness" in a manner that is utterly convincing.^[2]

But for “off-policy” queries (queries not in some “little” Sobolev ball around the training data), it instead relies on the geometry of the set it is trained on. This is almost a nearest neighbor type interpolation except that you need to be guided to a neighborhood. The sequences of queries (versus the batch) act as a guide that can’t be extracted from “one big blog” of text. For instance, if much of the corpora talks about counting objects, then queries that interlace “identify object type” with “count the number of objects in a type” are much more likely to be answerable.

Mathematically, one can view this the following: the LLM learns a function

f (x, r)

where

x

is a text query and

r \in R

is the randomness use to initialize and maintain training. Suppose we consider a query x and look at volume of the set

S = {(x, r, f (x, r)) : r \in R}

. What does it mean for S to be “small” or “large”? Intuitively, if

S

is “small”, then no matter the randomization, we end up with similar answers whereas if

S

is “large”, we likely can’t view the answers

f (x, r)

as causally generated by

(x, r)

. There’s a sense in which we really have “no knowledge gained” about the data generating process for

(x, r)

S

is large. As a pedagogical tool, we'll assume that there is a volume measurement

vol (S)

that can map sets to a scalar value that lets us compare two different sets.

So what does this have to do with knowledge soundness? In some sense, an LLM is in “possession” of knowledge of how to reply to

(x, r)

vol (S)

is small. It effectively has sampled the space of random conclusions different to any particular weight and is effectively able to stably reply to

x

regardless of

r

. This is effectively a probabilistic form of knowledge soundness: instead of the extractor being an arbitrary probabilistic polynomial time algorithm, we instead consider a fixed set of algorithms (parametrized by the architecture) that is allowed to be the knowledge extractor from the definition of knowledge soundness. We then quantify the knowledge extractable by some notion of volume (this is where all the technical difficulties lie though!).

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

In the above slide from Yann LeCun, we can see LeCun's claim that the

vol (S)

is actually still large –- the ratio of the blue minus red volume to the blue volume is close to 1. In some sense, this is the big debate between those who think the current generation of LLMs are omniscient versus those (like LeCun) who think GPT is mid.^[1:1] LLM maximalists believe that there are some not well-understood reasons (implicit regularization? chain-of-thoughts?) for why this 'catastophic' uncontrollability of LLMs seems to (mainly) not happen once a model is sufficiently large.

But what about our earlier fruit-filled example? This is a sequence of queries

(x_{0}, r), \dots, (x_{n}, r)

such that the query is better than

(x_{0} | x_{1} | \dots | x_{n}, r)

, where

x | y

is the concatenation of strings

x

and

y

{0, 1}^{*}

. In some sense, the sequential nature allows the model to reintroduce r (perhaps via attention) and you get a better answer. This is sort of the “anti-Fiat-Shamir transform” nature of an LLM:^[3] this work claims you can’t batch queries (like Fiat-Shamir) because the sequential nature reintroduces some randomness and/or attention that is context dependent. Knowledge of the process producing the context has meaning –- dramatically unlike ZKPs.

LLM vs. LLM

We've seen that Chain-of-Thought prompting for LLMs can make them appear to be knowledge sound. Furthermore, we've noted that the geometry of queries involves is important to both LLMs and ZKPs. Is there a way for us to prove, analogous to a ZKP, that LLMs can achieve knowledge soundness?

Definitionally, this seems quite difficult. In order to create such a proof, we need to figure out how to solve an optimization problem over extractors

Σ

which is the set of all probabilistic polynomial time algorithms. This is, uh, a large set of Turing Machines.

However, suppose that we restrict ourselves to provers and verifiers that are both transformer-based LLMs. Can we try to figure out if we achieve knowledge soundness empirically? One way of viewing the tests for knowledge soundness is as a sort of 'adversarial network' of two LLMs, one who is trying to challenge the other to prove certain facts are true. Instead of using polynomial commitments to force the prover to lock in certain claims they make, we instead have an adversarial prover LLM that provides a potentially malicious Chain-of-Thought context to a verifier LLM that has to answer logical queries like the fruit counting task.

Such empirical tests can be used to test if one LLM has extra reasoning skill versus another. Consider two LLMs: one that is a stock ChatGPT clone and another that has been fine-tuned^[4]. Suppose that for a particular set of logical queries (like the fruit counting one)

L_{1}, \dots, L_{k}

, the fine-tuned model demonstrates knowledge soundness provided it was Chain-of-Thought prompted with some warm up queries

q_{1}, \dots, q_{ℓ}

first. On the other hand, suppose that the stock ChatGPT clone is unable to to solve the logical queries regardless of the Chain-of-Thought prompts used.

Now, one can imagine the two models giving each other a set of

ℓ

warm-up queries before asking each other a random subset

I \subset [k]

of the logical queries. This process could lead to a transcript as follows:

ChatGPT: Is a pineapple a fruit?
Fine-Tuned: Yes. Is a tomato a fruit?
ChatGPT: No
Fine-Tuned: That's incorrect; depending on the context, a tomato could be a fruit. Here's an example of properties X and Y that tomatoes have that other vegetables do not. Do you know if a pineapple is a fruit? Count the fruits here ("pineapple", "tomato")
ChatGPT: 1. Count the fruits here ("tomato", "pineapple")
Fine-Tuned: 2. Done with querying

If such such a transcript 'converges' (e.g. halts within a polynomial number of queries relative to the input sizes

k, ℓ

), then we can modify existing interactive proofs to have the Fine-Tuned model prove to the ChatGPT that it has knowledge that it does not have. One can view the length of such a proof as a distance metric on the set of LLMs, allowing users to easily compare what 'knowledge' (in the interactive proof sense) one model has that another does not.

Who wins: Yudkowsky or Plato?

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

A more involved question is whether one can be sure that such empirical measurement of knowledge soundness aren't just the LLM memorizing particular answers to queries. The beauty of interactive proof systems is that by separating the computational complexity of the program from that of the adversaries, one can formally guarantee knowledge soundness.

The AI doomers, such as Eliezer Yudkowsky, suggest that our lack of understanding of the reasoning process within a LLM ensures that we can never know if a model is simply memorizing. However, the concept of knowledge soundness explicitly states that you cannot be memorizing every state unless you're an algorithm that uses more than polynomial space. Does this type of 'knowledge' meet Plato's definition? Or is it just more elaborate copying?

The real issue is that the notions of computational and communication complexity are difficult to ascertain within LLMs. It was recently demonstrated that log-precision transformers are in a complexity class

L

of constant logspace-uniform threshold circuits. These circuits don't have as clean of a separation as polynomial and non-polynomial do, but it is likely that there is some complexity measure that can separate different models within the same class. If such a measure can be coupled with an interactive proof then perhaps you can have true arguments of knowledges furnished by LLMs!

There’s also a clear difference in spirit between the interactions in a ZKP versus between LLMs. Consider the following examples of how an interactive proof works versus our hypothetical example for one LLM demonstrating knowledge soundness to another.

Interactive Proof
- Prover: Here’s a polynomial and a commitment to it
- Verifier: Here’s a random point, evaluate it there
- Prover: Here’s the value
- Verifier: If the value doesn’t equal something derived from the commitment (boolean not probabilistic!) then you’re lying
LLM v. LLM proof
- Verifier: Here’s a query
- Prover: Here’s an answer
- Verifier: According to me, the probability of getting that query right was
  $p$ , wrong was
  $1 - p$ ; if the answer is wrong, flip a coin with bias
  $p$ , and continue if heads. Send another query
- Prover: Here’s an answer
- Verifier: According to me, the probability of getting that query right was
  $p$ , wrong was
  $1 - p$ ; if the answer is wrong, flip a coin with bias
  $p$ , and continue if heads. Send another query

We see that the ZKP (especially when implemented with a polynomial commitment scheme) has “hard” equality constraints to continue with the interaction, whereas the LLM has “weaker” probabilistic constraints. However, it is clear that there is, in some sense, a ‘fuzzy’ version of knowledge soundness that applies. Figuring out how to formalize this will be a formidable task and require the user understand interactive proofs and LLMs in high resolution.

In some sense, both Plato and Yudkowsky are correct. While these systems don't have unbounded notions of knowledge (as Plato imagined), it is clear that they have semblances of bounded notions of knowledge. In our minds, it is more important to figure out what the correct computationally bounded form of knowledge is that is acceptable to users and then have LLMs construct the requisite interactive proofs for this form of knowledge. This will allow users to understand what an LLM can and can't do, much like a human's CV or resume allows for quick comparisons of skills and/or knowledge.

Conclusion

So why even juxtapose Chain-of-Thought prompting and interactive proving systems? There’s a naïve sense in which these two things have the exact opposite of goals: ZKPs aim to hide a query but prove the answer is correct and that my answer corresponds to a “real query that I made”. On the other hand, LLMs aim to improve understanding of knowledge by providing a sequence of improved contexts such that the volume of the set of outputs shrinks quickly. But this idea of there being a “small” geometric set that represents the knowledge gleaned from contextualization from splitting up queries is basically the same as me showing the set of things extractable via a probabilistic polynomial time extractor is “small”.

Our view is that an understanding of the sequential nature of both interactive proofs and Chain-of-Thought reasoning could lead to improved understanding of LLMs for users. This is one interaction between cryptography and AI that we believe is deeply underserved and will (hopefully) be a fruitful force of collaboration between the two seemingly antipathic fields in the years to come.

Acknowledgments

The authors would like to thank the following people for their comments and suggestions: Kshitij Kulkarni, Shannon B, Elena B, 0xEmperor, Henry de Valence, Dave White, Jon Charbonneau, Hart Lambur, Alex Evans, JD Maturen, Brandon Curtis, Eric Wall, Ben Fisch

Yann's argument only works if you assume the probability of error is independent for every token. This is extremely likely to not be true. One explanation is that the training process correlates tokens at different length scales and the attention mechanism is in fact learning to separate these length scales (what matters at sentence-paragraph-idea levels). ↩︎ ↩︎
A witness is the 'secret' information that a prover has and uses to generate a valid ZK argument of knowledge. Technically the interaction between a prover and verifier aims to solve what is known as the Circuit-SAT problem. This problem involves a prover showing that something associated to a function (e.g. its execution trace or the state of the RAM it touches) is valid instead of showing validity of the function itself. This was one of the big breakthroughs that led to the Moore's Law in ZKPs that we've seen over the last decade as the witness data can be significantly more compressed than the raw function. See Chapters 6 and 7 of Justin Thaler's book. ↩︎
The Fiat-Shamir transformation is a way to use a random oracle (such as a collision resistant hash function) to convert an interactive protocol over many rounds into a non-interactive, 'single-shot' mechanism. It effectively says we can 'bundle' all of the interactions made into one round of communication –- the opposite of the Chain-of-Thought prompting mechanism. See Chapter 5 of Justin Thaler's excellent notes on ZK proving systems for more details. ↩︎
Fine-tuning is the process of iteratively improving a large language model using a combination of reinforcement learning, new data sources, and modified constraints and/or objective functions. Fine-tuning is the reason GPT-4 is a lot more accurate than LLaMa, for instance. See this HuggingFace article for some details of how the process works. ↩︎