Reasoning in LLMs

## Reasoning in LLMs ### Overview Large Language Models have shown emergent ability to reason and performance on certain benchmarks have improved due to techniques like chain-of-thought. Most works evaluate the ability of these models to reason based on if they are able to produce the correct answer and if the reasoning they generated seems valid. Would be nice to have a better understanding if the reasoning actually represents why the model came up with this answer (causal in some sense). Faithfulness means that an explanation should represent the *actual reasoning* behind the models prediction, whereas *plausibility* would mean how convincing the explanation sounds to humans. These models might be generating reasoning-like-responses rather than reasoning step-by-step. ### Literature Review #### Faithful Reasoning - [Measuring Faithfulness in Chain-of-Thought Reasoning](https://arxiv.org/abs/2307.13702). They posit that chain-of-thought is not faithful as of now, they provide evidence by adding biasing features in the input context. They conduct two experiments by adding bias to the features referenced in the reasoning as well as biasing features not referenced in the inputs. 1. For the CoT truncation experiment, they don't do any analysis on why for certain tasks the performance drop is large while for others it is minimal. Some manual analysis on how the answer is changing, is the LLM just using whatever is given as the CoT in the final answer or anything like that. For e.g. if the model is giving the correct answer even after the truncated CoT we could check was the non CoT, etc. 2. For the mistaken part, they just add mistake into the suffix of the CoT. Maybe worthwhile to look at how changing other parts affect the result. 3. Also the hypothesis that the extra tokens **alone** are responsible for the increased performance is not enough, it has been shown that extra compute (in way of tokens) help to produce better results ([Pause Tokens](https://arxiv.org/abs/2310.02226)). Blog by the author ( [[Blog 1]](https://hackmd.io/@tamera/HJ7iu0ST5) ) - [Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting](https://arxiv.org/abs/2305.04388) The paper demonstrates that CoT explanations can be plausible yet *systematically unfaithful*. They show that the model's behaviour can be influenced by biasing features that the models fails to reference in the chain-of-thought. I overall like the direction of adding bias by changing the features that not referenced in the CoT but believe the experiments were not very rigourous. 1. Since they used RL tuned models and these have been shown to display *Sycophancy*, i.e. tailor their responses to follow a human user's view I think the experiments are not clean in some sense. For e.g. in the task where they use the bias that the answer is always (A) the model might be interpreting that the task is "Answer is A and justify it" since these models are also sensitive to repeated patterns. I believe if we give an instruction that the task is to choose the most appropriate option out of these two and justify your choice, the results might be slightly different. 2. For the second task related to steoreotypes, it has been shown that the models display human like content effects and would also be biased considering it is trained on the interent which has biases obviously. Humans also struggle on logical questions which are counterintuitive, but then when we ask humans to think perfectly logical and ignore any personal bias they perform better. For the second study we should do this to actual idea. - [Faithful Chain-of-Thought Reasoning](https://arxiv.org/abs/2301.13379)[Not Relevant] The paper highlights that standard CoT does not provide interpretability of how the model predicts the answer. They porpose a framework to convert the natural language to a symbolic chain and then use a deterministic solver for getting the result ( not sure what is novel about their approach, have already seen this kind of work before ). This deals with making sure that the final answer always matches the reasoning generated by the LM but leaves out the more important question (IMO) on **does the reasoning actually represent how the model is solving the question or is it just most plausible sequence completion task.** - [Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks](https://arxiv.org/abs/2307.02477) The author tries to understand how much of the current reasoning skills are due to task-general reasoning skills as compared to the ability to recall specefic tasks seen during pre-training. They try to measure the general reasoning abilities by creating counterfactuals tasks where they modify the input-output mapping but the reasoning skills required are same. They observe above chance performance on these counterfactual tasks but the performance degrades substantially as compared to the original tasks. They use zero shot prompting for these tasks. 1. Generally even for humans it would be harder to do these counterfactuals. This has been in the past that these LLM models also display human like content effects. Would be interesting to see how these models perform given extra *compute or time*. 2. They don't look at different model sizes, I would expect that smaller models would show better generalization (i.e. less of a difference in performance, even if the performance is lower than larger models). They would be less prone to overfiting. - [Large Language Models are Zero-Shot Reasoners](https://arxiv.org/pdf/2205.11916.pdf) They show that LLMs are decent *zero-shot* reasoners by simply adding "Let's think step by step" before each answer. 1. The paper does not shed any light on why zero-shot reasoning works. 2. Why do certain prompts (Let's think step by step) trigger better responses than other ones. They have compared the performance using different trigger prompts but not provided any understanding behind this behaviour. - [Have LLMs Advanced Enough? A Challenging Problem Solving Benchmark For Large Language Models](https://arxiv.org/abs/2305.15074) JEEBench paper. This paper proposes a new dataset to test the logical reasoning capabilities of LLMs using a challenging dataset of PCM questions. 1. They claim that self-refinement does not work on complex problems like this, would be interesting to do in an depth analysis of where is the model not able to identify errors. Is it because the factual information is incorrect in it's knowledge so it can't identify the errors ? 2. They hypothesis is that few-shot prompting is not very helpful in these questions, because conceptual errors are hard ti improve upon using few-shot examples. Can we build a control set of prompts (maybe like subject wise) to help these language models in recall the knowledge. 3. There have also been methods proposed to breakdown the questions into sub-questions, might be interesting to see the impact of such stuff. 4. They claim that most of the errors are caused by failure to retrieve concepts, a simple extension could have been to prompt these models for these concepts to check is it a memoization error or failure to recall error ? In case of failure to recall, we can maybe prompt things like subject, topic, etc to improve the accuracy. ( I tried it for one of the errors that they highlighted and see that on explicitly asking the model for that concept it was able to get it right.) 5. Something more challenging would be would can a language model be confident about it's prediction ( well calibrated ?) #### Some work on how the reasoning is just due to memorization - [Impact of Pretraining Term Frequencies on Few-Shot Reasoning](https://arxiv.org/abs/2202.07206) The paper provides evidence that LLMs have better reasoning performance on terms that are more frequent in the training dataset, which should ideally not be the case if the models are truly “reasoning”. 1. Important point that the paper highlighted was that to truly gauge the reasoning abilities of language models we should take the effect of the pretraining corpus on it. 2. Might be worthwhile extending this for other more complex reasoning tasks. For e.g. if the model is making some mistake in a factual task, maybe we can ask it to cite the documents it is referencing in the training corpus. - [Are NLP Models really able to Solve Simple Math Word Problems?](https://aclanthology.org/2021.naacl-main.168.pdf) A slightly old paper that shows that language models rely on shallow heuristics to achieve high performance on benchmark datasets. They provide evidence that a simple bag-of-words model can achieve a similar accuracy. They showed that just passing the premise without any question in the test set, achieves a high accuracy, which suggested the presence of patterns in the bodies of MWPs. They also show that certain problems can be solved without the word order of these problems. 1. They fine tune the language model, since they are not really using the foundational models for this study we can't infer anything since "reasoning" is said to be more of a *emergent* ability. #### Few other reasoning papers that I found interesting - [What In-Context Learning "Learns" In-Context: Disentangling Task Recognition and Task Learning](https://arxiv.org/abs/2305.09731) Proposes to break down in-context learning into two separate tasks, TR (task recognition) and TL (task learning where the model captures new input-label mappings). THey find that only TL scales with the model size and number of in-context demonstrations the TR performance remains fairly consistent. 1. The ablation study that they use for TL seems very simple. Even smaller models would be able to map the dummy labels to the actual labels. 2. The issue is that due for some of these LLMs we don't know exactly the kind of data they have trained on. The paper also doesn't make it clear for more complex logical reasoning tasks, like how do we do this separation. - [The Magic of IF: Investigating Causal Reasoning Abilities in Large Language Models of Code](https://arxiv.org/abs/2305.19213) They consider two challenging causal reasoning tasks: abductive reasoning ( generate a plausible reasoning for the ending ) for the and counterfactual reasoning. They propose the use of Code-LLMs to model causal relationships since casual relationships are modeled in code frequently ( for e.g. IF statements ). The codex models outperform the RLHF tuned models (even though they are descendent of the codex model), the authors hypothesize that this is due to *alignment tax*. 1. I like the overall direction but feel like idea is gimmicky along with no real interpretation given, posing simple NLI tasks as code would also be challenging and not scalable (due to costs and converting every task to a similar coding prompt might require a lot of prompt engineering) - [MathPrompter: Mathematical Reasoning using Large Language Models](https://arxiv.org/abs/2303.05398) The work tries to generate confidence in the results generated by LLM by asking it to solve the arithmetic questions in multiple ways and then checking if the results are the same for multiple variable values. They use a zero-shot CoT prompting technique. 1. The part where the paper does not focus is the natural language to the symbolic part, I would be interested to see the cases where it could formulate the python program right or vice versa. My intuition is that if the LM "understood" the question correctly the formulation would not differ a lot. I believe the majority of improvement would be due to correct calculation. 2. They don't really have a mechanism to check the intermediate steps. 3. Not very cost effective, and also tried some examples which they claim GPT got it wrong but on trying to reproduce them gpt4 got it right for me. - [Do Prompt-Based Models Really Understand the Meaning of their Prompts?](https://arxiv.org/abs/2109.01247) In this paper they experiment with a bunch of instruction prompts templates (some relevant & some irrelevant) and find that models perform equally well with both. They question whether the model can really understand the prompt analogous to how humans can.They find that instructions are only modestly beneficial over no instructions. 1. This paper is relatively older, they have not used the recent instructGPT models (RLHF ones). Think they would be better at handling instructions. - [In-Context Learning Creates Task Vectors](https://arxiv.org/pdf/2310.15916.pdf) The paper a formulation to explain the in-context learning behaviour of LLMs, essentially that given a input prompt transformers are able to learn task specefic vectors. 1. They use relatively simple single output token task to demonstrate this ability. Would be interesting to see how this works for more complex tasks. 2. This also fails to explain some of the previously observed differences with fine-tuning for e.g., (1) Why does the exact input-output not matter as such, (2) why does the order of input-demonstrations matter. - [Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?](https://arxiv.org/abs/2202.12837) The paper investigates into what aspects of the exemplars given for ICL contribute. They provide evidence that the model does not rely on the input-label mapping and instead what matters is label space, distribution of input text and overall format of the prompt sequence. 1. They do not investigate for any real logical reasoning tasks (arithmetic or deductive reasoning). 2. They also have a study where the the number of input demonstrations do not actually matter after a point, would be interesting to study this for more complex tasks. Like as we increase the task difficutly do we need more demonstrations. #### Ambiguous Question Answering I briefly thought about this problem of how LLMs should ask clarifying questions in case of doubt similar to how we do. Felt this idea is still relatively unexplored. Intuitively believe that RLHF tuned models would answer anything random in case some information is missing in the prompt. Either the task itself is ambiguous or there is ambiguity in the prompt (on context in the form of missing information) - [CLAM: Selective Clarification for Ambiguous Questions with Generative Language Models](https://arxiv.org/abs/2212.07769) They propose a framework for selectively asking clarifying questions about ambiguous user questions. The framework is a multistep one where they first classify the question as ambiguous or not, if it is they ask a clarifying question to the user. - [Task Ambiguity in Humans and Language Models](https://arxiv.org/abs/2212.10711) This work studies *task ambiguity* where the task the agent is being asked to perform is ambiguous. This is particulary important due to the wide application of ICL for downstream applications. They also release a benchmark dataset for this task. They study how do task instructions and the in-context examples in steering the model towards one particular task. 1. When they study how task instructions help in steering the model towards one particular task, they have missed maybe adding the instuction after\before the prompt to see how recency bias occurs in these models. 2. I think they also miss a comparitive study of how instructions and number of examples in the prompt perform relatively. 3. They also claim that fine-tuning the non instruction models on a small ambiguous dataset of 68 examples helped achieve the performance of text-davinci models. I think they should also test on a task of more difficultly after fine-tuning to get an actual sense of how it performs. #### Prompts for Chain of Thought - [Explanation Selection Using Unlabeled Data for Chain-of-Thought Prompting](https://arxiv.org/abs/2302.04813) They study the problem of optimizing explanations given for CoT exemplars for better performance on textual reasoning tasks. They use the following steps to perform the optimization. 1. Given the human annotated explanations, they generate more sample explanations for example. They do this by prompting the LLM to generate explanations and keep the ones that arrive at the correct answer. 2. Next they have a bunch of explanations and need to choose the best performing combination. They use the development set V and use majority voting to find pseudo labels to select the combination. Some possible work we could do: 1. Using explanations from different dataset of a *similar task* might be helpful in adding more variations to the explanations candidate set. 2. There has also been work done to optimize the ordering of examples in the prompt, one one stage framework which optimizes over both of them together might produce larger gains since the order of prompts and reasoning steps both have shown to affect the performance of CoT. 3. For the first proxy metric that they propose, don't really see a reduction in the prompt length overall ? In general the method seems expensive given that LLMs have been used for all the work. It might be possible to use a smaller LM tuned for this purpose similar to what Prof. Mausam was saying 4. They also assume access to a development set. 5. They have also not spent time understading why certain prompts do not work, the paper below did for the task ordering and found some issues. - [Fantastically Ordered Prompts and Where to Find Them](https://arxiv.org/abs/2104.08786) They demonstrate that the order in which the exemplar samples are provided is important for ICL. They also find the order is not transferable across tasks and models. They generate a probing set by querying from the language model itself, this set should share a similar training distribution to the training samples. They use Global Entropy (GlobalE) and Local Entropy (LocalE) to select the best ordering. Possible future work: 1. How does the order of examples for an actual OOD dataset, these dummy datasets might be too simple for these models and we might not be getting a complete picture. We can use a dataset like prontoQA-OOD for this exercise. 2. The paper does not provide any intuition around why the ordering is model as well as task specefic, maybe similar tasks could perform well using the same ordering ? 3. Again expensive to query the language model for the probing set. 4. I can see some cases where the global entropy paradigm #### Fact Checking in the Wild - [Complex Claim Verification with Evidence Retrieved in the Wild](https://arxiv.org/abs/2305.11859) They present an automated pipeline to check real world claims by retrieving raw evidence from the web. They don't present any novel technique as such but a complete pipeline using already established works. They claim that the challenging part in building such systems is retrieving relevant evidence from the wild. 1. Overall the method seems very **expensive** due to multiple queries to GPT to break down claim into relevant questions, getting relevant documents from the web for each subquestion, scoring all the documents by ranking text spans in each document followed by summarizing top K documents and then using a classifier they fine-tuned to check the claim. ### Some ToDo Tasks - [ ] We have seen that the order of input demonstrations matter for ICL so what if input demonstrations are arranged in order of increasing difficulty similar to how we would teach a student (examples in order of increasing difficulty).