<style> img { display: block; margin-left: auto; margin-right: auto; } </style> > [Paper link](https://arxiv.org/abs/2307.10169) | arXiv 2023 :::success **Thoughts** About Hallucinations, we need to combine Retrieval Augmentation (RA) and multi-turn prompting or using Decoding Strategies. ::: ## Abstract This paper aims to establish a systematic set of open problems and application successes so that ML researchers can comprehend the field’s current state more quickly and become productive. ## Introduction ![](https://hackmd.io/_uploads/r1FI78Qsn.png) ## Challenges Here I only focus ==on the topic== that may relate to the dialogue system that our team will build. Some of the topics will be skipped. <!-- ### Unfathomable Datasets :::info The size of modern pre-training datasets renders it impractical for any individual to read or conduct quality assessments on the encompassed documents thoroughly ::: ### Tokenizer-Reliance :::info Tokenizers introduce several challenges, e.g., computational overhead, language dependence, handling of novel words, fixed vocabulary size, information loss, and low human interpretability. ::: ### ==High Pre-Training Costs== :::info - Fine-tuning entire LLMs requires the same amount of memory as pre-training, rendering it infeasible for many practitioners. - Performance increases through larger compute budgets but at a decreasing rate if the model or dataset size is fixed, reflecting a power law with diminishing returns. - Parameter-efficient fine-tuning of LLMs still requires computing full forward/backward passes throughout the whole network. ::: ### ==Fine-Tuning Overhead== :::info When adapting an LLM via full-model finetuning, an individual copy of the model must be stored (consuming data storage) and loaded (expending memory allocation, etc.) for each task. ::: ### Limited Context Length :::info Limited context lengths are a barrier for handling long inputs well to facilitate applications like novel or textbook writing or summarizing. ::: --> ### ==Prompt Brittleness== :::info Variations of the prompt syntax, often occurring in ways unintuitive to humans, can result in dramatic output changes. ::: The order of examples included in a prompt, have been found to influence the model’s behavior significantly. Designing natural language queries that steer the model’s outputs toward desired outcomes is often referred to as *prompt engineering*. **Single-Turn Prompting** methods improve the input prompt in various ways to get a better answer in a single shot. ![](https://hackmd.io/_uploads/SykQGhXo2.png) **In-Context Learning (ICL)** Why ICL shows such competitive results across NLP tasks: - ICL emulates gradient-based meta-learning - Task recognition is the ability to recognize a task through demonstrations - ICL phenomenon take place around - Bayesian inference - Sparse linear regression - Structure induction - Maintaining coherence - Kernel regression - Clone-structured causal graphs ![](https://hackmd.io/_uploads/HJyQG3ms2.png) **Chain of Thought** - [Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems](https://aclanthology.org/P17-1015/) - [Chain-of-Thought Prompting Elicits Reasoning in Large Language Models](https://openreview.net/forum?id=_VjQlMeSB_J) - [Large Language Models are Zero-Shot Reasoners](https://openreview.net/forum?id=e2TBb5y0yFf) - [Automatic Chain of Thought Prompting in Large Language Models](https://arxiv.org/abs/2210.03493) ![](https://hackmd.io/_uploads/ByJmf2Xsn.png) **[Ask Me Anything](https://arxiv.org/abs/2210.02441)** It uses multiple prompt templates (called prompt chains), which are used to reformat few-shot example inputs into an openended question-answering format. The final output is obtained by aggregating the LLMs predictions for each reformatted input via a majority vote. **[Self-consistency](https://arxiv.org/abs/2203.11171)** It extends chain-of-thought prompting by sampling multiple reasoning paths and selecting the most consistent answer via a majority vote. ![](https://hackmd.io/_uploads/H1CMG2Qo3.png) **[Least-to-Most](https://arxiv.org/abs/2205.10625)** It uses a set of constant prompts to use the LLM to decompose a given complex problem into a series of subproblems. The LLM sequentially solves the subproblems with prompts for later-stage subproblems containing previously produced solutions, iteratively building the final output. ![](https://hackmd.io/_uploads/ByGOM27s2.png) **[Self-refine](http://arxiv.org/abs/2303.17651)** It is based on the notion of iterative refinement. To this end, a single LLM generates an initial output and then iteratively provides feedback on the previous output, followed by a refinement step in which the feedback is incorporated into a revised output. ![](https://hackmd.io/_uploads/rkGuz2Xoh.png) **[Tree of Thoughts](http://arxiv.org/abs/2305.10601)** It generalize CoT to maintain a tree of thoughts (with multiple different paths), where each thought is a language sequence that serves as an intermediate step. Doing so enables the LLM to self-evaluate the progress intermediate thoughts make towards solving the problem and incorporating search algorithms, such as breadth-first or depth-first search, allowing systematic exploration of the tree with lookahead and backtracking ### ==Hallucinations== :::info Generated text that is fluent and natural but unfaithful to the source content (intrinsic) and/or under-determined (extrinsic). ::: ![](https://hackmd.io/_uploads/H1lAr_mj2.png) To distinguish between different types of hallucinations, they consider the provided *source content* of the model, e.g., the prompt, possibly including **examples** or **retrieved context**. Based on such, they can distinguish between *intrinsic* and *extrinsic* hallucinations: In the former, the generated text logically contradicts the source content. In the latter, we cannot verify the output correctness from the provided source. ![](https://hackmd.io/_uploads/rJvJ0qQs3.png) **How to Measure Hallucinations** *[FactualityPrompts dataset](https://arxiv.org/abs/2206.04624)* consisting of factual and nonfactual input prompts, which allows one to isolate the effect of prompt’s actuality on the model’s continuation. Further, they measure hallucinations using named-entity and textual entailment-based metrics. [Another framework](https://arxiv.org/abs/2305.14251) is that first breaks generations into atomic facts and then computes the percentage of atomic facts supported by an external knowledge source like Wikipedia. [One the other way](https://arxiv.org/abs/2305.13534), detecting the behavior of *hallucination snowballing*, where the LLM overcommits to early mistakes (before outputting the explanation) in its generation, which it otherwise would not make. **Retrieval Augmentation (RA)** It can decouple to (i) memory storage of knowledge (e.g., databases or [search indexes](https://arxiv.org/abs/2203.05115)) and (ii) processing of the knowledge to arrive at a more modular architecture. For (i), a *retriever* module retrieves the top-$k$ relevant documents (or passages) for a query from a large corpus of text. Then, for (ii), we feed these retrieved documents to the language model together with the initial prompt. Here're some papers that relate to RA - [Retrieval-augmented language model pre-training (REALM)](https://arxiv.org/abs/2002.08909) - [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401) - [Adaptive Semiparametric Language Models](https://arxiv.org/abs/2102.02557) - [Atlas: Few-shot Learning with Retrieval Augmented Language Models](https://arxiv.org/abs/2208.03299) - [Improving language models by retrieving from trillions of tokens](https://arxiv.org/abs/2112.04426) - [Task-aware Retrieval with Instructions](https://arxiv.org/abs/2211.09260) But RA can still get hallucinations. ![](https://hackmd.io/_uploads/HJEFjsmo2.png) Another failure mode of RA is illustrated by [Khattab et al.](https://arxiv.org/abs/2212.14024) , who find that **sometimes the retriever cannot find passages that directly answer the question.** Hence, they propose a framework that unifies techniques from RA and multi-turn prompting to solve more complex questions programmatically. **Decoding Strategies** Refining the decoding strategy during inference time. [Zhang et al.](https://aclanthology.org/2021.humeval-1.3/) phrase this challenge as a trade-off between diversity and quality. While this challenge remains largely unsolved, several approaches such as [diverse beam search](https://ojs.aaai.org/index.php/AAAI/article/view/12340) and [confident decoding](https://arxiv.org/abs/1910.08684) try reducing the induced hallucinations at the decoding level. - [On Hallucination and Predictive Uncertainty in Conditional Language Generation](https://arxiv.org/abs/2103.15025) ### ==Misaligned Behavior== :::info LLMs often generate outputs that are not well-aligned with human values or intentions, which can have unintended or negative consequences. ::: **Pre-Training With Human Feedback** Where human feedback is incorporated during the pre-training stage rather than during fine-tuning. Conditional training leads to the best trade-off between alignment and capabilities. Conditional training is a simple technique that prepends a control token $c$ (e.g.,<|good|> or <|bad|>) before each training example $x$ depending on the outcome of a thresholded reward function $R(x) \ge t$. During inference, the model generations are conditioned on $c$ = <|good|>. Conditional training results in significantly better alignment with human preferences than standard LM pre-training, followed by fine-tuning with human feedback without hurting downstream task performance. **Instruction Fine-Tuning** **Reinforcement Learning From Human Feedback (RLHF)** RLHF works by using a pre-trained LM to generate text, which is then evaluated by humans by, for example, ranking two model generations for the same prompt. This data is then collected to learn a reward model that predicts a scalar reward given any generated text. The reward captures human preferences when judging model output. Finally, we optimize the LM against such reward model using RL policy gradient algorithms like PPO. **Self-improvement** Fine-tuning an LLM on self-generated data. ![](https://hackmd.io/_uploads/SkuxU-Loh.png) <!-- ### Outdated Knowledge :::info Updating isolated model behavior or factual knowledge can be expensive and untargeted, which might cause unintended side-effects. ::: ### Brittle Evaluations :::info Slight modifications of the benchmark prompt or evaluation protocol can give drastically different results. ::: ### Evaluations Based on Static, Human-Written Ground Truth :::info Static benchmarks become less useful over time due to changing capabilities while updating them often relies on human-written ground truth. ::: ### Indistinguishability between Generated and Human-Written Text :::info - The difficulty in classifying whether a text is LLM-generated or written by a human. - Another LLM can rewrite LLM-generated text to preserve approximately the same meaning but change the words or sentence structure. ::: ### Tasks Not Solvable By Scale :::info Tasks *seemingly* not solvable by further data/model scaling. ::: ### Lacking Experimental Designs :::info - Papers presenting novel LLMs often lack controlled experiments, likely due to the prohibitive costs of training enough models. - Common design spaces of LLM experiments are high-dimensional. ::: ### Lack of Reproducibility :::info - Parallelism strategies designed to distribute the training process across many accelerators are typically non-deterministic, rendering LLM training irreproducible. - API-served models are often irreproducible. ::: --> ## Applications ### Chatbots :::info Multi-turn interactions make Chatbots easily “forget” earlier parts of the conversation or repeat themselves ::: ## Conclusion In this work, they identify several unsolved challenges of large language models, provide an overview of their current applications, and discuss how the former constrain the latter.