Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback - HackMD

<style> img { display: block; margin-left: auto; margin-right: auto; } </style> > [Paper link](https://arxiv.org/pdf/2302.12813.pdf) | [Note link](https://zhuanlan.zhihu.com/p/633301642) | [Code link](https://github.com/pengbaolin/LLM-Augmenter) | arXiv 2023 :::success **Thoughts** ::: ## Abstract This paper proposes a LLM-AUGMENTER system, which augments a black-box LLM with a set of plug-and-play modules. It makes the LLM **generate responses grounded in external knowledge**, e.g., stored in task-specific databases. It also **iteratively revises LLM prompts** to improve model responses using feedback generated by utility functions, e.g., the factuality score of a LLM-generated response. LLM-AUGMENTER significantly reduces ChatGPT’s hallucinations without sacrificing the fluency and informativeness of its responses. ## Introduction The knowledge encoding of LLMs is lossy and the knowledge generalization could lead to “memory distortion”. Thus, it is highly desirable to augment a $fixed$ LLM with **plug-and-play (PnP)** modules for mission-critical tasks. In this paper, they present LLM-AUGMENTER to improve LLMs with external knowledge and automated feedback using PnP modules. ![](https://hackmd.io/_uploads/r1toxjWA3.png) The study shows that LLM-AUGMENTER significantly reduces ChatGPT’s hallucinations without sacrificing the fluency and informativeness of its generated responses. ## LLM-AUGMENTER ![image](https://hackmd.io/_uploads/rkD9c8Yvp.png) - Working Memory - Policy - Action Executor - Utility They improve a fixed LLM (e.g., ChatGPT) with external knowledge and automated feedback to mitigate generation problems such as hallucination. --- Markov Decision Process (MDP) - $\mathcal{S}$ is an infinite set of dialog states (encode information stored in Working Memory) - $\mathcal{A}$ is a set of actions that Policy picks to execute - Knowledge Consolidator - Prompt Engine - $P(s' \mid s, a)$ is the transition probability - $R(s, a)$ is the external reward, e.g. users or simulators - $\gamma \in (0, 1]$ is a discount factor ### Working Memory This module tracks the dialog state that captures all essential information in the conversation. The state is represented using a six-tuple $(q, e, o, u, f, h_q)$: - $q$: current user query - $e$: consolidated from external knowledge by Knowledge Consolidator - $o$: a set of the LLM-generated candidate responses for $q$ - Utility module - $u$: score of each element of $o$ - $f$: a verbalized feedback to guide the LLM - $h_q$ is the dialog history before $q$ ### Policy This module selects the next system action that leads to the best expected reward $R$. - Acquiring **evidence** $e$ for $q$ from external knowledge - Calling the LLM to generate a **candidate response** - Sending a **response to users** if it passes the verification by the Utility module They use REINFORCE $$ \arg \max_\theta \mathbb{E}_{s \sim S, a \sim \pi_\theta} [R(s, a)] $$ Policy learning can be done in three stages: - Bootstrapping from a rule-based policy: Domain experts encode task-specific knowledge and business logic into IF-THEN rules - Learning with user simulators: It uses a language model to simulate how human users interact with LLM-AUGMENTER - LLM-AUGMENTER interacts with human users to further refine its policy ### Action Executor #### Knowledge Consolidator ![](https://hackmd.io/_uploads/ryYt-NECh.png) The **retriever** first generates a set of search queries based on $q$ and $h_q$, and then calls a set of APIs to retrieve raw evidence from various external knowledge sources. Thus, the entity **linker** enriches raw evidence with related context to form evidence graphs. The **chainer** prunes irrelevant evidence from the graphs and forms a shortlist of evidence chains that are most relevant to queries. #### Prompt Engine The Prompt Engine generates a prompt (task instruction, user query $q$, dialog history $h_q$, evidence $e$, feedback $f$) to query the LLM to generate a (candidate) response $o$ for $q$. ### Utility Given a candidate response $o$, the Utility module generates utility score $u$ and a corresponding feedback $f$ using a set of task-specific utility functions. - Model-based utility functions assign preference scores to different dimensions of a response, such as fluency, informativeness and factuality. These functions are trained on pre-collected human preference data or annotated log data. - Rule-based utility functions, implemented using heuristics or programmed functions, measure whether a response complies with a specific rule. ## Information Seeking Dialog ## Wiki QA ## Limitations and Future Directions ## Conclusions This paper introduced LLM-AUGMENTER, a framework for augmenting black-box LLMs (e.g., ChatGPT) with external knowledge and automated feedback. The automated feedback elicits the “follow-up correction” abilities of models such as ChatGPT and InstructGPT in order to produce revised responses that rank higher according to some given utility functions