<style>
img {
display: block;
margin-left: auto;
margin-right: auto;
}
</style>
> [Paper link](https://arxiv.org/pdf/2302.12813.pdf) | [Note link](https://zhuanlan.zhihu.com/p/633301642) | [Code link](https://github.com/pengbaolin/LLM-Augmenter) | arXiv 2023
:::success
**Thoughts**
:::
## Abstract
This paper proposes a LLM-AUGMENTER system, which augments a black-box LLM with a set of plug-and-play modules.
It makes the LLM **generate responses grounded in external knowledge**, e.g., stored in task-specific databases.
It also **iteratively revises LLM prompts** to improve model responses using feedback generated by utility functions, e.g., the factuality score of a LLM-generated response.
LLM-AUGMENTER significantly reduces ChatGPT’s hallucinations without sacrificing the fluency and informativeness of its responses.
## Introduction
The knowledge encoding of LLMs is lossy and the knowledge generalization could lead to “memory distortion”.
Thus, it is highly desirable to augment a $fixed$ LLM with **plug-and-play (PnP)** modules for mission-critical tasks.
In this paper, they present LLM-AUGMENTER to improve LLMs with external knowledge and automated feedback using PnP modules.

The study shows that LLM-AUGMENTER significantly reduces ChatGPT’s hallucinations without sacrificing the fluency and informativeness of its generated responses.
## LLM-AUGMENTER

- Working Memory
- Policy
- Action Executor
- Utility
They improve a fixed LLM (e.g., ChatGPT) with external knowledge and automated feedback to mitigate generation problems such as hallucination.
---
Markov Decision Process (MDP)
- $\mathcal{S}$ is an infinite set of dialog states (encode information stored in Working Memory)
- $\mathcal{A}$ is a set of actions that Policy picks to execute
- Knowledge Consolidator
- Prompt Engine
- $P(s' \mid s, a)$ is the transition probability
- $R(s, a)$ is the external reward, e.g. users or simulators
- $\gamma \in (0, 1]$ is a discount factor
### Working Memory
This module tracks the dialog state that captures all essential information in the conversation.
The state is represented using a six-tuple $(q, e, o, u, f, h_q)$:
- $q$: current user query
- $e$: consolidated from external knowledge by Knowledge Consolidator
- $o$: a set of the LLM-generated candidate responses for $q$
- Utility module
- $u$: score of each element of $o$
- $f$: a verbalized feedback to guide the LLM
- $h_q$ is the dialog history before $q$
### Policy
This module selects the next system action that leads to the best expected reward $R$.
- Acquiring **evidence** $e$ for $q$ from external knowledge
- Calling the LLM to generate a **candidate response**
- Sending a **response to users** if it passes the verification by the Utility module
They use REINFORCE
$$
\arg \max_\theta \mathbb{E}_{s \sim S, a \sim \pi_\theta} [R(s, a)]
$$
Policy learning can be done in three stages:
- Bootstrapping from a rule-based policy: Domain experts encode task-specific knowledge and business logic into IF-THEN rules
- Learning with user simulators: It uses a language model to simulate how human users interact with LLM-AUGMENTER
- LLM-AUGMENTER interacts with human users to further refine its policy
### Action Executor
#### Knowledge Consolidator

The **retriever** first generates a set of search queries based on $q$ and $h_q$, and then calls a set of APIs to retrieve raw evidence from various external knowledge sources.
Thus, the entity **linker** enriches raw evidence with related context to form evidence graphs.
The **chainer** prunes irrelevant evidence from the graphs and forms a shortlist of evidence chains that are most relevant to queries.
#### Prompt Engine
The Prompt Engine generates a prompt (task instruction, user query $q$, dialog history $h_q$, evidence $e$, feedback $f$) to query the LLM to generate a (candidate) response $o$ for $q$.
### Utility
Given a candidate response $o$, the Utility module generates utility score $u$ and a corresponding feedback $f$ using a set of task-specific utility functions.
- Model-based utility functions assign preference scores to different dimensions of a response, such as fluency, informativeness and factuality. These functions are trained on pre-collected human preference data or annotated log data.
- Rule-based utility functions, implemented using heuristics or programmed functions, measure whether a response complies with a specific rule.
## Information Seeking Dialog
## Wiki QA
## Limitations and Future Directions
## Conclusions
This paper introduced LLM-AUGMENTER, a framework for augmenting black-box LLMs (e.g., ChatGPT) with external knowledge and automated feedback.
The automated feedback elicits the “follow-up correction” abilities of models such as ChatGPT and InstructGPT in order to produce revised responses that
rank higher according to some given utility functions