# WebGPT: Browser-assisted question-answering with human feedback ## 1 Introduction - most work focus on information retrieval or synthesis for long-form question answering (LFQA) - WebGPT is a GPT-3 style model (3 different sizes: 760M, 13B, and 175B) - obtained their WebGPT model by using behavior cloning (BC), then perform rejection sampling against a trained reward model aimed at predicting human preferences ## 2 Environment Design ![image](https://hackmd.io/_uploads/r10L1s4q6.png) - built a simple user interface for participants to answer questions from ELI5 dataset (reddit, Explain Like I'm 5) - We collect two additional kinds of data: *demonstrations* of humans using our web-browsing environment to answer questions, and *comparisons* between two model-generated answers to the same question (each with their own set of references). - We use this data in four main ways: - behavior cloning (i.e., supervised fine-tuning) using the demonstrations, - reward modeling using the comparisons, - reinforcement learning against the reward model - rejection sampling against the reward model. ![image](https://hackmd.io/_uploads/BJ5WejNqa.png) - answers to the questions in the environments have references and are useful for factual accuracy checking by judges - refer to Figure 1b for what the model is prompted with - one of the actions "quote" will add the metadata of the page as a reference - Browsing then continues until either the model issues a command to end browsing, the maximum number of actions has been reached, or the maximum total length of references has been reached. ## 3 Methods ### 3.1 Data Collection - 6,000 demonstrations, 92% of which were for questions from ELI5, and around 21,500 comparisons, 98% of which were for questions from ELI5 (rest from TruthfulQA) - designed GUI for searching using Bing ### 3.2 Training - **Behavior Cloning (BC)**: finetuned a pretrained GPT-3 on the 6k demonstrations using supervised learning with gt being the labeler answers - **Reward Modeling (RM)**: starting from final unembedding layer of BC model, a model is trained, given a question and answer, it outputs a scalar reward - the reward represents an Elo score, scaled such that the difference between two scores represents the logit of the probability that one will be preferred to the other by the human labelers - trained with cross entropy - **Reinforcement Learning (RL)**: finetuned the BC model with PPO against the reward model, a KL divergence penalty keeps from RL-overoptimization - **Rejection sampling (best-of-n)**: sample a fixed number of answers to a question (4, 16, 64) from either BC or RL and select the answer ranked the highest by the reward model (inference-time method) - their best model was BC + RM + rejection sampling (no RL) ## 4 Evaluation ![image](https://hackmd.io/_uploads/B1DJ9s496.png) ![image](https://hackmd.io/_uploads/r1SxqoEq6.png) - TL;DR: larger model -> better results - TL;DR: human feedback is vital to higher preference/performance/alignment with user preference - TL;DR: WebGPT 175B best-of-64 approaches/surpasses above 50% human preference compared with ELI5 reference answers and/or human demonstrations - TL;DR: WebGPT is a lot more truthful and informative but the gap between being truthful and truthful/informative increased ## 5 Experiments ![image](https://hackmd.io/_uploads/rJJZqjV5p.png) - TL;DR: RL has minimal improvements but can be used for slight improvement if inference-compute time is limited (best-of-1) - TL;DR: rejection sampling yields a huge performance improvement - Possible reasons rejection sampling >> RL: - It may help to have many answering attempts, simply to make use of more inference-time compute. - The environment is unpredictable: with rejection sampling, the model can try visiting many more websites, and then evaluate the information it finds with the benefit of hindsight. - The reward model was trained primarily on data collected from BC and rejection sampling policies, which may have made it more robust to overoptimization by rejection sampling than by RL. - RL requires hyperparameter tuning, whereas rejection sampling does not. - TL;DR: tuning the BC epochs and sampling temperature closed much of the gap between BC and RL model (minimal improvements from RL?) ![image](https://hackmd.io/_uploads/B1J6sjEcp.png) - The modelsfor our main evaluations come from the Pareto frontier of this trade-off: the 760M best-of-4 model, the 13B best-of-16 model, and the 175B best-of-64 model. ### 6 Discussion