# Language Grounding and World Representations ## Tuesdays 10:00AM ET, Asimov, ServiceNow Research and [zoom](https://servicenow.zoom.us/j/99846565987?from=addon) I Suggested schedule and papers below. A person will be leading and presenting the discussion on a paper each week. We are kicking off the discussion with some background on Reasining and Grounding using LLMs (see below) I prefer if the reading group has a shorter duration to have more focused and productive discussions (with the guest speaker). #### Keywords: Language grounding, world models, Generalization vs Understanding # Week 1 (03/10/2023) ``` @inproceedings{generalpatternmachines2023, author = {Mirchandani, Suvir and Xia, Fei and Florence, Pete and Ichter, Brian and Driess, Danny and Arenas, Montserrat Gonzalez and Rao, Kanishka and Sadigh, Dorsa and Zeng, Andy}, title = {Large Language Models as General Pattern Machines}, booktitle = {Proceedings of the 7th Conference on Robot Learning (CoRL)}, year = {2023}, } ``` ***Presenter:*** Suvir Mirchandani ***Summary:*** Pre-trained large language models have been applied to a variety of settings in robotics — proposing high-level task plans, synthesizing control programs, designing reward functions, and more — driven by their ability to perform in-context learning. In this talk, I will discuss work investigating the potential of LLMs to represent and extrapolate more abstract non-linguistic patterns, and the extent to which LLMs may serve as “general pattern machines.” I will cover how these pattern manipulation capabilities may be connected to robotics, from extrapolating sequences that represent simple motions, to least-to-most prompting of reward-conditioned trajectories that can represent simple closed-loop policies. I will also discuss the significant bottlenecks of using current LLMs as general pattern machines for robotics, as well as the downstream implications made possible as these limitations are addressed. Presenter Slides. [View Recording](https://servicenow.zoom.us/rec/share/9NGI43jNMSAGvaQ4bkCOPgl9nilWGrglJauiTSroWUoYPNeKe6y14cPTnhI80Jaw.ZTSWGlc_ffGG94ro) ![](https://hackmd.io/_uploads/r1a5kTFZp.png) # Week 2 (10/10/2023) ``` @inproceedings{ li2023emergent, title={Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task}, author={Kenneth Li and Aspen K Hopkins and David Bau and Fernanda Vi{\'e}gas and Hanspeter Pfister and Martin Wattenberg}, booktitle={The Eleventh International Conference on Learning Representations }, year={2023}, url={https://openreview.net/forum?id=DeG07_TcZvT} } ``` Kenneth Li Presenting two works on understanding and controlling LLMs. Specifically, in "Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task", he uncovers an interpretable and controllable world model of the game board. And in a later work "Inference-Time Intervention: Eliciting Truthful Answers from a Language Model", he compels a language model to tell the truth it knows but otherwise hides, by manipulating the activations of it. [View Recording](https://servicenow.zoom.us/rec/share/d8Wp0HU_KJCliwRfIKVaxFmTI-5_l0DM5NS-zah3FXNNvVrJP4Iya04D_Sn1yL8x.fLF0wRHROitk69yQ) ![](https://hackmd.io/_uploads/SkFCy6tW6.png) # Week 3 (17/10/2023) ``` @inproceedings{ patel2022mapping, title={Mapping Language Models to Grounded Conceptual Spaces}, author={Roma Patel and Ellie Pavlick}, booktitle={International Conference on Learning Representations}, year={2022}, url={https://openreview.net/forum?id=gJcEM8sxHK} } ``` A fundamental criticism of text-only language models (LMs) is their lack of grounding---that is, the ability to tie a word for which they have learned a representation, to its actual use in the world. However, despite this limitation, large pre-trained LMs have been shown to have a remarkable grasp of the conceptual structure of language, as demonstrated by their ability to answer questions, generate fluent text, or make inferences about entities, objects, and properties that they have never physically observed. In this work we investigate the extent to which the rich conceptual structure that LMs learn indeed reflects the conceptual structure of the non-linguistic world---which is something that LMs have never observed. We do this by testing whether the LMs can learn to map an entire conceptual domain (e.g., direction or colour) onto a grounded world representation given only a small number of examples. For example, we show a model what the word "left" means using a textual depiction of a grid world, and assess how well it can generalise to related concepts, for example, the word "right", in a similar grid world. [View Recording](https://servicenow.zoom.us/rec/share/x1htx4mhVSGGhdT7-wQPe02eWZaN2HWKH-3onVFreOpxXSIM94S6aDv1cO1a-KhL.ct6cKmxImaP2FLAJ) ![](https://hackmd.io/_uploads/S1BHxatW6.png) ## Abstract # Week 4 (23/10/2023) ``` @InProceedings{pmlr-v202-von-oswald23a, title = {Transformers Learn In-Context by Gradient Descent}, author = {Von Oswald, Johannes and Niklasson, Eyvind and Randazzo, Ettore and Sacramento, Joao and Mordvintsev, Alexander and Zhmoginov, Andrey and Vladymyrov, Max}, booktitle = {Proceedings of the 40th International Conference on Machine Learning}, year = {2023}, editor = {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan}, volume = {202}, series = {Proceedings of Machine Learning Research}, pdf = {https://proceedings.mlr.press/v202/von-oswald23a/von-oswald23a.pdf}, } ``` The algorithms implemented by task-optimized neural networks are usually unknown to their designers. In this talk, I will show results presented in two recent papers (https://arxiv.org/abs/2212.07677, https://arxiv.org/pdf/2309.05858.pdf) where we aimed to reverse engineer transformers trained to solve small-scale few-shot learning and sequential prediction tasks. It turns out that these trained neural networks often approach their tasks by constructing appropriate objective functions and then optimizing them using gradient-based methods within their forward dynamics. I will discuss how our findings might help understand in-context learning in language models and generally how Transformers might form their predictions. [View Recording](https://servicenow.zoom.us/rec/share/pdlNmNpEYkZpXIJfPC-Eav78HDKOqGcWgdRa2DlWHeG_Tym2ZQH8_E6yAshO3dPY.Eh4yUraImTuHLnNv) ![](https://hackmd.io/_uploads/rkP_gTKWT.png) # Week 5 (31/10/2023) ``` @inproceedings{li-etal-2021-implicit, title = "Implicit Representations of Meaning in Neural Language Models", author = "Li, Belinda Z. and Nye, Maxwell and Andreas, Jacob", booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)", year = "2021", publisher = "Association for Computational Linguistics", } ``` The extent to which language modeling induces representations of the world described by text—and the broader question of what can be learned about meaning from text alone—have remained a subject of ongoing debate across NLP and cognitive sciences. I'll discuss a few pieces of recent work aimed at understanding whether (and how) representations in transformer LMs linearly encode interpretable and controllable representations of facts and situations. I'll begin by presenting evidence from probing experiments suggesting that LM representations encode (rudimentary) information about entities' properties and dynamic state, and that these representations are causally implicated downstream language generation. Despite this, even today's largest LMs are prone to glaring semantic errors: they hallucinate facts, contradict input text, or even their own previous outputs. Building on our understanding of how LM representations influence behavior, I'll describe a "representation editing" model called REMEDI that can correct these errors by intervening directly in LM activations. I'll with some recent experiments that complicate this story: much of LMs' "knowledge" remains inaccessible to readout or manipulation with simple probes. A great deal of work is still needed to build language generation systems with fully transparent and controllable models of the world. [View Recording](https://servicenow.zoom.us/rec/share/3vpH2fk29-7zNKPUQ-Ulsvt_BDk7ZlWYiSi-R2mdCfgOKZaZC13xsozCawRmsmbo.wbmrwLbR0Qzop4Rs) ![](https://hackmd.io/_uploads/SkxzlpKWp.png) # Week 6 (07/11/2023) ``` @article{hao2023reasoning, title={Reasoning with language model is planning with world model}, author={Hao, Shibo and Gu, Yi and Ma, Haodi and Hong, Joshua Jiahua and Wang, Zhen and Wang, Daisy Zhe and Hu, Zhiting}, journal={arXiv preprint arXiv:2305.14992}, year={2023} } ``` language models (LLMs) have shown remarkable reasoning capabilities, especially when prompted to generate intermediate reasoning steps (e.g., Chain-of-Thought, CoT). However, LLMs still struggle with problems that are easy for humans, such as generating action plans for executing tasks in a given environment, or performing complex math, logical, and commonsense reasoning. This shortfall can be attributed to two key factors: the architectural limitations that hinder LLMs from symbolic reasoning and extending their capabilities beyond language-based output, and the inherent autoregressive nature of their reasoning structure, which restricts their capacity for deliberate, human-like reasoning. In this talk, I will introduce our two recent works. First, we'll discuss how to augment LLMs with external tools to overcome their architectural limitations, and second, we'll delve into the development of a novel reasoning framework that integrates a world model and MCTS planning to enable more advanced and versatile reasoning. [View Recording](https://servicenow.zoom.us/rec/share/u4oetDmqegoDuUhQFvM34S2nYfd0RUwcD3LfNG5WdrHZoLBd-_Zm9tOzZK5zmCzv.K-8BYOAi-rQHmyc6) ![](https://hackmd.io/_uploads/Bk_-rwpf6.png) # Week 7 (14/11/2023) ``` @article{lu2023chameleon, title={Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models}, author={Lu, Pan and Peng, Baolin and Cheng, Hao and Galley, Michel and Chang, Kai-Wei and Wu, Ying Nian and Zhu, Song-Chun and Gao, Jianfeng}, journal={arXiv preprint arXiv:2304.09842}, year={2023} } ``` Foundation models such as Large Language Models (LLMs) and Large Multimodal Models (LMMs) exhibit impressive problem-solving skills in many tasks and domains. In this talk, I will share two distinct studies aimed at facilitating the progress of these models. The first study explores how we can enhance LLMs by augmenting them with various external tools in a flexible and efficient manner. Our proposed Chameleon approach is a plug-and-play compositional reasoning framework. It synthesizes natural-language-like programs to compose various tools, including LLM models, vision models, web search engines, Python functions, and rule-based modules. The adaptability and effectiveness of Chameleon are demonstrated in two multi-modal knowledge-intensive reasoning tasks: ScienceQA and TabMWP. The second study presents a new benchmark for probing mathematical reasoning in visual contexts and, as a first, offers a comprehensive quantitative and qualitative evaluation of 12 foundation models in this field. The best-performing model, GPT-4V, achieves a 49.9% accuracy on MathVista but falls short of human performance by 10.4%. We further explore GPT-4V’s newly introduced ability of self-verification, its application of self-consistency, and its interactive chatbot capabilities, highlighting its promising potential for future research. [View Recording](https://servicenow.zoom.us/rec/share/SFzujxcej3Ra3MmeU0-SiwSaFiO2bpgeQVkMZgMAS7UiSxdVc62dcEoNWXqoKgRO.LezHrs8rhqKQRZjw) ![Screenshot 2023-11-12 at 2.09.46 PM-min](https://hackmd.io/_uploads/BJi7BiA7p.png) # Week 8 (21/11/2023) ``` @misc{liu2023improvedllava, author={Liu, Haotian and Li, Chunyuan and Li, Yuheng and Lee, Yong Jae}, title={Improved Baselines with Visual Instruction Tuning}, publisher={arXiv:2310.03744}, year={2023}, } @inproceedings{liu2023llava, author = {Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae}, title = {Visual Instruction Tuning}, booktitle = {NeurIPS}, year = {2023} } ``` Recognizing and understanding visual content, as well as reasoning about the visual world based on human instructions, has long been a challenging problem. Recently, OpenAI GPT-4V has showcased impressive capabilities in both NLP tasks and complex visual understanding challenges, thanks to large-scale pretraining and extensive instruction tuning. In this talk, I will introduce LLaVA, the first open-sourced project to demonstrate multimodal GPT-4V level capabilities in image understanding and reasoning. We demonstrate that this approach offers a promising path for building customizable, large multimodal models that follow human intent at an affordable cost. First, I will introduce how we approach this by creating a multimodal instruction-following dataset without the need for extensive manual annotations and by leveraging the existing pretrained LLMs and large vision encoders without the need of training-from-scratch. Additionally, I will present LLaVA-1.5, where it achieves SoTA on 11 benchmarks, with just simple modifications to the original LLaVA. It utilizes all public data, completes training in ~1 day on a single 8-A100 node, and surpasses methods like Qwen-VL-Chat that use billion-scale data. Finally, I will present some intriguing capabilities and limitations of LLaVA and outline a few future directions that we are eager to explore. [View Recording](https://servicenow.zoom.us/rec/share/2H6fwlri8AJxKtVy1GBXzoz9OxeF-CiokX2noX3ryXfvF4rDA4Kbl8kL9igPtu4B.dRp47QJOKZXF4Qe0) ![Screenshot 2023-11-15 at 10.00.03 PM](https://hackmd.io/_uploads/ByeTSZQET.png)