PaLM-E: An Embodied Multimodal Language Model - HackMD

<style> img { display: block; margin-left: auto; margin-right: auto; } </style> > [Paper link](https://arxiv.org/abs/2303.03378) | [Note link](https://blog.csdn.net/baidu_41617231/article/details/135475315) | [Code link](https://github.com/kyegomez/PALM-E) | ICML 2023 :::success **Thoughts** They attempt to develop an embodied language model by integrating multimodal information, including images. ::: ## Abstract This study proposes embodied language models to directly incorporate real-world continuous sensor modalities into language models and thereby establish the link between words and percepts. ## Background Although large language models (LLMs) exhibit strong reasoning capabilities across various domains, they often fall short in addressing grounded real-world problems in areas like computer vision and robotics. By leveraging cross-modality capabilities, vision-language models can offer richer information for learning robotic policies and affordance functions. ## Method This study proposes embodied language models that integrate continuous inputs from an embodied agent's sensor modalities, enabling the language model to make more grounded inferences for sequential decision-making in real-world scenarios. ### PaLM-E: An Embodied Multimodal Language Model ![image](https://hackmd.io/_uploads/S1iHa_NiC.png) The main architectural idea of PaLM-E is to inject continuous, embodied observations such as images, state estimates, or other sensor modalities into the language embedding space of a pre-trained language model. The inputs to PaLM-E consist of text and (multiple) continuous observations. The multimodal tokens corresponding to these observations are interleaved with the text to form multi-modal sentences. The output of PaLM-E is text generated auto-regressively by the model, which could be an answer to a question, or a sequence of decisions produced by PaLM-E in textual form that should be executed by a robot. ### Input & Scene Representations for Different Sensor Modalities This section, they describe the individual modalities that they incorporate into PaLM-E, and how they set up their encoders. They investigate state estimation vectors, Vision Transformers (ViTs) for 2D image features. For the 3D-aware using Object Scene Representation Transformer (OSRT). In addition to encoders that represent the input scene globally, they consider object-centric representations that factor observations into tokens that represent individual objects in the scene. ## Experiment They consider diverse robotic (mobile) manipulation tasks across three different robot embodiments, in simulation and with two different real robots. Also, they evaluate PaLM-E on general vision-language tasks such as visual-question-answering (VQA), image captioning, and established language modeling tasks. Below is a figure shows that PaLM-E-562B can do zero-shot multimodal chain-of-thought reasoning. ![image](https://hackmd.io/_uploads/Hkwd6OEi0.png) Below is a sample case for robotic tasks. ![image](https://hackmd.io/_uploads/HJSna_EiC.png) Below is results on planning tasks in the simulated environment. ![image](https://hackmd.io/_uploads/SJIyAuEjC.png) Below is results on general visual-language tasks. ![image](https://hackmd.io/_uploads/SkEgRuNi0.png)