<style>
img {
display: block;
margin-left: auto;
margin-right: auto;
}
</style>
> [Paper link](https://arxiv.org/abs/2303.03378) | [Note link](https://blog.csdn.net/baidu_41617231/article/details/135475315) | [Code link](https://github.com/kyegomez/PALM-E) | ICML 2023
:::success
**Thoughts**
They attempt to develop an embodied language model by integrating multimodal information, including images.
:::
## Abstract
This study proposes embodied language models to directly incorporate real-world continuous sensor modalities into language models and thereby establish the link between words and percepts.
## Background
Although large language models (LLMs) exhibit strong reasoning capabilities across various domains, they often fall short in addressing grounded real-world problems in areas like computer vision and robotics.
By leveraging cross-modality capabilities, vision-language models can offer richer information for learning robotic policies and affordance functions.
## Method
This study proposes embodied language models that integrate continuous inputs from an embodied agent's sensor modalities, enabling the language model to make more grounded inferences for sequential decision-making in real-world scenarios.
### PaLM-E: An Embodied Multimodal Language Model

The main architectural idea of PaLM-E is to inject continuous, embodied observations such as images, state estimates, or other sensor modalities into the language embedding space of a pre-trained language model.
The inputs to PaLM-E consist of text and (multiple) continuous observations.
The multimodal tokens corresponding to these observations are interleaved with the text to form multi-modal sentences.
The output of PaLM-E is text generated auto-regressively by the model, which could be an answer to a question, or a sequence of decisions produced by PaLM-E in textual form that should be executed by a robot.
### Input & Scene Representations for Different Sensor Modalities
This section, they describe the individual modalities that they incorporate into PaLM-E, and how they set up their encoders.
They investigate state estimation vectors, Vision Transformers (ViTs) for 2D image features.
For the 3D-aware using Object Scene Representation Transformer (OSRT).
In addition to encoders that represent the input scene globally, they consider object-centric representations that factor observations into tokens that represent individual objects in the scene.
## Experiment
They consider diverse robotic (mobile) manipulation tasks across three different robot embodiments, in simulation and with two different real robots.
Also, they evaluate PaLM-E on general vision-language tasks such as visual-question-answering (VQA), image captioning, and established language modeling tasks.
Below is a figure shows that PaLM-E-562B can do zero-shot multimodal chain-of-thought reasoning.

Below is a sample case for robotic tasks.

Below is results on planning tasks in the simulated environment.

Below is results on general visual-language tasks.
