paper title: Inner Monologue: Embodied Reasoning
through Planning with Language Models

## introduction
- LLMs can decompose abstract, high-level instuctions into a sequence of low-level steps
- LLMs are typically trained only on text data, hence they are not able to sense the environment and change the task planning
- prior work has investigated using language models as planners or incorporating multimodal-informed perception through language
- but no work has studied the critical link of not only planning with language, but also informing *embodied feedback with language*
- in this work, they combined LLMs with various sources of textual feedback for closing the agent-environment loop
- studies show that language helps humans internalize our knowledge and perform complex relational reasoning through thinking in language
- Inspired by the human thought process, they propose that such an inner monologue is a natural framework for incorporating feedback for LLMs.
## method
### problem statement
- embodied robotic agent
- it attempts to perform a high-level natural language instruction
- it is only capable of executing short-horizon skills that are previously trained(which may be trained with reinforcement learning or behavioral cloning)
- the planner
- which is a pretrained LLM, attempts to find a sequence of skills to accomplish the instruction
- the planner has access to textual feedback from the environment that can be appended to the instruction or requested by the planner
- The observation may be success detection, object detection, scene description, visual-question answering, or even human feedback.
### inner monologue
We formulate an “inner monologue” by continually injecting information from the various sources of feedback into the LLM planning language prompts as the robot interacts with the environment.
### sources of feedback

#### Success Detection
- it's a binary classification problem of whether the low-level skill has succeeded
- Engineered success detectors can operate on ground-truth state in simulation
- learned success detectors can be trained on real examples of successes and failures in the real world.
- ***Success*** feedback: We use the output of success detectors in language form
#### Passive Scene Description
- all sources of environment grounding feedback that are automatically provided and injected into the LLM prompt without any active prompting or querying by the LLM planner
- consistently provided and follow some structure
- ***Object*** feedback: the textual outputs of a object recognizers
- ***Scene*** feedback: task-progress scene description describes the semantic sub-goals inferred by the LLM towards completing the high-level instruction that are achieved by the agent so far
#### Active Scene Description
- sources of feedback that are provided directly in response to active queries by the LLM planner
- In this case, the LLM can directly ask a question about the scene, and this question can be answered either by a person, or by another pretrained model, such as a Visual Question Answering (VQA) model.
- While the Passive Scene Description are strictly structured and narrow in their scope, in the Active Scene Description setting the LLM can receive unstructured answers to open-ended questions, allowing it to actively gather information relevant to the scene, the task, or even preferences of the user (in the case of human-provided response).
- The combined output we send to the LLM planner includes both the LLM-generated question along with the response.
- we only consider human-provided response in this work, which we refer to as ***Human*** feedback.
## experiment
在3種環境中進行實驗,各自的實驗配置如下

### results
##### i. Simulated Tabletop Rearrangement

:question: Why *Object+Scene* performs the best?
-> because of its ability to keep track of all goal conditions and currently achieved goals
-> in the presence of many objects and test-time disturbances, the complex combinatorial state space requires the planner to additionally reason about the overall task progress(e.g., if the goal is to stack multiple blocks, the unfinished tower of blocks may be knocked over by the robot).
-> *success* feedback 只會判斷當下執行的task是否成功
##### ii. Real-World Tabletop Rearrangement

##### iii. Real-World Mobile Manipulator in a Kitchen Setting

## conclusion
- Inner Monologue, without requiring additional training beyond a frozen language model and pre-trained robotic skills, can accomplish complex, long-horizon, and unseen tasks
- it can efficiently retry under observed stochastic failure, replan under systematic infeasibility, or request human feedback for ambiguous queries, resulting in significantly improved performance in dynamical environments.
- Additionally, we allow the language model to generate chain-of-thought summarization following the achieved sub-goals (i.e., “Robot thought: ...”), which we find to be useful empirically.
### limitations
- As for failure modes, Inner Monologue may fail due to several sources of errors: (1) success detections, (2) LLM planning errors, and (3) control errors.
- In some instances, we found that the LLM planners ignored the environment feedback and still proposed policy skills involving objects not present in the scene
- The performance of low-level control policies limits not only overall high-level instruction completion performance, but also limits the scope of tasks that the LLM is able to reason over: no matter how much the LLM reasoning improves, it can still be bottlenecked by what low-level control policies are able to achieve.->(這個問題應該不大,機器人在工廠裡要做的事情應該比較固定)
- improvements can be made on how to aggregate potentially inaccurate sources of information, such as using text to describe the uncertainty of the feedback modules, or including additional feedback modules for safety and ethics for the proposed plans