# Inner Monologue: Embodied Reasoning through Planning with Language Models ## 1 Introduction - LLMs can not only generate fluent text but have understanding of the real-world - combining LLMs with various sources of textual feedback, only utilizing few-shot prompting without any additional training ![image](https://hackmd.io/_uploads/BJ1Qja49a.png) ## 2 Related Work - use symbolic reasoning, learned representations, learned task-primitives, this paper uses LLMs - SayCan and Codex assume actions are successfully executed - CLIP fusing vision and language ## 3 Leveraging Embodied Language Feedback with Inner Monologue ![image](https://hackmd.io/_uploads/HkvAnpEc6.png) - The observation $o$ may be: - success detection: binary yes/no if low-level skill has succeeded - object detection: object recognition - scene description: passive/action; passive can be from object recognition and active can be the LLM asking questions about the env to which a VQA model or human responds - visual-question answering - even human feedback - continually injecting the above forms of feedback into the LLM planning prompt ## 4 Experimental Results - The instantiation of Inner Monologue utilizes: - `InstructGPT` as the Language Model (LLM) for multi-step planning, cited in studies `[9, 91]`. - Scripted modules for language feedback comprising: - Object recognition (`Object`) to provide feedback about the objects present in the scene. - Success detection (`Success`) to inform about the success or failure of the most recent action. - Task-progress scene description (`Scene`) detailing the semantic sub-goals inferred by the LLM related to the high-level instruction completed by the agent. - A pre-trained language-conditioned pick-and-place primitive, similar to `CLIPort [76]` and `Transporter Nets [75]`. - **Object feedback** informs the LLM planner about the objects present, with a variant similar to that demonstrated in `[19]`. - **Success feedback** notifies the planner about the success or failure of the most recent action. In complex scenarios with multiple objects and disturbances, it aids the planner in considering the overall task progress (e.g., the risk of a stack of blocks being toppled). - **Scene feedback** provides a description of the semantic sub-goals deduced by the LLM towards completing the high-level instruction achieved so far by the agent. - Using both **Object + Scene feedback** is for managing additional reasoning complexity. - Adding **chain-of-thought** reasoning, as discussed in `[10, 12, 13]`, can improve the alignment between the inferred goals and the goals actually achieved by the agent. ![image](https://hackmd.io/_uploads/S1jv4A4q6.png) ![image](https://hackmd.io/_uploads/Bkm7v0456.png) ![image](https://hackmd.io/_uploads/S1v4v0V9p.png) ![image](https://hackmd.io/_uploads/r1xIDCN9T.png) - indicates that Success and Object feedback can effectively reduce LLM planning failures and thus overall failure rate - **Continued Adaptation to New Instructions** - The LLM planner can modify its strategy based on dynamic changes in the goal during a task. - It is capable of switching tasks in response to human commands like "finish the previous task" or "please stop", and concludes the task as "done" accordingly. - **Self-Proposing Goals under Infeasibility** - Inner Monologue can autonomously suggest alternative objectives when the original task is not feasible. - An example is when it changes the goal to finding a lighter block after failing to complete a task due to the weight of a block. - **Multilingual Interaction** - Pre-trained LLMs have the ability to understand and execute instructions in different languages without specific training. - Demonstrated when a Chinese instruction is correctly reinterpreted in English by the LLM planner. - This multilingual capacity includes interpreting symbols and emojis. - **Interactive Scene Understanding** - Inner Monologue employs past actions and feedback to interactively comprehend the scene. - It can accurately respond to queries about the scene that goes beyond the initial prompt. - **Robustness to Feedback Order** - The LLM planner is flexible in managing feedback that deviates from the expected sequence. - Showcased when it successfully adapts to a new instruction inserted mid-execution. - **Robustness to Typos** - The system maintains effectiveness in the presence of typos within instructions, correcting and realigning its goal state accordingly. ## 5 Limitations - planning/control errors/success detection false positives - low-level policies/skills inhibit high-level overall LLM reasoning and instruction completion performance