# Ghost in the Minecraft: Generally Capable Agents for Open-World Enviroments via Large Language Models with Text-based Knowledge and Memory ## 1 Introduction “What if a cyber brain could possibly generate its own ghost, create a soul all by itself? And if it did, just what would be the importance ofbeing human then?” - Generally Capable Agent (GCA) - prior work has been on specific Minecraft tasks like ObtainDiamond; they focus on a broader goal of exploring Minecraft - RL-agents for Minecraft are heavily limited and need to be trained for millions of steps and still don't perform well - poor generalization and scalability - struggles to map long horizon tasks to specific key presses - Ghost in the Minecraft (GITM), their GCA is composed of an LLM Decomposer, LLM Planner, and LLM Interface - Decomposer: decomposes well-defined sub-goals - Planner: plans a sequence of structured actions for each sub-goal - Interface: executes action, interacts with env, and receives observation Specifically, VPT [2] needs to be trained for 6,480 GPU days, DreamerV3 [7] needs to be trained for 17 GPU days, while our GITM does not require any GPUs and can be trained in just 2 days using a single CPU node with 32 CPU cores ![image](https://hackmd.io/_uploads/HyXyF0gqT.png) ## 2 Related Work - prior work use RL (imitation learning, hierarchical RL) - VPL builds foundation model for Minecraft via training on videos - some work have adopted knowledge distillation and curriculum learning - some use RL with LLMs while this paper solely uses LLMs ![image](https://hackmd.io/_uploads/HJyMKReqT.png) ## 3 Method ![image](https://hackmd.io/_uploads/SJOdebb9a.png) ### 3.1 LLM Decomposer - decompose a task goal into subgoals - a goal is (Object, Count, Material, Tool, Info) - Object is the object itself - count is how many you have of the object - material and tool are the prereq materials/tools needed for that object - info is the text-based knowledge related to this goal - given a specific goal, a sentence embedding is extracted from a pre-trained LLM and this embedding is used to retrieve the most relevant text-based knowledge from an external knowledge base - the LLM identifies the required materials/tools/related info from gathered knowledge - all prereq materials/tools can be listed as subgoals too allowing for recursive decomposition - they use an external knowledge base based on the minecraft wiki ![image](https://hackmd.io/_uploads/Hk_mx-Zca.png) ![image](https://hackmd.io/_uploads/SyyNx--56.png) ### 3.2 LLM Planner - a structured action is (Name, Arguments, Description) ![image](https://hackmd.io/_uploads/BJuz1bb9p.png) - decompose 3141 tasks from MineDogo dataset using a pre-trained LLM into action sequences ![image](https://hackmd.io/_uploads/BJUAbb-cp.png) - Action Interface: functional descriptions of structured actions and their parameters - Query Illustration: clarifies structure and meaning of user queries - Response Format: requires LLM to return response in specific way - Interaction Guideline: guides LLM to correct failed actions based on feedback message - the User Query has the goal + external info (apparently the 5-tuple from decomposer) - it has feedback from previous action - and a reference plan to follow - agent maintains working memory; stores entire set of action sequences after a task goal is achieved ### 3.3 LLM Interface ## 4 Experiments ![image](https://hackmd.io/_uploads/B1dUSb-cp.png) ![image](https://hackmd.io/_uploads/HySsS--56.png)