Omniverse Scene Understanding Notes

Omniverse Scene Understanding Notes ==== ## Table of Contents ### Data and Semantic Planning: - [ ] Ai2Thor - main dataset - [ ] Alfred - Ai2thor wrapper with task description and SP intro - [x] [Episodic Transformer for Vision-and-Language Navigation - SP with transformer model](#Episodic-Transformer) - [ ] Alexa Arena ### Graph - [x] GCN, Graph Attention and Graph Sage - https://blog.csdn.net/miaorunqing/article/details/106625461 - [ ] GNN problems ### Something with CLIP - [ ] [ConceptFusion](#Concept-Fusion) ### LLM + Planning - [x] [Scene description for LLM + Planning](#ChatGPT-Long-Step-Robot-Control) - [x] [Language Models as Zero-Shot Planners](#Language-Models-Zero-Shot-Planners) - [x] [SayCan](#SayCan) - [ ] [ReAct](#ReAct) - [ ] ProgPrompt - [x] [Interactive Task Planning with Language Models](#Interactive-Task-Planning) - [ ] [Language Agent Tree Search Unifies Reasining Acting and Planning in Language Models - unite mcts with LLM](#Language-Agent-Tree-Search) - [ ] TOOLCHAIN - have't read carefully, but looks like same topic with (7) #### LLM + Scene Graph (New) - [x] [SayPlan](#SayPlan) - [ ] [SayNav](#SayNav) - [ ] Scene-aware Activity Program Generation with Language Guidance ### Affordance - [ ] Visual Affordance and Function Understanding: A Survey - [ ] PartAfford: Part-level Affordance Discovery from 3D Objects - [ ] AdaAfford: Learning to Adapt Manipulation Affordance for 3D Articulated Objects via Few-shot Interactions #### New Survey - [ ] Affordances in Robotic Tasks - A Survey - [ ] CoTDet: Affordance Knowledge Prompting for Task Driven Object Detection - [ ] VoxPoser https://voxposer.github.io/ - [ ] DUALAFFORD: LEARNING COLLABORATIVE VISUAL AFFORDANCE FOR DUAL-GRIPPER MANIPULATION - [ ] 3D Implicit Transporter for Temporally Consistent Keypoint Discovery - [ ] Learning Agent-Aware Affordances for Closed-Loop Interaction with Articulated Objects. - [ ] RLAfford: End-to-End Affordance Learning for Robotic Manipulation - [ ] PartManip: Learning Cross-Category Generalizable Part Manipulation Policy from Point Cloud Observations - [ ] Grounding 3D Object Affordance from 2D Interactions in Images ### Human Affordance - [ ] From 3D Scene Geometry to Human Workspace - [ ] Putting Humans in a Scene: Learning Affordance in 3D Indoor Environments - [ ] Binge Watching: Scaling Affordance Learning from Sitcoms - [ ] PiGraphs: Learning Interaction Snapshots from Observations ### Open vocabulary affordance - [ ] Open-Vocabulary Affordance Detection in 3D Point Clouds - [ ] Open-Vocabulary Affordance Detection using Knowledge Distillation and Text-Point Correlation - [ ] Language-Conditioned Affordance-Pose Detection in 3D Point Clouds #### VLM as RL Reward - [ ] Zero-Shot Reward Specification via Grounded Natural Language - [ ] Can foundation models perform zero-shot task specification for robot manipulation? - [ ] Zero-Shot Reward Specification via Grounded Natural Language - [ ] Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning - [ ] Guiding pretraining in reinforcement learning with large language models #### Open-vocab Detection/Segmentation https://github.com/Hedlen/awesome-segment-anything/blob/main/README.md ### new survey - [ ] CoTDet: Affordance Knowledge Prompting for Task Driven Object Detection ### Other Related Work - Generative Agents: Interactive Simulacra of Human Behavior - OpenGVLab/GITM https://github.com/OpenGVLab/GITM/blob/main/GITM.pdf - Voyager - Steve-Eye: Equipping LLM-based Embodied Agents with Visual Perception in Open Worlds - https://github.com/GT-RIPL/Awesome-LLM-Robotics - https://zhuanlan.zhihu.com/p/541492104 - MINEDOJO: Building Open-Ended Embodied Agents with Internet-Scale Knowledge - Cognitive Architectures for Language Agents - Octopus: Embodied Vision-Language Programmer from Environmental Feedback https://choiszt.github.io/Octopus/ - EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought https://embodiedgpt.github.io/ - Otter: A multi-modal model with in-context instruction tuning. - JARVIS-1: Open-world Multi-task Agents with Memory-Augmented Multimodal Language Models - A Survey of Graph Meets Large Language Model: Progress and Future Directions --- ## Episodic-Transformer - Task: Given intructions and observations, output actions and objects. - Utilize causal transformer to handle the information from different modalities (text / image / actions). - Behavior cloning for expert instructions. - Utilize translation from natrual language to synthetic language to pretrain the language encoder. - synthetic language eg. Put Apple Table / Goto Bed - expert path is defined with Planning Domain Definition Language (PDDL) in ALFRED environment, which is taken as the synthetic language. - Model Architecture ![](https://hackmd.io/_uploads/SkI2atuf6.png) - Performance ![](https://hackmd.io/_uploads/HJ-8kquG6.png) ## ChatGPT-Long-Step-Robot-Control - ChatGPT Empowered Long-Step Robot Control in Various Environments: A Case Application - Task: Given Multi-step instructions, output Readable sequence of object manipulation - Complex prompting engineering with few-shot examples - Can be taken as the reference of prompting descriptions - Basic flow ![](https://hackmd.io/_uploads/Bk4xu1KG6.png) ## Language-Models-Zero-Shot-Planners - Tasks: Given task descriptions, output actionable steps - Executability / Correctness - Without fune-tuning - Basic Idea ![](https://hackmd.io/_uploads/H1yk35dfp.png) - Two Models: Planning LM (Causal LLM) / Translation LM (Masked LLM) - Planning LM: decompose high-level tasks into mid-level action plans - Translation LM: translate each step into admissible action - Translation LM extract the embeddings of the actions, then compare the embeddings of generated actions with excutable actions. ![](https://hackmd.io/_uploads/BJKuAqOGp.png) - ALgorithm ![](https://hackmd.io/_uploads/SyZXa5dzT.png) ## SayCan - Task: Given task description, output actionable instructions, and executed by robot. - Apply value functions (trained by TD methods) as affordance function to ensure the excutibility. - Training Low-Level Skills via both BC and RL policy training procedures to obtain the language-conditioned policies and value functions. - RL core: MP-Opt / BC core: BC-Z - Basic Idea ![](https://hackmd.io/_uploads/SkAiWh_Gp.png) - Algorithm ![](https://hackmd.io/_uploads/B1bN0o_f6.png) - Model architeture - RL Policy ![](https://hackmd.io/_uploads/H1F1x2_f6.png) - BC Policy ![](https://hackmd.io/_uploads/HkTIg2Ofa.png) - Performance ![](https://hackmd.io/_uploads/rydgZC_zT.png) ## Interactive-Task-Planning - Task: Given user instruction, plan/replan the low-level excution code - Hierachical Planning, first plan the high-level plans by LLM, then combine the high-lvel plan with scen information extrated by VLMs to generate low-level excution codes. - Three Modules - Visual Scene Grounding - Converts visual inputs into language using a Vision-Language Model (VLM). - LLMs for Planning and Execution - Generates high level plans and executes lower level robot skills. - Robotic Skill Grounding - Translates robot skills into a functional API, enabling LLMs to dictate robot actions. - Basic Idea ![](https://hackmd.io/_uploads/ryZ1fyYMT.png) - Detailed Diagram ![](https://hackmd.io/_uploads/rylqU1FGT.png) ## Language-Agent-Tree-Search ## Concept-Fusion ## ReAct ## SayPlan - Task: Given scene graph and instructions, semantic planning in large-scale scenes - Hierarchical Scene Graph ![](https://hackmd.io/_uploads/rJbkmEAG6.png) - Two Stages - Semantic Search (on scene graph) - Iterative Replanning - Semantic Search - start from the collapse subgraph contained high-level nodes only - Interact with LLM to build proper subgraph for the task - Chain of Thought (CoT) prompting - Scene Graph simulator commands: - collapse(G) - expand(node_name) - contract(node_name) - verify_plan(plan) (used for iterative replanning) - Iterative Replanning - Only planning over big target, utilizing path planner (Dikjstra) for pose-level path planning - Iteratively correct generated plans from feedback then append to the next input - e.g. pick(banana) -> feedback: "cannot pick banana" - Algorithm ![](https://hackmd.io/_uploads/Skr_GE0fp.png) - Performance ![](https://hackmd.io/_uploads/SJmofVAfp.png) ## SayNav