Omniverse Scene Understanding Notes
====
## Table of Contents
### Data and Semantic Planning:
- [ ] Ai2Thor - main dataset
- [ ] Alfred - Ai2thor wrapper with task description and SP intro
- [x] [Episodic Transformer for Vision-and-Language Navigation - SP with transformer model](#Episodic-Transformer)
- [ ] Alexa Arena
### Graph
- [x] GCN, Graph Attention and Graph Sage
- https://blog.csdn.net/miaorunqing/article/details/106625461
- [ ] GNN problems
### Something with CLIP
- [ ] [ConceptFusion](#Concept-Fusion)
### LLM + Planning
- [x] [Scene description for LLM + Planning](#ChatGPT-Long-Step-Robot-Control)
- [x] [Language Models as Zero-Shot Planners](#Language-Models-Zero-Shot-Planners)
- [x] [SayCan](#SayCan)
- [ ] [ReAct](#ReAct)
- [ ] ProgPrompt
- [x] [Interactive Task Planning with Language Models](#Interactive-Task-Planning)
- [ ] [Language Agent Tree Search Unifies Reasining Acting and Planning in Language Models - unite mcts with LLM](#Language-Agent-Tree-Search)
- [ ] TOOLCHAIN - have't read carefully, but looks like same topic with (7)
#### LLM + Scene Graph (New)
- [x] [SayPlan](#SayPlan)
- [ ] [SayNav](#SayNav)
- [ ] Scene-aware Activity Program Generation with Language Guidance
### Affordance
- [ ] Visual Affordance and Function Understanding: A Survey
- [ ] PartAfford: Part-level Affordance Discovery from 3D Objects
- [ ] AdaAfford: Learning to Adapt Manipulation Affordance for 3D Articulated Objects via Few-shot Interactions
#### New Survey
- [ ] Affordances in Robotic Tasks - A Survey
- [ ] CoTDet: Affordance Knowledge Prompting for Task Driven Object Detection
- [ ] VoxPoser https://voxposer.github.io/
- [ ] DUALAFFORD: LEARNING COLLABORATIVE VISUAL AFFORDANCE FOR DUAL-GRIPPER MANIPULATION
- [ ] 3D Implicit Transporter for Temporally Consistent Keypoint Discovery
- [ ] Learning Agent-Aware Affordances for Closed-Loop Interaction with Articulated Objects.
- [ ] RLAfford: End-to-End Affordance Learning for Robotic Manipulation
- [ ] PartManip: Learning Cross-Category Generalizable Part Manipulation Policy from Point Cloud Observations
- [ ] Grounding 3D Object Affordance from 2D Interactions in Images
### Human Affordance
- [ ] From 3D Scene Geometry to Human Workspace
- [ ] Putting Humans in a Scene: Learning Affordance in 3D Indoor Environments
- [ ] Binge Watching: Scaling Affordance Learning from Sitcoms
- [ ] PiGraphs: Learning Interaction Snapshots from Observations
### Open vocabulary affordance
- [ ] Open-Vocabulary Affordance Detection in 3D Point Clouds
- [ ] Open-Vocabulary Affordance Detection using Knowledge Distillation and Text-Point Correlation
- [ ] Language-Conditioned Affordance-Pose Detection in 3D Point Clouds
#### VLM as RL Reward
- [ ] Zero-Shot Reward Specification via Grounded Natural Language
- [ ] Can foundation models perform zero-shot task specification for robot manipulation?
- [ ] Zero-Shot Reward Specification via Grounded Natural Language
- [ ] Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning
- [ ] Guiding pretraining in reinforcement learning with large language models
#### Open-vocab Detection/Segmentation
https://github.com/Hedlen/awesome-segment-anything/blob/main/README.md
### new survey
- [ ] CoTDet: Affordance Knowledge Prompting for Task Driven Object Detection
### Other Related Work
- Generative Agents: Interactive Simulacra of Human Behavior
- OpenGVLab/GITM https://github.com/OpenGVLab/GITM/blob/main/GITM.pdf
- Voyager
- Steve-Eye: Equipping LLM-based Embodied Agents with Visual Perception in Open Worlds
- https://github.com/GT-RIPL/Awesome-LLM-Robotics
- https://zhuanlan.zhihu.com/p/541492104
- MINEDOJO: Building Open-Ended Embodied Agents with Internet-Scale Knowledge
- Cognitive Architectures for Language Agents
- Octopus: Embodied Vision-Language Programmer from Environmental Feedback https://choiszt.github.io/Octopus/
- EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought https://embodiedgpt.github.io/
- Otter: A multi-modal model with in-context instruction tuning.
- JARVIS-1: Open-world Multi-task Agents with Memory-Augmented Multimodal Language Models
- A Survey of Graph Meets Large Language Model: Progress and Future Directions
---
## Episodic-Transformer
- Task: Given intructions and observations, output actions and objects.
- Utilize causal transformer to handle the information from different modalities (text / image / actions).
- Behavior cloning for expert instructions.
- Utilize translation from natrual language to synthetic language to pretrain the language encoder.
- synthetic language eg. Put Apple Table / Goto Bed
- expert path is defined with Planning Domain Definition Language (PDDL) in ALFRED environment, which is taken as the synthetic language.
- Model Architecture

- Performance

## ChatGPT-Long-Step-Robot-Control
- ChatGPT Empowered Long-Step Robot Control in Various Environments: A Case Application
- Task: Given Multi-step instructions, output Readable sequence of object manipulation
- Complex prompting engineering with few-shot examples
- Can be taken as the reference of prompting descriptions
- Basic flow 
## Language-Models-Zero-Shot-Planners
- Tasks: Given task descriptions, output actionable steps
- Executability / Correctness
- Without fune-tuning
- Basic Idea

- Two Models: Planning LM (Causal LLM) / Translation LM (Masked LLM)
- Planning LM: decompose high-level tasks into mid-level action plans
- Translation LM: translate each step into admissible action
- Translation LM extract the embeddings of the actions, then compare the embeddings of generated actions with excutable actions.

- ALgorithm

## SayCan
- Task: Given task description, output actionable instructions, and executed by robot.
- Apply value functions (trained by TD methods) as affordance function to ensure the excutibility.
- Training Low-Level Skills via both BC and RL policy training procedures to obtain the language-conditioned policies and value functions.
- RL core: MP-Opt / BC core: BC-Z
- Basic Idea

- Algorithm

- Model architeture
- RL Policy 
- BC Policy 
- Performance

## Interactive-Task-Planning
- Task: Given user instruction, plan/replan the low-level excution code
- Hierachical Planning, first plan the high-level plans by LLM, then combine the high-lvel plan with scen information extrated by VLMs to generate low-level excution codes.
- Three Modules
- Visual Scene Grounding
- Converts visual inputs into language using a Vision-Language Model (VLM).
- LLMs for Planning and Execution
- Generates high level plans and executes lower level robot skills.
- Robotic Skill Grounding
- Translates robot skills into a functional API, enabling LLMs to dictate robot actions.
- Basic Idea

- Detailed Diagram

## Language-Agent-Tree-Search
## Concept-Fusion
## ReAct
## SayPlan
- Task: Given scene graph and instructions, semantic planning in large-scale scenes
- Hierarchical Scene Graph

- Two Stages
- Semantic Search (on scene graph)
- Iterative Replanning
- Semantic Search
- start from the collapse subgraph contained high-level nodes only
- Interact with LLM to build proper subgraph for the task
- Chain of Thought (CoT) prompting
- Scene Graph simulator commands:
- collapse(G)
- expand(node_name)
- contract(node_name)
- verify_plan(plan) (used for iterative replanning)
- Iterative Replanning
- Only planning over big target, utilizing path planner (Dikjstra) for pose-level path planning
- Iteratively correct generated plans from feedback then append to the next input
- e.g. pick(banana) -> feedback: "cannot pick banana"
- Algorithm

- Performance

## SayNav