# ICLR 2024 Rebuttal
## Reviewer 2 final stage
Dear reviewer,
Thank you for your comments and we would like to respond to your concern of insufficient use of LLM in both the training and deployment stage.
(a) In training stage, the reviewer is concerned that the use of human demonstrations dilutes the contribution of LLM. While we agree that LLM is a good human proxy, we don't think LLM is good human proxy for continuous aspect of a task as it is for discrete aspect of a task. In particular, we don't think prompting LLM to generate control signals is a principled way of doing continuous control. Our paper hopes to advocate a view that for hybrid systems (which is typical for most of long-horizon manipulations tasks), the best use of LLM is in the discrete decision-making domain. To translate the discrete decision-making by LLM to the continuous domain inevitably requries a grounding classifier that maps the continuous domain to the discrete domain. To address the reviewer's concern that the effort required for humans to generate demonstrations justifies asking humans to design feasibility matrix along the way, we want to stress that we only use a few humans demonstrations (fewer than 20), which is not a significant burden for humans, and the majority of the trajectories are generated through synthetic perturbations without humans' involvement. Without our method of using human demonstrations, manual engineering of the classifiers are required to leverage LLM discrete plan, which might be even more effort for humans. Lastly, designing feasibility matrix based on manipulation modes is non-trivial for complex tasks, and it cannot be assumed that any humans, especially those without the knowledge of mode family, can design the correct feasibility matrix.
(b) In deployment stage, the reviewer is concerned that LLM is not being used. In fact, LLM is actually being used to plan or replan at the discrete level. This is both shown through our figure 7 (f) and the fact that our simulation and real-world robot system can replan on the fly when perturbations derail an original LLM plan. The use of our learned classifier is to continuously monitor the continuous states of the system and communicate the corresponding discrete modes the system is undergoing to the LLM for any possible replan on the fly. We will make these points clearer in the final draft and discuss relevant TAMP literature. Lastly, to address the reviewer's sugguestion of using search-based planning, we want to stress this is precisely the benefit/motivation of recovering mode boundaries--that is we can do closed-loop planning that guarantees robustness to perturbations. We also show the suggested planning performance with our learned modes in the 2d Polygon domain as detailed on our website. However, for tasks without known dynamics model to plan with (e.g. the robosuite tasks and our real-world tasks), we can only perform closed-loop planning at the discrte level by LLM. The continuous execution requires imitating demonstrations in this case.
## Reviewer 2 second nudge
Dear reviewer,
**[Real-world robot experiments]** We have added some real-world robot experiments that showcase how our grounding classifier can be implemented on real-world setups as well as inputing pixel inputs beyond trajectory inputs. Please find these videos on our website at https://sites.google.com/view/grounding-plans/home#h.apnj3kgovccj. Thank you for your time to review our response!
## Reviewer 1
Dear reviewer Rhgd,
Thanks for your detailed review and feedback! Please find our response below:
**[Comparison with LLM-based planning methods such as text2motion/VIMA]** Both text2motion and VIMA solve a different grounding problem than ours and thus would not be the best baselines. Specifically, there are at least three kinds of grounding mentioned in the literature: **1. Task grounding** - using language [1] or multi-modal tokens [2] as inputs to an imitation policy to specify tasks/goals. **2. Symbolic grounding** - predicting the boolean values of symbolic state (e.g., In(can, gripper)=True, On(marbles, spoon)=False, etc) [3, 4, 5] **3. Action grounding** - mapping LLM plan to predefined primitive actions [6, 7, 8] Text2Motion deals with action grounding and VIMA deals with task grounding. In contrast, our work solves symbolic grounding, where we learn classifiers that map continuous states/observations to discrete modes proposed by LLM. Since we assume each mode is associated with a single policy that’s learned from segmented demonstrations, action grounding can also be achieved as a by-product of learning the classifier, which maps the LLM planned mode sequence to a sequence of policy rollouts. That being said, we agree with reviewer's point of adding more baselines, so we are adding a baseline that achieves symbolic grounding through clustering in the 2d polygon domain (see the ablation studies for 2D polygon, linked [here on the website](https://sites.google.com/view/grounding-plans/home#h.8wpp2pzbin0p)) and similarity-based trajectory segmentation in the robosuite domain (see the mode classification comparison table and examples for the new baseline, [here on the website](https://sites.google.com/view/grounding-plans/home#h.7l0jx2g9td4d)). To clarify the grounding, we have created a **new figure: Fig 7** that we include in the revised paper and is currently on the website.
**[Visualization of loss function and state $s$ for classifier]**
We have created a **new figure -- Fig 8** that we include in the revised paper and is currently on the website in [this section](https://sites.google.com/view/grounding-plans/home#h.pmzm9p2g4j7v) that illustrates both how we use LLMs to generate the classifier state and how this information is used downstream to compute the classifier loss.
**[Does our approach require a simulator to generate success labels?]** Having a simulator perhaps is required for methods that require per-step dense labeling. Since our approach only requires sparse label at the end of a trajectory, engineering a task success classifier for real-world tasks is not infeasible and hence simulators are not required. For example, for a scooping task where the goal is to transport marbles from one bowl to another bowl, one can engineer a classifier to detect if the perturbed execution still manages to transport at least one marble to the goal location (i.e., the second bowl).
**[How does MMLP work in the real-world setting?]** We are preparing real-world experiments and will show results on the website. https://sites.google.com/view/grounding-plans/home.
**[Opensourcing plans?]** Yes we are opensourcing the code soon!
**[Will the current perturbations work for more complex tasks?]** While figuring out the best perturbations strategies for different tasks with different complexity is not the main focus of this paper, we agree designing perturbations that are sufficient to generate counterfactual outcomes is a requirement for our method to work. In future work, we will investigate how to prompt LLM to generate different perturbation strategies for each task.
[1] Language Conditioned Imitation Learning over Unstructured Data
[2] VIMA: General Robot Manipulation with Multimodal Prompts
[3] Grounding Predicates through Actions
[4] From Skills to Symbols: Learning Symbolic Representations for Abstract High-Level Planning
[5] Learning Temporal Logic Formulas from Suboptimal Demonstrations: Theory and Experiments
[6] SayCan: Grounding Language in Robotic Affordances
[7] Skill induction and planning with latent language
[8] Text2Motion: From Natural Language Instructions to Feasible Plans
## Reviewer 2
Dear reviewer Mzap,
Thanks for your detailed review and feedback! Please find our response below:
**[Relevance to LLM]** The major concern of the reviewer is whether LLM is integral to our framework. The reviewer acknowledges that we show sufficiently the robustness of our mode-based policy to external perturbations, but argues that we do not show enough evidience that we solve the grounding problem. We respectfully disagree because solving the grounding problem is a prerequisite for our approach to robustify policy. In the following we first clarify what we mean by grounding, and second we explain LLM's relevance both in terms of the intent and implementation of this work. Lastly, we present more evidence on how not using LLM will adversely affect how well the grounding can be learned.
**[What grounding problem are we solving?]** There are at least three kinds of grounding mentioned in the literature: **1. Task grounding** - using language [1] or multi-modal tokens [2] as inputs to an imitation policy to specify tasks/goals. **2. Symbolic grounding** - predicting the boolean values of symbolic state (e.g. In(can, gripper)=True, On(marbles, spoon)=False, etc) [3, 4, 5] **3. Action grounding** - mapping LLM plan to predefined primitive actions [6, 7, 8] In our work, by grounding we do not mean task grounding, but rather symbolic grounding, where we learn classifiers that map continuous states/observations to discrete modes proposed by LLM. Since we assume each mode is associated with a single policy that’s learned from segmented demonstrations, action grounding can also be achieved as a by-product of learning the classifier, which maps the LLM planned mode sequence to a sequence of policy rollouts. We have created a **new figure -- Fig 7** that we include in the revised paper and is currently on the website, [linked here](https://sites.google.com/view/grounding-plans/home#h.xly3b8ysna28), which illustrates our overall method including a visual description of how we ground the LLM knowledge in modes.
**[The intent is to enable LLM-based discrete planning]** To order to use LLM for high-level planning, existing work [6, 8] typically assume symbolic classifiers and/or action primitives are given, i.e. the symbolic/action grounding part has been manually engineered. What’s left for this top-down approach is to search for a sequence of actions (grounded in language) with high overall feasibility/success rate when applying these actions in the physical space. In contrast, **the intent of our work is to reduce human involvement in engineering the classifiers and the grounded action primitives when using LLM for discrete planning**. Specifically, we take the bottom-up approach to discover action primitives grounded in demonstrations by learning a classifier to map continuous demonstrations to discrete LLM-proposed mode sequence. Since these skills are segmented from successful demonstrations and grounded in modes/physical space, our framework do not need humans to define them a priori or calculating the feasibility of executing them in the physical space as required in the top-down approach. Note that we do not claim to completely automate humans out of the loop as our framework do require humans to provide a few demonstrations as well as prompting the LLM to generate task-relevant features and feasibility matrix. But we do believe that LLM is best used for discrete planning in the symbolic space and there is a role for humans to provide the continuous signals grounded in the physical space (rather than prompting LLM to produce the continuous components such as control signals). Our work enables this view by learning the mapping from continuous phsycial space to the discrete symbolic space.
**[The intent requires solving the robust policy learning task to evaluate grounding]** The reviewer might be wondering why we tackle the problem of learning a robust mode-conditioned policy from a few demonstration if the intent is to learn a grounding classifier. To evaluate the utility of grounding in the physical spaces, we need to probe the boundary of the learned classifier. While visualizing boundaries is easy in the 2d polygon domain, it is difficult in the high-dimensional manipulation space. For manipulation, mode families are a useful construct to help achieve planning success [9, 10]. The reason that mode families have boundaries is that otherwise the the motion planning won’t guarantee success. Therefore, **evaluating the effectiveness of classifier's prediction boundary can be proxied by evaluating whether the classified modes can increase the execution success rate** especially under external pertubations. The consideration of external perturbations is distinct from prior work that use pre-defined high-level actions such as “walk to \<PLACE\>” [11], “(pick ball)” [12] “open(obj)” [7]. These high-level actions are sufficient if there are no external perturbations to derail the execution. Otherwise, decomposing these actions further into manipulation modes grounded in the physical space are necessary for (1) replanning/robustifying the actions against adversarial perturbations as well as (2) explaining why some but not all perturbations will cause execution failures. Inspired by this idea, we devise a fully differentiable end-to-end explanation pipeline that predicts if a perturbed trajectory is successful or not. Only when the grounding classifer in the pipeline has learned the correct mode partitions, can the overall pipeline differentiating all successful trajectories from failure trajectories. Our explaination-based learning approach is similar to analysis-by-synthesis in other domains. For example, in NeRF only when an accurate 3d representation has been learned can the fully differentiable volumetric rendering pipeline generate images that match groundtruth from all views.
**[The implementation requires LLM's common sense knowledge]** The main novel contribution is the method with which we learn the grounding classifier rather than how we implement the mode-based policy that improves robustness to perturbations. Our implementation requires LLM to provide common sense knowledge about the discrete task structure that are complementary to low-level continuous demonstrations. Specifically, (1) LLM informs how many modes there are as well as generating a matrix that describes the feasibility of transitioning from one mode to another. Without knowing the number of modes, trajectory clustering or segmentation is a NP-hard problem [13, 14]. (2) LLM reduce the dimensionality of feature space and improve data efficiency as separetely investigated in [15, 16]. For example, the state of distractor objects is not useful for learning a classifier that detects different modes in the demonstrations of picking up a can object. Including distractor objects' states as inputs require significantly more counterfactual data to learn a classifier that does not pay attention to distractor objects. (3) LLM is integral to replanning at test time when there is perturbation. The utility of the learned grounding operator lies in its ability to explain when perturbation derails a plan and its capability to map the replanned discrete mode sequence from LLM to continuous policy rollout. We have created a **new figure -- Fig 8** that we include in the revised paper and is currently on the website, [linked here](https://sites.google.com/view/grounding-plans/home#h.7l0jx2g9td4d) that illustrates both how we use LLMs to generate the classifier state and how this information is used downstream to compute the classifier loss.
**[Evidence that we use LLM knowledge to successfully learn grounding]** We present the following results on an anonymous project website https://sites.google.com/view/grounding-plans/home.
1. To show we have successfully learned grounding in the 2d polygon domain, we compare our learned mode classifcations with the groundtruth on demonstration trajectories (Therefore the reporting scores are trajectory segmentation accuracies). The following table shows the comparison with ablation model (no conterfactual data) and a simple trajectory segmentation baseline by KMeans++ clustering.
| Mode Classifier | 3-Mode | 4-Mode | 5-Mode |
| ----------- | ----------- | ----------- | ----------- |
| Ours | **0.990** | **0.967** | **0.970** |
| No Counterfactual Data | 0.604 | 0.464 | 0.831 |
| Trajectory Segmentation Baseline | 0.644 |0.554 | 0.641
Additionally, in Table 1 of the paper, the MMLP-Stable\(p\) and MMLP-Stable rows show that we can achieve near 100% success rate of reaching the final mode either with or without perturbations through planning in the learned mode partitions.
3. To show we have successfully learned grounding in the robosuite environments, we show figures on the website of segmenting the trajectories into modes similar to the groundtruth that we manually designed. We also report (below and on the website) the average mode classification accuracy (compared to ground truth) for our method. Additionally, we show videos for each robosuite task the performance of BC baseline without and with perturbations and how our mode-based imitation policy can better recover from perturbations as indirect evidence that we have learned useful mode classification.
| Mode Classifier | Can | Lift | Square Peg
| ----------- | ----------- | ----------- | ----------- |
| Ours (LLM-reduced State Space) | **0.83** | **0.83** | **0.67** |
| Full State Space | 0.55 | 0.70 | 0.57 |
| Trajectory Segmentation Baseline | 0.66 | 0.56 | 0.54 |
4. To show the importance of prompting the LLM for the correct feasibility matrix, we run 2d polygon experiments where the 3-mode and 4-mode tasks are given a generic seqential 5-mode feasibility matrix $F^5$ and the 5-mode task is given a 3-mode feasibility matrix $F^3$. We show the resulting learned mode boundaries does not match the groundtruth. Taking the second mode in the 3-mode task (the first polygon) as an example, while our model recovers 0.946 of the first mode region (measured as the F1 score), the baseline completely mis-identified the first mode (F1 = 0). This will cause issues for robot following mode sequences when they try to recover from perturbations. See also [this section](https://sites.google.com/view/grounding-plans/home#h.12fce10e77be29fa_3) on our website for qualitative evaluations.
5. To show the importance of prompting the LLM to reduce the feature set, we run the default full set of features as the state representation to learn the grounding classifer for robosuite tasks. In the above table, we also provide the mode classification accuracy for this method compared to ground truth and show that for all three robosuite tasks, it is lower than our method with the LLM-generated features.
6. We generate feasibility matrices by querying LLMs to generate pairwise connectivities between modes (see our response to "How is the feasibility matrix generated from LLMs and is it being kept fixed for each task" below for details). Similarly, prompt LLMs to select a subset of features for training mode classifiers and policies. We include the prompts and LLM responses for robosuite tasks on our website.
We hope these results on the website are sufficient evidence to show that our apporach has learned grounding and how LLM is integral in this process. One can argue that our method does not need LLM as a human can easily generate the feasibility matrix or the reduced feature set for a task. But this argument exactly supports why LLM is a good proxy for a human model as they can do the same task interchangeably. And just like the other LLM-based embodied AI works [6, 7, 8, 11, 12, 15, 16], we are using LLM to reduce human work that humans are good at rather than the other way around.
**[Whether we use manually labelled modes and features to learn the grounding]** There is a misunderstanding that the reviewer thinks we "use manually-defined modes instead of LLM-generated modes and use manually-labeled features instead of relying on automatic mechanism for grounding". In fact, we only use manually-defined modes and features to construct a ground-truth mode classifier, which is used to evaluate our mode classifier *learned* using LLM-generated feasibility matrix and subset features as described above in figure 7 and 8.
**[Improve figure readability]**
We have made several changes regarding figures in our updated manuscript. First, we introduced two new figures (Fig 7/8) that provide a more clear high-level explanation of the overall method as well as how the LLM is used in the grounding. You can view these figures at the top of our new website (https://sites.google.com/view/grounding-plans/home). These are aimed to replace the existing Figure 1 and Figure 3 where the reviewer had concerns around interpretability of the Figures. For Figures 4/5, we have updated figure captions per the reviewer's request in the new paper draft. We have included version of the Figures with updated captions on the website (very bottom) and plan to make further visual changes to Figure 4 to improve interpretability.
**[How are the keypoints and features being grounded in the demonstrations]** In robosuite environments, the demonstration state consists of predefined object states corresponding to those keypoints shown in figure 8 (a). We add a full list of available keypoints to the prompt when query LLM to find a subset of features relevant to a task. More prompting exmaples can be found on the website in [this section](https://sites.google.com/view/grounding-plans/home#h.743770ss0gzs).
**[How is the feasibility matrix generated from LLMs and is it being kept fixed for each task?]** To generate the feasibility matrix, we first prompt LLM to generate the connectivity between each pair of modes. Next, we define the feasibility matrix entry $F[i, j]$ as the negative distance between each pair of modes (i.e., the shortest path) if $j$ is reachable from $i$; otherwise $F[i, j]$ is zero. In the simplest case where all mode transitions form a chain (which is true for our 2d polygon and robosuite tasks), $F[i, j] = 0$ for all $j \le i + 1$ (i.e., at mode $i$ we can transit to the next mode $i+1$ or to any previous modes $j < i$); and $F[i, j] < 0$ for all transitions that skips modes (e.g., directly make a transition from mode 1 to 3). After generating the feasibility matrix based on the task description (e.g., lift up a block from the table), we fix it for the task. The feasibility matrix is interpretable; therefore, human experts can also modify this matrix manually.
[1] Language Conditioned Imitation Learning over Unstructured Data
[2] VIMA: General Robot Manipulation with Multimodal Prompts
[3] Grounding Predicates through Actions
[4] From Skills to Symbols: Learning Symbolic Representations for Abstract High-Level Planning
[5] Learning Temporal Logic Formulas from Suboptimal Demonstrations: Theory and Experiments
[6] SayCan: Grounding Language in Robotic Affordances
[7] Skill induction and planning with latent language
[8] Text2Motion: From Natural Language Instructions to Feasible Plans
[9] Multi-Modal Motion Planning in Non-Expansive Spaces
[10] Integrated Task and Motion Planning
[11] Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents
[12] PDDL PLANNING WITH PRETRAINED LARGE LANGUAGE MODELS
[13] NP-hardness of Euclidean sum-of-squares clustering
[14] Segmentation of Trajectories on Nonmonotone Criteria
[15] ELLA: Exploration through Learned Language Abstraction
[16] Learning with Language-Guided State Abstractions
## Reviewer 3
Dear reviewer i3As,
Thanks for your detailed review and feedback! We will first elaborate on our framework and method using two new figures and then respond to your individual questions.
**[What grounding problem are we solving?]** There are at least three kinds of grounding mentioned in the literature: **1. Task grounding** - using language [1] or multi-modal tokens [2] as inputs to an imitation policy to specify tasks/goals. **2. Symbolic grounding** - predicting the boolean values of symbolic state (e.g. In(can, gripper)=True, On(marbles, spoon)=False, etc) [3, 4, 5] **3. Action grounding** - mapping LLM plan to predefined primitive actions [6, 7, 8] In our work, by grounding we do not mean task grounding, but rather symbolic grounding, where we learn classifiers that map continuous states/observations to discrete modes proposed by LLM. Since we assume each mode is associated with a single policy that’s learned from segmented demonstrations, action grounding can also be achieved as a by-product of learning the classifier, which maps the LLM planned mode sequence to a sequence of policy rollouts.
We have created a **new figure -- Fig 7** that we include in the revised paper and is currently on the website in [this section](https://sites.google.com/view/grounding-plans/home#h.xly3b8ysna28) that illustrates our overall method including a visual description of how we ground the LLM knowledge in modes.
**[Clarification of our feasibility matrix and transition loss]** We appreciate the reviewer spotting some inconsistency in our original writing and figure 3. In response, we made a new figure to clarify the feasibility matrix and explain the transition loss in terms of success loss and failure loss.
We have created a **new figure -- Fig 8** that we include in the revised paper and is currently on the website in [this section](https://sites.google.com/view/grounding-plans/home#h.pmzm9p2g4j7v) that illustrates both how we use LLMs to generate the classifier state and how this information is used downstream to compute the classifier loss.
**[How do successful trajectories contribute to the loss term?]** In Figure 8 \(c\), when the classifier is not well trained, it might predict some invalid transitions for successful trajectories and consequently incur a success loss. We plan to update the loss function in the manuscript and update a new version soon.
**[Is this work merely an alternative to unspervised trajectory segmentation?]** While we agree with the reviewer that the word "groundbreaking" might be overclaiming, we respectfully disagree our method is merely a trajectory segmentation method. In our work, the capability to segment trajectories is merely a by-product of having learned the grounding classifier. The typical goal of trajectory segmentation in LfD literature is to discover reusable skills that can be activated open-loop [9, 10, 11]. Their goal is not concerned with performance degradation under external perturbations. In contrast, our goal is to learn the boundaries of these mode abstractions that define the valid domains of the discovered skills, as shown in the new Figure 7(d), so that the learned skills can be planned in a closed-loop fashion that is robust to external perturbations. Given the differences in motivations, our framework has two significant technical differences:
1. First, since our grounding operator needs to classify/segment not only demonstrated trajectories, but also state space that has not been demonstrated (i.e. need to find mode boundaries), we will need to generate additional data to cover the state space being considered beyond just the demonstration regions. Additionally, in order to learn the boundary we would need executions that succeed by crossing feasible boundaries as well as executions that fail by crossing infeasible boundaries. This additional data generation stage is not considered in typical trajectory segmentation setting that works with only successful demonstrations and not failures. The significantly larger scale of counterfactual data might render non-end-to-end systems such as those using HMM/probabilistic inference impractical [12].
2. Second, the aforementioned segmentation methods are not induced by the necessity to predict terminal task failures/success and hence does not necessarily break down demonstrations into minimal abstractions with which planning success can be guaranteed despite perturbations. The key insight is that mode families are a useful construct to help achieve planning success guarantee. The reason that boundary between consecutive mode families have to separate configuration spaces in a particular way is that otherwise the the motion planning won’t guarantee success. Consequently, we can use mode partitions to explain why some but not all perturbations will cause execution failures. Inspired by this idea, we devise a fully differentiable end-to-end explaination pipeline that predicts if a perturbed trajectory is successful or not. Only when the grounding classifer in the pipeline has learned the correct mode partitions, can the overall pipeline differentiating all successful trajectories from failure trajectories. Our explaination-based learning approach is similar to analysis-by-synthesis in other domains. For example, in NeRF only when an accurate 3d representation has been learned can the fully differentiable volumetric rendering pipeline generates images that match groundtruth from all views. To operationalize this idea, our transition loss (both the success loss and failure loss) enforces correct explanation why some perturbations do not fail a successful demonstration but others do, which subsequently enforces our learned classifier to ground continuous boundary states to match those discrete mode families provided by LLM, leading to segmentation of atomic skills useful for replanning under perturbations. To show the importance of our transition loss, we show (1) in 2d polygon domain a clustering-based trajectory segmentation baseline using similarity metric does not lead to correct mode boundaries; and (2) in robosuite domains our loss ablating out the transition loss (effectively an unsupervised trajectory segmentation based only on motion similarity [11]) does not lead to accurate grounding. These new results are documented on an anonymous project website https://sites.google.com/view/grounding-plans/home.
In particular, in 2d polygon domain, we compare our learned mode classifcations with the groundtruth on demonstration trajectories (Therefore the reporting scores are trajectory segmentation accuracies). The following table shows the comparison with ablation model (no conterfactual data) and a simple unsupervised trajectory segmentation baseline by KMeans++ clustering.
| Mode Classifier | 3-Mode | 4-Mode | 5-Mode |
| ----------- | ----------- | ----------- | ----------- |
| Ours | **0.990** | **0.967** | **0.970** |
| No Counterfactual Data | 0.604 | 0.464 | 0.831 |
| Trajectory Segmentation Baseline | 0.644 |0.554 | 0.641
For robosuite, We also report the average trajectory segmentation accuracy (compared to ground truth) for each method.
| Mode Classifier | Can | Lift | Square Peg
| ----------- | ----------- | ----------- | ----------- |
| Ours (LLM-reduced State Space) | **0.83** | **0.83** | **0.67** |
| Full State Space | 0.55 | 0.70 | 0.57 |
| Trajectory Segmentation Baseline | 0.66 | 0.56 | 0.54 |
**[Correspondence between the generated texual description and the continuous positions]** In robosuite environments, the demonstration state consists of predefined object states corresponding to those keypoints shown in figure 8 (a). We add a full list of available keypoints to the prompt when query LLM to find a subset of features relevant to a task. More prompting exmaples can be found on the website in [this section](https://sites.google.com/view/grounding-plans/home#h.743770ss0gzs).
**[What if the demos do not perfectly follow the LLM generated plan?]** In this work, we give humans LLM-generated plan for humans to provide demonstrations, so that we assume these successful demonstrations can be mapped to the same discrete plan even these demonstrations might come from a multimodal distribution in the continuous configuration space. In future work we will investigate how to map demonstrations with non-unique discrete structure to partial or nonlinear LLM plan.
**[How important is LLM-based feature selection?]** LLM-based feature selection helps improve data efficiency as also shown independently by [13, 14] For example, the state of distractor objects is not useful for learning a classifier that detects different modes in the demonstrations of picking up a can object. Including distractor objects’ states as inputs require significantly more counterfactual data to learn a classifier that does not pay attention to distractor objects. To corroborate this claim, we run the default full set of features of as state representation to learn the grounding classifer for robosuite tasks. We show that the resulting segmentation does not align well with the ground truth (see [table](https://sites.google.com/view/grounding-plans/home#h.7l0jx2g9td4d)) and plan to add qualitative results on the website in [this section](https://sites.google.com/view/grounding-plans/home#h.7l0jx2g9td4d).
**[How MMLP-conditional BC work in the absence of pseudo-attractor?]**
We apologize for the lack of clarity in the original description. The mode-conditioned policy is conditioned in the sense that the pseudo-attractor varies depending on the predicted mode by the system. However, the imitation policy is not conditioned on mode (i.e., without the pseudo-attractor is the same as the BC). We also tried conditioning the imitation learning on the mode (i.e., learning a different BC network for the state-action pairs classified in each mode), however, we found that this did not substantially impact the performance of the policy.
[1] Language Conditioned Imitation Learning over Unstructured Data
[2] VIMA: General Robot Manipulation with Multimodal Prompts
[3] Grounding Predicates through Actions
[4] From Skills to Symbols: Learning Symbolic Representations for Abstract High-Level Planning
[5] Learning Temporal Logic Formulas from Suboptimal Demonstrations: Theory and Experiments
[6] SayCan: Grounding Language in Robotic Affordances
[7] Skill induction and planning with latent language
[8] Text2Motion: From Natural Language Instructions to Feasible Plans
[9] TACO: Learning Task Decomposition via Temporal Alignment for Control
[10] LEAGUE: Guided Skill Learning and Abstraction for Long-Horizon Manipulation
[11] Learning Rational Subgoals from Demonstrations and Instructions
[12] Learning grounded finite-state representations from unstructured demonstrations
[13] ELLA: Exploration through Learned Language Abstraction
[14] Learning with Language-Guided State Abstractions
<!--
<div style="text-align: center;">
<img src="https://hackmd.io/_uploads/BJIryZtVT.png" width="79%">
</div>
**Figure 8:** Method sketch **(a)** First, given a generic set of features (including their descriptions), e.g. keypoints in an environment, we first prompt LLM to select a subset of relevant features as the state representation $s_i$ for a task using the task description. The state $s_i$ is used as the input to the classifier $\phi(\cdot)$ that predicts a categorical distribution of modes. **(b)** Second, we prompt the LLM to generate a feasibility matrix $F^\#$ with $\#$ modes. Entries $F_{ij}$ with 0 mean the transition from mode $i$ to mode $j$ is valid and incur zero costs. Entries $F_{ij}$ with negative value mean direct transition from mode $i$ to mode $j$ is not feasible and the magnitude denote the number of missing modes required to transition from $i$ to $j$. For tasks with sequential nature, diagonal entries $F_{ii}$ are self-transitions and are always feasible; entries $F_{i,i+1}$ are demonstrated valid mode transitions towards the goal; entries $F_{ij}$ in the lower left triangular part of the matrix are feasible as we assume reversing from a later mode to a previous mode is allowed. **\(c\)** The success loss (only applied to the success trajectories $[s^+_0, s^+_1, s^+_2, ...]$) dictate all transitions between consecutive states must be feasible. Given a fixed feasibility matrix, the loss optimizes parameters in the classifier such that the predicted categorical distribution between consecutive states only index 0 entries of the feasibility matrix (denoted by dotted green line), i.e. $\phi(s^+_i)F^\#\phi(s^+_{i+1})=0$. Since there are more than one 0 entries in the matrix, there is no guarantee that the correct number of configuration space partition will be learned. For example, a classifier could classify all states into a single mode for successful trajectories in a degenerate case as self-transitions incur 0 costs. **(d)** To learn the correct number of partitions, we use the init and final loss to specify that the starting and ending continuous states for all trajectories must be in the first and last mode being demonstrated respectively. For successful trajectories, adding init and final loss to the success loss optimizes the classifier to divide the state space such that mode switches connecting the first and last mode (denoted by pink dotted line) must occur somewhere in the trajectories. The gray, blue and yellow region in (d) represent a possible classifer partition of the state space that can validate successful trajectories without necessarily the correct boundaries yet. **(e)** To refine the boundaries, failure loss (only applied to failure trajectories $[s^-_0, s^-_1, s^-_2, ...]$) dictate there exists at least one invalid mode transition, i.e. $\phi(s^-_i)F^\#\phi(s^-_{i+1})<0$. Contrasting the failure trajectory (red) with the successful trajectory (black) optimizes the classifier to refine partition such that the invalid mode transition only occur in regions where the two trajectories diverge (denoted by the dark blue dotted line). With more and more successful and failure trajectories, the learned mode partitions eventually converge to one that can differentiate all successful trajectories from failure trajectories based on the given feasibility matrix, such as the one in Fig 7 (d).
-->