Problem: We aim to evaluate Large Language Models (LLMs) for embodied decision making. While a significant body of work has been leveraging LLMs for decision making in embodied environments, we still lack a systematic understanding of their performances, because they are usually applied in different domains for different purposes, and built based on different inputs and outputs. Furthermore, existing evaluations tend to rely solely on a final success rate, making it difficult to pinpoint what ability is missing in LLMs and where the problem lies, which in turn, blocks embodied agents from leveraging LLMs effectively and selectively.
Method: To address these limitations, we propose a generalized interface (Embodied Agent Interface) that supports the formalization of various types of tasks and input-output specifications of LLM-based modules. Specifically, it allows us to unify 1) a broad set of embodied decision making tasks involving both state and temporally extended goals, 2) four commonly-used LLM-based modules for decision making: goal interpretation, subgoal decomposition, action sequencing, and transition modeling, and 3) a collection of fine-grained metrics which break down evaluation into various types of errors, such as hallucination errors, affordance errors, various types of planning errors, etc.
Conclusion: Overall, our benchmark offers a comprehensive and systematic assessment of LLMs' performance for different subtasks, pinpointing the strengths and weaknesses in LLM-powered embodied AI systems, and providing insights for effective and selective use of LLMs in embodied decision making.
In our Embodied Agent Interface, we propose a set of ability modules to evaluate LLMs for embodied decision making. The four ability modules are: Goal Interpretation, Subgoal Decomposition, Action Sequencing, and Transition Modeling. We provide a detailed description of each module below.
Goal Interpretation aims to ground the natural language instruction to the environment representations of objects, states, relations, and actions. For example, the task instruction "Use the rag to clean the trays, the bowl, and the refrigerator. When you are done, leave the rag next to the sink..." can be grounded to specific objects with IDs, such as fridge (ID: 97), tray (ID: 1), bowl (ID: 1), rag (ID: 0), and sink (ID: 82). Note that a simple natural language description can be grounded into a set of multiple goal conditions (object state and relation).
Subgoal Decomposition generates a sequence of states, where each state can be a set of objects and their states. Here, we highlight the important states, such as the transitions between a sequence of next_to(rag.0, sink.82), toggled_on(sink.82), soaked(rag.0), toggled_off(sink.82), open(fridge.97), not_stained(fridge.97). To achieve these state transitions, we can use a high-level planner such as BFS to search for the Action Sequences that achieve these state transitions. We obtain the following action sequence: RIGHT_GRASP(rag.0), RIGHT_PLACE_NEXTTO(sink.82), TOGGLE_ON(sink.82), SOAK(rag.0), TOGGLE_OFF(sink.82), OPEN(fridge.97), CLEAN(fridge.97). Note that multiple actions may be required to achieve a single one-step state transition. For example, to perform the state transition next_to(rag.0, sink.82) → toggled_on(sink.82), we need two actions RIGHT_GRASP(rag.0), RIGHT_PLACE_NEXTTO(sink.82). See Figure 2 for the input and output formulation.
Action Sequences are essential to achieve the state transitions identified in Subgoal Decomposition. For example, a successful execution of the action sequence RIGHT_GRASP(rag.0), RIGHT_PLACE_NEXTTO(sink.82), TOGGLE_ON(sink.82), SOAK(rag.0), TOGGLE_OFF(sink.82), OPEN(fridge.97), CLEAN(fridge.97) is shown in Figure 3.
Transition Modeling serves as the low-level controller to guide the simulator in performing state transitions from preconditions to post-effects. For example, in cleaning task, the input is the operator name soak, and the preconditions are three states: holding (?obj1), next_to (?sink ?agent), and toggled_on (?sink). The post effect after executing SOAK is soaked (?obj1).
We evaluate the performance of LLMs for embodied decision making using the Embodied Agent Interface. Below is a detailed description of the evaluation setup.
Focusing on complex long-horizon tasks, we select VirtualHome (V) and BEHAVIOR (B) as our evaluation simulators based on their task length and scene complexity. Table 1 shows our annotations. Apart from the goal and trajectory annotations, we introduce the Goal Action annotation to reflect necessary actions that do not have post effects, such as the goal action touch in the task “pet the cat”. In the subset of VirtualHome tasks we work on, \(80.7\%\) task categories include instructions with action steps longer than \(10\), and \(33\%\) of the instructions have step lengths of more than \(10\).
We select BEHAVIOR as another simulator for our evaluation due to its task complexity. BEHAVIOR BDDL goals may contain quantifiers, such as (forpairs (?jar ?apple) (inside ?apple ?jar)), which need to be translated into grounded goals with only atomic propositions, e.g., and ((inside apple_1 jar_1) (inside apple_2 jar_2)). There can be different grounded goals that satisfy the same BDDL goal, such as ((inside apple_2 jar_1) (inside apple_1 jar_2)). We call them goal options. In general, one BDDL goal corresponds to a number of goal options. The average number of grounded goals for each task is \(6.7\), and there are \(4,164.4\) goal options for each task on average.
VirtualHome | BEHAVIOR | |
---|---|---|
#task name | 26 | 100 |
#task instruction | 338 | 100 |
#goal | 801 | 673 |
- #state | 340 | 153 |
- #relation | 299 | 520 |
- #action | 162 | - |
#trajectory | 338 | 100 |
- #step | 2960 | 1460 |
- avg. step | 8.76 | 14.6 |
#transition model | 33 | 30 |
- #precondition | 99 | 84 |
- #effect | 57 | 51 |
Each instance in the dataset represents a task goal. Specifically, each task contains the following data:
For tasks in the BEHAVIOR environment, the dataset also includes accompanying VR human demonstration videos that showcase the execution of the ground truth action trajectories.
Please find our JSON data format in this link: Dataset JSON Format
We integrated our evaluation pipeline into the HELM code base for easy and reproducible LLM inference. Users can set up their environment using here. We standardized decoding parameters across all models, using temperature zero for \(\operatorname*{arg\,max}\) sampling. Evaluating all models on our benchmark required \(180\) runs. Detailed model information is provided in the table below.
Model Name | Creator | Complete Model ID | Release | Hosting |
---|---|---|---|---|
Claude-3 Haiku | Anthropic | claude-3-haiku-20240307 | 03/07/24 | Anthropic |
Claude-3 Sonnet | Anthropic | claude-3-sonnet-20240229 | 02/29/24 | Anthropic |
Claude-3 Opus | Anthropic | claude-3-opus-20240229 | 02/29/24 | Anthropic |
Claude-3.5 Sonnet | Anthropic | claude-3-5-sonnet-20240620 | 06/20/24 | Anthropic |
Cohere Command R | Cohere | command-r | 03/11/24 | Cohere |
Cohere Command R+ | Cohere | command-r-plus | 04/04/24 | Cohere |
Gemini 1.0 Pro | gemini-pro | 12/13/23 | GCP Vertex | |
Gemini 1.5 Flash | gemini-1.5-flash-preview-0514 | 05/14/24 | GCP Vertex | |
Gemini 1.5 Pro | gemini-1.5-pro-preview-0409 | 04/09/24 | GCP Vertex | |
GPT-3.5-turbo | OpenAI | gpt-3.5-turbo-0125 | 01/25/24 | OpenAI |
GPT-4-turbo | OpenAI | gpt-4-turbo-2024-04-09 | 04/09/24 | OpenAI |
GPT-4o | OpenAI | gpt-4o-2024-05-13 | 05/13/24 | OpenAI |
Llama3 8B Instruct | Meta | meta-llama-3-8b-instruct | 04/18/24 | TogetherAI |
Llama3 70B Instruct | Meta | meta-llama-3-70b-instruct | 04/18/24 | TogetherAI |
Mistral Large | MistralAI | mistral-large-2402 | 02/26/24 | MistralAI |
Mixtral 8x22B MoE | MistralAI | mixtral-8x22b-instruct-v0.1 | 04/17/24 | TogetherAI |
o1-mini | OpenAI | o1-mini-2024-09-12 | 09/12/24 | OpenAI |
o1-preview | OpenAI | o1-preview-2024-09-12 | 09/12/24 | OpenAI |