Clicky

Embodied Agent Interface: A Single Line to Evaluate LLMs for Embodied Decision Making

Manling Li1, 2†, Shiyu Zhao1,†, Qineng Wang1, 2†, Kangrui Wang1, 2†, Yu Zhou1,†,
1Stanford University, 2Northwestern University, 3Amazon, 4MIT
Equal contribution

Embodied Agent Interface aims to tackle the following challenges in evaluating LLMs for building embodied decision-making agents: (1) Standardization of goal specifications. (2) Standardization of modules and interfaces. (3) Broad coverage of evaluation and fine-grained metrics.


Dataset Viewer



Empirical Findings

  1. Goal Interpretation:
    • LLMs struggle to translate natural language instructions into grounded states.
    • Common errors include generating intermediate goals and omitting spatial relationship goals.
    • Gemini 1.5 Pro has the highest goal interpretation performance, while Claude-3 Opus excels in goal retrieval rate.
    • Proprietary LLMs make fewer grammar errors compared to open-source LLMs.
    Table: All goal evaluation results (%) for goal interpretation
    Model Name Goal Interpretation
    State Spatial Action Overall
    Precision Recall F1 Precision Recall F1 Precision Recall F1 Precision Recall F1
    V B V B V B V B V B V B V B V B V B V B V B V B
    Claude-3 Haiku 21.8 22.8 58.9 93.5 31.8 36.7 24.2 64.5 50.8 64.6 32.8 64.6 12.2 - 95.7 - 21.6 - 18.0 41.5 63.2 71.2 28.0 52.5
    Claude-3 Sonnet 23.3 36.8 57.1 88.9 33.1 52.0 26.6 76.2 53.0 79.8 35.5 77.9 12.4 - 85.8 - 21.7 - 19.3 60.2 61.5 81.9 29.4 69.4
    Claude-3 Opus 27.0 72.6 66.9 93.5 38.5 81.7 22.6 75.2 46.8 79.2 30.5 77.1 14.5 - 92.6 - 25.1 - 20.7 72.2 65.0 82.5 31.4 77.0
    Claude-3.5 Sonnet 25.3 74.0 60.9 94.8 35.8 83.1 31.1 84.4 63.8 81.3 41.8 82.9 14.0 - 98.8 - 24.5 - 21.7 81.1 69.6 84.4 33.0 82.7
    Cohere Command R 51.1 7.7 69.6 31.4 58.9 12.4 34.5 56.8 21.3 55.0 26.3 55.9 3.6 - 38.9 - 6.5 - 27.4 28.2 55.7 49.6 36.7 36.0
    Cohere Command R+ 20.9 23.3 52.0 79.1 29.8 36.0 17.9 66.7 15.2 61.5 16.4 64.0 10.4 - 82.6 - 18.5 - 14.9 42.0 44.5 65.5 22.4 51.2
    Gemini 1.0 Pro 25.3 27.4 57.9 81.1 34.9 41.0 17.0 75.2 20.6 70.4 18.6 72.7 9.9 - 68.7 - 17.2 - 16.2 51.0 45.2 72.8 23.8 60.0
    Gemini 1.5 Flash 23.6 55.8 57.9 94.1 33.5 70.1 19.8 76.6 21.1 76.7 20.5 76.7 13.5 - 90.1 - 23.5 - 18.2 69.7 50.8 80.7 26.8 74.8
    Gemini 1.5 Pro 45.4 94.0 49.1 92.8 47.2 93.4 40.0 74.4 9.7 76.7 15.6 75.6 26.8 - 80.9 - 40.3 - 35.2 78.8 41.1 80.4 37.9 79.6
    GPT-3.5-turbo 22.4 52.0 50.0 66.7 30.9 58.5 8.5 51.5 18.8 46.9 11.7 49.1 15.2 - 60.5 - 24.4 - 15.7 49.5 40.5 51.4 22.7 50.4
    GPT-4-turbo 28.6 70.4 58.5 86.9 38.4 77.8 24.7 77.5 32.9 76.4 28.2 76.9 19.0 - 82.1 - 30.9 - 24.0 75.6 53.8 78.8 33.2 77.2
    GPT-4o 29.0 67.1 60.0 94.8 39.1 78.6 31.5 81.1 43.6 78.5 36.6 79.8 20.5 - 85.8 - 33.1 - 26.4 76.5 59.1 82.2 36.5 79.2
    Llama 3 8B Instruct 21.7 17.3 54.4 80.4 31.0 28.4 14.0 51.4 7.4 20.8 9.7 29.6 11.1 - 79.4 - 19.4 - 15.5 24.1 41.9 34.3 22.6 28.3
    Llama 3 70B Instruct 23.9 69.5 61.2 95.4 34.3 80.4 22.6 70.0 37.5 73.3 28.2 71.6 11.2 - 88.8 - 19.8 - 17.5 64.7 58.0 78.3 26.9 70.9
    Mistral Large 23.6 63.5 59.1 92.2 32.8 75.2 23.7 75.1 40.3 76.2 29.8 75.6 11.2 - 84.0 - 19.7 - 17.5 69.6 57.1 79.8 26.8 74.3
    Mixtral 8x22B MoE 23.6 22.9 56.9 83.7 33.4 36.0 22.2 70.7 36.3 67.7 27.5 69.2 11.2 - 94.8 - 20.0 - 17.4 44.4 56.2 71.3 26.6 54.7
    o1-mini 26.3 63.8 58.6 90.8 36.3 74.9 30.4 77.3 39.9 76.5 34.5 76.9 13.5 - 56.8 - 21.8 - 22.4 73.3 51.3 79.8 31.2 76.4
    o1-preview 28.2 66.8 60.3 94.8 38.5 78.4 44.9 82.9 62.4 82.7 52.2 82.8 26.0 - 81.5 - 39.5 - 31.8 78.1 65.4 85.4 42.7 81.6

  2. Action Sequencing:
    • Reasoning ability is crucial for LLMs; trajectory feasibility errors are common (41.2%).
    • o1-preview has the highest task (81.0%) and execution success rates (91.0%) in BEHAVIOR. Mistral Large (73.4%) and Gemini 1.5 Pro (73.1%) outperform it in VirtualHome.
    • SOTA LLMs make fewer grammar errors. For example, Claude-3 Opus makes no errors, while GPT-3.5-turbo has a 4.0% error rate in BEHAVIOR.
    • Common runtime errors include missing steps and wrong order. In BEHAVIOR, GPT-4o encounters 36.0% missing step errors and 9.0% wrong order errors.
    • LLMs perform better with state goals than relation goals, but struggle with complex action goals. GPT-4o achieves 82.0% success in state goals and 67.8% in relation goals in VirtualHome.
    • Task complexity, such as the number of goals and action sequence length, lowers success rates. In BEHAVIOR, tasks with more than 10 goals have a success rate below 40%.
    Table: Trajectory evaluation results (%) for action sequencing.
    Model Name Goal Evaluation Trajectory Evaluation
    Goal SR Execution SR Grammar Error (↓) Runtime Error (↓)
    Parsing Hallucination Action-Arg Num Wrong Order Missing Step Affordance Additional Step
    V B V B V B V B V B V B V B V B V B
    Claude-3 Haiku 43.3 26.0 48.5 32.0 0.0 0.0 4.9 6.0 0.3 0.0 1.6 7.0 43.3 54.0 1.3 1.0 3.3 1.0
    Claude-3 Sonnet 62.9 44.0 67.2 57.0 0.0 0.0 5.6 1.0 0.7 7.9 2.3 11.0 22.9 19.0 1.3 11.0 3.6 2.0
    Claude-3 Opus 66.2 51.0 70.8 59.0 0.0 0.0 14.1 0.0 0.0 0.0 0.7 3.0 14.1 35.0 0.3 3.0 6.2 2.0
    Claude-3.5 Sonnet 72.8 60.0 75.4 69.0 0.0 0.0 2.3 0.0 0.0 0.0 1.0 5.0 19.7 25.0 1.6 1.0 5.2 2.0
    Gemini 1.0 Pro 34.4 27.0 45.9 32.0 0.3 7.0 9.2 3.0 2.0 6.0 1.3 13.0 38.7 35.0 2.6 4.0 7.2 4.0
    Gemini 1.5 Flash 61.9 40.0 67.2 52.0 0.0 0.0 2.0 0.0 0.3 0.0 0.3 5.0 29.8 42.0 0.3 1.0 4.3 2.0
    Gemini 1.5 Pro 73.1 42.0 83.3 54.0 0.0 0.0 1.6 0.0 0.3 0.0 0.3 6.0 13.1 39.0 1.3 1.0 5.6 2.0
    GPT-3.5-turbo 14.7 16.0 31.8 20.0 35.1 4.0 1.6 7.0 1.3 23.0 0.3 1.0 28.2 36.0 1.6 8.0 2.0 1.3
    GPT-4-turbo 57.0 38.0 65.6 45.0 0.0 0.0 1.6 0.0 0.3 0.0 0.0 7.0 32.1 47.0 0.3 1.0 3.6 0.0
    GPT-4o 61.6 47.0 71.1 53.0 0.3 0.0 1.3 1.0 0.3 0.0 0.3 9.0 25.2 36.0 1.3 1.0 4.9 0.0
    Cohere Command R 24.6 16.0 37.7 19.0 0.7 5.0 29.8 13.0 2.0 0.0 3.0 8.0 25.2 43.0 2.0 12.0 4.3 4.0
    Cohere Command R+ 63.3 27.0 70.2 35.0 0.0 0.0 5.6 1.0 0.7 15.0 0.3 10.0 22.6 39.0 0.7 0.0 5.9 15.0
    Mistral Large 73.4 33.0 83.6 50.0 0.0 0.0 2.6 0.0 0.3 0.0 0.3 8.0 12.8 35.0 0.3 6.0 4.9 7.0
    Mixtral 8x22B MoE 46.2 30.0 49.5 40.0 0.0 3.0 13.1 6.0 0.7 0.0 0.7 10.0 34.7 32.0 1.3 9.0 3.0 2.0
    Llama 3 8B 21.6 10.0 25.9 16.0 0.0 0.0 41.6 15.0 1.0 9.0 0.3 6.0 31.1 44.0 0.0 9.0 0.3 5.0
    Llama 3 70B 55.7 34.0 63.0 42.0 0.0 0.0 23.3 2.0 1.0 0.0 2.0 15.0 7.9 38.0 3.0 3.0 7.9 6.0
    o1-mini 65.9 56.0 68.9 65.0 0.3 0.0 5.2 3.0 3.3 0.0 0.3 7.0 21.6 17.0 0.3 6.0 5.9 5.0
    o1-preview 71.1 81.0 78.4 91.0 2.0 0.0 8.2 0.0 0.0 0.0 0.3 0.0 34.1 6.0 0.3 2.0 8.9 3.0

    Table: All goal success results (%) for action sequencing and subgoal decomposition.
    Model Name Action Sequencing Subgoal Decomposition
    State Goal Relation Goal Action Goal Total State Goal Relation Goal Action Goal Total
    V B V B V B V B V B V B V B V B
    Claude-3 Haiku 58.6 27.0 47.2 38.7 33.1 - 49.0 35.5 89.4 26.0 82.2 34.8 71.6 - 83.1 32.4
    Claude-3 Sonnet 80.9 41.0 73.3 59.8 48.6 - 70.8 54.6 89.1 37.0 89.3 49.8 83.3 - 88.0 46.3
    Claude-3 Opus 64.7 45.0 79.4 53.0 57.4 - 67.3 50.8 92.4 43.0 88.6 41.6 83.3 - 89.1 42.0
    Claude-3.5 Sonnet 81.3 63.0 79.4 62.4 57.4 - 74.9 62.6 92.9 41.0 88.6 39.5 87.0 - 90.1 39.9
    Gemini 1.0 Pro 52.2 28.0 36.1 32.0 42.6 - 45.0 30.9 84.4 26.0 61.5 31.1 72.8 - 73.5 29.7
    Gemini 1.5 Flash 79.5 34.0 65.5 50.0 48.0 - 67.7 45.6 93.5 44.0 88.3 36.0 92.0 - 91.3 38.2
    Gemini 1.5 Pro 81.7 41.0 77.2 43.2 68.2 - 77.1 42.6 91.2 31.0 72.5 37.1 89.5 - 83.9 35.4
    GPT-3.5-turbo 29.5 20.0 18.3 22.6 23.6 - 24.8 21.9 84.7 28.0 54.4 28.5 64.8 - 69.4 28.3
    GPT-4-turbo 74.1 39.0 73.3 39.5 47.3 - 67.3 39.3 93.5 45.0 84.2 46.1 90.7 - 89.5 45.8
    GPT-4o 82.0 49.0 67.8 45.5 57.4 - 71.8 46.5 92.1 50.0 84.2 53.2 93.2 - 89.4 52.3
    Cohere Command R 24.1 20.0 40.0 25.9 37.1 - 32.0 24.3 85.3 20.0 67.4 21.4 60.5 - 73.6 21.0
    Cohere Command R+ 71.2 28.0 63.9 32.0 60.2 - 66.3 30.9 89.4 34.0 66.8 29.6 75.9 - 78.3 30.8
    Mistral Large 81.3 38.5 77.8 41.2 75.0 - 78.7 40.4 92.9 33.0 71.5 35.6 90.1 - 84.4 34.9
    Mixtral 8x22B MoE 48.9 30.0 56.1 36.8 37.2 - 48.2 35.0 92.1 30.0 74.8 34.1 87.7 - 84.8 33.0
    Llama 3 8B 26.3 16.0 26.1 23.7 10.1 - 22.2 21.6 68.8 21.0 54.7 23.6 50.0 - 59.8 22.9
    Llama 3 70B 42.8 31.0 64.4 45.5 53.4 - 51.8 41.5 93.2 25.0 63.4 27.7 82.7 - 80.0 27.0
    o1-mini 75.2 64.0 68.3 66.9 51.4 - 67.3 66.1 89.7 28.0 68.8 38.0 81.5 - 80.3 35.3
    o1-preview 86.0 89.5 71.1 84.4 56.1 - 74.3 85.8 91.8 56.5 88.3 69.4 92.6 - 90.6 65.9

  3. Subgoal Decomposition:
    • Subgoal decomposition is not easier than action sequencing in abstract action spaces.
    • o1-preview shows superior performance in VirtualHome (89.4%) and BEHAVIOR (57.0%). Gemini 1.5 Flash also performs well in VirtualHome (89.1%).
    • SOTA models avoid grammar errors but can hallucinate actions (e.g., GPT-4o adds "POUR" in VirtualHome).
    • Common runtime errors: extra steps in VirtualHome, missing steps in BEHAVIOR.
    • LLMs like o1-preview are more accurate in action goals in VirtualHome; state and relation goals in BEHAVIOR are more difficult due to stricter precondition checks.
    • Performance is lower in BEHAVIOR due to complex task representations with quantifiers like "forall" and "forpairs."
    Table: All trajectory evaluation results (%) for subgoal decomposition.
    Model Name Goal Evaluation Trajectory Evaluation
    Goal SR Execution SR Grammar Error (↓) Runtime Error (↓)
    Parsing Hallucination Action-Arg Num Wrong Order Missing Step Affordance Additional Step
    V B V B V B V B V B V B V B V B V B
    Claude-3 Haiku 78.4 30.0 82.8 35.0 0.3 0.0 2.4 1.0 1.8 0.0 1.8 3.0 2.7 58.0 8.3 3.0 20.4 3.0
    Claude-3 Sonnet 83.1 39.0 86.4 43.0 0.0 0.0 1.8 2.0 0.0 2.0 0.6 3.0 2.7 51.0 8.6 1.0 33.7 3.0
    Claude-3 Opus 87.0 41.0 90.0 47.0 0.3 0.0 3.6 3.0 0.0 0.0 1.2 5.0 3.0 45.0 2.4 0.0 16.0 6.0
    Claude-3.5 Sonnet 89.1 39.0 92.0 44.0 0.0 0.0 1.8 1.0 0.0 0.0 1.5 11.0 2.7 44.0 2.1 0.0 24.6 4.0
    Gemini 1.0 Pro 70.4 24.0 84.6 33.0 0.6 2.0 3.3 4.0 2.4 0.0 1.2 3.0 2.7 51.0 5.3 7.0 10.4 3.0
    Gemini 1.5 Flash 89.1 34.0 94.1 42.0 0.0 2.0 1.5 1.0 0.0 0.0 0.6 2.0 3.9 53.0 0.0 0.0 13.3 3.0
    Gemini 1.5 Pro 87.0 31.0 91.1 37.0 0.0 1.0 1.5 0.0 1.8 1.0 0.0 3.0 5.6 59.0 0.0 0.0 16.0 2.0
    GPT-3.5-turbo 69.2 24.0 81.4 36.0 1.5 2.0 0.0 3.0 0.6 0.0 1.5 4.0 11.8 51.0 3.3 4.0 20.4 3.0
    GPT-4-turbo 85.5 38.0 94.1 47.0 0.0 0.0 1.8 3.0 0.0 0.0 1.5 9.0 2.4 40.0 0.3 1.0 22.2 6.0
    GPT-4o 88.8 49.0 90.2 55.0 0.0 0.0 6.2 3.0 0.0 0.0 1.2 6.0 2.4 36.0 0.0 0.0 15.7 5.0
    Cohere Command R 71.3 15.0 79.6 25.0 2.1 23.0 3.9 10.0 0.9 0.0 1.5 0.0 6.2 37.0 5.9 5.0 14.5 4.0
    Cohere Command R+ 79.0 25.0 83.7 37.0 1.5 2.0 4.5 4.0 2.1 0.0 0.9 4.0 7.7 52.0 2.7 1.0 16.0 6.0
    Mistral Large 84.3 31.0 92.0 38.0 0.3 1.0 1.8 3.0 0.3 0.0 2.1 4.0 3.3 52.0 0.3 2.0 11.0 1.0
    Mixtral 8x22B MoE 80.5 28.0 90.2 33.0 0.3 0.0 2.4 4.0 0.0 0.0 3.0 2.0 3.9 59.0 0.3 2.0 11.2 0.0
    Llama 3 8B 48.8 21.0 58.0 29.0 0.6 2.0 2.4 11.0 0.6 0.0 6.8 6.0 5.0 44.0 26.6 8.0 18.3 7.0
    Llama 3 70B 78.4 20.0 87.3 30.0 0.0 1.0 2.4 5.0 0.9 1.0 2.4 8.0 5.3 51.0 1.8 4.0 20.4 4.0
    o1-mini 79.3 31.0 84.6 39.0 0.0 0.0 1.5 3.0 0.6 3.0 0.3 7.0 8.9 46.0 4.1 2.0 21.9 1.0
    o1-preview 89.4 57.0 93.2 62.0 0.0 2.0 1.5 3.0 0.0 0.0 0.3 5.0 2.7 25.0 2.4 3.0 12.1 7.0

  4. Transition Modeling:
    • Models excel in specific categories like object states and orientation.
    • Non-spatial relations consistently pose a challenge.
    • Planning effectiveness relies on consistency in predicted action space.
    Table: Full results of logic form accuracy for transition modeling in VHO
    Model Object States Object Orientation Object Affordance Spatial Relations Non-Spatial Relations
    Precision Recall F1 Precision Recall F1 Precision Recall F1 Precision Recall F1 Precision Recall F1
    Claude-3 Haiku 76.0 40.1 52.5 19.0 34.4 24.4 67.8 73.9 70.7 37.7 38.7 38.2 2.0 1.5 1.7
    Claude-3 Opus 87.4 49.2 63.0 46.3 96.9 62.6 76.8 74.3 75.5 37.6 39.9 38.7 10.4 5.2 7.0
    Claude-3 Sonnet 76.6 37.4 50.3 48.1 78.1 59.5 60.7 74.3 66.8 32.3 39.9 35.7 6.2 4.1 4.9
    Claude-3.5 Sonnet 86.1 46.7 60.5 93.9 96.9 95.3 77.7 75.5 76.6 45.3 39.8 42.4 7.1 5.1 5.9
    Cohere Command R 18.0 6.8 9.9 38.7 90.6 54.2 40.2 23.0 29.2 12.6 6.7 8.8 3.3 0.9 1.4
    Cohere Command R+ 44.9 19.0 26.3 34.6 68.8 45.9 51.0 62.1 56.0 30.1 34.8 32.4 7.6 3.1 4.4
    Gemini 1.0 Pro 68.4 12.3 20.4 16.3 62.5 27.9 55.3 20.1 29.6 45.0 16.5 24.3 7.7 2.5 3.8
    Gemini 1.5 Flash 82.3 37.6 51.6 2.0 3.1 2.5 54.4 74.7 62.9 47.4 42.9 45.0 16.3 5.2 7.9
    Gemini 1.5 Pro 45.3 11.9 18.8 88.2 93.8 90.9 79.9 75.5 77.7 42.2 35.8 38.7 15.5 5.2 7.8
    GPT-3.5-turbo 63.5 21.9 32.5 11.4 15.6 13.2 57.2 53.1 54.9 35.2 21.7 26.8 1.7 0.3 0.6
    GPT-4-turbo 79.3 44.2 56.7 10.1 31.3 15.3 65.9 71.0 68.4 31.8 34.2 32.9 3.8 1.0 1.6
    GPT-4o 80.2 41.5 54.6 48.0 59.4 52.8 76.2 73.7 74.9 40.8 40.7 40.8 14.8 5.1 7.5
    Llama 3 8b 30.8 13.7 18.9 0.0 0.0 0.0 1.6 3.2 2.1 15.5 18.2 16.8 0.0 0.0 0.0
    Llama 3 70b 63.5 21.9 32.5 49.0 66.3 56.6 65.0 50.0 57.0 27.0 27.0 27.0 5.0 2.0 3.0
    Mistral Large 30.0 8.0 13.0 48.0 88.0 62.0 72.0 29.0 41.0 35.0 18.0 24.0 3.0 1.0 1.0
    Mixtral 8x22B MoE 72.0 33.0 45.0 43.0 83.0 57.0 64.0 74.0 69.0 40.0 38.0 39.0 12.0 4.0 6.0
    o1-mini 82.5 45.9 59.0 51.3 62.5 56.3 59.8 57.1 58.5 32.1 32.8 32.5 5.0 4.1 4.5
    o1-preview 83.0 45.1 58.5 69.0 90.6 78.4 84.7 71.4 77.5 39.8 37.8 38.8 17.1 9.0 11.8
    Table: Full results of logic form accuracy for transition modeling in BH
    Model Object States Spatial Relations Non-Spatial Relations
    Precision Recall F1 Precision Recall F1 Precision Recall F1
    Claude-3.5 Sonnet 83.3 74.8 78.8 73.3 48.8 58.6 82.9 66.2 73.6
    Claude-3 Haiku 64.1 55.2 59.3 54.7 37.4 44.4 63.3 51.4 56.7
    Claude-3 Opus 74.6 69.4 71.9 70.4 44.6 54.6 68.5 69.1 68.8
    Claude-3 Sonnet 66.2 68.7 67.5 62.8 39.8 48.7 68.8 52.0 59.2
    Cohere Command R 59.7 43.9 50.6 29.1 11.6 16.6 27.2 15.3 19.6
    Cohere Command R+ 58.0 58.4 58.2 54.2 33.6 41.5 53.0 56.6 54.7
    Gemini 1.0 Pro 67.2 55.2 60.6 47.5 35.3 40.5 43.8 48.3 45.9
    Gemini 1.5 Flash 73.9 57.2 64.5 54.5 40.7 46.6 60.7 53.8 57.0
    Gemini 1.5 Pro 69.6 46.7 55.9 52.9 27.2 35.9 59.6 47.4 52.8
    GPT-3.5-turbo 67.1 46.1 54.6 57.6 31.6 40.9 40.8 36.1 38.3
    GPT-4-turbo 58.2 59.4 58.8 50.3 27.8 35.8 58.5 38.4 46.4
    GPT-4o 73.1 69.6 71.3 63.9 35.8 45.9 84.7 64.2 73.0
    Llama 3 70b 68.1 64.6 66.3 60.3 38.8 47.2 65.1 53.8 58.9
    Llama 3 8b 40.3 32.4 35.9 29.6 22.7 25.7 48.9 43.9 46.2
    Mistral Large 67.5 66.5 67.0 54.9 32.3 40.7 59.7 44.6 51.1
    Mixtral 8x22B MoE 60.2 60.0 60.1 53.2 39.9 45.6 57.9 55.8 56.8
    o1-mini 46.3 37.2 41.3 71.1 42.3 53.1 80.1 58.3 67.5
    o1-preview 85.5 72.3 78.3 72.4 46.1 56.3 88.0 79.5 83.5

    Table: Full results of planner success rate for transition modeling (%)
    Model Object States Object Orientation Object Affordance Spatial Relations Non-Spatial Relations
    V B V B V B V B V B
    Claude-3 Haiku 13.5 68.9 3.6 - 19.8 - 46.9 62.8 73.0 62.3
    Claude-3 Opus 63.5 84.4 71.4 - 58.7 - 64.8 80.9 55.4 82.0
    Claude-3 Sonnet 11.2 80.0 3.6 - 10.8 - 20.0 79.8 13.5 80.3
    Claude-3.5 Sonnet 67.4 86.7 96.4 - 67.8 - 96.6 80.8 91.9 80.3
    Cohere Command R 44.6 48.9 82.1 - 40.1 - 62.6 38.3 58.3 39.3
    Cohere Command R+ 36.5 77.8 46.4 - 35.3 - 40.7 57.4 31.1 47.5
    Gemini 1.0 Pro 10.7 22.2 0.0 - 10.2 - 14.5 13.8 2.7 14.8
    Gemini 1.5 Flash 34.8 55.6 7.1 - 46.7 - 61.4 68.1 60.8 70.5
    Gemini 1.5 Pro 94.4 35.6 89.3 - 95.8 - 89.0 40.4 83.8 39.3
    GPT-3.5-turbo 1.1 26.7 25.0 - 1.2 - 0.0 39.4 0.0 54.1
    GPT-4-turbo 51.7 40.0 50.0 - 47.9 - 67.6 44.7 64.9 52.5
    GPT-4o 71.9 68.9 78.6 - 63.5 - 66.9 64.9 68.9 68.9
    Llama 3 8b 27.0 35.6 0.0 - 26.4 - 37.9 27.7 31.1 26.2
    Llama 3 70b 10.1 68.9 3.6 - 6.6 - 15.2 77.7 18.9 85.2
    Mistral Large 15.7 73.3 7.1 - 14.4 - 17.9 76.6 8.1 80.3
    Mixtral 8x22B MoE 36.5 57.8 50.0 - 28.1 - 44.1 52.1 43.2 57.4
    o1-mini 63.5 77.8 82.1 - 59.3 - 75.9 77.7 71.6 75.4
    o1-preview 69.1 86.7 100.0 - 67.1 - 76.6 89.4 78.4 90.2

  5. Sensitivity Analysis:
    • Actions like "plug_in" and "walk_towards" show low success rates.
    • Complex interactions like "slice_carvingknife" and "place_inside" present challenges.
    • Training regimens may not fully capture real-world interaction diversity.
  6. Pipeline-Based vs. Modularized:
    • Similar trajectory executable rates for both methods.
    • Pipeline-based methods suffer from error accumulation.
    • SOTA LLMs avoid grammar errors; less advanced models do not.
    • All LLMs are prone to runtime errors, missing necessary steps.
    Table: Pipeline-based evaluation results for (1) \(\mathcal{G}+\mathcal{Q}\) and (2) \(\mathcal{G}+\Phi\)$ in BEHAVIOR. \(\mathcal{G}\): Goal Interpretation. \(\mathcal{Q}\): Action Sequencing. \(\Phi\): Subgoal Decomposition. In this table, M means 'modularized', whereas P means 'pipeline-based'.
    Model Name Goal Evaluation Trajectory Evaluation
    Goal SR Execution SR Grammar Error (↓) Runtime Error (↓)
    Parsing Hallucination Action-Arg Num Wrong Order Missing Step Affordance Additional Step
    M P M P M P M P M P M P M P M P M P
    Goal Interpretation + Action Sequencing
    Claude-3 Haiku 26.0 21.0 32.0 29.0 0.0 0.0 6.0 6.0 0.0 0.0 7.0 6.0 54.0 52.0 1.0 7.0 1.0 17.0
    Claude-3 Sonnet 44.0 41.0 57.0 53.0 0.0 0.0 1.0 3.0 0.0 0.0 11.0 14.0 19.0 21.0 11.0 9.0 2.0 12.0
    Claude-3 Opus 51.0 46.0 59.0 54.0 0.0 1.0 0.0 1.0 0.0 0.0 3.0 6.0 35.0 35.0 3.0 3.0 2.0 4.0
    Gemini 1.0 Pro 27.0 26.0 32.0 35.0 7.0 5.0 3.0 3.0 6.0 6.0 13.0 14.0 35.0 38.0 4.0 2.0 4.0 11.0
    Gemini 1.5 Flash 40.0 35.0 52.0 49.0 0.0 0.0 0.0 2.0 0.0 0.0 5.0 10.0 42.0 41.0 1.0 0.0 2.0 7.0
    Gemini 1.5 Pro 42.0 37.0 54.0 55.0 0.0 1.0 0.0 1.0 0.0 0.0 6.0 7.0 39.0 35.0 1.0 1.0 2.0 0.0
    GPT-3.5-turbo 16.0 14.0 20.0 32.0 4.0 1.0 7.0 3.0 23.0 15.0 1.0 5.0 36.0 39.0 8.0 6.0 1.0 3.0
    GPT-4-turbo 38.0 32.0 45.0 47.0 0.0 1.0 0.0 1.0 0.0 0.0 7.0 9.0 47.0 41.0 1.0 1.0 0.0 0.0
    GPT-4o 47.0 42.0 53.0 55.0 0.0 0.0 1.0 3.0 0.0 0.0 9.0 6.0 36.0 35.0 1.0 1.0 0.0 4.0
    Cohere Command R 16.0 5.0 19.0 9.0 5.0 3.0 13.0 38.0 0.0 1.0 8.0 8.0 43.0 31.0 12.0 12.0 4.0 8.0
    Cohere Command R+ 27.0 15.0 35.0 29.0 0.0 0.0 1.0 8.0 15.0 14.0 10.0 30.0 39.0 31.0 0.0 2.0 15.0 22.0
    Mistral Large 33.0 31.0 50.0 38.0 0.0 0.0 0.0 3.0 0.0 0.0 8.0 14.0 35.0 37.0 6.0 8.0 7.0 5.0
    Mixtral 8x22B MoE 30.0 26.0 40.0 36.0 3.0 3.0 6.0 13.0 0.0 0.0 10.0 14.0 32.0 21.0 9.0 13.0 2.0 15.0
    Llama3 8B 10.0 0.0 16.0 5.0 0.0 2.0 15.0 25.0 9.0 6.0 6.0 11.0 44.0 34.0 9.0 17.0 5.0 14.0
    Llama3 70B 34.0 26.0 42.0 40.0 0.0 1.0 2.0 3.0 0.0 0.0 15.0 18.0 38.0 35.0 3.0 5.0 6.0 9.0
    Goal Interpretation + Subgoal Decomposition
    Claude-3 Haiku 29.0 21.0 35.0 40.0 0.0 0.0 1.0 5.0 0.0 0.0 2.0 2.0 59.0 46.0 3.0 7.0 3.0 16.0
    Claude-3 Sonnet 38.0 31.0 43.0 45.0 0.0 0.0 2.0 3.0 0.0 0.0 3.0 2.0 51.0 47.0 1.0 3.0 3.0 18.0
    Claude-3 Opus 39.0 35.0 47.0 45.0 0.0 0.0 3.0 8.0 0.0 0.0 5.0 4.0 45.0 42.0 0.0 1.0 5.0 7.0
    Gemini 1.0 Pro 23.0 14.0 33.0 30.0 2.0 0.0 4.0 10.0 0.0 1.0 3.0 1.0 51.0 45.0 7.0 13.0 3.0 17.0
    Gemini 1.5 Flash 34.0 32.0 42.0 44.0 2.0 1.0 1.0 3.0 0.0 0.0 2.0 2.0 53.0 48.0 0.0 2.0 3.0 7.0
    Gemini 1.5 Pro 31.0 26.0 37.0 38.0 0.0 1.0 1.0 3.0 0.0 0.0 3.0 2.0 59.0 56.0 0.0 0.0 2.0 1.0
    GPT-3.5-turbo 24.0 14.0 36.0 27.0 2.0 0.0 3.0 12.0 0.0 22.0 3.0 1.0 52.0 32.0 4.0 6.0 3.0 5.0
    GPT-4-turbo 37.0 37.0 47.0 49.0 0.0 0.0 3.0 4.0 0.0 0.0 9.0 8.0 40.0 37.0 1.0 2.0 6.0 6.0
    GPT-4o 48.0 38.0 55.0 52.0 0.0 0.0 3.0 4.0 0.0 0.0 5.0 6.0 37.0 35.0 0.0 3.0 5.0 9.0
    Cohere Command R 15.0 8.0 25.0 15.0 21.0 13.0 11.0 32.0 0.0 1.0 0.0 1.0 38.0 32.0 4.0 6.0 4.0 12.0
    Cohere Command R+ 24.0 17.0 37.0 31.0 2.0 6.0 4.0 10.0 0.0 2.0 5.0 7.0 51.0 40.0 1.0 4.0 6.0 14.0
    Mistral Large 30.0 22.0 38.0 29.0 1.0 1.0 3.0 12.0 0.0 1.0 4.0 5.0 52.0 50.0 2.0 2.0 1.0 5.0
    Mixtral 8x22B MoE 27.0 22.0 33.0 29.0 0.0 0.0 4.0 9.0 0.0 2.0 2.0 2.0 59.0 45.0 2.0 13.0 0.0 17.0
    Llama3 8B 21.0 3.0 29.0 14.0 2.0 7.0 11.0 29.0 0.0 2.0 6.0 3.0 44.0 30.0 8.0 15.0 7.0 7.0
    Llama3 70B 20.0 19.0 30.0 31.0 1.0 1.0 5.0 22.0 1.0 1.0 8.0 7.0 51.0 35.0 4.0 3.0 4.0 7.0

  7. Replanning and Feedback:
    • Replanning based on feedback significantly improves performance.
    • Replanning can result in over-generation of actions.
    Table: Replanning evaluation results (%) for action sequencing.
    Model Name Goal Evaluation Trajectory Evaluation
    Goal SR Execution SR Grammar Error (↓) Runtime Error (↓)
    Parsing Hallucination Action-Arg Num Wrong Order Missing Step Affordance Additional Step
    GPT-4o 65.2 71.8 0.0 1.3 0.7 0.0 25.3 1.0 0.3
    GPT-4o w/ replanning 77.4 83.3 0.0 1.3 0.0 0.0 14.1 0.3 0.7

Abstract

Problem: We aim to evaluate Large Language Models (LLMs) for embodied decision making. While a significant body of work has been leveraging LLMs for decision making in embodied environments, we still lack a systematic understanding of their performances, because they are usually applied in different domains for different purposes, and built based on different inputs and outputs. Furthermore, existing evaluations tend to rely solely on a final success rate, making it difficult to pinpoint what ability is missing in LLMs and where the problem lies, which in turn, blocks embodied agents from leveraging LLMs effectively and selectively.

Method: To address these limitations, we propose a generalized interface (Embodied Agent Interface) that supports the formalization of various types of tasks and input-output specifications of LLM-based modules. Specifically, it allows us to unify 1) a broad set of embodied decision making tasks involving both state and temporally extended goals, 2) four commonly-used LLM-based modules for decision making: goal interpretation, subgoal decomposition, action sequencing, and transition modeling, and 3) a collection of fine-grained metrics which break down evaluation into various types of errors, such as hallucination errors, affordance errors, various types of planning errors, etc.

Conclusion: Overall, our benchmark offers a comprehensive and systematic assessment of LLMs' performance for different subtasks, pinpointing the strengths and weaknesses in LLM-powered embodied AI systems, and providing insights for effective and selective use of LLMs in embodied decision making.

Embodied agent interface overview.
Figure 1: Embodied Agent Interface unifies a broad set of tasks involving both state and temporally extended goals and four LLM-based modules for decision making.

Embodied Agent Interface


In our Embodied Agent Interface, we propose a set of ability modules to evaluate LLMs for embodied decision making. The four ability modules are: Goal Interpretation, Subgoal Decomposition, Action Sequencing, and Transition Modeling. We provide a detailed description of each module below.

Ability Module 1: Goal Interpretation

Goal Interpretation aims to ground the natural language instruction to the environment representations of objects, states, relations, and actions. For example, the task instruction "Use the rag to clean the trays, the bowl, and the refrigerator. When you are done, leave the rag next to the sink..." can be grounded to specific objects with IDs, such as fridge (ID: 97), tray (ID: 1), bowl (ID: 1), rag (ID: 0), and sink (ID: 82). Note that a simple natural language description can be grounded into a set of multiple goal conditions (object state and relation).

Ability Module 2: Subgoal Decomposition

Subgoal Decomposition generates a sequence of states, where each state can be a set of objects and their states. Here, we highlight the important states, such as the transitions between a sequence of next_to(rag.0, sink.82), toggled_on(sink.82), soaked(rag.0), toggled_off(sink.82), open(fridge.97), not_stained(fridge.97). To achieve these state transitions, we can use a high-level planner such as BFS to search for the Action Sequences that achieve these state transitions. We obtain the following action sequence: RIGHT_GRASP(rag.0), RIGHT_PLACE_NEXTTO(sink.82), TOGGLE_ON(sink.82), SOAK(rag.0), TOGGLE_OFF(sink.82), OPEN(fridge.97), CLEAN(fridge.97). Note that multiple actions may be required to achieve a single one-step state transition. For example, to perform the state transition next_to(rag.0, sink.82) → toggled_on(sink.82), we need two actions RIGHT_GRASP(rag.0), RIGHT_PLACE_NEXTTO(sink.82). See Figure 2 for the input and output formulation.

Embodied agent interface taxonomy example.
Figure 2: The input and output formulation of four ability modules for Embodied Agent Interface.

Ability Module 3: Action Sequencing

Action Sequences are essential to achieve the state transitions identified in Subgoal Decomposition. For example, a successful execution of the action sequence RIGHT_GRASP(rag.0), RIGHT_PLACE_NEXTTO(sink.82), TOGGLE_ON(sink.82), SOAK(rag.0), TOGGLE_OFF(sink.82), OPEN(fridge.97), CLEAN(fridge.97) is shown in Figure 3.

Ability Module 4: Transition Modeling

Transition Modeling serves as the low-level controller to guide the simulator in performing state transitions from preconditions to post-effects. For example, in cleaning task, the input is the operator name soak, and the preconditions are three states: holding (?obj1), next_to (?sink ?agent), and toggled_on (?sink). The post effect after executing SOAK is soaked (?obj1).

Example of successful execution in Embodied Agent Interface.
Figure 3: An example of successful execution in Embodied Agent Interface.

Evaluation Setup


We evaluate the performance of LLMs for embodied decision making using the Embodied Agent Interface. Below is a detailed description of the evaluation setup.

Dataset Description

Focusing on complex long-horizon tasks, we select VirtualHome (V) and BEHAVIOR (B) as our evaluation simulators based on their task length and scene complexity. Table 1 shows our annotations. Apart from the goal and trajectory annotations, we introduce the Goal Action annotation to reflect necessary actions that do not have post effects, such as the goal action touch in the task “pet the cat”. In the subset of VirtualHome tasks we work on, \(80.7\%\) task categories include instructions with action steps longer than \(10\), and \(33\%\) of the instructions have step lengths of more than \(10\).

We select BEHAVIOR as another simulator for our evaluation due to its task complexity. BEHAVIOR BDDL goals may contain quantifiers, such as (forpairs (?jar ?apple) (inside ?apple ?jar)), which need to be translated into grounded goals with only atomic propositions, e.g., and ((inside apple_1 jar_1) (inside apple_2 jar_2)). There can be different grounded goals that satisfy the same BDDL goal, such as ((inside apple_2 jar_1) (inside apple_1 jar_2)). We call them goal options. In general, one BDDL goal corresponds to a number of goal options. The average number of grounded goals for each task is \(6.7\), and there are \(4,164.4\) goal options for each task on average.

Table 1: Simulator dataset statistics. New annotations collected in this paper are highlighted in color.
VirtualHome BEHAVIOR
#task name 26 100
#task instruction 338 100
#goal 801 673
   - #state 340 153
   - #relation 299 520
   - #action 162 -
#trajectory 338 100
   - #step 2960 1460
   - avg. step 8.76 14.6
#transition model 33 30
   - #precondition 99 84
   - #effect 57 51

Each instance in the dataset represents a task goal. Specifically, each task contains the following data:

  • Natural language task name
  • Natural language task instruction
  • Symbolic goal definition (including its LTL form)
  • Symbolic action trajectory
  • The transition models involved in the task

For tasks in the BEHAVIOR environment, the dataset also includes accompanying VR human demonstration videos that showcase the execution of the ground truth action trajectories.

VirtualHome dataset structure example
Figure 4: VirtualHome dataset structure example.
BEHAVIOR dataset structure example
Figure 5: BEHAVIOR dataset structure example.

Please find our JSON data format in this link: Dataset JSON Format

LLMs Implementations

We integrated our evaluation pipeline into the HELM code base for easy and reproducible LLM inference. Users can set up their environment using here. We standardized decoding parameters across all models, using temperature zero for \(\operatorname*{arg\,max}\) sampling. Evaluating all models on our benchmark required \(180\) runs. Detailed model information is provided in the table below.

Table 2 : Model Cards for All Evaluated Large Language Models
Model Name Creator Complete Model ID Release Hosting
Claude-3 Haiku Anthropic claude-3-haiku-20240307 03/07/24 Anthropic
Claude-3 Sonnet Anthropic claude-3-sonnet-20240229 02/29/24 Anthropic
Claude-3 Opus Anthropic claude-3-opus-20240229 02/29/24 Anthropic
Claude-3.5 Sonnet Anthropic claude-3-5-sonnet-20240620 06/20/24 Anthropic
Cohere Command R Cohere command-r 03/11/24 Cohere
Cohere Command R+ Cohere command-r-plus 04/04/24 Cohere
Gemini 1.0 Pro Google gemini-pro 12/13/23 GCP Vertex
Gemini 1.5 Flash Google gemini-1.5-flash-preview-0514 05/14/24 GCP Vertex
Gemini 1.5 Pro Google gemini-1.5-pro-preview-0409 04/09/24 GCP Vertex
GPT-3.5-turbo OpenAI gpt-3.5-turbo-0125 01/25/24 OpenAI
GPT-4-turbo OpenAI gpt-4-turbo-2024-04-09 04/09/24 OpenAI
GPT-4o OpenAI gpt-4o-2024-05-13 05/13/24 OpenAI
Llama3 8B Instruct Meta meta-llama-3-8b-instruct 04/18/24 TogetherAI
Llama3 70B Instruct Meta meta-llama-3-70b-instruct 04/18/24 TogetherAI
Mistral Large MistralAI mistral-large-2402 02/26/24 MistralAI
Mixtral 8x22B MoE MistralAI mixtral-8x22b-instruct-v0.1 04/17/24 TogetherAI
o1-mini OpenAI o1-mini-2024-09-12 09/12/24 OpenAI
o1-preview OpenAI o1-preview-2024-09-12 09/12/24 OpenAI