Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making

Abstract

Problem: We aim to evaluate Large Language Models (LLMs) for embodied decision making. While a significant body of work has been leveraging LLMs for decision making in embodied environments, we still lack a systematic understanding of their performances, because they are usually applied in different domains for different purposes, and built based on different inputs and outputs. Furthermore, existing evaluations tend to rely solely on a final success rate, making it difficult to pinpoint what ability is missing in LLMs and where the problem lies, which in turn, blocks embodied agents from leveraging LLMs effectively and selectively.

Method: To address these limitations, we propose a generalized interface (Embodied Agent Interface) that supports the formalization of various types of tasks and input-output specifications of LLM-based modules. Specifically, it allows us to unify 1) a broad set of embodied decision making tasks involving both state and temporally extended goals, 2) four commonly-used LLM-based modules for decision making: goal interpretation, subgoal decomposition, action sequencing, and transition modeling, and 3) a collection of fine-grained metrics which break down evaluation into various types of errors, such as hallucination errors, affordance errors, various types of planning errors, etc.

Conclusion: Overall, our benchmark offers a comprehensive and systematic assessment of LLMs' performance for different subtasks, pinpointing the strengths and weaknesses in LLM-powered embodied AI systems, and providing insights for effective and selective use of LLMs in embodied decision making.

Embodied agent interface overview. — **Figure 1:** **Embodied Agent Interface** unifies a broad set of tasks involving both state and temporally extended goals and four LLM-based modules for decision making.

Goal Interpretation:

LLMs struggle to translate natural language instructions into grounded states.
Common errors include generating intermediate goals and omitting spatial relationship goals.
Gemini 1.5 Pro has the highest goal interpretation performance, while Claude-3 Opus excels in goal retrieval rate.
Proprietary LLMs make fewer grammar errors compared to open-source LLMs.

**Table:** All goal evaluation results (%) for goal interpretation
Model Name	Goal Interpretation
	State						Spatial						Action						Overall
	Precision		Recall		F1		Precision		Recall		F1		Precision		Recall		F1		Precision		Recall		F1
	V	B	V	B	V	B	V	B	V	B	V	B	V	B	V	B	V	B	V	B	V	B	V	B
Claude-3 Haiku	21.8	22.8	58.9	93.5	31.8	36.7	24.2	64.5	50.8	64.6	32.8	64.6	12.2	-	95.7	-	21.6	-	18.0	41.5	63.2	71.2	28.0	52.5
Claude-3 Sonnet	23.3	36.8	57.1	88.9	33.1	52.0	26.6	76.2	53.0	79.8	35.5	77.9	12.4	-	85.8	-	21.7	-	19.3	60.2	61.5	81.9	29.4	69.4
Claude-3 Opus	27.0	72.6	66.9	93.5	38.5	81.7	22.6	75.2	46.8	79.2	30.5	77.1	14.5	-	92.6	-	25.1	-	20.7	72.2	65.0	82.5	31.4	77.0
Claude-3.5 Sonnet	25.3	74.0	60.9	94.8	35.8	83.1	31.1	84.4	63.8	81.3	41.8	82.9	14.0	-	98.8	-	24.5	-	21.7	81.1	69.6	84.4	33.0	82.7
Cohere Command R	51.1	7.7	69.6	31.4	58.9	12.4	34.5	56.8	21.3	55.0	26.3	55.9	3.6	-	38.9	-	6.5	-	27.4	28.2	55.7	49.6	36.7	36.0
Cohere Command R+	20.9	23.3	52.0	79.1	29.8	36.0	17.9	66.7	15.2	61.5	16.4	64.0	10.4	-	82.6	-	18.5	-	14.9	42.0	44.5	65.5	22.4	51.2
Gemini 1.0 Pro	25.3	27.4	57.9	81.1	34.9	41.0	17.0	75.2	20.6	70.4	18.6	72.7	9.9	-	68.7	-	17.2	-	16.2	51.0	45.2	72.8	23.8	60.0
Gemini 1.5 Flash	23.6	55.8	57.9	94.1	33.5	70.1	19.8	76.6	21.1	76.7	20.5	76.7	13.5	-	90.1	-	23.5	-	18.2	69.7	50.8	80.7	26.8	74.8
Gemini 1.5 Pro	45.4	94.0	49.1	92.8	47.2	93.4	40.0	74.4	9.7	76.7	15.6	75.6	26.8	-	80.9	-	40.3	-	35.2	78.8	41.1	80.4	37.9	79.6
GPT-3.5-turbo	22.4	52.0	50.0	66.7	30.9	58.5	8.5	51.5	18.8	46.9	11.7	49.1	15.2	-	60.5	-	24.4	-	15.7	49.5	40.5	51.4	22.7	50.4
GPT-4-turbo	28.6	70.4	58.5	86.9	38.4	77.8	24.7	77.5	32.9	76.4	28.2	76.9	19.0	-	82.1	-	30.9	-	24.0	75.6	53.8	78.8	33.2	77.2
GPT-4o	29.0	67.1	60.0	94.8	39.1	78.6	31.5	81.1	43.6	78.5	36.6	79.8	20.5	-	85.8	-	33.1	-	26.4	76.5	59.1	82.2	36.5	79.2
Llama 3 8B Instruct	21.7	17.3	54.4	80.4	31.0	28.4	14.0	51.4	7.4	20.8	9.7	29.6	11.1	-	79.4	-	19.4	-	15.5	24.1	41.9	34.3	22.6	28.3
Llama 3 70B Instruct	23.9	69.5	61.2	95.4	34.3	80.4	22.6	70.0	37.5	73.3	28.2	71.6	11.2	-	88.8	-	19.8	-	17.5	64.7	58.0	78.3	26.9	70.9
Mistral Large	23.6	63.5	59.1	92.2	32.8	75.2	23.7	75.1	40.3	76.2	29.8	75.6	11.2	-	84.0	-	19.7	-	17.5	69.6	57.1	79.8	26.8	74.3
Mixtral 8x22B MoE	23.6	22.9	56.9	83.7	33.4	36.0	22.2	70.7	36.3	67.7	27.5	69.2	11.2	-	94.8	-	20.0	-	17.4	44.4	56.2	71.3	26.6	54.7
o1-mini	26.3	63.8	58.6	90.8	36.3	74.9	30.4	77.3	39.9	76.5	34.5	76.9	13.5	-	56.8	-	21.8	-	22.4	73.3	51.3	79.8	31.2	76.4
o1-preview	28.2	66.8	60.3	94.8	38.5	78.4	44.9	82.9	62.4	82.7	52.2	82.8	26.0	-	81.5	-	39.5	-	31.8	78.1	65.4	85.4	42.7	81.6

Action Sequencing:

Reasoning ability is crucial for LLMs; trajectory feasibility errors are common (41.2%).
o1-preview has the highest task (81.0%) and execution success rates (91.0%) in BEHAVIOR. Mistral Large (73.4%) and Gemini 1.5 Pro (73.1%) outperform it in VirtualHome.
SOTA LLMs make fewer grammar errors. For example, Claude-3 Opus makes no errors, while GPT-3.5-turbo has a 4.0% error rate in BEHAVIOR.
Common runtime errors include missing steps and wrong order. In BEHAVIOR, GPT-4o encounters 36.0% missing step errors and 9.0% wrong order errors.
LLMs perform better with state goals than relation goals, but struggle with complex action goals. GPT-4o achieves 82.0% success in state goals and 67.8% in relation goals in VirtualHome.
Task complexity, such as the number of goals and action sequence length, lowers success rates. In BEHAVIOR, tasks with more than 10 goals have a success rate below 40%.

**Table:** Trajectory evaluation results (%) for *action sequencing*.
Model Name	Goal Evaluation		Trajectory Evaluation
	Goal SR		Execution SR		Grammar Error (↓)						Runtime Error (↓)
	Goal SR		Execution SR		Parsing		Hallucination		Action-Arg Num		Wrong Order		Missing Step		Affordance		Additional Step
	V	B	V	B	V	B	V	B	V	B	V	B	V	B	V	B	V	B
Claude-3 Haiku	43.3	26.0	48.5	32.0	0.0	0.0	4.9	6.0	0.3	0.0	1.6	7.0	43.3	54.0	1.3	1.0	3.3	1.0
Claude-3 Sonnet	62.9	44.0	67.2	57.0	0.0	0.0	5.6	1.0	0.7	7.9	2.3	11.0	22.9	19.0	1.3	11.0	3.6	2.0
Claude-3 Opus	66.2	51.0	70.8	59.0	0.0	0.0	14.1	0.0	0.0	0.0	0.7	3.0	14.1	35.0	0.3	3.0	6.2	2.0
Claude-3.5 Sonnet	72.8	60.0	75.4	69.0	0.0	0.0	2.3	0.0	0.0	0.0	1.0	5.0	19.7	25.0	1.6	1.0	5.2	2.0
Gemini 1.0 Pro	34.4	27.0	45.9	32.0	0.3	7.0	9.2	3.0	2.0	6.0	1.3	13.0	38.7	35.0	2.6	4.0	7.2	4.0
Gemini 1.5 Flash	61.9	40.0	67.2	52.0	0.0	0.0	2.0	0.0	0.3	0.0	0.3	5.0	29.8	42.0	0.3	1.0	4.3	2.0
Gemini 1.5 Pro	73.1	42.0	83.3	54.0	0.0	0.0	1.6	0.0	0.3	0.0	0.3	6.0	13.1	39.0	1.3	1.0	5.6	2.0
GPT-3.5-turbo	14.7	16.0	31.8	20.0	35.1	4.0	1.6	7.0	1.3	23.0	0.3	1.0	28.2	36.0	1.6	8.0	2.0	1.3
GPT-4-turbo	57.0	38.0	65.6	45.0	0.0	0.0	1.6	0.0	0.3	0.0	0.0	7.0	32.1	47.0	0.3	1.0	3.6	0.0
GPT-4o	61.6	47.0	71.1	53.0	0.3	0.0	1.3	1.0	0.3	0.0	0.3	9.0	25.2	36.0	1.3	1.0	4.9	0.0
Cohere Command R	24.6	16.0	37.7	19.0	0.7	5.0	29.8	13.0	2.0	0.0	3.0	8.0	25.2	43.0	2.0	12.0	4.3	4.0
Cohere Command R+	63.3	27.0	70.2	35.0	0.0	0.0	5.6	1.0	0.7	15.0	0.3	10.0	22.6	39.0	0.7	0.0	5.9	15.0
Mistral Large	73.4	33.0	83.6	50.0	0.0	0.0	2.6	0.0	0.3	0.0	0.3	8.0	12.8	35.0	0.3	6.0	4.9	7.0
Mixtral 8x22B MoE	46.2	30.0	49.5	40.0	0.0	3.0	13.1	6.0	0.7	0.0	0.7	10.0	34.7	32.0	1.3	9.0	3.0	2.0
Llama 3 8B	21.6	10.0	25.9	16.0	0.0	0.0	41.6	15.0	1.0	9.0	0.3	6.0	31.1	44.0	0.0	9.0	0.3	5.0
Llama 3 70B	55.7	34.0	63.0	42.0	0.0	0.0	23.3	2.0	1.0	0.0	2.0	15.0	7.9	38.0	3.0	3.0	7.9	6.0
o1-mini	65.9	56.0	68.9	65.0	0.3	0.0	5.2	3.0	3.3	0.0	0.3	7.0	21.6	17.0	0.3	6.0	5.9	5.0
o1-preview	71.1	81.0	78.4	91.0	2.0	0.0	8.2	0.0	0.0	0.0	0.3	0.0	34.1	6.0	0.3	2.0	8.9	3.0

**Table:** All goal success results (%) for action sequencing and subgoal decomposition.
Model Name	Action Sequencing								Subgoal Decomposition
	State Goal		Relation Goal		Action Goal		Total		State Goal		Relation Goal		Action Goal		Total
	V	B	V	B	V	B	V	B	V	B	V	B	V	B	V	B
Claude-3 Haiku	58.6	27.0	47.2	38.7	33.1	-	49.0	35.5	89.4	26.0	82.2	34.8	71.6	-	83.1	32.4
Claude-3 Sonnet	80.9	41.0	73.3	59.8	48.6	-	70.8	54.6	89.1	37.0	89.3	49.8	83.3	-	88.0	46.3
Claude-3 Opus	64.7	45.0	79.4	53.0	57.4	-	67.3	50.8	92.4	43.0	88.6	41.6	83.3	-	89.1	42.0
Claude-3.5 Sonnet	81.3	63.0	79.4	62.4	57.4	-	74.9	62.6	92.9	41.0	88.6	39.5	87.0	-	90.1	39.9
Gemini 1.0 Pro	52.2	28.0	36.1	32.0	42.6	-	45.0	30.9	84.4	26.0	61.5	31.1	72.8	-	73.5	29.7
Gemini 1.5 Flash	79.5	34.0	65.5	50.0	48.0	-	67.7	45.6	93.5	44.0	88.3	36.0	92.0	-	91.3	38.2
Gemini 1.5 Pro	81.7	41.0	77.2	43.2	68.2	-	77.1	42.6	91.2	31.0	72.5	37.1	89.5	-	83.9	35.4
GPT-3.5-turbo	29.5	20.0	18.3	22.6	23.6	-	24.8	21.9	84.7	28.0	54.4	28.5	64.8	-	69.4	28.3
GPT-4-turbo	74.1	39.0	73.3	39.5	47.3	-	67.3	39.3	93.5	45.0	84.2	46.1	90.7	-	89.5	45.8
GPT-4o	82.0	49.0	67.8	45.5	57.4	-	71.8	46.5	92.1	50.0	84.2	53.2	93.2	-	89.4	52.3
Cohere Command R	24.1	20.0	40.0	25.9	37.1	-	32.0	24.3	85.3	20.0	67.4	21.4	60.5	-	73.6	21.0
Cohere Command R+	71.2	28.0	63.9	32.0	60.2	-	66.3	30.9	89.4	34.0	66.8	29.6	75.9	-	78.3	30.8
Mistral Large	81.3	38.5	77.8	41.2	75.0	-	78.7	40.4	92.9	33.0	71.5	35.6	90.1	-	84.4	34.9
Mixtral 8x22B MoE	48.9	30.0	56.1	36.8	37.2	-	48.2	35.0	92.1	30.0	74.8	34.1	87.7	-	84.8	33.0
Llama 3 8B	26.3	16.0	26.1	23.7	10.1	-	22.2	21.6	68.8	21.0	54.7	23.6	50.0	-	59.8	22.9
Llama 3 70B	42.8	31.0	64.4	45.5	53.4	-	51.8	41.5	93.2	25.0	63.4	27.7	82.7	-	80.0	27.0
o1-mini	75.2	64.0	68.3	66.9	51.4	-	67.3	66.1	89.7	28.0	68.8	38.0	81.5	-	80.3	35.3
o1-preview	86.0	89.5	71.1	84.4	56.1	-	74.3	85.8	91.8	56.5	88.3	69.4	92.6	-	90.6	65.9

Subgoal Decomposition:

Subgoal decomposition is not easier than action sequencing in abstract action spaces.
o1-preview shows superior performance in VirtualHome (89.4%) and BEHAVIOR (57.0%). Gemini 1.5 Flash also performs well in VirtualHome (89.1%).
SOTA models avoid grammar errors but can hallucinate actions (e.g., GPT-4o adds "POUR" in VirtualHome).
Common runtime errors: extra steps in VirtualHome, missing steps in BEHAVIOR.
LLMs like o1-preview are more accurate in action goals in VirtualHome; state and relation goals in BEHAVIOR are more difficult due to stricter precondition checks.
Performance is lower in BEHAVIOR due to complex task representations with quantifiers like "forall" and "forpairs."

**Table:** All trajectory evaluation results (%) for subgoal decomposition.
Model Name	Goal Evaluation		Trajectory Evaluation
	Goal SR		Execution SR		Grammar Error (↓)						Runtime Error (↓)
	Goal SR		Execution SR		Parsing		Hallucination		Action-Arg Num		Wrong Order		Missing Step		Affordance		Additional Step
	V	B	V	B	V	B	V	B	V	B	V	B	V	B	V	B	V	B
Claude-3 Haiku	78.4	30.0	82.8	35.0	0.3	0.0	2.4	1.0	1.8	0.0	1.8	3.0	2.7	58.0	8.3	3.0	20.4	3.0
Claude-3 Sonnet	83.1	39.0	86.4	43.0	0.0	0.0	1.8	2.0	0.0	2.0	0.6	3.0	2.7	51.0	8.6	1.0	33.7	3.0
Claude-3 Opus	87.0	41.0	90.0	47.0	0.3	0.0	3.6	3.0	0.0	0.0	1.2	5.0	3.0	45.0	2.4	0.0	16.0	6.0
Claude-3.5 Sonnet	89.1	39.0	92.0	44.0	0.0	0.0	1.8	1.0	0.0	0.0	1.5	11.0	2.7	44.0	2.1	0.0	24.6	4.0
Gemini 1.0 Pro	70.4	24.0	84.6	33.0	0.6	2.0	3.3	4.0	2.4	0.0	1.2	3.0	2.7	51.0	5.3	7.0	10.4	3.0
Gemini 1.5 Flash	89.1	34.0	94.1	42.0	0.0	2.0	1.5	1.0	0.0	0.0	0.6	2.0	3.9	53.0	0.0	0.0	13.3	3.0
Gemini 1.5 Pro	87.0	31.0	91.1	37.0	0.0	1.0	1.5	0.0	1.8	1.0	0.0	3.0	5.6	59.0	0.0	0.0	16.0	2.0
GPT-3.5-turbo	69.2	24.0	81.4	36.0	1.5	2.0	0.0	3.0	0.6	0.0	1.5	4.0	11.8	51.0	3.3	4.0	20.4	3.0
GPT-4-turbo	85.5	38.0	94.1	47.0	0.0	0.0	1.8	3.0	0.0	0.0	1.5	9.0	2.4	40.0	0.3	1.0	22.2	6.0
GPT-4o	88.8	49.0	90.2	55.0	0.0	0.0	6.2	3.0	0.0	0.0	1.2	6.0	2.4	36.0	0.0	0.0	15.7	5.0
Cohere Command R	71.3	15.0	79.6	25.0	2.1	23.0	3.9	10.0	0.9	0.0	1.5	0.0	6.2	37.0	5.9	5.0	14.5	4.0
Cohere Command R+	79.0	25.0	83.7	37.0	1.5	2.0	4.5	4.0	2.1	0.0	0.9	4.0	7.7	52.0	2.7	1.0	16.0	6.0
Mistral Large	84.3	31.0	92.0	38.0	0.3	1.0	1.8	3.0	0.3	0.0	2.1	4.0	3.3	52.0	0.3	2.0	11.0	1.0
Mixtral 8x22B MoE	80.5	28.0	90.2	33.0	0.3	0.0	2.4	4.0	0.0	0.0	3.0	2.0	3.9	59.0	0.3	2.0	11.2	0.0
Llama 3 8B	48.8	21.0	58.0	29.0	0.6	2.0	2.4	11.0	0.6	0.0	6.8	6.0	5.0	44.0	26.6	8.0	18.3	7.0
Llama 3 70B	78.4	20.0	87.3	30.0	0.0	1.0	2.4	5.0	0.9	1.0	2.4	8.0	5.3	51.0	1.8	4.0	20.4	4.0
o1-mini	79.3	31.0	84.6	39.0	0.0	0.0	1.5	3.0	0.6	3.0	0.3	7.0	8.9	46.0	4.1	2.0	21.9	1.0
o1-preview	89.4	57.0	93.2	62.0	0.0	2.0	1.5	3.0	0.0	0.0	0.3	5.0	2.7	25.0	2.4	3.0	12.1	7.0

Transition Modeling:

Models excel in specific categories like object states and orientation.
Non-spatial relations consistently pose a challenge.
Planning effectiveness relies on consistency in predicted action space.

**Table:** Full results of logic form accuracy for *transition modeling* in VHO
Model	Object States			Object Orientation			Object Affordance			Spatial Relations			Non-Spatial Relations
Model	Precision	Recall	F1	Precision	Recall	F1	Precision	Recall	F1	Precision	Recall	F1	Precision	Recall	F1
Claude-3 Haiku	76.0	40.1	52.5	19.0	34.4	24.4	67.8	73.9	70.7	37.7	38.7	38.2	2.0	1.5	1.7
Claude-3 Opus	87.4	49.2	63.0	46.3	96.9	62.6	76.8	74.3	75.5	37.6	39.9	38.7	10.4	5.2	7.0
Claude-3 Sonnet	76.6	37.4	50.3	48.1	78.1	59.5	60.7	74.3	66.8	32.3	39.9	35.7	6.2	4.1	4.9
Claude-3.5 Sonnet	86.1	46.7	60.5	93.9	96.9	95.3	77.7	75.5	76.6	45.3	39.8	42.4	7.1	5.1	5.9
Cohere Command R	18.0	6.8	9.9	38.7	90.6	54.2	40.2	23.0	29.2	12.6	6.7	8.8	3.3	0.9	1.4
Cohere Command R+	44.9	19.0	26.3	34.6	68.8	45.9	51.0	62.1	56.0	30.1	34.8	32.4	7.6	3.1	4.4
Gemini 1.0 Pro	68.4	12.3	20.4	16.3	62.5	27.9	55.3	20.1	29.6	45.0	16.5	24.3	7.7	2.5	3.8
Gemini 1.5 Flash	82.3	37.6	51.6	2.0	3.1	2.5	54.4	74.7	62.9	47.4	42.9	45.0	16.3	5.2	7.9
Gemini 1.5 Pro	45.3	11.9	18.8	88.2	93.8	90.9	79.9	75.5	77.7	42.2	35.8	38.7	15.5	5.2	7.8
GPT-3.5-turbo	63.5	21.9	32.5	11.4	15.6	13.2	57.2	53.1	54.9	35.2	21.7	26.8	1.7	0.3	0.6
GPT-4-turbo	79.3	44.2	56.7	10.1	31.3	15.3	65.9	71.0	68.4	31.8	34.2	32.9	3.8	1.0	1.6
GPT-4o	80.2	41.5	54.6	48.0	59.4	52.8	76.2	73.7	74.9	40.8	40.7	40.8	14.8	5.1	7.5
Llama 3 8b	30.8	13.7	18.9	0.0	0.0	0.0	1.6	3.2	2.1	15.5	18.2	16.8	0.0	0.0	0.0
Llama 3 70b	63.5	21.9	32.5	49.0	66.3	56.6	65.0	50.0	57.0	27.0	27.0	27.0	5.0	2.0	3.0
Mistral Large	30.0	8.0	13.0	48.0	88.0	62.0	72.0	29.0	41.0	35.0	18.0	24.0	3.0	1.0	1.0
Mixtral 8x22B MoE	72.0	33.0	45.0	43.0	83.0	57.0	64.0	74.0	69.0	40.0	38.0	39.0	12.0	4.0	6.0
o1-mini	82.5	45.9	59.0	51.3	62.5	56.3	59.8	57.1	58.5	32.1	32.8	32.5	5.0	4.1	4.5
o1-preview	83.0	45.1	58.5	69.0	90.6	78.4	84.7	71.4	77.5	39.8	37.8	38.8	17.1	9.0	11.8

**Table:** Full results of logic form accuracy for *transition modeling* in BH
Model	Object States			Spatial Relations			Non-Spatial Relations
Model	Precision	Recall	F1	Precision	Recall	F1	Precision	Recall	F1
Claude-3.5 Sonnet	83.3	74.8	78.8	73.3	48.8	58.6	82.9	66.2	73.6
Claude-3 Haiku	64.1	55.2	59.3	54.7	37.4	44.4	63.3	51.4	56.7
Claude-3 Opus	74.6	69.4	71.9	70.4	44.6	54.6	68.5	69.1	68.8
Claude-3 Sonnet	66.2	68.7	67.5	62.8	39.8	48.7	68.8	52.0	59.2
Cohere Command R	59.7	43.9	50.6	29.1	11.6	16.6	27.2	15.3	19.6
Cohere Command R+	58.0	58.4	58.2	54.2	33.6	41.5	53.0	56.6	54.7
Gemini 1.0 Pro	67.2	55.2	60.6	47.5	35.3	40.5	43.8	48.3	45.9
Gemini 1.5 Flash	73.9	57.2	64.5	54.5	40.7	46.6	60.7	53.8	57.0
Gemini 1.5 Pro	69.6	46.7	55.9	52.9	27.2	35.9	59.6	47.4	52.8
GPT-3.5-turbo	67.1	46.1	54.6	57.6	31.6	40.9	40.8	36.1	38.3
GPT-4-turbo	58.2	59.4	58.8	50.3	27.8	35.8	58.5	38.4	46.4
GPT-4o	73.1	69.6	71.3	63.9	35.8	45.9	84.7	64.2	73.0
Llama 3 70b	68.1	64.6	66.3	60.3	38.8	47.2	65.1	53.8	58.9
Llama 3 8b	40.3	32.4	35.9	29.6	22.7	25.7	48.9	43.9	46.2
Mistral Large	67.5	66.5	67.0	54.9	32.3	40.7	59.7	44.6	51.1
Mixtral 8x22B MoE	60.2	60.0	60.1	53.2	39.9	45.6	57.9	55.8	56.8
o1-mini	46.3	37.2	41.3	71.1	42.3	53.1	80.1	58.3	67.5
o1-preview	85.5	72.3	78.3	72.4	46.1	56.3	88.0	79.5	83.5

**Table:** Full results of planner success rate for *transition modeling* (%)
Model	Object States		Object Orientation		Object Affordance		Spatial Relations		Non-Spatial Relations
Model	V	B	V	B	V	B	V	B	V	B
Claude-3 Haiku	13.5	68.9	3.6	-	19.8	-	46.9	62.8	73.0	62.3
Claude-3 Opus	63.5	84.4	71.4	-	58.7	-	64.8	80.9	55.4	82.0
Claude-3 Sonnet	11.2	80.0	3.6	-	10.8	-	20.0	79.8	13.5	80.3
Claude-3.5 Sonnet	67.4	86.7	96.4	-	67.8	-	96.6	80.8	91.9	80.3
Cohere Command R	44.6	48.9	82.1	-	40.1	-	62.6	38.3	58.3	39.3
Cohere Command R+	36.5	77.8	46.4	-	35.3	-	40.7	57.4	31.1	47.5
Gemini 1.0 Pro	10.7	22.2	0.0	-	10.2	-	14.5	13.8	2.7	14.8
Gemini 1.5 Flash	34.8	55.6	7.1	-	46.7	-	61.4	68.1	60.8	70.5
Gemini 1.5 Pro	94.4	35.6	89.3	-	95.8	-	89.0	40.4	83.8	39.3
GPT-3.5-turbo	1.1	26.7	25.0	-	1.2	-	0.0	39.4	0.0	54.1
GPT-4-turbo	51.7	40.0	50.0	-	47.9	-	67.6	44.7	64.9	52.5
GPT-4o	71.9	68.9	78.6	-	63.5	-	66.9	64.9	68.9	68.9
Llama 3 8b	27.0	35.6	0.0	-	26.4	-	37.9	27.7	31.1	26.2
Llama 3 70b	10.1	68.9	3.6	-	6.6	-	15.2	77.7	18.9	85.2
Mistral Large	15.7	73.3	7.1	-	14.4	-	17.9	76.6	8.1	80.3
Mixtral 8x22B MoE	36.5	57.8	50.0	-	28.1	-	44.1	52.1	43.2	57.4
o1-mini	63.5	77.8	82.1	-	59.3	-	75.9	77.7	71.6	75.4
o1-preview	69.1	86.7	100.0	-	67.1	-	76.6	89.4	78.4	90.2

Sensitivity Analysis:
- Actions like "plug_in" and "walk_towards" show low success rates.
- Complex interactions like "slice_carvingknife" and "place_inside" present challenges.
- Training regimens may not fully capture real-world interaction diversity.

Pipeline-Based vs. Modularized:

Similar trajectory executable rates for both methods.
Pipeline-based methods suffer from error accumulation.
SOTA LLMs avoid grammar errors; less advanced models do not.
All LLMs are prone to runtime errors, missing necessary steps.

**Table:** Pipeline-based evaluation results for (1) $\mathcal{G}+\mathcal{Q}$ and (2) $\mathcal{G}+\Phi$$ in BEHAVIOR. $\mathcal{G}$: Goal Interpretation. $\mathcal{Q}$: Action Sequencing. $\Phi$: Subgoal Decomposition. In this table, M means 'modularized', whereas P means 'pipeline-based'.
Model Name	Goal Evaluation		Trajectory Evaluation
	Goal SR		Execution SR		Grammar Error (↓)						Runtime Error (↓)
	Goal SR		Execution SR		Parsing		Hallucination		Action-Arg Num		Wrong Order		Missing Step		Affordance		Additional Step
	M	P	M	P	M	P	M	P	M	P	M	P	M	P	M	P	M	P
Goal Interpretation + Action Sequencing
Claude-3 Haiku	26.0	21.0	32.0	29.0	0.0	0.0	6.0	6.0	0.0	0.0	7.0	6.0	54.0	52.0	1.0	7.0	1.0	17.0
Claude-3 Sonnet	44.0	41.0	57.0	53.0	0.0	0.0	1.0	3.0	0.0	0.0	11.0	14.0	19.0	21.0	11.0	9.0	2.0	12.0
Claude-3 Opus	51.0	46.0	59.0	54.0	0.0	1.0	0.0	1.0	0.0	0.0	3.0	6.0	35.0	35.0	3.0	3.0	2.0	4.0
Gemini 1.0 Pro	27.0	26.0	32.0	35.0	7.0	5.0	3.0	3.0	6.0	6.0	13.0	14.0	35.0	38.0	4.0	2.0	4.0	11.0
Gemini 1.5 Flash	40.0	35.0	52.0	49.0	0.0	0.0	0.0	2.0	0.0	0.0	5.0	10.0	42.0	41.0	1.0	0.0	2.0	7.0
Gemini 1.5 Pro	42.0	37.0	54.0	55.0	0.0	1.0	0.0	1.0	0.0	0.0	6.0	7.0	39.0	35.0	1.0	1.0	2.0	0.0
GPT-3.5-turbo	16.0	14.0	20.0	32.0	4.0	1.0	7.0	3.0	23.0	15.0	1.0	5.0	36.0	39.0	8.0	6.0	1.0	3.0
GPT-4-turbo	38.0	32.0	45.0	47.0	0.0	1.0	0.0	1.0	0.0	0.0	7.0	9.0	47.0	41.0	1.0	1.0	0.0	0.0
GPT-4o	47.0	42.0	53.0	55.0	0.0	0.0	1.0	3.0	0.0	0.0	9.0	6.0	36.0	35.0	1.0	1.0	0.0	4.0
Cohere Command R	16.0	5.0	19.0	9.0	5.0	3.0	13.0	38.0	0.0	1.0	8.0	8.0	43.0	31.0	12.0	12.0	4.0	8.0
Cohere Command R+	27.0	15.0	35.0	29.0	0.0	0.0	1.0	8.0	15.0	14.0	10.0	30.0	39.0	31.0	0.0	2.0	15.0	22.0
Mistral Large	33.0	31.0	50.0	38.0	0.0	0.0	0.0	3.0	0.0	0.0	8.0	14.0	35.0	37.0	6.0	8.0	7.0	5.0
Mixtral 8x22B MoE	30.0	26.0	40.0	36.0	3.0	3.0	6.0	13.0	0.0	0.0	10.0	14.0	32.0	21.0	9.0	13.0	2.0	15.0
Llama3 8B	10.0	0.0	16.0	5.0	0.0	2.0	15.0	25.0	9.0	6.0	6.0	11.0	44.0	34.0	9.0	17.0	5.0	14.0
Llama3 70B	34.0	26.0	42.0	40.0	0.0	1.0	2.0	3.0	0.0	0.0	15.0	18.0	38.0	35.0	3.0	5.0	6.0	9.0
Goal Interpretation + Subgoal Decomposition
Claude-3 Haiku	29.0	21.0	35.0	40.0	0.0	0.0	1.0	5.0	0.0	0.0	2.0	2.0	59.0	46.0	3.0	7.0	3.0	16.0
Claude-3 Sonnet	38.0	31.0	43.0	45.0	0.0	0.0	2.0	3.0	0.0	0.0	3.0	2.0	51.0	47.0	1.0	3.0	3.0	18.0
Claude-3 Opus	39.0	35.0	47.0	45.0	0.0	0.0	3.0	8.0	0.0	0.0	5.0	4.0	45.0	42.0	0.0	1.0	5.0	7.0
Gemini 1.0 Pro	23.0	14.0	33.0	30.0	2.0	0.0	4.0	10.0	0.0	1.0	3.0	1.0	51.0	45.0	7.0	13.0	3.0	17.0
Gemini 1.5 Flash	34.0	32.0	42.0	44.0	2.0	1.0	1.0	3.0	0.0	0.0	2.0	2.0	53.0	48.0	0.0	2.0	3.0	7.0
Gemini 1.5 Pro	31.0	26.0	37.0	38.0	0.0	1.0	1.0	3.0	0.0	0.0	3.0	2.0	59.0	56.0	0.0	0.0	2.0	1.0
GPT-3.5-turbo	24.0	14.0	36.0	27.0	2.0	0.0	3.0	12.0	0.0	22.0	3.0	1.0	52.0	32.0	4.0	6.0	3.0	5.0
GPT-4-turbo	37.0	37.0	47.0	49.0	0.0	0.0	3.0	4.0	0.0	0.0	9.0	8.0	40.0	37.0	1.0	2.0	6.0	6.0
GPT-4o	48.0	38.0	55.0	52.0	0.0	0.0	3.0	4.0	0.0	0.0	5.0	6.0	37.0	35.0	0.0	3.0	5.0	9.0
Cohere Command R	15.0	8.0	25.0	15.0	21.0	13.0	11.0	32.0	0.0	1.0	0.0	1.0	38.0	32.0	4.0	6.0	4.0	12.0
Cohere Command R+	24.0	17.0	37.0	31.0	2.0	6.0	4.0	10.0	0.0	2.0	5.0	7.0	51.0	40.0	1.0	4.0	6.0	14.0
Mistral Large	30.0	22.0	38.0	29.0	1.0	1.0	3.0	12.0	0.0	1.0	4.0	5.0	52.0	50.0	2.0	2.0	1.0	5.0
Mixtral 8x22B MoE	27.0	22.0	33.0	29.0	0.0	0.0	4.0	9.0	0.0	2.0	2.0	2.0	59.0	45.0	2.0	13.0	0.0	17.0
Llama3 8B	21.0	3.0	29.0	14.0	2.0	7.0	11.0	29.0	0.0	2.0	6.0	3.0	44.0	30.0	8.0	15.0	7.0	7.0
Llama3 70B	20.0	19.0	30.0	31.0	1.0	1.0	5.0	22.0	1.0	1.0	8.0	7.0	51.0	35.0	4.0	3.0	4.0	7.0

Replanning and Feedback:

Replanning based on feedback significantly improves performance.
Replanning can result in over-generation of actions.

**Table:** Replanning evaluation results (%) for action sequencing.
Model Name	Goal Evaluation		Trajectory Evaluation
	Goal SR	Execution SR	Grammar Error (↓)			Runtime Error (↓)
	Goal SR	Execution SR	Parsing	Hallucination	Action-Arg Num	Wrong Order	Missing Step	Affordance	Additional Step
GPT-4o	65.2	71.8	0.0	1.3	0.7	0.0	25.3	1.0	0.3
GPT-4o w/ replanning	77.4	83.3	0.0	1.3	0.0	0.0	14.1	0.3	0.7

	VirtualHome	BEHAVIOR
#task name	26	100
#task instruction	338	100
#goal	801	673
- #state	340	153
- #relation	299	520
- #action	162	-
#trajectory	338	100
- #step	2960	1460
- avg. step	8.76	14.6
#transition model	33	30
- #precondition	99	84
- #effect	57	51

Model Name	Creator	Complete Model ID	Release	Hosting
Claude-3 Haiku	Anthropic	claude-3-haiku-20240307	03/07/24	Anthropic
Claude-3 Sonnet	Anthropic	claude-3-sonnet-20240229	02/29/24	Anthropic
Claude-3 Opus	Anthropic	claude-3-opus-20240229	02/29/24	Anthropic
Claude-3.5 Sonnet	Anthropic	claude-3-5-sonnet-20240620	06/20/24	Anthropic
Cohere Command R	Cohere	command-r	03/11/24	Cohere
Cohere Command R+	Cohere	command-r-plus	04/04/24	Cohere
Gemini 1.0 Pro	Google	gemini-pro	12/13/23	GCP Vertex
Gemini 1.5 Flash	Google	gemini-1.5-flash-preview-0514	05/14/24	GCP Vertex
Gemini 1.5 Pro	Google	gemini-1.5-pro-preview-0409	04/09/24	GCP Vertex
GPT-3.5-turbo	OpenAI	gpt-3.5-turbo-0125	01/25/24	OpenAI
GPT-4-turbo	OpenAI	gpt-4-turbo-2024-04-09	04/09/24	OpenAI
GPT-4o	OpenAI	gpt-4o-2024-05-13	05/13/24	OpenAI
Llama3 8B Instruct	Meta	meta-llama-3-8b-instruct	04/18/24	TogetherAI
Llama3 70B Instruct	Meta	meta-llama-3-70b-instruct	04/18/24	TogetherAI
Mistral Large	MistralAI	mistral-large-2402	02/26/24	MistralAI
Mixtral 8x22B MoE	MistralAI	mixtral-8x22b-instruct-v0.1	04/17/24	TogetherAI
o1-mini	OpenAI	o1-mini-2024-09-12	09/12/24	OpenAI
o1-preview	OpenAI	o1-preview-2024-09-12	09/12/24	OpenAI

Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making

Goal Interpretation

Subgoal Decomposition

Action Sequencing

Transition Modeling

Abstract

Embodied Agent Interface

Ability Module 1: Goal Interpretation

Ability Module 2: Subgoal Decomposition

Ability Module 3: Action Sequencing

Ability Module 4: Transition Modeling

Dataset Viewer

Leaderboard

Empirical Findings

Evaluation Setup

Dataset Description

LLMs Implementations

BibTeX