Embodied Agent Interface: Evaluating LLMs for Embodied Decision Making
VirtualHome Track
The Embodied Agent Interface (EAI) Challenge invites participants to develop and evaluate Large Language Models (LLMs) for embodied reasoning through our standardized evaluation protocol. This challenge is part of the FMEA Workshop at CVPR 2026.
Unlike typical evaluations that only report success rates, our framework provides fine-grained metrics that examine both whether proposed actions could actually work in practice and if they truly accomplish the intended goals. The challenge uses the VirtualHome simulator with annotations including Linear Temporal Logic (LTL) specifications and comprehensive error analysis.
The challenge assesses four critical capabilities for embodied reasoning:
Understanding and interpreting high-level task objectives in embodied environments
Breaking down complex goals into manageable intermediate subgoals
Planning and ordering executable actions to accomplish each subgoal
Understanding how actions change the world state in the environment
All data and evaluation are conducted within the VirtualHome simulator. We provide datasets and starter code through our GitHub repository.
| Split | Description | Status |
|---|---|---|
| Training | Training data with ground-truth annotations | Coming Soon |
| Validation | Validation set for development and tuning | Coming Soon |
| Test | Held-out test set for final evaluation | Coming Soon |
pip install embodied-agent-interfaceOur evaluation framework goes beyond simple success rates. We employ fine-grained metrics across all four task dimensions to measure both the executability and correctness of agent outputs.
Teams are ranked by a weighted overall score combining metrics across all four tasks. Detailed evaluation criteria and scoring rubrics will be released with the data.
* Exact dates will be announced soon. Stay tuned!
Honorable mentions will be awarded for top performance in individual tasks: Goal Interpretation, Subgoal Decomposition, Action Sequencing, and Transition Modeling.
Participants should submit their model outputs following the format specified in our evaluation toolkit. Detailed submission instructions will be provided when the challenge officially opens.
The submission portal will be available soon.
The leaderboard will be updated once the challenge officially opens and submissions begin.
| Rank | Team / Method | Overall | Goal Interp. | Subgoal Decomp. | Action Seq. | Transition |
|---|---|---|---|---|---|---|
| Coming soon — challenge has not yet started. | ||||||
Our previous challenge was held at NeurIPS 2025 (December 7, 2025, San Diego). It evaluated both VirtualHome and BEHAVIOR environments. See the results and details here.
Winners:
For questions about the challenge, please reach out to us: