World model benchmarks & evals

Indexed benchmarks for world models — WorldScore, VBench, Physics-IQ, embodied and driving evals — with maintainers, measurement scope, and primary sources for every claim.

Entries are curated and vendor-verified (2026-07-05) — placement cannot be bought. Corrections: see contact.

Video & World Generation Benchmarks

WorldScore

Stanford University (Stanford Vision and Learning Lab) · pricing

A unified benchmark that evaluates world generation by decomposing it into sequences of next-scene generation tasks along explicit camera trajectories, allowing 3D, 4D, and video generation models to be compared on a common protocol.

Contains 3,000 curated test examples spanning static and dynamic, indoor and outdoor, photorealistic and stylized worlds. [haoyi-duan.github.io]
Scores models along three aspects — controllability, quality, and dynamics — using metrics such as camera control, 3D consistency, photometric consistency, and motion smoothness. [haoyi-duan.github.io]
Developed at Stanford by Haoyi Duan, Hong-Xing Yu, Fei-Fei Li, Jiajun Wu and colleagues, and published at ICCV 2025 (arXiv:2504.00983). [arxiv.org]

VBench

S-Lab (NTU) & Shanghai AI Laboratory

A comprehensive benchmark suite that dissects video generation quality into hierarchical, disentangled evaluation dimensions, each validated against human preference annotations.

Comprises 16 evaluation dimensions, including subject identity consistency, motion smoothness, temporal flickering, and spatial relationships. [arxiv.org]
First submitted to arXiv in November 2023, with all prompts, evaluation methods, generated videos, and human preference annotations open-sourced. [arxiv.org]

VBench-2.0

Shanghai AI Laboratory & S-Lab (NTU) and collaborators

A successor benchmark to VBench that targets 'intrinsic faithfulness' of generated video — whether videos obey physical laws, commonsense, anatomy, and compositional logic rather than just looking convincing.

Evaluates 18 fine-grained dimensions organized into five categories: Human Fidelity, Controllability, Creativity, Physics, and Commonsense. [vchitect.github.io]
Released in March 2025 (arXiv:2503.21755) and maintained by researchers from Shanghai AI Laboratory, S-Lab at NTU, Sun Yat-Sen University, and CUHK. [vchitect.github.io]
Combines generalist evaluators (state-of-the-art VLMs and LLMs) with specialist tools such as video anomaly detection methods. [vchitect.github.io]

WorldModelBench

UC Berkeley, MIT, NVIDIA, UCSD (WorldModelBench Team)

A benchmark for judging video generation models as world models, scoring generated videos on instruction following, commonsense, and adherence to physical laws across application-driven domains.

Covers 7 domains (Robotics, Driving, Industry, Human Activities, Gaming, Animation, Natural) with 56 subdomains and 350 prompts. [worldmodelbench-team.github.io]
Physical adherence is scored against five physical laws: Newton's first law, solid mechanics, fluid mechanics, impenetrability, and gravity. [worldmodelbench-team.github.io]
Collected 67,000 human labels used to fine-tune a multimodal judge model, and was announced in February 2025 (arXiv:2502.20694). [worldmodelbench-team.github.io]

WorldSimBench

Qin et al. (multi-institution research team)

A dual-track benchmark for video generation models used as world simulators, combining human-preference-based perceptual evaluation with action-level evaluation of whether generated video translates into correct control signals.

Uses two evaluation components: Explicit Perceptual Evaluation (visual fidelity via a trained human-preference evaluator) and Implicit Manipulative Evaluation (video-action consistency in dynamic environments). [arxiv.org]
Covers three embodied scenarios: open-ended embodied environments, autonomous driving, and robot manipulation. [arxiv.org]
Introduces the HF-Embodied dataset of video assessments with fine-grained human feedback; first posted to arXiv in October 2024. [arxiv.org]

Interactive World Model Benchmarks

WorldMark

Xu et al. (academic research team)

A unified benchmark suite for interactive image-to-video world models that standardizes scenes, action sequences, and a shared control interface so heterogeneous models (Genie-class, Matrix-Game, and others) can be compared directly.

Provides 500 test cases spanning first- and third-person viewpoints, photorealistic and stylized environments, and three difficulty levels with 20-60 second durations. [arxiv.org]
Introduces a unified action-mapping layer that translates a shared WASD-style action vocabulary into each model's native control format. [arxiv.org]
Evaluates interactive world models including Genie, YUME, HY-World, and Matrix-Game on visual quality, control alignment, and world consistency; submitted to arXiv in April 2026. [arxiv.org]

iWorld-Bench

Fang et al. (academic research team)

A benchmark for interactive world models with a unified action generation framework, testing perception, reasoning, and interaction abilities including visual generation, trajectory following, and memory.

Evaluates 14 representative world models across six task types, with a publicly available model leaderboard. [arxiv.org]
Built from 330,000 video clips with 2,100 curated samples covering diverse perspectives, weather conditions, and scenes; the action generation framework produced 4,900 test samples. [arxiv.org]
Submitted to arXiv in May 2026 and accepted at ICML 2026. [arxiv.org]

WBench

Ying et al. (research team; GitHub under meituan-longcat)

A multi-turn benchmark for interactive video world models that measures how models handle sequential user interactions such as navigation, subject actions, event editing, and perspective switching.

Comprises 289 test cases and 1,058 interaction turns across four interaction types: navigation, subject action, event editing, and perspective switching. [arxiv.org]
Scores five areas — video quality, setting adherence, interaction adherence, consistency, and physics compliance — using 22 automatic sub-metrics that combine specialized vision models with large multimodal models. [arxiv.org]
An evaluation of 20 state-of-the-art models found no single model performs strongly across all dimensions; submitted to arXiv in May 2026. [arxiv.org]

1X World Model Challenge

1X Technologies

An open challenge to build world models that predict future first-person observations of the EVE humanoid robot, with tracks for token compression, pixel-space sampling, and (planned) policy evaluation inside a learned world model.

Provides over 100 hours of vector-quantized image tokens and raw actions collected from EVE humanoid robots operating at 1X offices, tokenized by a MAGVIT2 autoencoder into 16x16 tokens per 256x256 frame. [github.com]
The Compression and Sampling tracks each carry a $10,000 prize; compression targets a temporally teacher-forced loss below 8.0, with baseline GENIE-style models at 8.79 and 8.99. [github.com]
A third Evaluation Challenge is planned, asking participants to rank robot policies by evaluating them inside a learned world model. [github.com]

Physics & Physical Reasoning Evals

Physics-IQ

Google DeepMind & INSAIT (Sofia University)

A benchmark of real filmed videos that tests whether generative video models understand physical principles — asking models to continue a scene and scoring whether the predicted dynamics match reality.

Tests physical understanding across five domains: solid mechanics, fluid dynamics, optics, thermodynamics, and magnetism, using real high-quality camera footage rather than rendered scenes. [physics-iq.github.io]
Developed by researchers from INSAIT (Sofia University) and Google DeepMind; accepted to WACV 2026 with an associated ICCV 2025 challenge. [physics-iq.github.io]
Evaluations of models including Sora, Runway, Pika, Lumiere, Stable Video Diffusion, and VideoPoet found severely limited physical understanding despite visual realism. [physics-iq.github.io]

VideoPhy-2

Bansal et al. (UCLA-led research team)

An action-centric evaluation of physical commonsense in generated video, pairing human ratings of semantic adherence and physical plausibility with an automatic evaluator for fast scoring.

Curates 200 diverse real-world actions with detailed prompts, and judges videos on semantic adherence, physical commonsense, and grounding of physical rules. [arxiv.org]
The best evaluated model achieved only 22% joint performance on the hard subset, with particular failures on conservation laws such as mass and momentum. [arxiv.org]
Ships with VideoPhy-AutoEval, an automatic evaluator for the dataset; submitted to arXiv in March 2025. [arxiv.org]

PhysBench

USC, UC Berkeley & Toyota Research Institute

A benchmark for vision-language models' physical world understanding, spanning object properties, object relationships, scene understanding, and physics-driven dynamics over interleaved video-image-text data.

Contains 10,002 test entries organized into four task classes, 19 subclasses, and 10 capability dimensions. [physbench.github.io]
Experiments on 75 representative VLMs revealed significant gaps in physical world understanding; published at ICLR 2025. [physbench.github.io]
Maintains an active leaderboard ranking general, image, and video VLMs by accuracy on the test subset. [physbench.github.io]

WM-ABench

Maitrix.org (UCSD, JHU, Cornell Tech, EPFL, UMich)

An atomic evaluation of whether vision-language models have internal world models, using controlled counterfactual simulations to separately probe perception and prediction abilities.

Covers 23 fine-grained world-modeling dimensions built from 6 simulated environments with controlled counterfactual scenarios. [wm-abench.maitrix.org]
660 experiments on 15 state-of-the-art VLMs found nearly all models perform at near-random accuracy when distinguishing motion trajectories. [wm-abench.maitrix.org]
Uses a two-stage framework separating perception (visual, spatial, temporal, quantitative, motion) from prediction (mechanistic, transitive, compositional inference); paper is arXiv:2506.21876. [arxiv.org]

Embodied & Robotics World Model Evals

EmbodiedBench

UIUC, Northwestern, University of Toronto & TTIC

A benchmark for evaluating multimodal LLMs as vision-driven embodied agents, spanning high-level planning and low-level navigation and manipulation across four simulated environments.

Includes four environments — EB-ALFRED, EB-Habitat, EB-Navigation, and EB-Manipulation — covering both high-level task decomposition and low-level atomic actions. [embodiedbench.github.io]
Assesses six agent capabilities: basic task solving, commonsense reasoning, complex instruction understanding, spatial awareness, visual perception, and long-horizon planning. [embodiedbench.github.io]
Features 1,128 testing tasks; accepted as an oral presentation at ICML 2025, with a related challenge at the CVPR 2026 Workshop on Foundation Models Meet Embodied Agents. [embodiedbench.github.io]

EWMBench

AgiBot (AgibotTech)

A benchmark for embodied world models — text-driven video generation models for robot manipulation — evaluating scene consistency, motion/trajectory correctness, semantic alignment, and diversity beyond general perceptual metrics.

Evaluates embodied world models along scene consistency, trajectory consistency, semantics, and diversity, emphasizing physical plausibility and task coherence. [github.com]
Maintained by AgibotTech with a dataset hosted on Hugging Face (agibot-world/EWMBench), including ground truth and sample generated results. [github.com]
The accompanying paper (arXiv:2505.09694) was released in May 2025. [arxiv.org]

AgiBot World Challenge 2026 (World Model track)

AgiBot

A competition advancing embodied AI evaluation from simulation to real-robot testing, with a dedicated World Model track focused on how AI systems predict physical-world changes.

Held alongside ICRA 2026 in Vienna with finals in June 2026, featuring two tracks: Reasoning to Action (R2A) and World Model (WM). [agibot.com]
Drew 526 research and enterprise teams from 27 countries, with over 100 teams surpassing the official baseline. [agibot.com]
Used closed-loop testing on real AGIBOT G2 humanoid robots, prioritizing robot stability, physical-world adaptability, and long-horizon task reliability over simulation scores. [agibot.com]

Autonomous Driving Benchmarks

NAVSIM

Autonomous Vision Group (University of Tübingen) and collaborators

A data-driven, non-reactive simulation framework for benchmarking vision-based end-to-end driving policies, computing simulation-based metrics such as progress and time-to-collision from unrolled bird's-eye-view abstractions.

Uses non-reactive simulation where the evaluated policy and environment do not influence each other, enabling large-scale real-world benchmarking that aligns better with closed-loop evaluation than displacement errors. [arxiv.org]
Accepted at NeurIPS 2024 (Datasets and Benchmarks track); its CVPR 2024 competition drew 143 teams submitting 463 entries. [arxiv.org]
NAVSIM v2 (Pseudo-Simulation for Autonomous Driving, CoRL 2025) augments real data with synthetic observations generated via 3D Gaussian Splatting and achieves R^2 = 0.8 correlation with closed-loop simulation. [arxiv.org]

Bench2Drive

Thinklab, Shanghai Jiao Tong University

A closed-loop benchmark for end-to-end autonomous driving in the CARLA simulator, disentangling driving ability across interactive scenarios rather than relying on open-loop displacement metrics.

Provides 2 million fully annotated training frames from 13,638 short clips across 44 interactive scenarios, 23 weather conditions, and 12 towns in CARLA v2. [arxiv.org]
Evaluation requires models to pass 44 interactive scenarios totaling 220 routes under varied locations and weathers. [arxiv.org]
Accepted at the NeurIPS 2024 Datasets and Benchmarks track. [arxiv.org]

Waymo Open Sim Agents Challenge (WOSAC)

Waymo

A public benchmark and challenge evaluating the realism of simulated traffic agents for autonomous-vehicle development, built on the Waymo Open Motion Dataset with a standardized metric suite.

Described by its authors as the first public challenge for designing simulators to train and evaluate autonomous driving behavior models, with an ongoing evaluation server. [arxiv.org]
Introduced by Waymo researchers (Montali, Lambert, et al.) and accepted at the NeurIPS 2023 Datasets and Benchmarks track. [arxiv.org]
The challenge has run annually since 2023, with technical reports from later editions (e.g. the 2025 challenge) published by participating teams. [arxiv.org]

OpenScene

OpenDriveLab

A large-scale 3D occupancy prediction benchmark for autonomous driving, redistributed from the nuPlan dataset, which also underpins predictive world model and end-to-end driving challenge tracks.

Covers over 120 hours of occupancy labels across Boston, Pittsburgh, Las Vegas, and Singapore, as a compact (>10x reduced) redistribution of nuPlan at 2Hz. [github.com]
Served as the base dataset for the 'Predictive World Model' and 'End-to-End Driving' tracks of the CVPR 2024 Autonomous Grand Challenge and the NAVSIM-v2 track at CVPR 2025. [github.com]
Includes occupancy flow annotations (direction and velocity per occupancy grid) to bridge scene representation and decision-making. [github.com]

WorldLens

Liang, Kong et al. (multi-institution research team)

A full-spectrum evaluation of driving world models measuring how well a model builds, understands, and behaves within its generated world, spanning visual realism, geometric consistency, physical plausibility, and functional reliability.

Evaluates five aspects: Generation, Reconstruction, Action-Following, Downstream Task, and Human Preference. [arxiv.org]
Releases WorldLens-26K, a large-scale dataset of human-annotated videos with numerical scores and textual rationales, plus WorldLens-Agent, an evaluation model trained on the annotations. [arxiv.org]
First submitted in December 2025 and accepted as a CVPR 2026 oral presentation. [arxiv.org]

ReactSim-Bench

Zhang et al. (academic research team)

A benchmark for the reactive capability of behavior world models in autonomous driving simulation, testing whether simulated agents respond appropriately when the autonomous vehicle deviates from logged behavior.

Comprises 2,636 test scenarios and decouples AV and agent control so simulated agents must react to non-logged vehicle behaviors. [arxiv.org]
Measures safety and compliance via collision metrics, map-based metrics, and kinematic feasibility metrics, and analyzes the effect of replan frequency across Transformer-based, diffusion-based, and next-token-prediction models. [arxiv.org]
Submitted to arXiv in June 2026. [arxiv.org]

3D & Spatial Understanding Benchmarks

VSI-Bench

New York University (Saining Xie's group) & Stanford (Fei-Fei Li)

A video-based visual-spatial intelligence benchmark measuring how multimodal LLMs perceive, remember, and reason about 3D spaces from sequential video observations of real indoor environments.

Contains over 5,000 question-answer pairs built on real indoor scene videos, introduced in the paper 'Thinking in Space'. [arxiv.org]
Authored by Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie; first submitted in December 2024. [arxiv.org]
The paper finds spatial reasoning remains the primary bottleneck for MLLMs, and that generating explicit 'cognitive maps' during question answering improves spatial distance performance. [arxiv.org]

Spatial4D-Bench

Wang et al. (multi-institution research team)

A 4D spatial intelligence benchmark testing multimodal LLMs on spatial and spatiotemporal understanding and reasoning, including route planning, action recognition, and physical plausibility assessment.

Contains approximately 40,000 question-answer pairs organized into 18 tasks across six cognitive categories, from object understanding to spatiotemporal reasoning. [arxiv.org]
Finds that state-of-the-art open-source and proprietary MLLMs show substantial limitations across 4D spatial reasoning tasks. [arxiv.org]
First submitted to arXiv in December 2025, with a revised version in March 2026. [arxiv.org]