World Model Benchmark

World model benchmarks & evals

Indexed benchmarks for world models — WorldScore, VBench, Physics-IQ, embodied and driving evals — with maintainers, measurement scope, and primary sources for every claim.

Entries are curated and vendor-verified (2026-07-05) — placement cannot be bought. Corrections: see contact.

Video & World Generation Benchmarks

WorldScore

Stanford University (Stanford Vision and Learning Lab) · pricing

A unified benchmark that evaluates world generation by decomposing it into sequences of next-scene generation tasks along explicit camera trajectories, allowing 3D, 4D, and video generation models to be compared on a common protocol.

  • Contains 3,000 curated test examples spanning static and dynamic, indoor and outdoor, photorealistic and stylized worlds. [haoyi-duan.github.io]
  • Scores models along three aspects — controllability, quality, and dynamics — using metrics such as camera control, 3D consistency, photometric consistency, and motion smoothness. [haoyi-duan.github.io]
  • Developed at Stanford by Haoyi Duan, Hong-Xing Yu, Fei-Fei Li, Jiajun Wu and colleagues, and published at ICCV 2025 (arXiv:2504.00983). [arxiv.org]

VBench

S-Lab (NTU) & Shanghai AI Laboratory

A comprehensive benchmark suite that dissects video generation quality into hierarchical, disentangled evaluation dimensions, each validated against human preference annotations.

  • Comprises 16 evaluation dimensions, including subject identity consistency, motion smoothness, temporal flickering, and spatial relationships. [arxiv.org]
  • First submitted to arXiv in November 2023, with all prompts, evaluation methods, generated videos, and human preference annotations open-sourced. [arxiv.org]

VBench-2.0

Shanghai AI Laboratory & S-Lab (NTU) and collaborators

A successor benchmark to VBench that targets 'intrinsic faithfulness' of generated video — whether videos obey physical laws, commonsense, anatomy, and compositional logic rather than just looking convincing.

  • Evaluates 18 fine-grained dimensions organized into five categories: Human Fidelity, Controllability, Creativity, Physics, and Commonsense. [vchitect.github.io]
  • Released in March 2025 (arXiv:2503.21755) and maintained by researchers from Shanghai AI Laboratory, S-Lab at NTU, Sun Yat-Sen University, and CUHK. [vchitect.github.io]
  • Combines generalist evaluators (state-of-the-art VLMs and LLMs) with specialist tools such as video anomaly detection methods. [vchitect.github.io]

WorldModelBench

UC Berkeley, MIT, NVIDIA, UCSD (WorldModelBench Team)

A benchmark for judging video generation models as world models, scoring generated videos on instruction following, commonsense, and adherence to physical laws across application-driven domains.

  • Covers 7 domains (Robotics, Driving, Industry, Human Activities, Gaming, Animation, Natural) with 56 subdomains and 350 prompts. [worldmodelbench-team.github.io]
  • Physical adherence is scored against five physical laws: Newton's first law, solid mechanics, fluid mechanics, impenetrability, and gravity. [worldmodelbench-team.github.io]
  • Collected 67,000 human labels used to fine-tune a multimodal judge model, and was announced in February 2025 (arXiv:2502.20694). [worldmodelbench-team.github.io]

WorldSimBench

Qin et al. (multi-institution research team)

A dual-track benchmark for video generation models used as world simulators, combining human-preference-based perceptual evaluation with action-level evaluation of whether generated video translates into correct control signals.

  • Uses two evaluation components: Explicit Perceptual Evaluation (visual fidelity via a trained human-preference evaluator) and Implicit Manipulative Evaluation (video-action consistency in dynamic environments). [arxiv.org]
  • Covers three embodied scenarios: open-ended embodied environments, autonomous driving, and robot manipulation. [arxiv.org]
  • Introduces the HF-Embodied dataset of video assessments with fine-grained human feedback; first posted to arXiv in October 2024. [arxiv.org]

Interactive World Model Benchmarks

WorldMark

Xu et al. (academic research team)

A unified benchmark suite for interactive image-to-video world models that standardizes scenes, action sequences, and a shared control interface so heterogeneous models (Genie-class, Matrix-Game, and others) can be compared directly.

  • Provides 500 test cases spanning first- and third-person viewpoints, photorealistic and stylized environments, and three difficulty levels with 20-60 second durations. [arxiv.org]
  • Introduces a unified action-mapping layer that translates a shared WASD-style action vocabulary into each model's native control format. [arxiv.org]
  • Evaluates interactive world models including Genie, YUME, HY-World, and Matrix-Game on visual quality, control alignment, and world consistency; submitted to arXiv in April 2026. [arxiv.org]

iWorld-Bench

Fang et al. (academic research team)

A benchmark for interactive world models with a unified action generation framework, testing perception, reasoning, and interaction abilities including visual generation, trajectory following, and memory.

  • Evaluates 14 representative world models across six task types, with a publicly available model leaderboard. [arxiv.org]
  • Built from 330,000 video clips with 2,100 curated samples covering diverse perspectives, weather conditions, and scenes; the action generation framework produced 4,900 test samples. [arxiv.org]
  • Submitted to arXiv in May 2026 and accepted at ICML 2026. [arxiv.org]

WBench

Ying et al. (research team; GitHub under meituan-longcat)

A multi-turn benchmark for interactive video world models that measures how models handle sequential user interactions such as navigation, subject actions, event editing, and perspective switching.

  • Comprises 289 test cases and 1,058 interaction turns across four interaction types: navigation, subject action, event editing, and perspective switching. [arxiv.org]
  • Scores five areas — video quality, setting adherence, interaction adherence, consistency, and physics compliance — using 22 automatic sub-metrics that combine specialized vision models with large multimodal models. [arxiv.org]
  • An evaluation of 20 state-of-the-art models found no single model performs strongly across all dimensions; submitted to arXiv in May 2026. [arxiv.org]

1X World Model Challenge

1X Technologies

An open challenge to build world models that predict future first-person observations of the EVE humanoid robot, with tracks for token compression, pixel-space sampling, and (planned) policy evaluation inside a learned world model.

  • Provides over 100 hours of vector-quantized image tokens and raw actions collected from EVE humanoid robots operating at 1X offices, tokenized by a MAGVIT2 autoencoder into 16x16 tokens per 256x256 frame. [github.com]
  • The Compression and Sampling tracks each carry a $10,000 prize; compression targets a temporally teacher-forced loss below 8.0, with baseline GENIE-style models at 8.79 and 8.99. [github.com]
  • A third Evaluation Challenge is planned, asking participants to rank robot policies by evaluating them inside a learned world model. [github.com]

Physics & Physical Reasoning Evals

Physics-IQ

Google DeepMind & INSAIT (Sofia University)

A benchmark of real filmed videos that tests whether generative video models understand physical principles — asking models to continue a scene and scoring whether the predicted dynamics match reality.

  • Tests physical understanding across five domains: solid mechanics, fluid dynamics, optics, thermodynamics, and magnetism, using real high-quality camera footage rather than rendered scenes. [physics-iq.github.io]
  • Developed by researchers from INSAIT (Sofia University) and Google DeepMind; accepted to WACV 2026 with an associated ICCV 2025 challenge. [physics-iq.github.io]
  • Evaluations of models including Sora, Runway, Pika, Lumiere, Stable Video Diffusion, and VideoPoet found severely limited physical understanding despite visual realism. [physics-iq.github.io]

VideoPhy-2

Bansal et al. (UCLA-led research team)

An action-centric evaluation of physical commonsense in generated video, pairing human ratings of semantic adherence and physical plausibility with an automatic evaluator for fast scoring.

  • Curates 200 diverse real-world actions with detailed prompts, and judges videos on semantic adherence, physical commonsense, and grounding of physical rules. [arxiv.org]
  • The best evaluated model achieved only 22% joint performance on the hard subset, with particular failures on conservation laws such as mass and momentum. [arxiv.org]
  • Ships with VideoPhy-AutoEval, an automatic evaluator for the dataset; submitted to arXiv in March 2025. [arxiv.org]

PhysBench

USC, UC Berkeley & Toyota Research Institute

A benchmark for vision-language models' physical world understanding, spanning object properties, object relationships, scene understanding, and physics-driven dynamics over interleaved video-image-text data.

  • Contains 10,002 test entries organized into four task classes, 19 subclasses, and 10 capability dimensions. [physbench.github.io]
  • Experiments on 75 representative VLMs revealed significant gaps in physical world understanding; published at ICLR 2025. [physbench.github.io]
  • Maintains an active leaderboard ranking general, image, and video VLMs by accuracy on the test subset. [physbench.github.io]

WM-ABench

Maitrix.org (UCSD, JHU, Cornell Tech, EPFL, UMich)

An atomic evaluation of whether vision-language models have internal world models, using controlled counterfactual simulations to separately probe perception and prediction abilities.

  • Covers 23 fine-grained world-modeling dimensions built from 6 simulated environments with controlled counterfactual scenarios. [wm-abench.maitrix.org]
  • 660 experiments on 15 state-of-the-art VLMs found nearly all models perform at near-random accuracy when distinguishing motion trajectories. [wm-abench.maitrix.org]
  • Uses a two-stage framework separating perception (visual, spatial, temporal, quantitative, motion) from prediction (mechanistic, transitive, compositional inference); paper is arXiv:2506.21876. [arxiv.org]

Embodied & Robotics World Model Evals

EmbodiedBench

UIUC, Northwestern, University of Toronto & TTIC

A benchmark for evaluating multimodal LLMs as vision-driven embodied agents, spanning high-level planning and low-level navigation and manipulation across four simulated environments.

  • Includes four environments — EB-ALFRED, EB-Habitat, EB-Navigation, and EB-Manipulation — covering both high-level task decomposition and low-level atomic actions. [embodiedbench.github.io]
  • Assesses six agent capabilities: basic task solving, commonsense reasoning, complex instruction understanding, spatial awareness, visual perception, and long-horizon planning. [embodiedbench.github.io]
  • Features 1,128 testing tasks; accepted as an oral presentation at ICML 2025, with a related challenge at the CVPR 2026 Workshop on Foundation Models Meet Embodied Agents. [embodiedbench.github.io]

EWMBench

AgiBot (AgibotTech)

A benchmark for embodied world models — text-driven video generation models for robot manipulation — evaluating scene consistency, motion/trajectory correctness, semantic alignment, and diversity beyond general perceptual metrics.

  • Evaluates embodied world models along scene consistency, trajectory consistency, semantics, and diversity, emphasizing physical plausibility and task coherence. [github.com]
  • Maintained by AgibotTech with a dataset hosted on Hugging Face (agibot-world/EWMBench), including ground truth and sample generated results. [github.com]
  • The accompanying paper (arXiv:2505.09694) was released in May 2025. [arxiv.org]

AgiBot World Challenge 2026 (World Model track)

AgiBot

A competition advancing embodied AI evaluation from simulation to real-robot testing, with a dedicated World Model track focused on how AI systems predict physical-world changes.

  • Held alongside ICRA 2026 in Vienna with finals in June 2026, featuring two tracks: Reasoning to Action (R2A) and World Model (WM). [agibot.com]
  • Drew 526 research and enterprise teams from 27 countries, with over 100 teams surpassing the official baseline. [agibot.com]
  • Used closed-loop testing on real AGIBOT G2 humanoid robots, prioritizing robot stability, physical-world adaptability, and long-horizon task reliability over simulation scores. [agibot.com]

Autonomous Driving Benchmarks

NAVSIM

Autonomous Vision Group (University of Tübingen) and collaborators

A data-driven, non-reactive simulation framework for benchmarking vision-based end-to-end driving policies, computing simulation-based metrics such as progress and time-to-collision from unrolled bird's-eye-view abstractions.

  • Uses non-reactive simulation where the evaluated policy and environment do not influence each other, enabling large-scale real-world benchmarking that aligns better with closed-loop evaluation than displacement errors. [arxiv.org]
  • Accepted at NeurIPS 2024 (Datasets and Benchmarks track); its CVPR 2024 competition drew 143 teams submitting 463 entries. [arxiv.org]
  • NAVSIM v2 (Pseudo-Simulation for Autonomous Driving, CoRL 2025) augments real data with synthetic observations generated via 3D Gaussian Splatting and achieves R^2 = 0.8 correlation with closed-loop simulation. [arxiv.org]

Bench2Drive

Thinklab, Shanghai Jiao Tong University

A closed-loop benchmark for end-to-end autonomous driving in the CARLA simulator, disentangling driving ability across interactive scenarios rather than relying on open-loop displacement metrics.

  • Provides 2 million fully annotated training frames from 13,638 short clips across 44 interactive scenarios, 23 weather conditions, and 12 towns in CARLA v2. [arxiv.org]
  • Evaluation requires models to pass 44 interactive scenarios totaling 220 routes under varied locations and weathers. [arxiv.org]
  • Accepted at the NeurIPS 2024 Datasets and Benchmarks track. [arxiv.org]

Waymo Open Sim Agents Challenge (WOSAC)

Waymo

A public benchmark and challenge evaluating the realism of simulated traffic agents for autonomous-vehicle development, built on the Waymo Open Motion Dataset with a standardized metric suite.

  • Described by its authors as the first public challenge for designing simulators to train and evaluate autonomous driving behavior models, with an ongoing evaluation server. [arxiv.org]
  • Introduced by Waymo researchers (Montali, Lambert, et al.) and accepted at the NeurIPS 2023 Datasets and Benchmarks track. [arxiv.org]
  • The challenge has run annually since 2023, with technical reports from later editions (e.g. the 2025 challenge) published by participating teams. [arxiv.org]

OpenScene

OpenDriveLab

A large-scale 3D occupancy prediction benchmark for autonomous driving, redistributed from the nuPlan dataset, which also underpins predictive world model and end-to-end driving challenge tracks.

  • Covers over 120 hours of occupancy labels across Boston, Pittsburgh, Las Vegas, and Singapore, as a compact (>10x reduced) redistribution of nuPlan at 2Hz. [github.com]
  • Served as the base dataset for the 'Predictive World Model' and 'End-to-End Driving' tracks of the CVPR 2024 Autonomous Grand Challenge and the NAVSIM-v2 track at CVPR 2025. [github.com]
  • Includes occupancy flow annotations (direction and velocity per occupancy grid) to bridge scene representation and decision-making. [github.com]

WorldLens

Liang, Kong et al. (multi-institution research team)

A full-spectrum evaluation of driving world models measuring how well a model builds, understands, and behaves within its generated world, spanning visual realism, geometric consistency, physical plausibility, and functional reliability.

  • Evaluates five aspects: Generation, Reconstruction, Action-Following, Downstream Task, and Human Preference. [arxiv.org]
  • Releases WorldLens-26K, a large-scale dataset of human-annotated videos with numerical scores and textual rationales, plus WorldLens-Agent, an evaluation model trained on the annotations. [arxiv.org]
  • First submitted in December 2025 and accepted as a CVPR 2026 oral presentation. [arxiv.org]

ReactSim-Bench

Zhang et al. (academic research team)

A benchmark for the reactive capability of behavior world models in autonomous driving simulation, testing whether simulated agents respond appropriately when the autonomous vehicle deviates from logged behavior.

  • Comprises 2,636 test scenarios and decouples AV and agent control so simulated agents must react to non-logged vehicle behaviors. [arxiv.org]
  • Measures safety and compliance via collision metrics, map-based metrics, and kinematic feasibility metrics, and analyzes the effect of replan frequency across Transformer-based, diffusion-based, and next-token-prediction models. [arxiv.org]
  • Submitted to arXiv in June 2026. [arxiv.org]

3D & Spatial Understanding Benchmarks

VSI-Bench

New York University (Saining Xie's group) & Stanford (Fei-Fei Li)

A video-based visual-spatial intelligence benchmark measuring how multimodal LLMs perceive, remember, and reason about 3D spaces from sequential video observations of real indoor environments.

  • Contains over 5,000 question-answer pairs built on real indoor scene videos, introduced in the paper 'Thinking in Space'. [arxiv.org]
  • Authored by Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie; first submitted in December 2024. [arxiv.org]
  • The paper finds spatial reasoning remains the primary bottleneck for MLLMs, and that generating explicit 'cognitive maps' during question answering improves spatial distance performance. [arxiv.org]

Spatial4D-Bench

Wang et al. (multi-institution research team)

A 4D spatial intelligence benchmark testing multimodal LLMs on spatial and spatiotemporal understanding and reasoning, including route planning, action recognition, and physical plausibility assessment.

  • Contains approximately 40,000 question-answer pairs organized into 18 tasks across six cognitive categories, from object understanding to spatiotemporal reasoning. [arxiv.org]
  • Finds that state-of-the-art open-source and proprietary MLLMs show substantial limitations across 4D spatial reasoning tasks. [arxiv.org]
  • First submitted to arXiv in December 2025, with a revised version in March 2026. [arxiv.org]