How the field decides whether a world model works
A source-linked tracker of benchmarks and evaluation suites for AI world models
Reviewed 2026-07-04
World models — systems that learn to simulate environments, predict consequences, and stay coherent while you act inside them — are graded on axes that ordinary generative-video metrics never had to consider: object permanence, physical plausibility, controllability, closed-loop behavior. A benchmark ecosystem has grown up around each of those axes, and it moves fast enough that last year's evaluation table is already incomplete.
This site indexes that ecosystem: generation-quality suites, interactive and agentic evals, physics probes, embodied-robotics tests, driving benchmarks, and 3D spatial-understanding datasets. Every entry states what the suite actually measures and links to its paper, code, and leaderboard.
The benchmark index
Two dozen evaluation suites across six categories, from video-generation quality to closed-loop driving — each with maintainer, scope, and sourced facts.
Evaluation vocabulary
Rollout, sim2real gap, closed-loop evaluation, physical plausibility — the terms benchmark papers assume you already know.
How entries are vetted
Primary sources only, no rankings from us, disclosed affiliations, dated retrievals.
Disclosure: this tracker is maintained by QUVISS, whose AI production tooling builds on the world-model research measured by these benchmarks. QUVISS →