World Model Benchmark

How the field decides whether a world model works

A source-linked tracker of benchmarks and evaluation suites for AI world models

Reviewed 2026-07-04

World models — systems that learn to simulate environments, predict consequences, and stay coherent while you act inside them — are graded on axes that ordinary generative-video metrics never had to consider: object permanence, physical plausibility, controllability, closed-loop behavior. A benchmark ecosystem has grown up around each of those axes, and it moves fast enough that last year's evaluation table is already incomplete.

This site indexes that ecosystem: generation-quality suites, interactive and agentic evals, physics probes, embodied-robotics tests, driving benchmarks, and 3D spatial-understanding datasets. Every entry states what the suite actually measures and links to its paper, code, and leaderboard.

The benchmark index

Two dozen evaluation suites across six categories, from video-generation quality to closed-loop driving — each with maintainer, scope, and sourced facts.

Evaluation vocabulary

Rollout, sim2real gap, closed-loop evaluation, physical plausibility — the terms benchmark papers assume you already know.

How entries are vetted

Primary sources only, no rankings from us, disclosed affiliations, dated retrievals.

Disclosure: this tracker is maintained by QUVISS, whose AI production tooling builds on the world-model research measured by these benchmarks. QUVISS →