How the field decides whether a world model works

A source-linked tracker of benchmarks and evaluation suites for AI world models

Reviewed 2026-07-04

World models — systems that learn to simulate environments, predict consequences, and stay coherent while you act inside them — are graded on axes that ordinary generative-video metrics never had to consider: object permanence, physical plausibility, controllability, closed-loop behavior. A benchmark ecosystem has grown up around each of those axes, and it moves fast enough that last year's evaluation table is already incomplete.

This site indexes that ecosystem: generation-quality suites, interactive and agentic evals, physics probes, embodied-robotics tests, driving benchmarks, and 3D spatial-understanding datasets. Every entry states what the suite actually measures and links to its paper, code, and leaderboard.

How the field decides whether a world model works

The benchmark index

Evaluation vocabulary

How entries are vetted