Why track world-model benchmarks
Ask three labs whether their world model is "good" and you'll get three incompatible tables. Evaluation of world models is scattered across paper appendices, leaderboard sites, and GitHub READMEs, with new suites appearing every few months. This site keeps a single, maintained index of those benchmarks — what each one measures, who runs it, and where its numbers live — with every fact linked to the paper, project page, or repository it came from.
The maintainers are the team at QUVISS. We build AI production tools that sit downstream of the model families measured here, which gives us a working reason to keep this index current — and a conflict of interest you should know about. Our rule is simple: nothing appears on these pages without a primary source attached, and we editorialize about coverage, not about winners.
Inclusion rules
- Primary sources or nothing. A benchmark enters the index only after we've read its paper, project page, or repository; each fact links to where it was checked.
- No rankings by us. We describe what each suite measures and point to its own leaderboard. We do not declare which model is best.
- Unaffiliated by default. No benchmark maintainer has paid to appear, and inclusion cannot be bought. Our own affiliations are disclosed in the footer of every page.
- Staleness is a bug. Each entry carries a retrieval date. If a benchmark has been superseded or abandoned, tell us and the entry gets updated, not quietly dropped.
- AI-drafted, source-gated. Research and drafting involve AI systems; publication requires a verified source for every claim.