World model evaluation glossary

Plain-language definitions of world-model evaluation terms: rollout, closed-loop evaluation, sim2real gap, controllability, temporal consistency, and more.

World model: A model that learns how an environment evolves, predicting future states (often future video frames or latent states) from past observations and actions. Video generation models are increasingly evaluated as world models rather than just content generators. [source]
Rollout: A predicted sequence of future states or frames produced by unrolling a model forward in time from an initial observation, optionally conditioned on a sequence of actions. Benchmarks score rollouts for accuracy, consistency, and physical plausibility. [source]
Closed-loop evaluation: An evaluation where the model's outputs feed back into the environment, so its own decisions change what it sees next — as when a driving policy must complete routes in a simulator. It surfaces compounding errors that single-step metrics miss. [source]
Open-loop / non-reactive evaluation: An evaluation where the model is queried once (or on fixed logged data) and the environment does not react to its outputs. NAVSIM popularized non-reactive simulation as a middle ground that correlates better with closed-loop results than displacement errors. [source]
Physical plausibility: Whether generated video or predicted futures obey real-world physics such as gravity, solid mechanics, fluid behavior, and conservation laws — as opposed to merely looking visually convincing. [source]
Intrinsic faithfulness: A term introduced with VBench-2.0 for whether generated videos adhere to physical laws, commonsense reasoning, anatomical correctness, and compositional integrity, beyond superficial per-frame visual quality. [source]
Temporal consistency: The degree to which objects, subjects, and scenes remain stable and coherent across frames of a generated video — measured through dimensions like subject identity consistency and temporal flickering. [source]
Controllability: How precisely a generative world model follows specified conditions — camera trajectories, object placements, or text instructions — when producing new scenes or video. [source]
Action conditioning: Feeding user or agent actions (e.g. movement commands, robot controls) into a world model so its predicted future depends on those actions; interactive world model benchmarks standardize action vocabularies to compare models fairly. [source]
Teacher forcing: An evaluation (or training) setup where a model predicts the next step while being given the ground-truth history at each step, isolating one-step prediction quality from compounding rollout errors. [source]
Sim2real gap: The performance drop that occurs when systems developed or evaluated in simulation are deployed in the real world; recent embodied AI competitions have shifted toward closed-loop testing on physical robots to close this gap. [source]
Occupancy forecasting: Predicting which 3D voxels of a scene will be occupied (and by what semantic class) at future timesteps — a common world-model task in autonomous driving built on datasets like nuPlan and nuScenes. [source]
Learned evaluator (VLM-as-judge): Using a trained model — often a vision-language model fine-tuned on human ratings — to score generated videos automatically, so benchmarks can scale beyond manual human evaluation. [source]