Evidence infrastructure

Benchmark Atlas

A benchmark is useful only when it says what claim it supports. This atlas records each benchmark domain, operational construct, source, limitation, and current WM Arena integration status.

View Evidence Matrix Video Gen scorecard

Benchmarks

Each row states what the benchmark can support, why that claim is valid, and what still prevents stronger WM Arena ranking evidence.

Open Evidence Matrix

Atari 100k summary

atari_100k

Ranking

rlRL proxy environmentLong-horizon coherenceAction conditioning

Atari is valid for RL-proxy claims because the score ultimately asks whether a policy trained with limited environment data performs well in a standardized game suite.

Limitation: Does not test robotics executability

ranking_allowed_when_self_measuredOpen source

Crafter summary

crafter

Pilot

rlRL proxy environmentLong-horizon coherence

Crafter adds a longer-horizon survival and crafting domain where policy progress depends on coherent state and resource dynamics.

Limitation: WM Arena currently treats it as a pilot surface

ranking_allowed_when_self_measuredOpen source

VBench-2.0 summary

vbench_2_0

External

videoIntrinsic video faithfulnessVideo physical reasoningAction conditioning

VBench-2.0 directly operationalizes intrinsic video faithfulness across physical, commonsense, controllability, human, and composition dimensions.

Limitation: Does not prove embodied closed-loop utility

external_until_reproducedOpen source

Physics-IQ summary

physics_iq

Pilot

videoVideo physical reasoning

Physics-IQ is valid for physical reasoning claims because it checks whether video models preserve spatial and temporal object dynamics rather than only visual appeal.

Limitation: Some WM Arena pilot cells currently fail due to local model availability

ranking_allowed_when_self_measuredOpen source

Physion v2 summary

physion_v2

Pilot

videoVideo physical reasoning

Object-contact and violation-of-expectation probes isolate intuitive physics failures that broad video quality scores can miss.

Limitation: Probe-style evidence does not imply downstream policy utility

ranking_allowed_when_self_measuredOpen source

WorldModelBench summary

worldmodelbench

Catalog

videoVideo physical reasoningIntrinsic video faithfulness

WorldModelBench is relevant because it judges video generation as world modeling in application-driven domains, not only as image aesthetics.

Limitation: Requires reproduction and judge audit before WM Arena ranking

external_until_reproducedOpen source

Morpheus summary

morpheus

Catalog

videoVideo physical reasoning

Conservation-law probes test a stronger form of physical reasoning than generic video realism.

Limitation: Not yet integrated

external_until_reproducedOpen source

PhyWorldBench summary

phyworldbench

Catalog

videoVideo physical reasoning

Text-to-video physical realism benchmarks directly target whether generated scenes obey physical constraints implied by the prompt.

Limitation: Not a closed-loop benchmark

external_until_reproducedOpen source

PhyGenBench summary

phygenbench

Catalog

videoVideo physical reasoning

Physical commonsense generation probes expose impossible-object and impossible-motion failures that aggregate video scores often hide.

Limitation: External code must be cloned and smoke-tested before adapter integration

external_until_reproducedOpen source

World-in-World summary

world_in_world

Smoke passed

interactiveClosed-loop embodied utilityAction conditioningLong-horizon coherence

World-in-World is valid for embodied utility because it evaluates whether world-model rollouts improve closed-loop task success rather than only visual quality.

IGNavformal

Success rate: 40.28%SPL: 28.05%

n=144 · 2026-05-05

AEQAformal

Mean score: 20.17 scoreMean efficiency: 17.86 efficiency

n=181 · 2026-05-08

ARblocked

Requires licensed MP3D / Matterport3D Habitat-ready assets before formal evaluation can run.

Manipulationsubstrate

RLBench/CoppeliaSim substrate smoke passed; full VLM planner and world-model scoring still needs to run.

Limitation: IGNav and AEQA formal evaluations are recorded; AR still requires licensed MP3D Habitat-ready assets

external_until_reproducedOpen source

WorldArena summary

worldarena

External

interactiveClosed-loop embodied utilityIntrinsic video faithfulnessRobotics executability

WorldArena bridges perception and functional utility by evaluating video metrics plus data-engine, policy-evaluator, and action-planner roles.

Limitation: EWMScore is useful but should not be mixed with WM Arena self-measured rankings until reproduced

external_until_reproducedOpen source

RoboWM-Bench summary

robowm_bench

Catalog

roboticsRobotics executabilityAction conditioningForward dynamics

RoboWM-Bench tests whether generated manipulation behavior can be translated into executable robot actions, which directly matches the robotics world-model construct.

Limitation: Does not replace broad policy transfer benchmarks

external_until_reproducedOpen source

LIBERO-X summary

libero_x

Catalog

roboticsClosed-loop embodied utilityRobotics executability

LIBERO-X is relevant when robotics claims involve robust task execution under realistic perturbations rather than only in-distribution demonstrations.

Limitation: Need official artifact and smoke test before adapter

external_until_reproducedOpen source

RoboCasa365 summary

robocasa365

External

roboticsRobotics executabilityClosed-loop embodied utility

RoboCasa365 is a large standardized household manipulation benchmark, so it is useful evidence for broad robotics task success and generalization claims.

Limitation: Policy leaderboard is not automatically a world-model evaluator

external_until_reproducedOpen source

ManiSkill3 summary

maniskill3

Catalog

roboticsRobotics executabilityForward dynamics

ManiSkill3 is not itself a world-model score, but it is a practical substrate for repeatable contact-rich embodied evaluation.

Limitation: Needs WM-specific protocol layered on top

not_a_ranking_benchmarkOpen source

DrivingGen summary

drivinggen

Catalog

drivingDriving world simulation

DrivingGen is domain-valid for driving world models because it targets scene generation under driving-specific conditions and metrics.

Limitation: Driving simulation evidence should be separated from robotics and video-generation evidence

external_until_reproducedOpen source

WorldLens summary

worldlens

Catalog

drivingDriving world simulation3D/4D world consistency

WorldLens is relevant because it evaluates driving world models from appearance fidelity through behavioral and 4D consistency.

Limitation: Public repo must be cloned and smoke-tested before adapter integration

external_until_reproducedOpen source

Waymo World Model reference summary

waymo_world_model_reference

Catalog

drivingDriving world simulation

The Waymo reference is useful market evidence that driving world models are being used for simulation, but it is not an open benchmark.

Limitation: Not open for WM Arena scoring

not_a_ranking_benchmarkOpen source

WorldScore summary

worldscore

Catalog

3D/4D3D/4D world consistencyIntrinsic video faithfulness

WorldScore decomposes world generation into next-scene generation under camera trajectory specifications, directly testing spatial-temporal consistency.

Limitation: Not a closed-loop control benchmark

external_until_reproducedOpen source

4DWorldBench summary

four_d_worldbench

Catalog

3D/4D3D/4D world consistencyVideo physical reasoning

4DWorldBench broadens 3D/4D evaluation across input modalities and physics-centric generation prompts.

Limitation: Not yet integrated

external_until_reproducedOpen source

InSpatio-World summary

inspatio_world

Catalog

3D/4D3D/4D world consistency

InSpatio-World is relevant as an open 4D world model reference tied to WorldScore-style dynamic consistency claims.

Limitation: Model repo must be cloned and smoke-tested before any adapter

external_until_reproducedOpen source

MIND summary

mind

Catalog

interactiveLong-horizon coherenceAction conditioning

MIND directly measures two core interactive world-model constructs: remembering the world across revisits and following action controls.

Limitation: Not yet integrated

external_until_reproducedOpen source

Benchmark	Domain	Claims	Status	Validity Argument	Source
Atari 100k atari_100k	rl	RL proxy environmentLong-horizon coherenceAction conditioning	Ranking ranking_allowed_when_self_measured	Atari is valid for RL-proxy claims because the score ultimately asks whether a policy trained with limited environment data performs well in a standardized game suite. Limitation: Does not test robotics executability	WM Arena Atari manifestchecked 2026-05-04
Crafter crafter	rl	RL proxy environmentLong-horizon coherence	Pilot ranking_allowed_when_self_measured	Crafter adds a longer-horizon survival and crafting domain where policy progress depends on coherent state and resource dynamics. Limitation: WM Arena currently treats it as a pilot surface	WM Arena Crafter manifestchecked 2026-05-04
VBench-2.0 vbench_2_0	video	Intrinsic video faithfulnessVideo physical reasoningAction conditioning	External external_until_reproduced	VBench-2.0 directly operationalizes intrinsic video faithfulness across physical, commonsense, controllability, human, and composition dimensions. Limitation: Does not prove embodied closed-loop utility	VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness Vchitect VBench Leaderboardchecked 2026-05-04
Physics-IQ physics_iq	video	Video physical reasoning	Pilot ranking_allowed_when_self_measured	Physics-IQ is valid for physical reasoning claims because it checks whether video models preserve spatial and temporal object dynamics rather than only visual appeal. Limitation: Some WM Arena pilot cells currently fail due to local model availability	Physics-IQ benchmarkchecked 2026-05-04
Physion v2 physion_v2	video	Video physical reasoning	Pilot ranking_allowed_when_self_measured	Object-contact and violation-of-expectation probes isolate intuitive physics failures that broad video quality scores can miss. Limitation: Probe-style evidence does not imply downstream policy utility	Physion: Evaluating Physical Prediction from Vision in Humans and Machineschecked 2026-05-04
WorldModelBench worldmodelbench	video	Video physical reasoningIntrinsic video faithfulness	Catalog external_until_reproduced	WorldModelBench is relevant because it judges video generation as world modeling in application-driven domains, not only as image aesthetics. Limitation: Requires reproduction and judge audit before WM Arena ranking	WorldModelBench: Judging Video Generation Models As World Modelschecked 2026-05-04
Morpheus morpheus	video	Video physical reasoning	Catalog external_until_reproduced	Conservation-law probes test a stronger form of physical reasoning than generic video realism. Limitation: Not yet integrated	Morpheus physical reasoning benchmarkchecked 2026-05-04
PhyWorldBench phyworldbench	video	Video physical reasoning	Catalog external_until_reproduced	Text-to-video physical realism benchmarks directly target whether generated scenes obey physical constraints implied by the prompt. Limitation: Not a closed-loop benchmark	PhyWorldBenchchecked 2026-05-04
PhyGenBench phygenbench	video	Video physical reasoning	Catalog external_until_reproduced	Physical commonsense generation probes expose impossible-object and impossible-motion failures that aggregate video scores often hide. Limitation: External code must be cloned and smoke-tested before adapter integration	PhyGenBench projectchecked 2026-05-04
World-in-World world_in_world	interactive	Closed-loop embodied utilityAction conditioningLong-horizon coherence	Smoke passed external_until_reproduced	World-in-World is valid for embodied utility because it evaluates whether world-model rollouts improve closed-loop task success rather than only visual quality. IGNavformal Success rate: 40.28%SPL: 28.05%Mean trajectory length: 46.90 steps n=144 · 2026-05-05 AEQAformal Mean score: 20.17 scoreMean efficiency: 17.86 efficiencyBlind mean score: 45.44 score n=181 · 2026-05-08 ARblocked Requires licensed MP3D / Matterport3D Habitat-ready assets before formal evaluation can run. Manipulationsubstrate RLBench/CoppeliaSim substrate smoke passed; full VLM planner and world-model scoring still needs to run. Limitation: IGNav and AEQA formal evaluations are recorded; AR still requires licensed MP3D Habitat-ready assets	World-in-World project page World-in-World codechecked 2026-05-04
WorldArena worldarena	interactive	Closed-loop embodied utilityIntrinsic video faithfulnessRobotics executability	External external_until_reproduced	WorldArena bridges perception and functional utility by evaluating video metrics plus data-engine, policy-evaluator, and action-planner roles. Limitation: EWMScore is useful but should not be mixed with WM Arena self-measured rankings until reproduced	WorldArena project page WorldArena leaderboardchecked 2026-05-04
RoboWM-Bench robowm_bench	robotics	Robotics executabilityAction conditioningForward dynamics	Catalog external_until_reproduced	RoboWM-Bench tests whether generated manipulation behavior can be translated into executable robot actions, which directly matches the robotics world-model construct. Limitation: Does not replace broad policy transfer benchmarks	RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulationchecked 2026-05-04
LIBERO-X libero_x	robotics	Closed-loop embodied utilityRobotics executability	Catalog external_until_reproduced	LIBERO-X is relevant when robotics claims involve robust task execution under realistic perturbations rather than only in-distribution demonstrations. Limitation: Need official artifact and smoke test before adapter	LIBERO-X robustness benchmarkchecked 2026-05-04
RoboCasa365 robocasa365	robotics	Robotics executabilityClosed-loop embodied utility	External external_until_reproduced	RoboCasa365 is a large standardized household manipulation benchmark, so it is useful evidence for broad robotics task success and generalization claims. Limitation: Policy leaderboard is not automatically a world-model evaluator	RoboCasa365 RoboCasa Leaderboardchecked 2026-05-04
ManiSkill3 maniskill3	robotics	Robotics executabilityForward dynamics	Catalog not_a_ranking_benchmark	ManiSkill3 is not itself a world-model score, but it is a practical substrate for repeatable contact-rich embodied evaluation. Limitation: Needs WM-specific protocol layered on top	ManiSkill3checked 2026-05-04
DrivingGen drivinggen	driving	Driving world simulation	Catalog external_until_reproduced	DrivingGen is domain-valid for driving world models because it targets scene generation under driving-specific conditions and metrics. Limitation: Driving simulation evidence should be separated from robotics and video-generation evidence	DrivingGen DrivingGen project pagechecked 2026-05-04
WorldLens worldlens	driving	Driving world simulation3D/4D world consistency	Catalog external_until_reproduced	WorldLens is relevant because it evaluates driving world models from appearance fidelity through behavioral and 4D consistency. Limitation: Public repo must be cloned and smoke-tested before adapter integration	WorldLens repositorychecked 2026-05-04
Waymo World Model reference waymo_world_model_reference	driving	Driving world simulation	Catalog not_a_ranking_benchmark	The Waymo reference is useful market evidence that driving world models are being used for simulation, but it is not an open benchmark. Limitation: Not open for WM Arena scoring	Waymo World Model blogchecked 2026-05-04
WorldScore worldscore	3D/4D	3D/4D world consistencyIntrinsic video faithfulness	Catalog external_until_reproduced	WorldScore decomposes world generation into next-scene generation under camera trajectory specifications, directly testing spatial-temporal consistency. Limitation: Not a closed-loop control benchmark	WorldScorechecked 2026-05-04
4DWorldBench four_d_worldbench	3D/4D	3D/4D world consistencyVideo physical reasoning	Catalog external_until_reproduced	4DWorldBench broadens 3D/4D evaluation across input modalities and physics-centric generation prompts. Limitation: Not yet integrated	4DWorldBenchchecked 2026-05-04
InSpatio-World inspatio_world	3D/4D	3D/4D world consistency	Catalog external_until_reproduced	InSpatio-World is relevant as an open 4D world model reference tied to WorldScore-style dynamic consistency claims. Limitation: Model repo must be cloned and smoke-tested before any adapter	InSpatio-Worldchecked 2026-05-04
MIND mind	interactive	Long-horizon coherenceAction conditioning	Catalog external_until_reproduced	MIND directly measures two core interactive world-model constructs: remembering the world across revisits and following action controls. Limitation: Not yet integrated	MIND MIND project pagechecked 2026-05-04