Atari 100k summary
atari_100k
Atari is valid for RL-proxy claims because the score ultimately asks whether a policy trained with limited environment data performs well in a standardized game suite.
Limitation: Does not test robotics executability
A benchmark is useful only when it says what claim it supports. This atlas records each benchmark domain, operational construct, source, limitation, and current WM Arena integration status.
Each row states what the benchmark can support, why that claim is valid, and what still prevents stronger WM Arena ranking evidence.
atari_100k
Atari is valid for RL-proxy claims because the score ultimately asks whether a policy trained with limited environment data performs well in a standardized game suite.
Limitation: Does not test robotics executability
crafter
Crafter adds a longer-horizon survival and crafting domain where policy progress depends on coherent state and resource dynamics.
Limitation: WM Arena currently treats it as a pilot surface
vbench_2_0
VBench-2.0 directly operationalizes intrinsic video faithfulness across physical, commonsense, controllability, human, and composition dimensions.
Limitation: Does not prove embodied closed-loop utility
physics_iq
Physics-IQ is valid for physical reasoning claims because it checks whether video models preserve spatial and temporal object dynamics rather than only visual appeal.
Limitation: Some WM Arena pilot cells currently fail due to local model availability
physion_v2
Object-contact and violation-of-expectation probes isolate intuitive physics failures that broad video quality scores can miss.
Limitation: Probe-style evidence does not imply downstream policy utility
worldmodelbench
WorldModelBench is relevant because it judges video generation as world modeling in application-driven domains, not only as image aesthetics.
Limitation: Requires reproduction and judge audit before WM Arena ranking
morpheus
Conservation-law probes test a stronger form of physical reasoning than generic video realism.
Limitation: Not yet integrated
phyworldbench
Text-to-video physical realism benchmarks directly target whether generated scenes obey physical constraints implied by the prompt.
Limitation: Not a closed-loop benchmark
phygenbench
Physical commonsense generation probes expose impossible-object and impossible-motion failures that aggregate video scores often hide.
Limitation: External code must be cloned and smoke-tested before adapter integration
world_in_world
World-in-World is valid for embodied utility because it evaluates whether world-model rollouts improve closed-loop task success rather than only visual quality.
Requires licensed MP3D / Matterport3D Habitat-ready assets before formal evaluation can run.
RLBench/CoppeliaSim substrate smoke passed; full VLM planner and world-model scoring still needs to run.
Limitation: IGNav and AEQA formal evaluations are recorded; AR still requires licensed MP3D Habitat-ready assets
worldarena
WorldArena bridges perception and functional utility by evaluating video metrics plus data-engine, policy-evaluator, and action-planner roles.
Limitation: EWMScore is useful but should not be mixed with WM Arena self-measured rankings until reproduced
robowm_bench
RoboWM-Bench tests whether generated manipulation behavior can be translated into executable robot actions, which directly matches the robotics world-model construct.
Limitation: Does not replace broad policy transfer benchmarks
libero_x
LIBERO-X is relevant when robotics claims involve robust task execution under realistic perturbations rather than only in-distribution demonstrations.
Limitation: Need official artifact and smoke test before adapter
robocasa365
RoboCasa365 is a large standardized household manipulation benchmark, so it is useful evidence for broad robotics task success and generalization claims.
Limitation: Policy leaderboard is not automatically a world-model evaluator
maniskill3
ManiSkill3 is not itself a world-model score, but it is a practical substrate for repeatable contact-rich embodied evaluation.
Limitation: Needs WM-specific protocol layered on top
drivinggen
DrivingGen is domain-valid for driving world models because it targets scene generation under driving-specific conditions and metrics.
Limitation: Driving simulation evidence should be separated from robotics and video-generation evidence
worldlens
WorldLens is relevant because it evaluates driving world models from appearance fidelity through behavioral and 4D consistency.
Limitation: Public repo must be cloned and smoke-tested before adapter integration
waymo_world_model_reference
The Waymo reference is useful market evidence that driving world models are being used for simulation, but it is not an open benchmark.
Limitation: Not open for WM Arena scoring
worldscore
WorldScore decomposes world generation into next-scene generation under camera trajectory specifications, directly testing spatial-temporal consistency.
Limitation: Not a closed-loop control benchmark
four_d_worldbench
4DWorldBench broadens 3D/4D evaluation across input modalities and physics-centric generation prompts.
Limitation: Not yet integrated
inspatio_world
InSpatio-World is relevant as an open 4D world model reference tied to WorldScore-style dynamic consistency claims.
Limitation: Model repo must be cloned and smoke-tested before any adapter
mind
MIND directly measures two core interactive world-model constructs: remembering the world across revisits and following action controls.
Limitation: Not yet integrated
| Benchmark | Domain | Claims | Status | Validity Argument | Source |
|---|---|---|---|---|---|
Atari 100k atari_100k | rl | RL proxy environmentLong-horizon coherenceAction conditioning | Ranking ranking_allowed_when_self_measured | Atari is valid for RL-proxy claims because the score ultimately asks whether a policy trained with limited environment data performs well in a standardized game suite. Limitation: Does not test robotics executability | WM Arena Atari manifestchecked 2026-05-04 |
Crafter crafter | rl | RL proxy environmentLong-horizon coherence | Pilot ranking_allowed_when_self_measured | Crafter adds a longer-horizon survival and crafting domain where policy progress depends on coherent state and resource dynamics. Limitation: WM Arena currently treats it as a pilot surface | WM Arena Crafter manifestchecked 2026-05-04 |
VBench-2.0 vbench_2_0 | video | Intrinsic video faithfulnessVideo physical reasoningAction conditioning | External external_until_reproduced | VBench-2.0 directly operationalizes intrinsic video faithfulness across physical, commonsense, controllability, human, and composition dimensions. Limitation: Does not prove embodied closed-loop utility | |
Physics-IQ physics_iq | video | Video physical reasoning | Pilot ranking_allowed_when_self_measured | Physics-IQ is valid for physical reasoning claims because it checks whether video models preserve spatial and temporal object dynamics rather than only visual appeal. Limitation: Some WM Arena pilot cells currently fail due to local model availability | Physics-IQ benchmarkchecked 2026-05-04 |
Physion v2 physion_v2 | video | Video physical reasoning | Pilot ranking_allowed_when_self_measured | Object-contact and violation-of-expectation probes isolate intuitive physics failures that broad video quality scores can miss. Limitation: Probe-style evidence does not imply downstream policy utility | |
WorldModelBench worldmodelbench | video | Video physical reasoningIntrinsic video faithfulness | Catalog external_until_reproduced | WorldModelBench is relevant because it judges video generation as world modeling in application-driven domains, not only as image aesthetics. Limitation: Requires reproduction and judge audit before WM Arena ranking | WorldModelBench: Judging Video Generation Models As World Modelschecked 2026-05-04 |
Morpheus morpheus | video | Video physical reasoning | Catalog external_until_reproduced | Conservation-law probes test a stronger form of physical reasoning than generic video realism. Limitation: Not yet integrated | Morpheus physical reasoning benchmarkchecked 2026-05-04 |
PhyWorldBench phyworldbench | video | Video physical reasoning | Catalog external_until_reproduced | Text-to-video physical realism benchmarks directly target whether generated scenes obey physical constraints implied by the prompt. Limitation: Not a closed-loop benchmark | PhyWorldBenchchecked 2026-05-04 |
PhyGenBench phygenbench | video | Video physical reasoning | Catalog external_until_reproduced | Physical commonsense generation probes expose impossible-object and impossible-motion failures that aggregate video scores often hide. Limitation: External code must be cloned and smoke-tested before adapter integration | PhyGenBench projectchecked 2026-05-04 |
World-in-World world_in_world | interactive | Closed-loop embodied utilityAction conditioningLong-horizon coherence | Smoke passed external_until_reproduced | World-in-World is valid for embodied utility because it evaluates whether world-model rollouts improve closed-loop task success rather than only visual quality. IGNavformal Success rate: 40.28%SPL: 28.05%Mean trajectory length: 46.90 steps n=144 · 2026-05-05 AEQAformal Mean score: 20.17 scoreMean efficiency: 17.86 efficiencyBlind mean score: 45.44 score n=181 · 2026-05-08 ARblocked Requires licensed MP3D / Matterport3D Habitat-ready assets before formal evaluation can run. Manipulationsubstrate RLBench/CoppeliaSim substrate smoke passed; full VLM planner and world-model scoring still needs to run. Limitation: IGNav and AEQA formal evaluations are recorded; AR still requires licensed MP3D Habitat-ready assets | |
WorldArena worldarena | interactive | Closed-loop embodied utilityIntrinsic video faithfulnessRobotics executability | External external_until_reproduced | WorldArena bridges perception and functional utility by evaluating video metrics plus data-engine, policy-evaluator, and action-planner roles. Limitation: EWMScore is useful but should not be mixed with WM Arena self-measured rankings until reproduced | |
RoboWM-Bench robowm_bench | robotics | Robotics executabilityAction conditioningForward dynamics | Catalog external_until_reproduced | RoboWM-Bench tests whether generated manipulation behavior can be translated into executable robot actions, which directly matches the robotics world-model construct. Limitation: Does not replace broad policy transfer benchmarks | |
LIBERO-X libero_x | robotics | Closed-loop embodied utilityRobotics executability | Catalog external_until_reproduced | LIBERO-X is relevant when robotics claims involve robust task execution under realistic perturbations rather than only in-distribution demonstrations. Limitation: Need official artifact and smoke test before adapter | LIBERO-X robustness benchmarkchecked 2026-05-04 |
RoboCasa365 robocasa365 | robotics | Robotics executabilityClosed-loop embodied utility | External external_until_reproduced | RoboCasa365 is a large standardized household manipulation benchmark, so it is useful evidence for broad robotics task success and generalization claims. Limitation: Policy leaderboard is not automatically a world-model evaluator | |
ManiSkill3 maniskill3 | robotics | Robotics executabilityForward dynamics | Catalog not_a_ranking_benchmark | ManiSkill3 is not itself a world-model score, but it is a practical substrate for repeatable contact-rich embodied evaluation. Limitation: Needs WM-specific protocol layered on top | ManiSkill3checked 2026-05-04 |
DrivingGen drivinggen | driving | Driving world simulation | Catalog external_until_reproduced | DrivingGen is domain-valid for driving world models because it targets scene generation under driving-specific conditions and metrics. Limitation: Driving simulation evidence should be separated from robotics and video-generation evidence | |
WorldLens worldlens | driving | Driving world simulation3D/4D world consistency | Catalog external_until_reproduced | WorldLens is relevant because it evaluates driving world models from appearance fidelity through behavioral and 4D consistency. Limitation: Public repo must be cloned and smoke-tested before adapter integration | WorldLens repositorychecked 2026-05-04 |
Waymo World Model reference waymo_world_model_reference | driving | Driving world simulation | Catalog not_a_ranking_benchmark | The Waymo reference is useful market evidence that driving world models are being used for simulation, but it is not an open benchmark. Limitation: Not open for WM Arena scoring | Waymo World Model blogchecked 2026-05-04 |
WorldScore worldscore | 3D/4D | 3D/4D world consistencyIntrinsic video faithfulness | Catalog external_until_reproduced | WorldScore decomposes world generation into next-scene generation under camera trajectory specifications, directly testing spatial-temporal consistency. Limitation: Not a closed-loop control benchmark | WorldScorechecked 2026-05-04 |
4DWorldBench four_d_worldbench | 3D/4D | 3D/4D world consistencyVideo physical reasoning | Catalog external_until_reproduced | 4DWorldBench broadens 3D/4D evaluation across input modalities and physics-centric generation prompts. Limitation: Not yet integrated | 4DWorldBenchchecked 2026-05-04 |
InSpatio-World inspatio_world | 3D/4D | 3D/4D world consistency | Catalog external_until_reproduced | InSpatio-World is relevant as an open 4D world model reference tied to WorldScore-style dynamic consistency claims. Limitation: Model repo must be cloned and smoke-tested before any adapter | InSpatio-Worldchecked 2026-05-04 |
MIND mind | interactive | Long-horizon coherenceAction conditioning | Catalog external_until_reproduced | MIND directly measures two core interactive world-model constructs: remembering the world across revisits and following action controls. Limitation: Not yet integrated |