Evidence infrastructure

Benchmark Atlas

A benchmark is useful only when it says what claim it supports. This atlas records each benchmark domain, operational construct, source, limitation, and current WM Arena integration status.

Benchmarks

Each row states what the benchmark can support, why that claim is valid, and what still prevents stronger WM Arena ranking evidence.

Open Evidence Matrix

Atari 100k summary

atari_100k

Ranking
rlRL proxy environmentLong-horizon coherenceAction conditioning

Atari is valid for RL-proxy claims because the score ultimately asks whether a policy trained with limited environment data performs well in a standardized game suite.

Limitation: Does not test robotics executability

ranking_allowed_when_self_measuredOpen source

Crafter summary

crafter

Pilot
rlRL proxy environmentLong-horizon coherence

Crafter adds a longer-horizon survival and crafting domain where policy progress depends on coherent state and resource dynamics.

Limitation: WM Arena currently treats it as a pilot surface

ranking_allowed_when_self_measuredOpen source

VBench-2.0 summary

vbench_2_0

External
videoIntrinsic video faithfulnessVideo physical reasoningAction conditioning

VBench-2.0 directly operationalizes intrinsic video faithfulness across physical, commonsense, controllability, human, and composition dimensions.

Limitation: Does not prove embodied closed-loop utility

external_until_reproducedOpen source

Physics-IQ summary

physics_iq

Pilot
videoVideo physical reasoning

Physics-IQ is valid for physical reasoning claims because it checks whether video models preserve spatial and temporal object dynamics rather than only visual appeal.

Limitation: Some WM Arena pilot cells currently fail due to local model availability

ranking_allowed_when_self_measuredOpen source

Physion v2 summary

physion_v2

Pilot
videoVideo physical reasoning

Object-contact and violation-of-expectation probes isolate intuitive physics failures that broad video quality scores can miss.

Limitation: Probe-style evidence does not imply downstream policy utility

ranking_allowed_when_self_measuredOpen source

WorldModelBench summary

worldmodelbench

Catalog
videoVideo physical reasoningIntrinsic video faithfulness

WorldModelBench is relevant because it judges video generation as world modeling in application-driven domains, not only as image aesthetics.

Limitation: Requires reproduction and judge audit before WM Arena ranking

external_until_reproducedOpen source

Morpheus summary

morpheus

Catalog
videoVideo physical reasoning

Conservation-law probes test a stronger form of physical reasoning than generic video realism.

Limitation: Not yet integrated

external_until_reproducedOpen source

PhyWorldBench summary

phyworldbench

Catalog
videoVideo physical reasoning

Text-to-video physical realism benchmarks directly target whether generated scenes obey physical constraints implied by the prompt.

Limitation: Not a closed-loop benchmark

external_until_reproducedOpen source

PhyGenBench summary

phygenbench

Catalog
videoVideo physical reasoning

Physical commonsense generation probes expose impossible-object and impossible-motion failures that aggregate video scores often hide.

Limitation: External code must be cloned and smoke-tested before adapter integration

external_until_reproducedOpen source

World-in-World summary

world_in_world

Smoke passed
interactiveClosed-loop embodied utilityAction conditioningLong-horizon coherence

World-in-World is valid for embodied utility because it evaluates whether world-model rollouts improve closed-loop task success rather than only visual quality.

IGNavformal
Success rate: 40.28%SPL: 28.05%
n=144 · 2026-05-05
AEQAformal
Mean score: 20.17 scoreMean efficiency: 17.86 efficiency
n=181 · 2026-05-08
ARblocked

Requires licensed MP3D / Matterport3D Habitat-ready assets before formal evaluation can run.

Manipulationsubstrate

RLBench/CoppeliaSim substrate smoke passed; full VLM planner and world-model scoring still needs to run.

Limitation: IGNav and AEQA formal evaluations are recorded; AR still requires licensed MP3D Habitat-ready assets

external_until_reproducedOpen source

WorldArena summary

worldarena

External
interactiveClosed-loop embodied utilityIntrinsic video faithfulnessRobotics executability

WorldArena bridges perception and functional utility by evaluating video metrics plus data-engine, policy-evaluator, and action-planner roles.

Limitation: EWMScore is useful but should not be mixed with WM Arena self-measured rankings until reproduced

external_until_reproducedOpen source

RoboWM-Bench summary

robowm_bench

Catalog
roboticsRobotics executabilityAction conditioningForward dynamics

RoboWM-Bench tests whether generated manipulation behavior can be translated into executable robot actions, which directly matches the robotics world-model construct.

Limitation: Does not replace broad policy transfer benchmarks

external_until_reproducedOpen source

LIBERO-X summary

libero_x

Catalog
roboticsClosed-loop embodied utilityRobotics executability

LIBERO-X is relevant when robotics claims involve robust task execution under realistic perturbations rather than only in-distribution demonstrations.

Limitation: Need official artifact and smoke test before adapter

external_until_reproducedOpen source

RoboCasa365 summary

robocasa365

External
roboticsRobotics executabilityClosed-loop embodied utility

RoboCasa365 is a large standardized household manipulation benchmark, so it is useful evidence for broad robotics task success and generalization claims.

Limitation: Policy leaderboard is not automatically a world-model evaluator

external_until_reproducedOpen source

ManiSkill3 summary

maniskill3

Catalog
roboticsRobotics executabilityForward dynamics

ManiSkill3 is not itself a world-model score, but it is a practical substrate for repeatable contact-rich embodied evaluation.

Limitation: Needs WM-specific protocol layered on top

not_a_ranking_benchmarkOpen source

DrivingGen summary

drivinggen

Catalog
drivingDriving world simulation

DrivingGen is domain-valid for driving world models because it targets scene generation under driving-specific conditions and metrics.

Limitation: Driving simulation evidence should be separated from robotics and video-generation evidence

external_until_reproducedOpen source

WorldLens summary

worldlens

Catalog
drivingDriving world simulation3D/4D world consistency

WorldLens is relevant because it evaluates driving world models from appearance fidelity through behavioral and 4D consistency.

Limitation: Public repo must be cloned and smoke-tested before adapter integration

external_until_reproducedOpen source

Waymo World Model reference summary

waymo_world_model_reference

Catalog
drivingDriving world simulation

The Waymo reference is useful market evidence that driving world models are being used for simulation, but it is not an open benchmark.

Limitation: Not open for WM Arena scoring

not_a_ranking_benchmarkOpen source

WorldScore summary

worldscore

Catalog
3D/4D3D/4D world consistencyIntrinsic video faithfulness

WorldScore decomposes world generation into next-scene generation under camera trajectory specifications, directly testing spatial-temporal consistency.

Limitation: Not a closed-loop control benchmark

external_until_reproducedOpen source

4DWorldBench summary

four_d_worldbench

Catalog
3D/4D3D/4D world consistencyVideo physical reasoning

4DWorldBench broadens 3D/4D evaluation across input modalities and physics-centric generation prompts.

Limitation: Not yet integrated

external_until_reproducedOpen source

InSpatio-World summary

inspatio_world

Catalog
3D/4D3D/4D world consistency

InSpatio-World is relevant as an open 4D world model reference tied to WorldScore-style dynamic consistency claims.

Limitation: Model repo must be cloned and smoke-tested before any adapter

external_until_reproducedOpen source

MIND summary

mind

Catalog
interactiveLong-horizon coherenceAction conditioning

MIND directly measures two core interactive world-model constructs: remembering the world across revisits and following action controls.

Limitation: Not yet integrated

external_until_reproducedOpen source