Claim-level evidence

Evidence Matrix

This is not a global leaderboard. Each cell answers one narrower question: what evidence supports this model making this capability claim, and how strong is that evidence?

Open Benchmark Atlas Atari leaderboard

Model × Capability Claim

A cell is evidence for one capability claim. It is intentionally not a global ranking.

Open Oasis 500M evidence

Action-conditioned interactive world model

Action claim

Action conditioning

Self measured

Interactive keyboard rollout

Open Oasis declares an action-conditioned interactive world-model use, so action responsiveness is the relevant construct.

Why not stronger: WM Arena currently has pilot interactive evidence but not a ranking-eligible MIND or World-in-World reproduction.

Closed loop claim

Closed-loop embodied utility

Pending

Closed-loop task success is the right standard for saying an interactive world model helps embodied decisions.

Why not stronger: World-in-World IGNav and AEQA formal evaluations have completed on AWS with official datasets and model servers, but AR remains blocked on licensed MP3D Habitat-ready assets and manipulation still needs full VLM/world-model scoring.

Wan 2.2 T2V-A14B evidence

Diffusion MoE T2V

Physics claim

Video physical reasoning

Ranking

Physion v2 OCP accuracy: 0.610

Physion v2 object-contact prediction is a direct physical-reasoning probe for generated futures.

Why not stronger: Physics-IQ cells are still failed in the pilot manifest, so the physical reasoning claim is only partially supported.

Faithful video claim

Intrinsic video faithfulness

External

VBench-2.0 overall: 0.847

VBench-2.0 directly operationalizes intrinsic faithfulness across physics, commonsense, controllability, human fidelity, and creativity.

Why not stronger: The value is cited from the external Vchitect leaderboard, not reproduced by WM Arena.

Robot exec claim

Robotics executability

Pending

Robotics executability checks whether generated manipulation video can become successful robot action, a stronger claim than video quality.

Why not stronger: RoboWM-Bench and WorldArena adapters have not been locally cloned, smoke-tested, and wired into WM Arena.

Open-Sora v2 evidence

Open text-to-video transformer

Faithful video claim

Intrinsic video faithfulness

Pending

Open text-to-video models should be compared on intrinsic video faithfulness before any stronger world-model claim.

Why not stronger: No WM Arena reproduced VBench-2.0 score is currently checked into the manifest.

TWISTER evidence

Transformer + CPC

RL proxy claim

RL proxy environment

Ranking

Atari 100k HNS mean: 1.620

Atari HNS is direct evidence for the RL proxy claim because the learned model is judged through game-policy performance.

Why not stronger: Atari remains a pixel-game domain and does not prove robotics or driving utility.

Action claim

Action conditioning

Pending

TWISTER is action-conditioned, so counterfactual Atari rollouts are the right next evidence for controllability.

Why not stronger: The dedicated counterfactual lane is not ranking-eligible yet.

DIAMOND evidence

Diffusion (EDM)

RL proxy claim

RL proxy environment

Ranking

Atari 100k HNS mean: 1.460

Atari HNS is direct evidence for the RL proxy claim because the model is evaluated through standardized game performance.

Why not stronger: It does not establish physical execution, closed-loop robotics, or driving simulation claims.

Physics claim

Video physical reasoning

Pending

Physical reasoning probes would test whether DIAMOND's visually strong predictions preserve object dynamics.

Why not stronger: WM Arena has not run these video-physics evaluators on DIAMOND yet.

3D/4D claim

3D/4D world consistency

Not claimed

3D/4D consistency is a separate claim from Atari pixel prediction.

Why not stronger: DIAMOND's WM Arena passport/category does not claim 3D or 4D world generation.

EfficientZero V2 evidence

MuZero + MCTS

RL proxy claim

RL proxy environment

Paper

Paper-reported discrete/continuous control

EfficientZero V2 extends the model-based RL evidence family beyond discrete Atari-style settings.

Why not stronger: The number is not yet a WM Arena self-measured result.

EfficientZero evidence

MuZero + Self-Supervised

RL proxy claim

RL proxy environment

Paper

Paper-reported Atari 100k score

EfficientZero is a canonical limited-data Atari model-based RL baseline.

Why not stronger: It is a paper-only entry in WM Arena until the evaluation is reproduced locally.

M³ evidence

Modular Transformer

RL proxy claim

RL proxy environment

Paper

Paper-reported Atari result

The paper positions M3 as a modular world model over token streams, making Atari-style RL proxy evidence relevant.

Why not stronger: WM Arena has not yet reproduced a ranking-eligible local run for this model.

STORM evidence

Stochastic Transformer

RL proxy claim

RL proxy environment

Ranking

Atari 100k HNS mean: 1.270

Atari HNS is the primary WM Arena score for model-based RL proxy environments.

Why not stronger: The evidence is game-domain specific and should not be generalized to embodied robotics.

IRIS evidence

Transformer + dVAE

RL proxy claim

RL proxy environment

Ranking

Atari 100k HNS mean: 1.046

Atari HNS connects the learned world model to policy performance under a standard data budget.

Why not stronger: This is not evidence for physical or 3D/4D consistency.

Model	RL proxy	Dynamics	Action	Physics	Faithful video	Closed loop	Robot exec	Driving	3D/4D	Horizon
Open Oasis 500M Action-conditioned interactive world model	—	—		—	—		—	—	—	—
Wan 2.2 T2V-A14B Diffusion MoE T2V	—	—	—			—		—	—	—
Open-Sora v2 Open text-to-video transformer	—	—	—	—		—	—	—	—	—
TWISTER Transformer + CPC		—		—	—	—	—	—	—	—
DIAMOND Diffusion (EDM)		—	—		—	—	—	—		—
EfficientZero V2 MuZero + MCTS		—	—	—	—	—	—	—	—	—
EfficientZero MuZero + Self-Supervised		—	—	—	—	—	—	—	—	—
M³ Modular Transformer		—	—	—	—	—	—	—	—	—
STORM Stochastic Transformer		—	—	—	—	—	—	—	—	—
IRIS Transformer + dVAE		—	—	—	—	—	—	—	—	—