Claim-level evidence

Evidence Matrix

This is not a global leaderboard. Each cell answers one narrower question: what evidence supports this model making this capability claim, and how strong is that evidence?

Model × Capability Claim

A cell is evidence for one capability claim. It is intentionally not a global ranking.

Open Oasis 500M evidence

Action-conditioned interactive world model

Action claim
Action conditioning
Self measured

Interactive keyboard rollout

Open Oasis declares an action-conditioned interactive world-model use, so action responsiveness is the relevant construct.

Why not stronger: WM Arena currently has pilot interactive evidence but not a ranking-eligible MIND or World-in-World reproduction.

Closed loop claim
Closed-loop embodied utility
Pending

-

Closed-loop task success is the right standard for saying an interactive world model helps embodied decisions.

Why not stronger: World-in-World IGNav and AEQA formal evaluations have completed on AWS with official datasets and model servers, but AR remains blocked on licensed MP3D Habitat-ready assets and manipulation still needs full VLM/world-model scoring.

Wan 2.2 T2V-A14B evidence

Diffusion MoE T2V

Physics claim
Video physical reasoning
Ranking

Physion v2 OCP accuracy: 0.610

Physion v2 object-contact prediction is a direct physical-reasoning probe for generated futures.

Why not stronger: Physics-IQ cells are still failed in the pilot manifest, so the physical reasoning claim is only partially supported.

Faithful video claim
Intrinsic video faithfulness
External

VBench-2.0 overall: 0.847

VBench-2.0 directly operationalizes intrinsic faithfulness across physics, commonsense, controllability, human fidelity, and creativity.

Why not stronger: The value is cited from the external Vchitect leaderboard, not reproduced by WM Arena.

Robot exec claim
Robotics executability
Pending

-

Robotics executability checks whether generated manipulation video can become successful robot action, a stronger claim than video quality.

Why not stronger: RoboWM-Bench and WorldArena adapters have not been locally cloned, smoke-tested, and wired into WM Arena.

Open-Sora v2 evidence

Open text-to-video transformer

Faithful video claim
Intrinsic video faithfulness
Pending

-

Open text-to-video models should be compared on intrinsic video faithfulness before any stronger world-model claim.

Why not stronger: No WM Arena reproduced VBench-2.0 score is currently checked into the manifest.

TWISTER evidence

Transformer + CPC

RL proxy claim
RL proxy environment
Ranking

Atari 100k HNS mean: 1.620

Atari HNS is direct evidence for the RL proxy claim because the learned model is judged through game-policy performance.

Why not stronger: Atari remains a pixel-game domain and does not prove robotics or driving utility.

Action claim
Action conditioning
Pending

-

TWISTER is action-conditioned, so counterfactual Atari rollouts are the right next evidence for controllability.

Why not stronger: The dedicated counterfactual lane is not ranking-eligible yet.

DIAMOND evidence

Diffusion (EDM)

RL proxy claim
RL proxy environment
Ranking

Atari 100k HNS mean: 1.460

Atari HNS is direct evidence for the RL proxy claim because the model is evaluated through standardized game performance.

Why not stronger: It does not establish physical execution, closed-loop robotics, or driving simulation claims.

Physics claim
Video physical reasoning
Pending

-

Physical reasoning probes would test whether DIAMOND's visually strong predictions preserve object dynamics.

Why not stronger: WM Arena has not run these video-physics evaluators on DIAMOND yet.

3D/4D claim
3D/4D world consistency
Not claimed

-

3D/4D consistency is a separate claim from Atari pixel prediction.

Why not stronger: DIAMOND's WM Arena passport/category does not claim 3D or 4D world generation.

EfficientZero V2 evidence

MuZero + MCTS

RL proxy claim
RL proxy environment
Paper

Paper-reported discrete/continuous control

EfficientZero V2 extends the model-based RL evidence family beyond discrete Atari-style settings.

Why not stronger: The number is not yet a WM Arena self-measured result.

EfficientZero evidence

MuZero + Self-Supervised

RL proxy claim
RL proxy environment
Paper

Paper-reported Atari 100k score

EfficientZero is a canonical limited-data Atari model-based RL baseline.

Why not stronger: It is a paper-only entry in WM Arena until the evaluation is reproduced locally.

evidence

Modular Transformer

RL proxy claim
RL proxy environment
Paper

Paper-reported Atari result

The paper positions M3 as a modular world model over token streams, making Atari-style RL proxy evidence relevant.

Why not stronger: WM Arena has not yet reproduced a ranking-eligible local run for this model.

STORM evidence

Stochastic Transformer

RL proxy claim
RL proxy environment
Ranking

Atari 100k HNS mean: 1.270

Atari HNS is the primary WM Arena score for model-based RL proxy environments.

Why not stronger: The evidence is game-domain specific and should not be generalized to embodied robotics.

IRIS evidence

Transformer + dVAE

RL proxy claim
RL proxy environment
Ranking

Atari 100k HNS mean: 1.046

Atari HNS connects the learned world model to policy performance under a standard data budget.

Why not stronger: This is not evidence for physical or 3D/4D consistency.