Evidence Matrix
This is not a global leaderboard. Each cell answers one narrower question: what evidence supports this model making this capability claim, and how strong is that evidence?
Model × Capability Claim
A cell is evidence for one capability claim. It is intentionally not a global ranking.
Open Oasis 500M evidence
Action-conditioned interactive world model
Interactive keyboard rollout
Open Oasis declares an action-conditioned interactive world-model use, so action responsiveness is the relevant construct.
Why not stronger: WM Arena currently has pilot interactive evidence but not a ranking-eligible MIND or World-in-World reproduction.
-
Closed-loop task success is the right standard for saying an interactive world model helps embodied decisions.
Why not stronger: World-in-World IGNav and AEQA formal evaluations have completed on AWS with official datasets and model servers, but AR remains blocked on licensed MP3D Habitat-ready assets and manipulation still needs full VLM/world-model scoring.
Wan 2.2 T2V-A14B evidence
Diffusion MoE T2V
Physion v2 OCP accuracy: 0.610
Physion v2 object-contact prediction is a direct physical-reasoning probe for generated futures.
Why not stronger: Physics-IQ cells are still failed in the pilot manifest, so the physical reasoning claim is only partially supported.
VBench-2.0 overall: 0.847
VBench-2.0 directly operationalizes intrinsic faithfulness across physics, commonsense, controllability, human fidelity, and creativity.
Why not stronger: The value is cited from the external Vchitect leaderboard, not reproduced by WM Arena.
-
Robotics executability checks whether generated manipulation video can become successful robot action, a stronger claim than video quality.
Why not stronger: RoboWM-Bench and WorldArena adapters have not been locally cloned, smoke-tested, and wired into WM Arena.
Open-Sora v2 evidence
Open text-to-video transformer
-
Open text-to-video models should be compared on intrinsic video faithfulness before any stronger world-model claim.
Why not stronger: No WM Arena reproduced VBench-2.0 score is currently checked into the manifest.
TWISTER evidence
Transformer + CPC
Atari 100k HNS mean: 1.620
Atari HNS is direct evidence for the RL proxy claim because the learned model is judged through game-policy performance.
Why not stronger: Atari remains a pixel-game domain and does not prove robotics or driving utility.
-
TWISTER is action-conditioned, so counterfactual Atari rollouts are the right next evidence for controllability.
Why not stronger: The dedicated counterfactual lane is not ranking-eligible yet.
DIAMOND evidence
Diffusion (EDM)
Atari 100k HNS mean: 1.460
Atari HNS is direct evidence for the RL proxy claim because the model is evaluated through standardized game performance.
Why not stronger: It does not establish physical execution, closed-loop robotics, or driving simulation claims.
-
Physical reasoning probes would test whether DIAMOND's visually strong predictions preserve object dynamics.
Why not stronger: WM Arena has not run these video-physics evaluators on DIAMOND yet.
-
3D/4D consistency is a separate claim from Atari pixel prediction.
Why not stronger: DIAMOND's WM Arena passport/category does not claim 3D or 4D world generation.
EfficientZero V2 evidence
MuZero + MCTS
Paper-reported discrete/continuous control
EfficientZero V2 extends the model-based RL evidence family beyond discrete Atari-style settings.
Why not stronger: The number is not yet a WM Arena self-measured result.
EfficientZero evidence
MuZero + Self-Supervised
Paper-reported Atari 100k score
EfficientZero is a canonical limited-data Atari model-based RL baseline.
Why not stronger: It is a paper-only entry in WM Arena until the evaluation is reproduced locally.
M³ evidence
Modular Transformer
Paper-reported Atari result
The paper positions M3 as a modular world model over token streams, making Atari-style RL proxy evidence relevant.
Why not stronger: WM Arena has not yet reproduced a ranking-eligible local run for this model.
STORM evidence
Stochastic Transformer
Atari 100k HNS mean: 1.270
Atari HNS is the primary WM Arena score for model-based RL proxy environments.
Why not stronger: The evidence is game-domain specific and should not be generalized to embodied robotics.
IRIS evidence
Transformer + dVAE
Atari 100k HNS mean: 1.046
Atari HNS connects the learned world model to policy performance under a standard data budget.
Why not stronger: This is not evidence for physical or 3D/4D consistency.
| Model | RL proxy | Dynamics | Action | Physics | Faithful video | Closed loop | Robot exec | Driving | 3D/4D | Horizon |
|---|---|---|---|---|---|---|---|---|---|---|
Open Oasis 500M Action-conditioned interactive world model | — | — | — | — | — | — | — | — | ||
Wan 2.2 T2V-A14B Diffusion MoE T2V | — | — | — | — | — | — | — | |||
Open-Sora v2 Open text-to-video transformer | — | — | — | — | — | — | — | — | — | |
TWISTER Transformer + CPC | — | — | — | — | — | — | — | — | ||
DIAMOND Diffusion (EDM) | — | — | — | — | — | — | — | |||
EfficientZero V2 MuZero + MCTS | — | — | — | — | — | — | — | — | — | |
EfficientZero MuZero + Self-Supervised | — | — | — | — | — | — | — | — | — | |
M³ Modular Transformer | — | — | — | — | — | — | — | — | — | |
STORM Stochastic Transformer | — | — | — | — | — | — | — | — | — | |
IRIS Transformer + dVAE | — | — | — | — | — | — | — | — | — |