Infinipaint: World Modeling a Trivially Simple World

Infinipaint banner

Premise Setup

World models are useful when simulating complex, high-entropy environments.

CSGO world model scenario
Game world model example.
Waymo world model driving scenario
Waymo world model example.

What happens if we world-model a trivially simple environment with only a few rules?

Project Setup + Data Collection

We want to understand:
  1. How simple can model architecture get?
  2. How effective is that architecture?
Minipaint environment in action
Minipaint used as a controlled world-model benchmark.
Action heatmap over Minipaint frames
Frame-action pairs used for supervised world-model training.

Architecture

A VAE compresses frames; a stateless CNN predicts latent deltas

Why a stateless CNN?
  • Main UI variables are visible on-screen.
  • Conditioning on (a_t and a_t+1) makes the last cursor position explicit.
  • No hidden state → recurrence is unnecessary.
  • Deterministic dynamics → direct prediction over diffusion
Infinipaint VAE and LatentCNN architecture
Infinipaint pipeline from action-conditioned latent prediction to rendered frame.

Key Engineering Decisions

Tuning latent compression
  • 8x compression too aggressive; 4x preserves fine structure
  • Latent-diff shows sparse, localized update patterns
VAE reconstructions at eight times compression
8x compression.
VAE reconstructions at four times compression
4x compression.
Latent difference analysis visualization
Latent diff analysis.
Sparse-delta loss balancing
  • Unchanged regions dominate loss, LatentCNN predicts 0 change.
  • Fix: compute L1 on changed vs unchanged, then average.
Without balancing (single mean)
Changed
1%
Unchanged
99%
→ avg L1
0.02
With balancing (average of means)
Changed
1%
→ avg L1
1.00
Unchanged
99%
→ avg L1
0.01
0.505

Core Results

Step 1 quality: SSIM 0.995 +/- 0.006 and PSNR 36.0 dB +/- 6.0.

Three rollout phases
  1. Steps 1 to 150: stable (SSIM stays above 0.95).
  2. Steps 150 to 500: steady degradation (SSIM ~0.92 to ~0.85).
  3. Steps 500+: collapse (SSIM below 0.75, ~0.62 by step 800).
SSIM and PSNR degradation curves across rollouts
Held-out autoregressive evaluation across 20 episodes.

Behavioral Findings

  • Lines curve & thin, colors fringe.
  • During very late stages (1k+): world-model collapses into eraser-only
Rollout progression grid across episodes
2x4 rollout grid showing degradation over time.
Late rollout degradation in final moments
Late-stage collapse in the interactive viewer.

Limitations + Conclusion

Main Limitations
  1. Up to 280x slower than deterministic simulation (MLX).
  2. Autoregressive errors compound over long rollouts.
Concluding Thoughts
  • A simple stateless CNN is surprisingly effective.
  • Rollouts introduce long-horizon failure modes.
Interactive viewer rollout
Viewer rollout behavior.

Source code: github.com/shuklabhay/infinipaint Full writeup: shuklabhay.github.io/blog/infinipaint

Slide 1 / 8