Infinipaint: World Modeling a Trivially Simple World
Premise Setup
World models are useful when simulating complex, high-entropy environments.
What happens if we world-model a trivially simple environment with only a few rules?
Project Setup + Data Collection
We want to understand:
- How simple can model architecture get?
- How effective is that architecture?
Architecture
A VAE compresses frames; a stateless CNN predicts latent deltas
Why a stateless CNN?
- Main UI variables are visible on-screen.
- Conditioning on (a_t and a_t+1) makes the last cursor position explicit.
- No hidden state → recurrence is unnecessary.
- Deterministic dynamics → direct prediction over diffusion
Key Engineering Decisions
Tuning latent compression
- 8x compression too aggressive; 4x preserves fine structure
- Latent-diff shows sparse, localized update patterns
Sparse-delta loss balancing
- Unchanged regions dominate loss, LatentCNN predicts 0 change.
- Fix: compute L1 on changed vs unchanged, then average.
Without balancing (single mean)
Changed
1%
1%
Unchanged
99%
99%
→ avg L1
0.02
With balancing (average of means)
Changed
1%
1%
→ avg L1
1.00
↘
Unchanged
99%
99%
→ avg L1
0.01
↗
0.505
Core Results
Step 1 quality: SSIM 0.995 +/- 0.006 and PSNR 36.0 dB +/- 6.0.
Three rollout phases
- Steps 1 to 150: stable (SSIM stays above 0.95).
- Steps 150 to 500: steady degradation (SSIM ~0.92 to ~0.85).
- Steps 500+: collapse (SSIM below 0.75, ~0.62 by step 800).
Behavioral Findings
- Lines curve & thin, colors fringe.
- During very late stages (1k+): world-model collapses into eraser-only
Limitations + Conclusion
Main Limitations
- Up to 280x slower than deterministic simulation (MLX).
- Autoregressive errors compound over long rollouts.
Concluding Thoughts
- A simple stateless CNN is surprisingly effective.
- Rollouts introduce long-horizon failure modes.
Source code: github.com/shuklabhay/infinipaint Full writeup: shuklabhay.github.io/blog/infinipaint