Infinipaint: World Modeling a Trivially Simple World

Infinipaint banner

Premise Setup

World models are useful when simulating complex, high-entropy environments.

CSGO world model scenario — Game world model example.

Waymo world model driving scenario — Waymo world model example.

What happens if we world-model a trivially simple environment with only a few rules?

Project Setup + Data Collection

We want to understand:

How simple can model architecture get?
How effective is that architecture?

Minipaint environment in action — Minipaint used as a controlled world-model benchmark.

Action heatmap over Minipaint frames — Frame-action pairs used for supervised world-model training.

Architecture

A VAE compresses frames; a stateless CNN predicts latent deltas

Why a stateless CNN?

Main UI variables are visible on-screen.
Conditioning on (a_t and a_t+1) makes the last cursor position explicit.
No hidden state → recurrence is unnecessary.
Deterministic dynamics → direct prediction over diffusion

Infinipaint VAE and LatentCNN architecture — Infinipaint pipeline from action-conditioned latent prediction to rendered frame.

Key Engineering Decisions

Tuning latent compression

8x compression too aggressive; 4x preserves fine structure
Latent-diff shows sparse, localized update patterns

VAE reconstructions at eight times compression — 8x compression.

VAE reconstructions at four times compression — 4x compression.

Latent difference analysis visualization — Latent diff analysis.

Sparse-delta loss balancing

Unchanged regions dominate loss, LatentCNN predicts 0 change.
Fix: compute L1 on changed vs unchanged, then average.

Without balancing (single mean)

Changed
1%

Unchanged
99%

→ avg L1

0.02

With balancing (average of means)

Changed
1%

→ avg L1

1.00

↘

Unchanged
99%

→ avg L1

0.01

↗

0.505

Core Results

Step 1 quality: SSIM 0.995 +/- 0.006 and PSNR 36.0 dB +/- 6.0.

Three rollout phases

Steps 1 to 150: stable (SSIM stays above 0.95).
Steps 150 to 500: steady degradation (SSIM ~0.92 to ~0.85).
Steps 500+: collapse (SSIM below 0.75, ~0.62 by step 800).

SSIM and PSNR degradation curves across rollouts — Held-out autoregressive evaluation across 20 episodes.

Behavioral Findings

Lines curve & thin, colors fringe.
During very late stages (1k+): world-model collapses into eraser-only

Rollout progression grid across episodes — 2x4 rollout grid showing degradation over time.

Late rollout degradation in final moments — Late-stage collapse in the interactive viewer.

Limitations + Conclusion

Main Limitations

Up to 280x slower than deterministic simulation (MLX).
Autoregressive errors compound over long rollouts.

Concluding Thoughts

A simple stateless CNN is surprisingly effective.
Rollouts introduce long-horizon failure modes.

Interactive viewer rollout — Viewer rollout behavior.

Source code: github.com/shuklabhay/infinipaint Full writeup: shuklabhay.github.io/blog/infinipaint

Slide 1 / 8