How to Reduce FID by 30% in Text-to-Image AI Learning

Key Takeaways: 200K Step Secret, Muon Optimizer, Token Routing

  • REPA alignment is only an initial accelerator; it must be removed after 200K steps.
  • Achieved FID 18.2 → 15.55 (15% improvement) with just the Muon optimizer.
  • TREAD token routing pulls down FID to 14.10 at 1024×1024 high resolution.

What Happened?

The Photoroom team released Part 2 of their text-to-image generation model PRX learning optimization guide.[Hugging Face] While Part 1 covered the architecture, this time they poured out specific ablation results on what to do and how to do it when actually learning.

Frankly, most technical documents of this kind end with “Our model is great,” but this is different. They also disclosed failed experiments and showed the trade-offs of each technique numerically.

Why is it Important?

Training a text-to-image model from scratch is incredibly expensive. Thousands of GPU hours can be wasted with just one wrong setting. The data released by Photoroom reduces this trial and error.

Personally, the most notable finding is about REPA (representation alignment). Using REPA-DINOv3 drops the FID from 18.2 to 14.64. But there’s a problem. Throughput decreases by 13%, and after 200K steps, it actually hinders learning. Simply put, it’s just an early booster.

Also, the BF16 weight saving bug. If you save as BF16 instead of FP32 without knowing this, the FID jumps from 18.2 to 21.87. It goes up by 3.67. Surprisingly, many teams fall into this trap.

Practical Guide: Resolution-Specific Strategies

Technique 256×256 FID 1024×1024 FID Throughput
Baseline 18.20 3.95 b/s
REPA-E-VAE 12.08 3.39 b/s
TREAD 21.61 ↑ 14.10 ↓ 1.64 b/s
Muon Optimizer 15.55

At 256×256, TREAD actually degrades quality. But at 1024×1024, completely different results come out. This means that the higher the resolution, the more the token routing effect is maximized.

What Will Happen Next?

Photoroom will release the entire training code in Part 3 and conduct a 24-hour “speedrun.” They’re going to show how quickly you can make a decent model.

Personally, I think this release will have a significant impact on the open-source image generation model ecosystem. This is the first time since Stable Diffusion that training know-how has been disclosed so specifically.

Frequently Asked Questions (FAQ)

Q: When should REPA be removed?

A: After about 200K steps. It accelerates learning in the beginning, but after that, it actually hinders convergence. This was clearly revealed in the Photoroom experiment. Missing the timing will degrade the final model quality.

Q: Should I use synthetic data or real images?

A: Use both. Initially, learn the global structure with synthetic images, and later, capture high-frequency details with real images. If you only use synthetic images, the FID is good, but it doesn’t feel like a photo.

Q: How much better is the Muon optimizer than AdamW?

A: About 15% improvement based on FID. It dropped from 18.2 to 15.55. The computational cost is similar, so there’s no reason not to use it. However, hyperparameter tuning is a bit tricky.


If you found this article useful, please subscribe to AI Digester.

Reference Materials

Leave a Comment