Core Line 3: 200K step secret, Muon optimizer, token routing
- REPA sort is only an early accelerator and should be removed after 200K steps
- Muon optimizer alone achieves FID 18.2 → 15.55 (15% improvement)
- At 1024×1024 high resolution, TREAD token routing reduces FID to 14.10
What happened?
The Photoroom team released an optimization guide for their text-to-image generation model PRX Part 2. [Hugging Face] While Part 1 covered architecture, this time they shared concrete ablation results on what to do during actual training.
Honestly, most technical documents of this kind end with “our model is the best,” but this is different. They also disclosed failed experiments and showed trade-offs of each technique with numbers.
Why is it important?
The cost of training a text-image model from scratch is enormous. A single wrong setting can waste thousands of GPU hours. The data released by Photoroom reduces such trial and error.
Personally, the most notable finding is about REPA (Representation Alignment). Using REPA-DINOv3 drops FID from 18.2 to 14.64. But there is a problem. Throughput decreases by 13% and training actually degrades after 200K steps. Simply put, it is only an early booster.
Another bug in BF16 weight storage. If you unknowingly save in BF16 instead of FP32, FID spikes from 18.2 to 21.87. That is an increase of 3.67. Surprisingly, many teams fall into this trap.
Practical Guide: Strategies by Resolution
| Technique | 256×256 FID | 1024×1024 FID | Throughput |
|---|---|---|---|
| Baseline | 18.20 | – | 3.95 b/s |
| REPA-E-VAE | 12.08 | – | 3.39 b/s |
| TREAD | 21.61 ↑ | 14.10 ↓ | 1.64 b/s |
| Muon Optimizer | 15.55 | – | – |
At 256×256, TREAD actually degrades quality. But at 1024×1024, completely different results are obtained. The higher the resolution, the greater the token routing effect.
What will happen in the future?
Photoroom will provide the complete training code in Part 3. They plan to release it and conduct a 24-hour “speed run.” The goal is to show how fast a good model can be built.
Personally, I think this release will have a significant impact on the open source image generation model ecosystem. Since Stable Diffusion, this is the first time training know-how has been disclosed in such detail.
Frequently Asked Questions (FAQ)
Q: When should REPA be removed?
A: After about 200K steps. It accelerates learning at first, but actually hinders convergence after that. This is clearly shown in Photoroom experiments. Missing the timing will degrade the quality of the final model.
Q: Should I use synthetic data or real images?
A: Use both. Use synthetic images at first to learn global structure, then use real images in later stages to capture high-frequency details. Using only compositing gives good FID but does not look photorealistic.
Q: How much better is Muon optimizer than AdamW?
A: About 15% improvement in FID. It drops from 18.2 to 15.55. Since computational cost is similar, there is no reason not to use it. However, hyperparameter tuning is slightly tricky.
If you found this article useful, please subscribe to AI Digester.
References
- Training Design for Text-to-Image Models: Lessons from Ablations – Hugging Face (2026-02-03)