Core Line 3: 200K step secret, Muon optimizer, token routing
- REPA sort is just an initial accelerator and should be removed after 200K steps
- Achieved FID 18.2 → 15.55 (15% improvement) with Muon optimizer alone
- TREAD token routing reduces FID to 14.10 at 1024×1024 high resolution
What happened?
The Photoroom team has released a guide to optimizing the text-to-image generation model PRX Part 2. [Hugging Face] If Part 1 was about the architecture, this time they poured out concrete ablation results on what to do during actual learning.
Frankly, most technical documents of this kind end with “Our model is the best,” but this is different. They also disclose failed experiments and show the trade-offs of each technology with numbers.
Why is it important?
The cost of training a text-image model from scratch is enormous. A single wrong setting can waste thousands of GPU hours. The data released by Photoroom reduces these trials and errors.
Personally, the most notable finding is about REPA (Representation Alignment). Using REPA-DINOv3 drops FID from 18.2 to 14.64. But there’s a problem. Throughput decreases by 13%, and learning is actually hindered after 200K steps. In short, it’s just an initial booster.
Another bug with BF16 weight storage. If you don’t know this and save as BF16 instead of FP32, the FID skyrockets from 18.2 to 21.87. It goes up by 3.67. Surprisingly, many teams fall into this trap.
Practical Guide: Strategies by Resolution
| Technique | 256×256 FID | 1024×1024 FID | Throughput |
|---|---|---|---|
| Baseline | 18.20 | – | 3.95 b/s |
| REPA-E-VAE | 12.08 | – | 3.39 b/s |
| TREAD | 21.61 ↑ | 14.10 ↓ | 1.64 b/s |
| Muon Optimizer | 15.55 | – | – |
At 256×256, TREAD actually degrades quality. But at 1024×1024, completely different results are obtained. The higher the resolution, the greater the token routing effect.
What will happen in the future?
Photoroom will provide the entire training code in Part 3. They plan to release it and run a 24-hour “speed run”. The goal is to show how quickly a good model can be made.
Personally, I think this release will have a big impact on the open source image generation model ecosystem. This is the first time that training know-how has been released in such detail since Stable Diffusion.
Frequently Asked Questions (FAQ)
Q: When should REPA be removed? One?
A: After about 200K steps. It accelerates learning at first, but then actually hinders convergence. This is clearly revealed in the Photoroom experiment. Missing the timing will degrade the quality of the final model.
Q: Should I use synthetic data or real images?
A: Use both. Initially, use synthetic images to learn the global structure, and in the later stages, use real images to capture high-frequency details. If you only use compositing, it won’t look like a photo even if the FID is good.
Q: How much better is the Muon optimizer than AdamW?
A: About 15% improvement based on FID. It drops from 18.2 to 15.55. There’s no reason not to use it since the computational cost is similar. However, hyperparameter tuning is a bit tricky.
If you found this article helpful, please subscribe to AI Digester.
References
- Training Design for Text-to-Image Models: Lessons from Ablations – Hugging Face (2026-02-03)