Newton-Schulz Orthogonalization, 3.4x FLOPs Reduction
- Integrated Newton-Schulz orthogonalization into a single operation, reducing FLOPs by 3.4x
- Reduced orthogonalization error by 96x (3.838 → 0.040)
- Secured convergence stability with learnable coefficients
What happened?
A Chinese research team announced UNSO, a new integrated framework for Newton-Schulz orthogonalization. [arXiv] It replaces the existing NS iterations with a single polynomial operation. The key is to “remove meaningless terms and introduce learnable coefficients.”
Based on a 128×512 matrix, FLOPs decreased from 2.533×10^8 to 8.831×10^7. [arXiv]
Why is it important?
Muon is gaining attention as an optimizer to replace AdamW. It trained GPT-2 XL for $175 and is also used in Kimi K2. [Keller Jordan] But Newton-Schulz iteration is a bottleneck.
UNSO broke this bottleneck. It doesn’t just reduce iterations, it eliminates them altogether. Orthogonalization error was also reduced by 96x. Error accumulation is a cause of learning instability, and this solves it.
What will happen in the future?
Muon is officially included in PyTorch 2.10. [PyTorch] NVIDIA NeMo also supports it. [NVIDIA] UNSO is likely to be absorbed quickly.
Frequently Asked Questions (FAQ)
Q: Can I use UNSO now?
A: Yes. The authors have released the code on GitHub. It can be used in a PyTorch environment to replace the existing Muon. However, self-benchmarking is recommended before production application.
Q: Should I use Muon instead of AdamW?
A: It depends on the situation. Muon is only applied to the hidden layer. AdamW is required for embedding or output layers. Combining the two optimizers is standard.
Q: How much does the actual training time decrease?
A: It is 3.4 times faster in the orthogonalization stage. The total training time depends on the model size. The greater the proportion of orthogonalization, the greater the benefit of UNSO.
If this article was helpful, please subscribe to AI Digester.
Reference Materials
- UNSO: Unified Newton Schulz Orthogonalization – arXiv (2026-02-04)
- Muon: An optimizer for hidden layers – Keller Jordan Blog (2025-01-15)
- Muon Optimizer – PyTorch Documentation (2026-01-20)