235B Parameter Model Revolutionizes UI Automation
- Achieves SOTA with 78.5% on ScreenSpot-Pro benchmark
- Agent localization improves performance by 10-20%
- Accurately locates small UI elements even on 4K high-resolution interfaces
What Happened?
H Company released Holo2-235B-A22B, a specialized model for UI Localization (identifying the position of user interface elements). [Hugging Face] This 235B parameter model finds the exact location of UI elements like buttons, text fields, and links in screenshots.
The key is Agentic Localization technology. Instead of providing all answers at once, it improves predictions across multiple steps. This allows it to accurately identify even small UI elements on 4K high-resolution screens. [Hugging Face]
Why Is It Important?
The GUI agent field is heating up. Big tech companies like Claude Computer Use and OpenAI Operator are competing to release UI automation features. However, H Company, a small startup, has taken the top spot in this benchmark.
What I personally find noteworthy is the agentic approach. Traditional models often failed when trying to adjust positions in one go, but the approach of improving the model through multiple attempts proved effective. The 10-20% performance improvement demonstrates this.
Frankly, 235B parameters is quite heavy. How fast it runs in actual production environments remains to be seen.
What Will Happen in the Future?
As GUI agent competition intensifies, UI Localization Accuracy is expected to become a key differentiating factor. Since the H Company model was released as open source, it is likely to be integrated into other agent frameworks.
It could also impact the RPA (robotic process automation) market. Traditional RPA tools were rule-based, but now vision-based UI understanding could become the standard.
Frequently Asked Questions (FAQ)
Q: What exactly is UI Localization?
A: It is a technology that looks at a screenshot and finds the exact coordinates of specific UI elements (buttons, input fields, etc.). Simply put, it is AI knowing where to click when looking at a screen. This is a core technology for GUI automation agents.
Q: What is different from existing models?
A: Agentic localization is the key. Instead of trying to get it right in one go, it refines over multiple steps. Similar to how humans scan a screen to find a target. This method achieved a 10-20% performance improvement.
Q: Can I use the model directly?
A: It is publicly available for research on Hugging Face. However, as a 235B parameter model, it requires significant GPU resources. It is more suitable for research or benchmarking purposes rather than actual production applications.
If you found this article useful, please subscribe to AI Digester.
References
- Introducing Holo2-235B-A22B: State-of-the-Art UI Localization – Hugging Face (2026-02-03)