The 235B Parameter Model Has Upended the UI Automation Landscape
- Achieved SOTA with 78.5% on the ScreenSpot-Pro benchmark
- 10-20% performance improvement with Agentic Localization
- Accurately identifies small UI elements even in 4K high-resolution interfaces
What Happened?
H Company has released Holo2-235B-A22B, a model specializing in UI Localization (identifying the location of user interface elements).[Hugging Face] This 235B parameter model accurately locates UI elements such as buttons, text fields, and links in screenshots.
The key is Agentic Localization technology. Instead of providing an answer in one go, it refines predictions over multiple steps. This allows it to accurately pinpoint small UI elements even on 4K high-resolution screens.[Hugging Face]
Why is it Important?
The GUI agent field is heating up. Big tech companies like Claude Computer Use and OpenAI Operator are releasing UI automation features in rapid succession. However, a small startup, H Company, has taken the top spot in this field’s benchmark.
Personally, I’m paying attention to the agentic approach. Existing models often failed because they tried to pinpoint the location in one go, but the approach of refining predictions through multiple attempts has proven effective. The 10-20% performance improvement figure proves this.
Frankly, 235B parameters is quite heavy. We’ll have to see how quickly it operates in a real production environment.
What Will Happen in the Future?
As GUI agent competition intensifies, UI Localization accuracy is expected to become a key differentiator. Since the H Company model has been released as open source, other agent frameworks are likely to integrate it.
It could also impact the RPA (Robotic Process Automation) market. While existing RPA tools were rule-based, vision-based UI understanding could now become the standard.
Frequently Asked Questions (FAQ)
Q: What exactly is UI Localization?
A: It is the technology of finding the exact coordinates of a specific UI element (button, input field, etc.) by looking at a screenshot. Simply put, it’s AI knowing where to click on the screen. It is a core technology of GUI automation agents.
Q: What’s different from existing models?
A: Agentic Localization is key. Instead of trying to get it right in one go, it refines predictions over multiple steps. It’s similar to how a person scans the screen to find a target. This method has achieved a 10-20% performance improvement.
Q: Can I try the model myself?
A: It is publicly available on Hugging Face for research purposes. However, since it is a 235B parameter model, it requires significant GPU resources. It is more suitable for research or benchmarking purposes than for actual production applications.
If you found this article useful, please subscribe to AI Digester.
Reference Materials
- Introducing Holo2-235B-A22B: State-of-the-Art UI Localization – Hugging Face (2026-02-03)