UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction as Reasoning
Abstract
GUI grounding, which maps natural-language instructions to actionable UI elements, is a core capability of GUI agents. Prior work largely treats instructions as a static proxy for user intent, overlooking the impact of instruction diversity on grounding performance. Through a careful investigation of existing grounding datasets, we find a 23.3\% flaw rate in their instructions and show that inference-time exploitation of instruction diversity yields up to a 76\% relative performance improvement. In this paper, we introduce the "Instruction as Reasoning" paradigm, treating instructions as dynamic analytical pathways that offer distinct perspective and enabling the model to select the most effective pathway during reasoning. To achieve this, we propose a two-stage training framework: supervised fine-tuning (SFT) on synthesized, diverse instructions to instill multi-perspective reasoning, followed by reinforcement learning (RL) to optimize pathway selection and composition. Our resulting models, UI-Ins-7B and UI-Ins-32B, achieve state-of-the-art results on five challenging benchmarks and exhibit emergent reasoning, selectively composing and synthesizing novel instruction pathways at inference. In particular, UI-Ins-32B attains the best grounding accuracy: 87.3\% on UI-I2E-Bench and 84.9\% on MMBench-GUI L2, besides, UI-Ins-7B yields superior agent performance, achieving a 66.1\% success rate on the AndroidWorld. All code, data, and models will be publicly released.