SeekUI: Predicting Visual Search Behavior on Graphical User Interfaces with a Reward-Augmented Vision Language Model

要旨

Visual search is key to understanding and improving interaction with graphical user interfaces (GUIs), yet predicting scanpaths on real GUIs remains an open challenge. Unlike free-viewing, visual search is goal-driven and shaped by both linguistic and visual features of the GUI. State-of-the-art models of visual search, trained on natural images, fail with GUIs because they cannot capture the effects of grouping and semantics on search strategies. We present \textsc{SeekUI}, a reward-augmented Vision Language Model (VLM) that predicts scanpaths directly from a GUI screenshot and a text cue describing the desired target. Our model extends the capability of VLMs to reproduce human-like visual search behavior on GUIs and outperforms baseline models across different types of GUIs. Importantly, it reproduces key empirical phenomena established in eye-tracking studies of visual search, including the Guess–Scan–Confirm strategy. In sum, \textsc{SeekUI} provides a foundation for predicting visual search behavior and has potential for informing GUI evaluation and optimization.

著者
Zixin Guo
Aalto University, Espoo, Finland
Yue Jiang
Aalto University, Espoo, Finland
Luis A.. Leiva
University of Luxembourg, Esch-sur-Alzette, Luxembourg
Antti Oulasvirta
Aalto University, Helsinki, Finland

会議: CHI 2026

ACM CHI Conference on Human Factors in Computing Systems

セッション: Interactive Visualization for Model Inspection and Debugging

P1 - Room 131
7 件の発表
2026-04-14 20:15:00
2026-04-14 21:45:00