Visual search is key to understanding and improving interaction with graphical user interfaces (GUIs), yet predicting scanpaths on real GUIs remains an open challenge. Unlike free-viewing, visual search is goal-driven and shaped by both linguistic and visual features of the GUI. State-of-the-art models of visual search, trained on natural images, fail with GUIs because they cannot capture the effects of grouping and semantics on search strategies. We present \textsc{SeekUI}, a reward-augmented Vision Language Model (VLM) that predicts scanpaths directly from a GUI screenshot and a text cue describing the desired target. Our model extends the capability of VLMs to reproduce human-like visual search behavior on GUIs and outperforms baseline models across different types of GUIs. Importantly, it reproduces key empirical phenomena established in eye-tracking studies of visual search, including the Guess–Scan–Confirm strategy. In sum, \textsc{SeekUI} provides a foundation for predicting visual search behavior and has potential for informing GUI evaluation and optimization.
ACM CHI Conference on Human Factors in Computing Systems