Gaze as Input

As eye-tracking becomes increasingly common in modern mobile devices, the potential for hands-free, gaze-based interaction grows, but current gesture sets are largely expert-designed and often misaligned with how users naturally move their eyes. To address this gap, we introduce a two-phase methodology for developing intuitive gaze gestures. First, four co-design workshops with 20 non-expert participants generated 102 initial concepts. Next, four gaze interaction experts reviewed and refined these into a set of 32 gestures. We found that non-experts, after a brief introduction, intuitively anchor gestures in familiar metaphors and develop a compositional grammar; i.e., activation (dwell) + action (gaze gesture or blink), to ensure intentionality and mitigate the classic Midas Touch problem. Experts prioritized gestures that are ergonomically sound, aligned with natural saccades, and reliably distinguishable. The resulting user-grounded, expert-validated gesture set, along with actionable design principles, provides a foundation for developing intuitive, hands-free interfaces for gaze-enabled devices.

University of St Andrews, St Andrews, United Kingdom

University of St Andrews, Fife, United Kingdom

King's College London, London, United Kingdom

Dalian Maritime University, Dalian, Liaoning, China

University of Glasgow, Glasgow, United Kingdom

University of St Andrews, St Andrews, United Kingdom

We present a new and accurate approach for gaze estimation on consumer computing devices. We take advantage of continued strides in the quality of user-facing cameras found in e.g., smartphones, laptops, and desktops — 4K or greater in high-end devices — such that it is now possible to capture the 2D reflection of a device's screen in the user's eyes. This alone is insufficient for accurate gaze tracking due to the near-infinite variety of screen content. Crucially, however, the device knows what is being displayed on its own screen — in this work, we show this information allows for robust segmentation of the reflection, the location and size of which encodes the user's screen-relative gaze target. We explore several strategies to leverage this useful signal, quantifying performance in a user study. Our best performing model reduces mean tracking error by ~18% compared to a baseline appearance-based model. A supplemental study reveals an additional 10-20% improvement if the gaze-tracking camera is located at the bottom of the device.

Carnegie Mellon University, Pittsburgh, Pennsylvania, United States

Multimodal interaction has long promised to make interfaces more intuitive and effective by combining complementary inputs. Among these, gaze and speech form a compelling pairing: gaze provides rapid spatial grounding, while speech conveys rich semantic information. Together, they offer rich cues for understanding user behaviour and intent. Yet despite decades of exploration, the research remains fragmented, making this synthesis timely as these inputs mature and are integrated into consumer-ready devices. This scoping review examined 103 studies published between 1991 and 2025, organised into \emph{explicit}, where users intentionally provide gaze and speech, and \emph{implicit}, where systems leverage users' natural behaviours to support interaction. Across both, we identified recurring ways for combining gaze and speech to resolve ambiguity, ground references, and support adaptivity. We contribute a synthesis of research on their combined use while highlighting challenges of temporal alignment, fusion and privacy, offering guidance for future research toward richer multimodal human-computer interaction.

KAIST, Daejeon, Korea, Republic of

Glasgow University, Glasgow, United Kingdom

KAIST, Daejeon, Korea, Republic of

Technical University of Munich , München, Germany

KAIST, Daejeon, Korea, Republic of

The University of Sydney, Sydney, New South Wales, Australia

Lancaster University, Lancaster, United Kingdom

RMIT University, Melbourne, VIC, Australia

Interacting with multiple objects simultaneously makes us fast. A pre-step to this interaction is to select the objects, i.e., multi-object selection, which is enabled through two steps: (1) toggling multi-selection mode --- mode-switching --- and then (2) selecting all the intended objects --- subselection. In extended reality (XR), each step can be performed with the eyes, hands, and voice. To examine how design choices affect user performance, we evaluated four mode-switching (\Semipinch, \Fullpinch, \Doublepinch, and \Voice) and three subselection techniques (Gaze+Dwell, Gaze+Pinch, and Gaze+Voice) in a user study. Results revealed that while \Doublepinch paired with Gaze+Pinch yielded the highest overall performance, \Semipinch achieved the lowest performance. Although \Voice-based mode-switching showed benefits, Gaze+Voice subselection was less favored, as the required repetitive vocal commands were perceived as tedious. Overall, these findings provide empirical insights and inform design recommendations for multi-selection techniques in XR.

Concordia University, Montreal, Quebec, Canada

Simon Fraser University, Vancouver, British Columbia, Canada

Aarhus University, Aarhus, Denmark

Concordia University, Montreal, Quebec, Canada

Stimulus-evoked gaze dynamics offer a secure and hands-free signal in virtual reality (VR), yet the underlying design space of effective visual stimuli remains poorly understood. This work examines how preattentive processing and binocular rivalry can inform stimulus design for gaze-based identification in VR. We conducted a two-part study: (1) a feasibility assessment of closed-set identification performance with 26 participants and 44,928 gaze samples collected by using a commercial headset (Meta Quest Pro), and (2) a usability study with 16 participants comparing the same interaction in a login context to PIN and out-of-band methods as a potential authentication technique. Our findings confirm the feasibility of personal identification, highlight usability advantages, and reveal participants’ desire for greater transparency to understand individual variations in login results. Together, these results offer conceptual insights into the perceptual mechanisms shaping stimulus-evoked gaze behavior, and outline design implications for future VR authentication workflows.

Gwangju Institute of Science and Technology, Gwangju, Korea, Republic of

Gwangju Institute of Science and Technology, Gwangju , Korea, Republic of

Gwangju Institute of Science and Technology, Gwangju, Korea, Republic of