Gaze and Speech in Multimodal Human-Computer Interaction: A Scoping Review

Multimodal interaction has long promised to make interfaces more intuitive and effective by combining complementary inputs. Among these, gaze and speech form a compelling pairing: gaze provides rapid spatial grounding, while speech conveys rich semantic information. Together, they offer rich cues for understanding user behaviour and intent. Yet despite decades of exploration, the research remains fragmented, making this synthesis timely as these inputs mature and are integrated into consumer-ready devices. This scoping review examined 103 studies published between 1991 and 2025, organised into \emph{explicit}, where users intentionally provide gaze and speech, and \emph{implicit}, where systems leverage users' natural behaviours to support interaction. Across both, we identified recurring ways for combining gaze and speech to resolve ambiguity, ground references, and support adaptivity. We contribute a synthesis of research on their combined use while highlighting challenges of temporal alignment, fusion and privacy, offering guidance for future research toward richer multimodal human-computer interaction.

KAIST, Daejeon, Korea, Republic of

Glasgow University, Glasgow, United Kingdom

KAIST, Daejeon, Korea, Republic of

Technical University of Munich , München, Germany

KAIST, Daejeon, Korea, Republic of

The University of Sydney, Sydney, New South Wales, Australia

Lancaster University, Lancaster, United Kingdom

RMIT University, Melbourne, VIC, Australia

ACM CHI Conference on Human Factors in Computing Systems

P1 - Room 124

6 件の発表

開始日時2026-04-15 18:00:00

終了日時2026-04-15 19:30:00

お気に入り

あとで読む

コレクション

要旨

著者

会議: CHI 2026

セッション: Gaze as Input