Contemporary voice assistants require that objects of inter-est be specified in spoken commands. Of course, users are often looking directly at the object or place of interest – fine-grained, contextual information that is currently unused. We present WorldGaze, a software-only method for smartphones that provides the real-world gaze location of a user that voice agents can utilize for rapid, natural, and precise interactions. We achieve this by simultaneously opening the front and rear cameras of a smartphone. The front-facing camera is used to track the head in 3D, including estimating its direction vector. As the geometry of the front and back cameras are fixed and known, we can raycast the head vector into the 3D world scene as captured by the rear-facing camera. This allows the user to intuitively define an object or region of interest using their head gaze. We started our investigations with a qualitative exploration of competing methods, before developing a functional, real-time implementation. We conclude with an evaluation that shows WorldGaze can be quick and accurate, opening new multimodal gaze+voice interactions for mobile voice agents.
We present GazeConduits, a calibration-free ad-hoc mobile interaction concept that enables users to collaboratively interact with tablets, other users, and content in a cross-device setting using gaze and touch input. GazeConduits leverages recently introduced smartphone capabilities to detect facial features and estimate users' gaze directions. To join a collaborative setting, users place one or more tablets onto a shared table and position their phone in the center, which then tracks users present as well as their gaze direction to determine the tablets they look at. We present a series of techniques using GazeConduits for collaborative interaction across mobile devices for content selection and manipulation. Our evaluation with 20 simultaneous tablets on a table shows that GazeConduits can reliably identify which tablet or collaborator a user is looking at.
Eye movements provide insight into what parts of an image a viewer finds most salient, interesting, or relevant to the task at hand. Unfortunately, eye tracking data, a commonly-used proxy for attention, is cumbersome to collect. Here we explore an alternative: a comprehensive web-based toolbox for crowdsourcing visual attention. We draw from four main classes of attention-capturing methodologies in the literature. ZoomMaps is a novel zoom-based interface that captures viewing on a mobile phone. CodeCharts is a self-reporting methodology that records points of interest at precise viewing durations. ImportAnnots is an "annotation" tool for selecting important image regions, and cursor-based BubbleView lets viewers click to deblur a small area. We compare these methodologies using a common analysis framework in order to develop appropriate use cases for each interface. This toolbox and our analyses provide a blueprint for how to gather attention data at scale without an eye tracker.
Listening to text using read-aloud applications is a popular way for people to consume content when their visual attention is situationally impaired (e.g., commuting, walking, tired eyes). However, due to the linear nature of audio, such apps do not support skimming---a non-linear, rapid form of reading---essential for quickly grasping the gist and organization of difficult texts, like academic or professional documents. To support auditory skimming for situational impairments, we (1) identified the user needs and challenges in auditory skimming through a formative study (N=20), (2) derived the concept of "eyes-reduced" skimming that blends auditory and visual modes of reading, inspired by how participants mixed visual and non-visual interactions, (3) generated a set of design guidelines for eyes-reduced skimming, and (4) designed and evaluated a novel audio skimming app that embodies the guidelines. Our in-situ preliminary observation study (N=6) suggested that participants were positive about our design and were able to auditorily skim documents. We discuss design implications for eyes-reduced reading, read-aloud apps, and text-to-speech engines.
Subtitles play a crucial role in cross-lingual distribution of multimedia content and help communicate information where auditory content is not feasible (loud environments, hearing impairments, unknown languages). Established methods utilize text at the bottom of the screen, which may distract from the video. Alternative techniques place captions closer to related content (e.g., faces) but are not applicable to arbitrary videos such as documentations. Hence, we propose to leverage live gaze as indirect input method to adapt captions to individual viewing behavior. We implemented two gaze-adaptive methods and compared them in a user study (n=54) to traditional captions and audio-only videos. The results show that viewers with less experience with captions prefer our gaze-adaptive methods as they assist them in reading. Furthermore, gaze distributions resulting from our methods are closer to natural viewing behavior compared to the traditional approach. Based on these results, we provide design implications for gaze-adaptive captions.