Eye and Face

https://doi.org/10.1145/3613904.3642086

Currently, interactive systems use physiological sensing to enable advanced functionalities. While eye tracking is a promising means to understand the user, eye tracking data inherently suffers from missing data due to blinks, which may result in reduced system performance. We conducted a literature review to understand how researchers deal with this issue. We uncovered that researchers often implemented their use-case-specific pipeline to overcome the issue, ranging from ignoring missing data to artificial interpolation. With these first insights, we run a large-scale analysis on 11 publicly available datasets to understand the impact of the various approaches on data quality and accuracy. By this, we highlight the pitfalls in data processing and which methods work best. Based on our results, we provide guidelines for handling eye tracking data for interactive systems. Further, we propose a standard data processing pipeline that allows researchers and practitioners to pre-process and standardize their data efficiently.

LMU Munich, Munich, Germany

LMU Munich, Munich , Germany

LMU Munich, Munich, Germany

https://doi.org/10.1145/3613904.3642348

Silent speech is unaffected by ambient noise, increases accessibility, and enhances privacy and security. Yet current silent speech recognizers operate in a phrase-in/phrase-out manner, thus are slow, error prone, and impractical for mobile devices. We present MELDER, a Mobile Lip Reader that operates in real-time by splitting the input video into smaller temporal segments to process them individually. An experiment revealed that this substantially improves computation time, making it suitable for mobile devices. We further optimize the model for everyday use by exploiting the knowledge from a high-resource vocabulary using a transfer learning model. We then compare MELDER in both stationary and mobile settings with two state-of-the-art silent speech recognizers, where MELDER demonstrated superior overall performance. Finally, we compare two visual feedback methods of MELDER with the visual feedback method of Google Assistant. The outcomes shed light on how these proposed feedback methods influence users' perceptions of the model's performance.

University of California, Merced, Merced, California, United States

https://doi.org/10.1145/3613904.3642095

Silent speech interaction (SSI) allows users to discreetly input text without using their hands. Existing wearable SSI systems typically require custom devices and are limited to a small lexicon, limiting their utility to a small set of command words. This work proposes ReHearSSE, an earbud-based ultrasonic SSI system capable of generalizing to words that do not appear in its training dataset, providing support for nearly an entire dictionary's worth of words. As a user silently spells words, ReHearSSE uses autoregressive features to identify subtle changes in ear canal shape. ReHearSSE infers words using a deep learning model trained to optimize connectionist temporal classification (CTC) loss with an intermediate embedding that accounts for different letters and transitions between them. We find that ReHearSSE recognizes 100 unseen words with an accuracy of 89.3%.

The University of Tokyo, Tokyo, Japan

Tsinghua University, Beijing, China

The University of Tokyo, Tokyo, Japan

Tsinghua University, Beijing, China

University of Toronto, Toronto, Ontario, Canada

https://doi.org/10.1145/3613904.3642092

Silent speech recognition is a promising technology that decodes human speech without requiring audio signals, enabling private human-computer interactions. In this paper, we propose Watch Your Mouth, a novel method that leverages depth sensing to enable accurate silent speech recognition. By leveraging depth information, our method provides unique resilience against environmental factors such as variations in lighting and device orientations, while further addressing privacy concerns by eliminating the need for sensitive RGB data. We started by building a deep-learning model that locates lips using depth data. We then designed a deep learning pipeline to efficiently learn from point clouds and translate lip movements into commands and sentences. We evaluated our technique and found it effective across diverse sensor locations: On-Head, On-Wrist, and In-Environment. Watch Your Mouth outperformed the state-of-the-art RGB-based method, demonstrating its potential as an accurate and reliable input technique.

University of California, Los Angeles, Los Angeles, California, United States

The University of Tokyo, Tokyo, Japan

University of California, Los Angeles, Los Angeles, California, United States