Accessing auditory information remains challenging for DHH individuals in real-world situations and multiplayer VR interactions. To improve this, we investigated caption designs that specialize in the needs of DHH users in multiplayer VR settings. First, we conducted three co-design workshops with DHH participants, social workers, and designers to gather insights into the specific needs of design directions for DHH users in the context of a room escape game in VR. We further refined our designs with 13 DHH users to determine the most preferred features. Based on this, we developed VRCaptions, a caption prototype for DHH users to better experience multiplayer conversations in VR. We lastly invited two mixed-hearing groups to participate in the VR room escape game with our VRCaptions to validate. The results demonstrate that VRCaptions can enhance the ability of DHH participants to access information and reduce the barrier to communication in VR.
We introduce M^2Silent, which enables multi-user silent speech interactions in shared spaces using multi-directional speakers. Ensuring privacy during interactions with voice-controlled systems presents significant challenges, particularly in environments with multiple individuals, such as libraries, offices, or vehicles. M^2Silent addresses this by allowing users to communicate silently, without producing audible speech, using acoustic sensing integrated into directional speakers. We leverage FMCW signals as audio carriers, simultaneously playing audio and sensing the user's silent speech. To handle the challenge of multiple users interacting simultaneously, we propose time-shifted FMCW signals and blind source separation algorithms, which help isolate and accurately recognize the speech features of each user. We also present a deep-learning model for real-time silent speech recognition. M^2Silent achieves Word Error Rate (WER) of 6.5% and Sequence Error Rate (SER) of 12.8% in multi-user silent speech recognition while maintaining high audio quality, offering a novel solution for privacy-preserving, multi-user silent interactions in shared spaces.
This article examined how different time and task management information widgets affect time perception across modalities. In mentally demanding office environments, effective countdown representations are crucial for enhancing temporal awareness and productivity. We developed TickSens, a set of information widgets with different modalities, and conducted a within-subjects experiment with 30 participants to evaluate the five types of time perception modes: visual, auditory, haptic, as well as the blank and the timer modes. Our assessment focused on the technology acceptance, cognitive performance and emotional responses. Results indicated that compared to the blank and the timer modes, the use of modalities significantly improved the cognitive performance and positive emotional responses, and was better received by participants. The visual mode had the best task performance, while the auditory feedback was effective in boosting focus and the haptic mode significantly enhances user acceptance. The study revealed varied user preferences that enlightened the integration of these widgets into office.
The recent surge in artificial intelligence, particularly in multimodal processing technology, has advanced human-computer interaction, by altering how intelligent systems perceive, understand, and respond to contextual information (i.e., context awareness). Despite such advancements, there is a significant gap in comprehensive reviews examining these advances, especially from a multimodal data perspective, which is crucial for refining system design. This paper addresses a key aspect of this gap by conducting a systematic survey of data modality-driven Vision-based Multimodal Interfaces (VMIs). VMIs are essential for integrating multimodal data, enabling more precise interpretation of user intentions and complex interactions across physical and digital environments. Unlike previous task- or scenario-driven surveys, this study highlights the critical role of the visual modality in processing contextual information and facilitating multimodal interaction. Adopting a design framework moving from the whole to the details and back, it classifies VMIs across dimensions, providing insights for developing effective, context-aware systems.
Identifying objective markers of attentional states is critical, particularly in real-world scenarios where attentional lapses have serious consequences. In this study, we identified gaze-based indices of attentional lapses and validated them by examining their impact on the performance of classification models. We designed a virtual reality visual search task that encouraged active eye movements to define dynamic gaze-based metrics of different attentional states (zone in/out). The results revealed significant differences in both reactive ocular features, such as first fixation and saccade onset latency, and global ocular features, such as saccade amplitude, depending on the attentional state. Moreover, the performance of the classification models improved significantly when trained only on the proven gaze-based and behavioral indices rather than all available features, with the highest prediction accuracy of 79.3%. We highlight the importance of the preliminary studies before model training and provide generalizable gaze-based indices of attentional states for practical applications.
Novice content creators often invest significant time recording expressive speech for social media videos. While recent advancements in text-to-speech (TTS) technology can generate highly realistic speech in various languages and accents, many struggle with unintuitive or overly granular TTS interfaces. We propose simplifying TTS generation by allowing users to specify high-level context alongside their script. Our Wizard-of-Oz system, SpeakEasy, leverages user-provided context to inform and influence TTS output, enabling iterative refinement with high-level feedback. This approach was informed by two 8-subject formative studies: one examining content creators' experiences with TTS, and the other drawing on effective strategies from voice actors. Our evaluation shows that participants using SpeakEasy were more successful in generating performances matching their personal standards, without requiring significantly more effort than leading industry interfaces.
People with disabilities that affect their speech may use speech-generating devices (SGD), commonly referred to as Augmentative and Alternative Communication (AAC) technology. This technology enables practical conversation; however, delivering expressive and timely comments remains challenging. This paper explores how to extend AAC technology to support a subset of humorous expressions: delivering timely humorous comments -witty remarks- through AI-powered interfaces. To understand the role of humor in AAC and the challenges and experiences of delivering humor with AAC, we conducted seven qualitative interviews with AAC users. Based on these insights and the lead author's firsthand experience as an AAC user, we designed four AI-powered interfaces to assist in delivering well-timed humorous comments during ongoing conversations.
Our user study with five AAC users found that when timing is critical (e.g., delivering a humorous comment), AAC users are willing to trade agency for efficiency—contrasting prior research where they hesitated to delegate decision-making to AI. We conclude by discussing the trade-off between agency and efficiency in AI-powered interfaces, how AI can shape user intentions, and offer design recommendations for AI-powered AAC interfaces.
See our project and demo at: https://tobiwg.github.io/research/why_so_serious