Multimodal Interaction

会議の名前
CHI 2025
VRCaptions: Design Captions for DHH Users in Multiplayer Communication in VR
要旨

Accessing auditory information remains challenging for DHH individuals in real-world situations and multiplayer VR interactions. To improve this, we investigated caption designs that specialize in the needs of DHH users in multiplayer VR settings. First, we conducted three co-design workshops with DHH participants, social workers, and designers to gather insights into the specific needs of design directions for DHH users in the context of a room escape game in VR. We further refined our designs with 13 DHH users to determine the most preferred features. Based on this, we developed VRCaptions, a caption prototype for DHH users to better experience multiplayer conversations in VR. We lastly invited two mixed-hearing groups to participate in the VR room escape game with our VRCaptions to validate. The results demonstrate that VRCaptions can enhance the ability of DHH participants to access information and reduce the barrier to communication in VR.

著者
Tianze Xie
Southern University of Science and Technology, Shenzhen, China
Xuesong Zhang
Southern University of Science and Technology, Shenzhen, China
Feiyu Huang
Southern University of Science and Technology, Shenzhen, China
Di Liu
Southern University of Science and Technology, Shenzhen, China
Pengcheng An
Southern University of Science and Technology, Shenzhen, China
Seungwoo Je
SUSTech, Shenzhen, China
DOI

10.1145/3706598.3714186

論文URL

https://dl.acm.org/doi/10.1145/3706598.3714186

動画
M^2Silent: Enabling Multi-user Silent Speech Interactions via Multi-directional Speakers in Shared Spaces
要旨

We introduce M^2Silent, which enables multi-user silent speech interactions in shared spaces using multi-directional speakers. Ensuring privacy during interactions with voice-controlled systems presents significant challenges, particularly in environments with multiple individuals, such as libraries, offices, or vehicles. M^2Silent addresses this by allowing users to communicate silently, without producing audible speech, using acoustic sensing integrated into directional speakers. We leverage FMCW signals as audio carriers, simultaneously playing audio and sensing the user's silent speech. To handle the challenge of multiple users interacting simultaneously, we propose time-shifted FMCW signals and blind source separation algorithms, which help isolate and accurately recognize the speech features of each user. We also present a deep-learning model for real-time silent speech recognition. M^2Silent achieves Word Error Rate (WER) of 6.5% and Sequence Error Rate (SER) of 12.8% in multi-user silent speech recognition while maintaining high audio quality, offering a novel solution for privacy-preserving, multi-user silent interactions in shared spaces.

著者
Juntao Zhou
Shanghai Jiao Tong University, Shanghai, China
Dian Ding
Shanghai Jiao Tong University, Shanghai, China
Yijie Li
National University of Singapore, Singapore, Singapore
Yu Lu
Shanghai Jiao Tong University, Shanghai, China
Yida Wang
Shanghai Jiao Tong University, Shanghai, Shanghai, China
Yongzhao Zhang
University of Electronic Science and Technology of China, Chengdu, Sichuan, China
Yi-Chao Chen
Shanghai Jiao Tong University, Shanghai, China
Guangtao Xue
Shanghai Jiao Tong University, Shanghai, China
DOI

10.1145/3706598.3714174

論文URL

https://dl.acm.org/doi/10.1145/3706598.3714174

動画
Effects of Information Widgets on Time Perception during Mentally Demanding Tasks
要旨

This article examined how different time and task management information widgets affect time perception across modalities. In mentally demanding office environments, effective countdown representations are crucial for enhancing temporal awareness and productivity. We developed TickSens, a set of information widgets with different modalities, and conducted a within-subjects experiment with 30 participants to evaluate the five types of time perception modes: visual, auditory, haptic, as well as the blank and the timer modes. Our assessment focused on the technology acceptance, cognitive performance and emotional responses. Results indicated that compared to the blank and the timer modes, the use of modalities significantly improved the cognitive performance and positive emotional responses, and was better received by participants. The visual mode had the best task performance, while the auditory feedback was effective in boosting focus and the haptic mode significantly enhances user acceptance. The study revealed varied user preferences that enlightened the integration of these widgets into office.

著者
Zengrui Li
Beijing Institute of Technology, Beijing, China
Di Shi
Beijing Institute of Technology, Beijing, China
Qijun Gao
Beijing Institute of Technology, Beijing, China
Yichen Chen
Beijing Institute of Technology, Beijing, China
Nanyi Wang
Beijing institute of Technology, Beijing, China
Xipei Ren
Beijing Institute of Technology, Beijing, China
DOI

10.1145/3706598.3713270

論文URL

https://dl.acm.org/doi/10.1145/3706598.3713270

動画
Vision-Based Multimodal Interfaces: A Survey and Taxonomy for Enhanced Context-Aware System Design
要旨

The recent surge in artificial intelligence, particularly in multimodal processing technology, has advanced human-computer interaction, by altering how intelligent systems perceive, understand, and respond to contextual information (i.e., context awareness). Despite such advancements, there is a significant gap in comprehensive reviews examining these advances, especially from a multimodal data perspective, which is crucial for refining system design. This paper addresses a key aspect of this gap by conducting a systematic survey of data modality-driven Vision-based Multimodal Interfaces (VMIs). VMIs are essential for integrating multimodal data, enabling more precise interpretation of user intentions and complex interactions across physical and digital environments. Unlike previous task- or scenario-driven surveys, this study highlights the critical role of the visual modality in processing contextual information and facilitating multimodal interaction. Adopting a design framework moving from the whole to the details and back, it classifies VMIs across dimensions, providing insights for developing effective, context-aware systems.

著者
Yongquan 'Owen' Hu
University of New South Wales, Sydney, NSW, Australia
Jingyu Tang
Huazhong University of Science and Technology, Wuhan, China
Xinya Gong
Southern University of Science and Technology, Shenzhen, China
Zhongyi Zhou
RIKEN AIP, Tokyo, Japan
Shuning Zhang
Tsinghua University, Beijing, China
Don Samitha Elvitigala
Monash University, Melbourne, Australia
Florian ‘Floyd’. Mueller
Monash University, Melbourne, VIC, Australia
Wen Hu
UNSW, Syndey, New South Wales, Australia
Aaron Quigley
CSIRO’s Data61 , Sydney, NSW, Australia
DOI

10.1145/3706598.3714161

論文URL

https://dl.acm.org/doi/10.1145/3706598.3714161

動画
Looking but Not Focusing: Defining Gaze-Based Indices of Attention Lapses and Classifying Attentional States
要旨

Identifying objective markers of attentional states is critical, particularly in real-world scenarios where attentional lapses have serious consequences. In this study, we identified gaze-based indices of attentional lapses and validated them by examining their impact on the performance of classification models. We designed a virtual reality visual search task that encouraged active eye movements to define dynamic gaze-based metrics of different attentional states (zone in/out). The results revealed significant differences in both reactive ocular features, such as first fixation and saccade onset latency, and global ocular features, such as saccade amplitude, depending on the attentional state. Moreover, the performance of the classification models improved significantly when trained only on the proven gaze-based and behavioral indices rather than all available features, with the highest prediction accuracy of 79.3%. We highlight the importance of the preliminary studies before model training and provide generalizable gaze-based indices of attentional states for practical applications.

著者
Eugene Hwang
KAIST, Daejeon, Korea, Republic of
Jeongmi Lee
KAIST, Daejeon, Korea, Republic of
DOI

10.1145/3706598.3714269

論文URL

https://dl.acm.org/doi/10.1145/3706598.3714269

動画
SpeakEasy: Enhancing Text-to-Speech Interactions for Expressive Content Creation
要旨

Novice content creators often invest significant time recording expressive speech for social media videos. While recent advancements in text-to-speech (TTS) technology can generate highly realistic speech in various languages and accents, many struggle with unintuitive or overly granular TTS interfaces. We propose simplifying TTS generation by allowing users to specify high-level context alongside their script. Our Wizard-of-Oz system, SpeakEasy, leverages user-provided context to inform and influence TTS output, enabling iterative refinement with high-level feedback. This approach was informed by two 8-subject formative studies: one examining content creators' experiences with TTS, and the other drawing on effective strategies from voice actors. Our evaluation shows that participants using SpeakEasy were more successful in generating performances matching their personal standards, without requiring significantly more effort than leading industry interfaces.

著者
Stephen Brade
Massachusetts Institute of Technology, Cambridge, Massachusetts, United States
Sam Anderson
Adobe Research, New York, New York, United States
Rithesh Kumar
Adobe Research, Toronto, Ontario, Canada
Zeyu Jin
Adobe Research, San Francisco, California, United States
Anh Truong
Adobe Research, New York, New York, United States
DOI

10.1145/3706598.3714263

論文URL

https://dl.acm.org/doi/10.1145/3706598.3714263

動画
Why So Serious? Exploring Timely Humorous Comments in AAC Through AI-Powered Interfaces
要旨

People with disabilities that affect their speech may use speech-generating devices (SGD), commonly referred to as Augmentative and Alternative Communication (AAC) technology. This technology enables practical conversation; however, delivering expressive and timely comments remains challenging. This paper explores how to extend AAC technology to support a subset of humorous expressions: delivering timely humorous comments -witty remarks- through AI-powered interfaces. To understand the role of humor in AAC and the challenges and experiences of delivering humor with AAC, we conducted seven qualitative interviews with AAC users. Based on these insights and the lead author's firsthand experience as an AAC user, we designed four AI-powered interfaces to assist in delivering well-timed humorous comments during ongoing conversations. Our user study with five AAC users found that when timing is critical (e.g., delivering a humorous comment), AAC users are willing to trade agency for efficiency—contrasting prior research where they hesitated to delegate decision-making to AI. We conclude by discussing the trade-off between agency and efficiency in AI-powered interfaces, how AI can shape user intentions, and offer design recommendations for AI-powered AAC interfaces. See our project and demo at: https://tobiwg.github.io/research/why_so_serious

受賞
Honorable Mention
著者
Tobias M. Weinberg
Cornell Tech, New york, New York, United States
Kowe Kadoma
Cornell University, New York, New York, United States
Ricardo E.. Gonzalez Penuela
Cornell Tech, Cornell University, New York, New York, United States
Stephanie Valencia
University of Maryland College Park, College Park, Maryland, United States
Thijs Roumen
Cornell Tech, New York, New York, United States
DOI

10.1145/3706598.3714102

論文URL

https://dl.acm.org/doi/10.1145/3706598.3714102

動画