Auditory UI

会議の名前
CHI 2025
AppAgent: Multimodal Agents as Smartphone Users
要旨

Recent advancements in large language models (LLMs) have led to the creation of intelligent agents capable of performing complex tasks. This paper introduces a novel LLM-based multimodal agent framework designed to operate smartphone applications. Our framework allows the agent to mimic human-like interactions such as tapping and swiping through a simplified action space, eliminating the need for system back-end access and enhancing its versatility across various apps. Central to the agent's functionality is an innovative in-context learning method, where it either autonomously explores or learns from human demonstrations, creating a knowledge base used to execute complex tasks across diverse applications. We conducted extensive testing with our agent on over 50 tasks spanning 10 applications, ranging from social media to sophisticated image editing tools. Additionally, a user study confirmed the agent's superior performance and practicality in handling a diverse array of high-level tasks, demonstrating its effectiveness in real-world settings. Our project page is available at \url{https://appagent-official.github.io/}.

著者
Chi Zhang
Westlake University, Hangzhou, Zhejiang, China
Zhao Yang
Shanghai Supwisdom Information Technology Co., Ltd., Shanghai, Shanghai, China
Jiaxuan Liu
Tencent, Shanghai, China
Yanda Li
University of Technology Sydney, Sydney, NSW, Australia
Yucheng Han
Nanyang Technological University, Singapore, Singaore, Singapore
Xin Chen
ShanghaiTech University, Shanghai, -Select-, China
Zebiao Huang
Tencent, Shanghai, China
bin fu
Tencent, Shanghai, China
Gang Yu
StepFun, shanghai, China
DOI

10.1145/3706598.3713600

論文URL

https://dl.acm.org/doi/10.1145/3706598.3713600

動画
Weaving Sound Information to Support Real-Time Sensemaking of Auditory Environments: Co-Designing with a DHH User
要旨

Current AI sound awareness systems can provide deaf and hard of hearing people with information about sounds, including discrete sound sources and transcriptions. However, synthesizing AI outputs based on DHH people’s ever-changing intents in complex auditory environments remains a challenge. In this paper, we describe the co-design process of SoundWeaver, a sound awareness system prototype that dynamically weaves AI outputs from different AI models based on users’ intents and presents synthesized information through a heads-up display. Adopting a Research through Design perspective, we created SoundWeaver with one DHH co-designer, adapting it to his personal contexts and goals (e.g., cooking at home and chatting in a game store). Through this process, we present design implications for the future of “intent-driven” AI systems for sound accessibility.

著者
Jeremy Zhengqi. Huang
University of Michigan, Ann Arbor, Michigan, United States
Jaylin Herskovitz
University of Michigan, Ann Arbor, Michigan, United States
Liang-Yuan Wu
University of Michigan, Ann Arbor, Michigan, United States
Cecily Morrison
Microsoft Research , Cambridge, United Kingdom
Dhruv Jain
University of Michigan, Ann Arbor, Michigan, United States
DOI

10.1145/3706598.3714268

論文URL

https://dl.acm.org/doi/10.1145/3706598.3714268

動画
Sprayable Sound: Exploring the Experiential and Design Potential of Physically Spraying Sound Interaction
要旨

Perfume and fragrance have captivated people for centuries across different cultures. Inspired by the ephemeral nature of sprayable olfactory interactions and experiences, we explore the potential of applying a similar interaction principle to the auditory modality. In this paper, we present SoundMist, a sonic interaction method that enables users to generate ephemeral auditory presences by physically dispersing a liquid into the air, much like the fading phenomenon of fragrance. We conducted a study to understand the experiential factors inherent in sprayable sound interaction and held an ideation workshop to identify potential design spaces or opportunities that this interaction could shape. Our findings, derived from thematic analysis, suggest that physically sprayable sound interaction can induce experiences related to four key factors—materiality of sound produced by dispersed liquid particles, different sounds entangled with each liquid, illusive perception of temporally floating sound, and enjoyment derived from blending different sounds—and can be applied to artistic practices, safety indications, multisensory approaches, and emotional interfaces.

著者
Jongik Jeon
KAIST, Deajeon, Korea, Republic of
Chang Hee Lee
KAIST (Korea Advanced Institute of Science and Technology), Daejoen, Korea, Republic of
DOI

10.1145/3706598.3713786

論文URL

https://dl.acm.org/doi/10.1145/3706598.3713786

動画
OnomaCap: Making Non-speech Sound Captions Accessible and Enjoyable through Onomatopoeic Sound Representation
要旨

Non-speech sounds play an important role in setting the mood of a video and aiding comprehension. However, current non-speech sound captioning practices focus primarily on sound categories, which fails to provide a rich sound experience for d/Deaf and hard-of-hearing (DHH) viewers. Onomatopoeia, which succinctly captures expressive sound information, offers a potential solution but remains underutilized in non-speech sound captioning. This paper investigates how onomatopoeia benefits DHH audiences in non-speech sound captioning. We collected 7,962 sound-onomatopoeia pairs from listeners and developed a sound-onomatopoeia model that automatically transcribes sounds into onomatopoeic descriptions indistinguishable from human-generated ones. A user evaluation of 25 DHH participants using the model-generated onomatopoeia demonstrated that onomatopoeia significantly improved their video viewing experience. Participants most favored captions with onomatopoeia and category, and expressed a desire to see such captions across genres. We discuss the benefits and challenges of using onomatopoeia in non-speech sound captions, offering insights for future practices.

著者
JooYeong Kim
Gwangju Institute of Science and Technology, Gwangju, Korea, Republic of
Jin-Hyuk Hong
Gwangju Institute of Science and Technology, Gwangju, Korea, Republic of
DOI

10.1145/3706598.3713911

論文URL

https://dl.acm.org/doi/10.1145/3706598.3713911

動画
SpeechCompass: Enhancing Mobile Captioning with Diarization and Directional Guidance via Multi-Microphone Localization
要旨

Speech-to-text capabilities on mobile devices have proven helpful for hearing and speech accessibility, language translation, note-taking, and meeting transcripts. However, our foundational large-scale survey (n=263) shows that the inability to distinguish and indicate speaker direction makes them challenging in group conversations. SpeechCompass addresses this limitation through real-time, multi-microphone speech localization, where the direction of speech allows visual separation and guidance (e.g., arrows) in the user interface. We introduce efficient real-time audio localization algorithms and custom sound perception hardware, running on a low-power microcontroller with four integrated microphones, which we characterize in technical evaluations. Informed by a large-scale survey (n=494), we conducted an in-person study of group conversations with eight frequent users of mobile speech-to-text, who provided feedback on five visualization styles. The value of diarization and visualizing localization was consistent across participants, with everyone agreeing on the value and potential of directional guidance for group conversations.

受賞
Best Paper
著者
Artem Dementyev
Google Inc., Mountain View, California, United States
Dimitri Kanevsky
Google, Mountain View, California, United States
Samuel Yang
Google, Mountain View, California, United States
Mathieu Parvaix
Google Research, Mountain View, California, United States
Chiong Lai
Google, Mountain View, California, United States
Alex Olwal
Google Inc., Mountain View, California, United States
DOI

10.1145/3706598.3713631

論文URL

https://dl.acm.org/doi/10.1145/3706598.3713631

動画
SPECTRA: Personalizable Sound Recognition for Deaf and Hard of Hearing Users through Interactive Machine Learning
要旨

We introduce SPECTRA, a novel pipeline for personalizable sound recognition designed to understand DHH users' needs when collecting audio data, creating a training dataset, and reasoning about the quality of a model. To evaluate the prototype, we recruited 12 DHH participants who trained personalized models for their homes. We investigated waveforms, spectrograms, interactive clustering, and data annotating to support DHH users throughout this workflow, and we explored the impact of a hands-on training session on their experience and attitudes toward sound recognition tools. Our findings reveal the potential for clustering visualizations and waveforms to enrich users' understanding of audio data and refinement of training datasets, along with data annotations to promote varied data collection. We provide insights into DHH users' experiences and perspectives on personalizing a sound recognition pipeline. Finally, we share design considerations for future interactive systems to support this population.

著者
Steven M.. Goodman
University of Washington, Seattle, Washington, United States
Emma J. McDonnell
University of Washington, Seattle, Washington, United States
Jon E.. Froehlich
University of Washington, Seattle, Washington, United States
Leah Findlater
University of Washington, Seattle, Washington, United States
DOI

10.1145/3706598.3713294

論文URL

https://dl.acm.org/doi/10.1145/3706598.3713294

動画
A Longitudinal Study on the Effects of Circadian Fatigue on Sound Source Identification and Localization using a Heads-Up Display
要旨

Circadian fatigue, largely caused by sleep deprivation, significantly diminishes alertness and situational awareness. This issue becomes critical in environments where auditory awareness—such as responding to verbal instructions or localizing alarms—is essential for performance and safety. While head-mounted displays have demonstrated potential in enhancing situational awareness through visual cues, their effectiveness in supporting sound localization under the influence of circadian fatigue remains under-explored. This study addresses this knowledge gap through a longitudinal study (N=19) conducted over 2–4 months, tracking participants’ fatigue levels through daily assessments. Participants were called in to perform non-line-of-sight sound source identification and localization tasks in a virtual environment under high- and low-fatigue conditions, both with and without head-up display assistance. The results show task-dependent effects of circadian fatigue. Unexpectedly, reaction times were shorter across all tasks under high-fatigue conditions. Yet, in sound localization, where precision is key, the HUD offered the greatest performance enhancement by reducing pointing error. The results suggest the auditory channel is a robust means of enhancing situational awareness and providing support for incorporating spatial audio cues and HUD as standard features in augmented reality platforms for fatigue-prone scenarios.

著者
Alexander G. Minton
University of Technology Sydney, Sydney, New South Wales, Australia
Howe Yuan. Zhu
University of Sydney, Camperdown, New South Wales, Australia
Hsiang-Ting Chen
University of Adelaide, Adelaide, South Australia, Australia
Yu-Kai Wang
University of Technology Sydney, Sydney, NSW, Australia
Zhuoli Zhuang
University of Technology Sydney, Sydney, NSW, Australia
Gina Notaro
Lockheed Martin, Cherry Hill, New Jersey, United States
Raquel Galvan
Lockheed Martin, Arlington, Virginia, United States
James Allen
Lockheed Martin, Cherry Hill, New Jersey, United States
Matthias D. Ziegler
Lockheed Martin, Arlington, Virginia, United States
Chin-Teng Lin
Australian AI Institute, School of Computer Science, Faculty of Engineering and Information Technology, University of Technology Sydney, Ultimo, Australia
DOI

10.1145/3706598.3713402

論文URL

https://dl.acm.org/doi/10.1145/3706598.3713402

動画