Auditory UI

https://dl.acm.org/doi/10.1145/3706598.3714268

Current AI sound awareness systems can provide deaf and hard of hearing people with information about sounds, including discrete sound sources and transcriptions. However, synthesizing AI outputs based on DHH people’s ever-changing intents in complex auditory environments remains a challenge. In this paper, we describe the co-design process of SoundWeaver, a sound awareness system prototype that dynamically weaves AI outputs from different AI models based on users’ intents and presents synthesized information through a heads-up display. Adopting a Research through Design perspective, we created SoundWeaver with one DHH co-designer, adapting it to his personal contexts and goals (e.g., cooking at home and chatting in a game store). Through this process, we present design implications for the future of “intent-driven” AI systems for sound accessibility.

University of Michigan, Ann Arbor, Michigan, United States

Microsoft Research , Cambridge, United Kingdom

University of Michigan, Ann Arbor, Michigan, United States

10.1145/3706598.3714268

https://dl.acm.org/doi/10.1145/3706598.3713786

Perfume and fragrance have captivated people for centuries across different cultures. Inspired by the ephemeral nature of sprayable olfactory interactions and experiences, we explore the potential of applying a similar interaction principle to the auditory modality. In this paper, we present SoundMist, a sonic interaction method that enables users to generate ephemeral auditory presences by physically dispersing a liquid into the air, much like the fading phenomenon of fragrance. We conducted a study to understand the experiential factors inherent in sprayable sound interaction and held an ideation workshop to identify potential design spaces or opportunities that this interaction could shape. Our findings, derived from thematic analysis, suggest that physically sprayable sound interaction can induce experiences related to four key factors—materiality of sound produced by dispersed liquid particles, different sounds entangled with each liquid, illusive perception of temporally floating sound, and enjoyment derived from blending different sounds—and can be applied to artistic practices, safety indications, multisensory approaches, and emotional interfaces.

KAIST, Deajeon, Korea, Republic of

KAIST (Korea Advanced Institute of Science and Technology), Daejoen, Korea, Republic of

10.1145/3706598.3713786

https://dl.acm.org/doi/10.1145/3706598.3713911

Non-speech sounds play an important role in setting the mood of a video and aiding comprehension. However, current non-speech sound captioning practices focus primarily on sound categories, which fails to provide a rich sound experience for d/Deaf and hard-of-hearing (DHH) viewers. Onomatopoeia, which succinctly captures expressive sound information, offers a potential solution but remains underutilized in non-speech sound captioning. This paper investigates how onomatopoeia benefits DHH audiences in non-speech sound captioning. We collected 7,962 sound-onomatopoeia pairs from listeners and developed a sound-onomatopoeia model that automatically transcribes sounds into onomatopoeic descriptions indistinguishable from human-generated ones. A user evaluation of 25 DHH participants using the model-generated onomatopoeia demonstrated that onomatopoeia significantly improved their video viewing experience. Participants most favored captions with onomatopoeia and category, and expressed a desire to see such captions across genres. We discuss the benefits and challenges of using onomatopoeia in non-speech sound captions, offering insights for future practices.

Gwangju Institute of Science and Technology, Gwangju, Korea, Republic of

10.1145/3706598.3713911

https://dl.acm.org/doi/10.1145/3706598.3713631

Speech-to-text capabilities on mobile devices have proven helpful for hearing and speech accessibility, language translation, note-taking, and meeting transcripts. However, our foundational large-scale survey (n=263) shows that the inability to distinguish and indicate speaker direction makes them challenging in group conversations. SpeechCompass addresses this limitation through real-time, multi-microphone speech localization, where the direction of speech allows visual separation and guidance (e.g., arrows) in the user interface. We introduce efficient real-time audio localization algorithms and custom sound perception hardware, running on a low-power microcontroller with four integrated microphones, which we characterize in technical evaluations. Informed by a large-scale survey (n=494), we conducted an in-person study of group conversations with eight frequent users of mobile speech-to-text, who provided feedback on five visualization styles. The value of diarization and visualizing localization was consistent across participants, with everyone agreeing on the value and potential of directional guidance for group conversations.

Google Inc., Mountain View, California, United States

Google, Mountain View, California, United States

Google Research, Mountain View, California, United States

Google, Mountain View, California, United States

Google Inc., Mountain View, California, United States

10.1145/3706598.3713631

https://dl.acm.org/doi/10.1145/3706598.3713294

We introduce SPECTRA, a novel pipeline for personalizable sound recognition designed to understand DHH users' needs when collecting audio data, creating a training dataset, and reasoning about the quality of a model. To evaluate the prototype, we recruited 12 DHH participants who trained personalized models for their homes. We investigated waveforms, spectrograms, interactive clustering, and data annotating to support DHH users throughout this workflow, and we explored the impact of a hands-on training session on their experience and attitudes toward sound recognition tools. Our findings reveal the potential for clustering visualizations and waveforms to enrich users' understanding of audio data and refinement of training datasets, along with data annotations to promote varied data collection. We provide insights into DHH users' experiences and perspectives on personalizing a sound recognition pipeline. Finally, we share design considerations for future interactive systems to support this population.

University of Washington, Seattle, Washington, United States

10.1145/3706598.3713294