d/Deaf and Hard of Hearing (DHH) individuals often engage with music through a multimodal approach, where visual modalities are also used rather than relying on sound alone.
While tools like captions and visualizers offer partial support, they often fail to capture the emotional depth and structural nuances of music.
To explore new possibilities, we adopted an iterative, probe-based approach.
Through a formative study with 9 DHH participants, we identified key design requirements for visualizing rhythm, emotion, and lyrics.
We developed FAME (Facial Avatar for Musical Expression), a design probe that conveys music through expressive facial animation, instrument highlights, and synchronized captions, lip-syncing to lyrics or scat-singing to melodies.
Through a two-phase exploratory study with 12 DHH users, we examined FAME’s efficacy, applicability, and requirements for representing musical elements.
Our findings refine design requirements for avatar-based systems and highlight the potential of avatars as expressive and socially meaningful tools for music accessibility.
Songwriting has long served as a powerful medium for expressing unconscious emotions and fostering self-awareness in psychotherapy. Due to the auditory-centric nature of traditional approaches, Deaf and Hard-of-Hearing (DHH) individuals have often been excluded from music’s therapeutic benefits. In response, this study presents a music psychotherapy tool co-designed with therapists, integrating conversational agents (CAs) and music generative AI as symbolic and therapeutic media. Through a usage study with 23 DHH individuals, we found that collaborative songwriting with the CA enabled them to experience emotional release, re-interpretation, and deeper self-understanding. In particular, the CA’s strategies—supportive empathy, example response options, and visual-based metaphors—were found to facilitate musical dialogue effectively for DHH individuals. These findings contribute to inclusive AI design by showing the potential of human–AI collaboration to bridge therapeutic and artistic practices.
Audio media -- radio, podcasts, audiobooks -- structures everyday life: we keep up, wind down, and share moments through long-form listening. Yet for people living with aphasia -- a communication disability that affects audio comprehension -- unsupported audio often means losing the thread and marring the experience. While accessibility advances have focused on print, web, and audiovisual content, audio-only remains unconsidered; oftentimes optimised for marketisation rather than sustained understanding. We report a three-week in-situ deployment of Re-Connect app, an audio media player which meets the people at the moment of comprehension difficulty. With ten adults living with aphasia, we show how people assemble personal repertoires of small, co-present communication cues that repair in the moment and support recall. Grounded in lived experience, we argue for personal, source-proximate scaffolds that help make long-form audio more understandable and enjoyable.
Whispered and dysarthric speech hinder effective communication and undermine the reliability of voice-enabled systems. We present CLARIS, a compact speech-to-speech restoration system that turns such atypical input into clear, expressive speech. CLARIS requires no disorder-specific architectural tuning, generalizes across languages, and adapts quickly to new accents and speakers, enabling practical personalization. On whispered English, Hindi, and clinically challenging dysarthric speech, CLARIS delivers state-of-the-art intelligibility and naturalness, with listener studies confirming gains in quality, intelligibility, naturalness, and prosody. The system runs in real time, converting one second of input in about 30ms and enables inclusive, private, and personalized voice interaction. Audio samples are available at https://claris-w2s.github.io/CLARIS/
Videos make exercise instruction widely available, but they rely on visual demonstrations that blind and low vision (BLV) learners cannot see. While audio descriptions (AD) can make videos accessible, describing movements remains challenging as the AD must convey what to do (mechanics, location, orientation) and how to do it (speed, fluidity, timing). Prior work thus used multimodal instruction to support BLV learners with individual simple movements. However, it is unclear how these approaches scale to dance instruction with unique, complex movements and precise timing constraints. To inform accessible remote dance instruction systems, we conducted three co-design workshops (N=28) with BLV dancers, instructors, and experts in sound, haptics, and AD. Participants designed 8 systems revealing common themes: staged learning to dissect routines, crafting vocabularies for movements, and selectively using modalities—narration for movement structure, sound for expression, and haptics for spatial cues. We conclude with design implications to make learning dance accessible.
Esports have highlighted both their potential for social inclusion and the accessibility challenges faced by individuals with physical disabilities. This study introduces a novel paradigm for inclusive esports by shifting game control from traditional kinematic inputs to kinetic inputs. An EMG-based control interface enables gameplay through force regulation (e.g., muscle activation and inhibition). The aim is to explore how this interface enables common gameplay mechanics among players with and without physical disabilities. User Study 1 involved 20 able-bodied participants performing competitive esports tasks to examine how EMG-based control accuracy is influenced by movement range, such as wrist and elbow motion. User Study 2 extended the investigation to eight participants with physical disabilities to compare control accuracy between disabled and able-bodied users. The findings suggest that the interface enables common gameplay mechanics for individuals who can separately control activation and inhibition of each muscle corresponding to each EMG sensor via calibration adjustment but disability-related involuntary muscle activity and unintended co-contraction remains a major challenge for the interface.
People who are blind or have low-vision (BLV) may hesitate to travel independently in unfamiliar environments due to uncertainty about the physical landscape. While most tools focus on in-situ navigation assistance, those supporting pre-travel assistance typically provide information about only landmarks and turn-by-turn instructions, lacking detailed visual context. Street level imagery, which contains rich visual information and has the potential to reveal environmental details, remains inaccessible to BLV people. In this work, we present SceneScout, a multimodal large language model (MLLM)-driven prototype that enables accessible interactions with street level imagery. SceneScout supports two modes: (1) Route Preview, enabling users to familiarize themselves with visual details along a route, and (2) Virtual Exploration, enabling user-driven movement within street level imagery. Our user study demonstrates that SceneScout helps BLV users uncover visual information otherwise unavailable through existing means. An initial analysis of AI-generated descriptions suggests that majority are accurate and describe stable visual elements even in older imagery, though occasional subtle and plausible errors make them difficult to verify without sight. We discuss future opportunities and challenges of street level imagery-based navigation experiences.