Captioning, Description, and Media Interaction

会議の名前
CHI 2026
Beyond Descriptions: A Generative Scene2Audio Framework for Blind and Low-Vision Users to Experience Vista Landscapes
要旨

Current scene perception tools for Blind and Low Vision (BLV) individuals rely on spoken descriptions but lack engaging representations of visually pleasing distant environmental landscapes (Vista spaces). Our proposed Scene2Audio framework generates comprehensible and enjoyable nonverbal audio using generative models informed by psychoacoustics, and principles of scene audio composition. Through a user study with 11 BLV participants, we found that combining the Scene2Audio sounds with speech creates a better experience than speech alone, as the sound effects complement the speech making the scene easier to imagine. A mobile app “in-the-wild” study with 7 BLV users for more than a week further showed the potential of Scene2Audio in enhancing outdoor scene experiences. Our work bridges the gap between visual and auditory scene perception by moving beyond purely descriptive aids, addressing the aesthetic needs of BLV users.

著者
Chitralekha Gupta
National University of Singapore, Singapore, Singapore
Jing Peng
National University of Singapore, Singapore, Singapore
Ashwin Ram
Saarland University, Saarland Informatics Campus, Saarbrücken, Germany
Shreyas Sridhar
National University of Singapore, Singapore, Singapore, Singapore
Christophe Jouffrais
CNRS, Toulouse, France
Suranga Nanayakkara
School of Computing, National University of Singapore, Singapore, Singapore
Fuzzy Feelings: Arousal’s Interpretive Noise and the Case for Acoustic-Based Haptics
要旨

Captions rarely convey emotional nuances in speech, leaving Deaf and Hard-of-Hearing (DHH) viewers without access to tonal and affective information. We present a two-part mixed-methods study on how haptic feedback can communicate vocal emotion without adding visual load. In Part 1, we replicated an arousal-driven captioning approach using speech-emotion-recognition to modulate typographic weight and vibration intensity. Participants showed divergent mental models and often mapped “more vibration” to loudness rather than emotional arousal, underscoring the construct’s conceptual fuzziness. In Part 2, we evaluated five acoustic-to-haptic mappings that bypass affective inference and translate pitch, rhythm, and waveform cues into vibration patterns. No single pattern dominated, but participants associated options such as ‘pulse’ or ‘sawtooth’ with high-arousal emotions, and ‘pitch-normalized’ signals with calmer states. We derive design guidelines emphasizing contrastive, acoustically grounded mappings and user control for integrating emotional haptics into short-form, captioned media.

著者
Caluã de Lacerda Pataca
Birmingham City University, Birmingham, United Kingdom
Stephanie Patterson
Rochester Institute of Technology, Rochester, New York, United States
Roshan L. Peiris
Rochester Institute of Technology, Rochester, New York, United States
Matt Huenerfauth
Rochester Institute of Technology, Rochester, New York, United States
動画
Like, Comment & Caption: A Decade of Social Media Video Caption Research (2015–2025)
要旨

As video has become the dominant mode of content on platforms such as YouTube, TikTok, and Instagram, captioning has emerged as a critical factor for accessibility, engagement, and visibility. While prior studies have examined different types of social media video captions or communities' captioning usage, a systematic synthesis has not been undertaken, leading to the risk of proposing interventions that overlook core platform constraints or miss critical accessibility needs. This paper reviews 36 peer-reviewed papers published between 2015 and 2025 across fields such as Human-Computer Interaction (HCI), accessibility, media studies, education, and language learning. We note that captions operate as collective infrastructure co-produced by viewers, creators, and platforms. Deaf and Hard of Hearing (DHH), neurodivergent, and multilingual viewers depend on captions and increasingly expect mechanisms for feedback, while creators face inadequate tool support. Building on these insights, we propose the framework of Participatory Captioning and suggest design implications, highlighting future directions for social media video caption research.

受賞
Honorable Mention
著者
Huong Nguyen
New Jersey Institute of Technology, Newark, New Jersey, United States
Emma J. McDonnell
University of Washington, Seattle, Washington, United States
Lloyd May
Stanford University, Palo Alto, California, United States
Alexander Druzenko
Epic Systems, Madison, Wisconsin, United States
Zoobia Saifullah Syeda
New Jersey Institute of Technology, Newark, New Jersey, United States
Mark Cartwright
New Jersey Institute of Technology, Newark, New Jersey, United States
Sooyeon Lee
New Jersey Institute of Technology, Newark, New Jersey, United States
Enhancing Subtitle Features in Mobile Apps: Analyzing User Reviews for Accessibility and Usability Insights
要旨

While the benefits of subtitles for comprehension are widely acknowledged, empirical research has largely neglected the insights offered by user reviews of mobile applications. This study presents a comprehensive analysis of user feedback on subtitle features in the top 230 mobile applications across various categories. We curated a dataset of 48,872 user reviews specifically related to subtitles, manually categorizing them based on user-provided information. To understand sentiment and accessibility concerns, we conducted sentiment analysis and extracted accessibility-related reviews. Our findings offer valuable insights into user challenges and expectations regarding subtitles in mobile applications, providing a foundation for enhancing their design and implementation. These insights include the need for improved customization options, better accessibility features, and more responsive subtitle performance.

著者
Wajdi M. Aljedaani
Saudi Data and Artificial Intelligence Authority, Riyadh, Saudi Arabia
Matheus Souza
University of North Texas, Denton, Texas, United States
Marcelo Medeiros. Eler
University of São Paulo, São Paulo, Brazil
ADCanvas: Accessible and Conversational Audio Description Authoring for Blind and Low Vision Creators
要旨

Audio Description (AD) provides essential access to visual media for blind and low vision (BLV) audiences. Yet current AD production tools remain largely inaccessible to BLV video creators, who possess valuable expertise but face barriers due to visually-driven interfaces. We present ADCanvas, a multimodal authoring system that supports non-visual control over audio description (AD) creation. ADCanvas combines conversational interaction with keyboard-based playback control and a plain-text, screen reader–accessible editor to support end-to-end AD authoring and visual question answering (VQA). Combining screen-reader-friendly controls with a multimodal LLM agent, ADCanvas supports live VQA, script generation, and AD modification. Through a user study with 12 BLV video creators, we find that users adopt the conversational agent as an informational aide and drafting assistant, while maintaining agency through verification and editing. For example, participants saw themselves as curators who received information from the model and filtered it down for their audience. Our findings offer design implications for accessible media tools, including precise editing controls, accessibility support for creative ideation, and configurable rules for human-AI collaboration.

著者
Franklin Mingzhe Li
Carnegie Mellon University, Pittsburgh, Pennsylvania, United States
Michael Xieyang Liu
Google DeepMind, Pittsburgh, Pennsylvania, United States
Cynthia L. Bennett
Google, New York, New York, United States
Shaun K.. Kane
Google Research, Boulder, Colorado, United States
Deaf and Hard of Hearing Access to Intelligent Personal Assistants: Comparison of Voice-Based Options with an LLM-Powered Touch Interface
要旨

We investigate intelligent personal assistants (IPAs) accessibility for deaf and hard of hearing (DHH) people who can use their voice in everyday communication. The inability of IPAs to understand diverse accents including deaf speech renders them largely inaccessible to non-signing and speaking DHH individuals. Using an Echo Show, we compared the usability of natural language input via two spoken English methods against that of a large language model (LLM)-assisted touch interface in a mixed-methods study. The two spoken English methods consisted of Alexa's built-in automatic speech recognition and a Wizard-of-Oz setting with a trained facilitator re-speaking commands. The touch method was navigated through an LLM-powered ‘task prompter,’ which integrated the user's history and smart environment to suggest contextually-appropriate commands. Quantitative results showed no significant differences across both spoken English conditions vs LLM-assisted touch. Qualitative results showed variability in opinions on the usability of each method. Ultimately, it will be necessary to have robust deaf-accented speech recognized natively by IPAs.

著者
Paige S. DeVries
Gallaudet University , Washington, District of Columbia, United States
Michaela Okosi
Gallaudet University , Washington, District of Columbia, United States
Ming Li
Gallaudet University , Washington, District of Columbia, United States
Nora Dunphy
University of California Berkeley, Berkeley, California, United States
Gidey Gezae
Pennsylvania State University , State College, Pennsylvania, United States
Dante Conway
Gallaudet University, Washington, District of Columbia, United States
Abraham Glasser
Gallaudet University, Washington, District of Columbia, United States
Raja Kushalnagar
Gallaudet University, Washington, District of Columbia, United States
Christian Vogler
Gallaudet University , Washington , District of Columbia, United States
Reimagining Sign Language Technologies: Analyzing Translation Work of Chinese Deaf Online Content Creators
要旨

While sign language translation systems promise to enhance deaf people's access to information and communication, they have been met with strong skepticism from deaf communities due to risks of misrepresenting and oversimplifying the richness of signed communication in technologies. This article provides empirical evidence of the complexity of translation work involved in deaf communication through interviews with 13 deaf Chinese content creators who actively produce and share sign language content on video sharing platforms with both deaf and hearing audiences. By studying this unique group of content creators, our findings highlight the nuances of sign language translation, showing how deaf creators create content with multilingualism and multiculturalism in mind, support meaning making across languages and cultures, and navigate politics involved in their translation work. Grounded in these deaf-led translation practices, we draw on the sociolinguistic concept of (trans)languaging to re-conceptualize and reimagine the design of sign language translation systems.

受賞
Honorable Mention
著者
Xinru Tang
University of California, Irvine, Irvine, California, United States
Anne Marie Piper
University of California, Irvine, Irvine, California, United States