Captioning, Description, and Media Interaction

Paper Guilds

/

会議の一覧

/

CHI 2026

/

Captioning, Description, and Media Interaction

CHI 2026

Assistive Tech and Professional Inclusion

Child-Computer Interaction

Current scene perception tools for Blind and Low Vision (BLV) individuals rely on spoken descriptions but lack engaging representations of visually pleasing distant environmental landscapes (Vista spaces). Our proposed Scene2Audio framework generates comprehensible and enjoyable nonverbal audio using generative models informed by psychoacoustics, and principles of scene audio composition. Through a user study with 11 BLV participants, we found that combining the Scene2Audio sounds with speech creates a better experience than speech alone, as the sound effects complement the speech making the scene easier to imagine. A mobile app “in-the-wild” study with 7 BLV users for more than a week further showed the potential of Scene2Audio in enhancing outdoor scene experiences. Our work bridges the gap between visual and auditory scene perception by moving beyond purely descriptive aids, addressing the aesthetic needs of BLV users.

National University of Singapore, Singapore, Singapore

Saarland University, Saarland Informatics Campus, Saarbrücken, Germany

National University of Singapore, Singapore, Singapore, Singapore

CNRS, Toulouse, France

School of Computing, National University of Singapore, Singapore, Singapore

お気に入り

あとで読む

コレクション

Captions rarely convey emotional nuances in speech, leaving Deaf and Hard-of-Hearing (DHH) viewers without access to tonal and affective information. We present a two-part mixed-methods study on how haptic feedback can communicate vocal emotion without adding visual load. In Part 1, we replicated an arousal-driven captioning approach using speech-emotion-recognition to modulate typographic weight and vibration intensity. Participants showed divergent mental models and often mapped “more vibration” to loudness rather than emotional arousal, underscoring the construct’s conceptual fuzziness. In Part 2, we evaluated five acoustic-to-haptic mappings that bypass affective inference and translate pitch, rhythm, and waveform cues into vibration patterns. No single pattern dominated, but participants associated options such as ‘pulse’ or ‘sawtooth’ with high-arousal emotions, and ‘pitch-normalized’ signals with calmer states. We derive design guidelines emphasizing contrastive, acoustically grounded mappings and user control for integrating emotional haptics into short-form, captioned media.

Birmingham City University, Birmingham, United Kingdom

Rochester Institute of Technology, Rochester, New York, United States

お気に入り

あとで読む

コレクション

As video has become the dominant mode of content on platforms such as YouTube, TikTok, and Instagram, captioning has emerged as a critical factor for accessibility, engagement, and visibility. While prior studies have examined different types of social media video captions or communities' captioning usage, a systematic synthesis has not been undertaken, leading to the risk of proposing interventions that overlook core platform constraints or miss critical accessibility needs. This paper reviews 36 peer-reviewed papers published between 2015 and 2025 across fields such as Human-Computer Interaction (HCI), accessibility, media studies, education, and language learning. We note that captions operate as collective infrastructure co-produced by viewers, creators, and platforms. Deaf and Hard of Hearing (DHH), neurodivergent, and multilingual viewers depend on captions and increasingly expect mechanisms for feedback, while creators face inadequate tool support. Building on these insights, we propose the framework of Participatory Captioning and suggest design implications, highlighting future directions for social media video caption research.

New Jersey Institute of Technology, Newark, New Jersey, United States

University of Washington, Seattle, Washington, United States

Stanford University, Palo Alto, California, United States

Epic Systems, Madison, Wisconsin, United States

New Jersey Institute of Technology, Newark, New Jersey, United States

お気に入り

あとで読む

コレクション

While the benefits of subtitles for comprehension are widely acknowledged, empirical research has largely neglected the insights offered by user reviews of mobile applications. This study presents a comprehensive analysis of user feedback on subtitle features in the top 230 mobile applications across various categories. We curated a dataset of 48,872 user reviews specifically related to subtitles, manually categorizing them based on user-provided information. To understand sentiment and accessibility concerns, we conducted sentiment analysis and extracted accessibility-related reviews. Our findings offer valuable insights into user challenges and expectations regarding subtitles in mobile applications, providing a foundation for enhancing their design and implementation. These insights include the need for improved customization options, better accessibility features, and more responsive subtitle performance.

Saudi Data and Artificial Intelligence Authority, Riyadh, Saudi Arabia

University of North Texas, Denton, Texas, United States

University of São Paulo, São Paulo, Brazil

お気に入り

あとで読む

コレクション

Audio Description (AD) provides essential access to visual media for blind and low vision (BLV) audiences. Yet current AD production tools remain largely inaccessible to BLV video creators, who possess valuable expertise but face barriers due to visually-driven interfaces. We present ADCanvas, a multimodal authoring system that supports non-visual control over audio description (AD) creation. ADCanvas combines conversational interaction with keyboard-based playback control and a plain-text, screen reader–accessible editor to support end-to-end AD authoring and visual question answering (VQA). Combining screen-reader-friendly controls with a multimodal LLM agent, ADCanvas supports live VQA, script generation, and AD modification. Through a user study with 12 BLV video creators, we find that users adopt the conversational agent as an informational aide and drafting assistant, while maintaining agency through verification and editing. For example, participants saw themselves as curators who received information from the model and filtered it down for their audience. Our findings offer design implications for accessible media tools, including precise editing controls, accessibility support for creative ideation, and configurable rules for human-AI collaboration.

Carnegie Mellon University, Pittsburgh, Pennsylvania, United States

Google DeepMind, Pittsburgh, Pennsylvania, United States

Google, New York, New York, United States

Google Research, Boulder, Colorado, United States

お気に入り

あとで読む

コレクション

We investigate intelligent personal assistants (IPAs) accessibility for deaf and hard of hearing (DHH) people who can use their voice in everyday communication. The inability of IPAs to understand diverse accents including deaf speech renders them largely inaccessible to non-signing and speaking DHH individuals. Using an Echo Show, we compared the usability of natural language input via two spoken English methods against that of a large language model (LLM)-assisted touch interface in a mixed-methods study. The two spoken English methods consisted of Alexa's built-in automatic speech recognition and a Wizard-of-Oz setting with a trained facilitator re-speaking commands. The touch method was navigated through an LLM-powered ‘task prompter,’ which integrated the user's history and smart environment to suggest contextually-appropriate commands. Quantitative results showed no significant differences across both spoken English conditions vs LLM-assisted touch. Qualitative results showed variability in opinions on the usability of each method. Ultimately, it will be necessary to have robust deaf-accented speech recognized natively by IPAs.

Gallaudet University , Washington, District of Columbia, United States

University of California Berkeley, Berkeley, California, United States

Pennsylvania State University , State College, Pennsylvania, United States

Gallaudet University, Washington, District of Columbia, United States

Gallaudet University , Washington , District of Columbia, United States

お気に入り

あとで読む

コレクション

While sign language translation systems promise to enhance deaf people's access to information and communication, they have been met with strong skepticism from deaf communities due to risks of misrepresenting and oversimplifying the richness of signed communication in technologies. This article provides empirical evidence of the complexity of translation work involved in deaf communication through interviews with 13 deaf Chinese content creators who actively produce and share sign language content on video sharing platforms with both deaf and hearing audiences. By studying this unique group of content creators, our findings highlight the nuances of sign language translation, showing how deaf creators create content with multilingualism and multiculturalism in mind, support meaning making across languages and cultures, and navigate politics involved in their translation work. Grounded in these deaf-led translation practices, we draw on the sociolinguistic concept of (trans)languaging to re-conceptualize and reimagine the design of sign language translation systems.

University of California, Irvine, Irvine, California, United States

お気に入り

あとで読む

コレクション

Assistive Tech and Professional Inclusion

Child-Computer Interaction

要旨

著者

要旨

著者

動画

要旨

受賞
Honorable Mention

著者

要旨

著者

要旨

著者

要旨

著者

要旨

受賞
Honorable Mention

著者