Supporting Accessibility of Text, Image and Video B

https://doi.org/10.1145/3613904.3642632

Blind or Low-Vision (BLV) users often rely on audio descriptions (AD) to access video content. However, conventional static ADs can leave out detailed information in videos, impose a high mental load, neglect the diverse needs and preferences of BLV users, and lack immersion. To tackle these challenges, we introduce SPICA, an AI-powered system that enables BLV users to interactively explore video content. Informed by prior empirical studies on BLV video consumption, SPICA offers novel interactive mechanisms for supporting temporal navigation of frame captions and spatial exploration of objects within key frames. Leveraging an audio-visual machine learning pipeline, SPICA augments existing ADs by adding interactivity, spatial sound effects, and individual object descriptions without requiring additional human annotation. Through a user study with 14 BLV participants, we evaluated the usability and usefulness of SPICA and explored user behaviors, preferences, and mental models when interacting with augmented ADs.

University of Notre Dame, Notre Dame, Indiana, United States

University of Notre Dame, South Bend, Indiana, United States

Beijing Jiaotong University, Beijing, China

University of California San Diego, San Diego, California, United States

University of Notre Dame, Notre Dame, Indiana, United States

University of Texas at Dallas, Richardson, Texas, United States

University of Wisconsin-Madison, Madison, Wisconsin, United States

University of Notre Dame, Notre Dame, Indiana, United States

https://doi.org/10.1145/3613904.3642699

Readers find text difficult to consume for many reasons. Summarization can address some of these difficulties, but introduce others, such as omitting, misrepresenting, or hallucinating information, which can be hard for a reader to notice. One approach to addressing this problem is to instead modify how the original text is rendered to make important information more salient. We introduce Grammar-Preserving Text Saliency Modulation (GP-TSM), a text rendering method with a novel means of identifying what to de-emphasize. Specifically, GP-TSM uses a recursive sentence compression method to identify successive levels of detail beyond the core meaning of a passage, which are de-emphasized by rendering words in successively lighter but still legible gray text. In a lab study (n=18), participants preferred GP-TSM over pre-existing word-level text rendering methods and were able to answer GRE reading comprehension questions more efficiently.

Harvard University, Cambridge, Massachusetts, United States

The University of Sydney, Sydney, NSW, Australia

Harvard University, Allston, Massachusetts, United States

https://doi.org/10.1145/3613904.3642211

“Scene description” applications that describe visual content in a photo are useful daily tools for blind and low vision (BLV) people. Researchers have studied their use, but they have only explored those that leverage remote sighted assistants; little is known about applications that use AI to generate their descriptions. Thus, to investigate their use cases, we conducted a two-week diary study where 16 BLV participants used an AI-powered scene description application we designed. Through their diary entries and follow-up interviews, users shared their information goals and assessments of the visual descriptions they received. We analyzed the entries and found frequent use cases, such as identifying visual features of known objects, and surprising ones, such as avoiding contact with dangerous objects. We also found users scored the descriptions relatively low on average, 2.7 out of 5 (SD=1.5) for satisfaction and 2.4 out of 4 (SD=1.2) for trust, showing that descriptions still need significant improvements to deliver satisfying and trustworthy experiences. We discuss future opportunities for AI as it becomes a more powerful accessibility tool for BLV users.

Cornell Tech, Cornell University, New York, New York, United States

Cornell University, Ithaca, New York, United States

Google, New York, New York, United States

Cornell Tech, New York, New York, United States

https://doi.org/10.1145/3613904.3642325

AI-generated images are proliferating as a new visual medium. However, state-of-the-art image generation models do not output alternative (alt) text with their images, rendering them largely inaccessible to screen reader users (SRUs). Moreover, less is known about what information would be most desirable to SRUs in this new medium. To address this, we invited AI image creators and SRUs to evaluate alt text prepared from various sources and write their own alt text for AI images. Our mixed-methods analysis makes three contributions. First, we highlight creators’ perspectives on alt text, as creators are well-positioned to write descriptions of their images. Second, we illustrate SRUs’ alt text needs particular to the emerging medium of AI images. Finally, we discuss the promises and pitfalls of utilizing text prompts written as input for AI models in alt text generation, and areas where broader digital accessibility guidelines could expand to account for AI images.

Northeastern University, Boston, Massachusetts, United States

Google, Seattle, Washington, United States

Google DeepMind, Seattle, Washington, United States

Google Research, Boulder, Colorado, United States

Google, New York, New York, United States