Visual Captions: Augmenting Verbal Communication with On-the-fly Visuals

要旨

Video conferencing solutions like Zoom, Google Meet, and Microsoft Teams are becoming increasingly popular for facilitating conversations, and recent advancements such as live captioning help people better understand each other. We believe that the addition of visuals based on the context of conversations could further improve comprehension of complex or unfamiliar concepts. To explore the potential of such capabilities, we conducted a formative study through remote interviews (N=10) and crowdsourced a dataset of over 1500 sentence-visual pairs across a wide range of contexts. These insights informed Visual Captions, a real-time system that integrates with a videoconferencing platform to enrich verbal communication. Visual Captions leverages a fine-tuned large language model to proactively suggest relevant visuals in open-vocabulary conversations. We present the findings from a lab study (N=26) and an in-the-wild case study (N=10), demonstrating how Visual Captions can help improve communication through visual augmentation in various scenarios.

著者
Xingyu "Bruce". Liu
UCLA, Los Angeles, California, United States
Vladimir Kirilyuk
Google, Mountain View, California, United States
Xiuxiu Yuan
Google, Mountain View, California, United States
Alex Olwal
Google Inc., Mountain View, California, United States
Peggy Chi
Google Research, Mountain View, California, United States
Xiang 'Anthony' Chen
UCLA, Los Angeles, California, United States
Ruofei Du
Google, San Francisco, California, United States
論文URL

https://doi.org/10.1145/3544548.3581566

動画

会議: CHI 2023

The ACM CHI Conference on Human Factors in Computing Systems (https://chi2023.acm.org/)

セッション: Communication and Social Good

Hall G2
6 件の発表
2023-04-26 23:30:00
2023-04-27 00:55:00