PEANUT: A Human-AI Collaborative Tool for Annotating Audio-Visual Data

要旨

Audio-visual learning seeks to enhance the computer’s multi-modal perception leveraging the correlation between the auditory and visual modalities. Despite their many useful downstream tasks, such as video retrieval, AR/VR, and accessibility, the performance and adoption of existing audio-visual models have been impeded by the availability of high quality datasets. Annotating audio-visual datasets is laborious, expensive, and time consuming. To address this challenge, we designed and developed an efficient audio visual annotation tool called Peanut. Peanut’s human-AI collaborative pipeline separates the multi-modal task into two single-modal tasks, and utilizes state-of-the-art object detection and sound-tagging models to reduce the annotators’ effort to process each frame and the number of manually-annotated frames needed. A within-subject user study with 20 participants found that Peanut can significantly accelerate the audio-visual data annotation process while maintaining high annotation accuracy.

著者
Zheng Zhang
University of Notre Dame, Notre Dame, Indiana, United States
Zheng Ning
University of Notre Dame, Notre Dame, Indiana, United States
Chenliang Xu
University of Rochester, Rochester, New York, United States
Yapeng Tian
University of Texas at Dallas, Richardson, Texas, United States
Toby Jia-Jun. Li
University of Notre Dame, Notre Dame, Indiana, United States
論文URL

https://doi.org/10.1145/3586183.3606776

動画

会議: UIST 2023

ACM Symposium on User Interface Software and Technology

セッション: Masterful Media: Audio and Video Authoring Tools

Gold Room
6 件の発表
2023-10-30 23:20:00
2023-10-31 00:40:00