Masterful Media: Audio and Video Authoring Tools

会議の名前
UIST 2023
Automated Conversion of Music Videos into Lyric Videos
要旨

Musicians and fans often produce lyric videos, a form of music videos that showcase the song's lyrics, for their favorite songs. However, making such videos can be challenging and time-consuming as the lyrics need to be added in synchrony and visual harmony with the video. Informed by prior work and close examination of existing lyric videos, we propose a set of design guidelines to help creators make such videos. Our guidelines ensure the readability of the lyric text while maintaining a unified focus of attention. We instantiate these guidelines in a fully automated pipeline that converts an input music video into a lyric video. We demonstrate the robustness of our pipeline by generating lyric videos from a diverse range of input sources. A user study shows that lyric videos generated by our pipeline are effective in maintaining text readability and unifying the focus of attention.

著者
Jiaju Ma
Stanford University, Stanford, California, United States
Anyi Rao
Stanford University, Stanford, California, United States
Li-Yi Wei
Adobe Research, San Jose, California, United States
Rubaiat Habib Kazi
Adobe Research, Seattle, Washington, United States
Hijung Valentina Shin
Adobe Research, Cambridge, Massachusetts, United States
Maneesh Agrawala
Stanford University, Stanford, California, United States
論文URL

https://doi.org/10.1145/3586183.3606757

動画
Mirrorverse: Live Tailoring of Video Conferencing Interfaces
要旨

How can we let users adapt video-based meetings as easily as they rearrange furniture in a physical meeting room? We describe a design space for video conferencing systems that includes a five-step ``ladder of tailorability,'' from minor adjustments to live reprogramming of the interface. We then present Mirrorverse and show how it applies the principles of computational media to support live tailoring of video conferencing interfaces to accommodate highly diverse meeting situations. We present multiple use scenarios, including a virtual workshop, an online yoga class, and a stand-up team meeting to evaluate the approach and demonstrate its potential for new, remote meetings with fluid transitions across activities.

著者
Jens Emil Sloth. Grønbæk
Aarhus University, Aarhus, Denmark
Marcel Borowski
Aarhus University, Aarhus, Denmark
Eve Hoggan
Computer Science, Aarhus University, Aarhus, Denmark
Wendy E.. Mackay
Inria, Paris, France
Michel Beaudouin-Lafon
Université Paris-Saclay, Orsay, France
Clemens Nylandsted. Klokmose
Aarhus University, Aarhus, Denmark
論文URL

https://doi.org/10.1145/3586183.3606767

動画
Papeos: Augmenting Research Papers with Talk Videos
要旨

Research consumption has been traditionally limited to the reading of academic papers—a static, dense, and formally written format. Alternatively, pre-recorded conference presentation videos, which are more dynamic, concise, and colloquial, have recently become more widely available but potentially under-utilized. In this work, we explore the design space and benefits for combining academic papers and talk videos to leverage their complementary nature to provide a rich and fluid research consumption experience. Based on formative and co-design studies, we present Papeos, a novel reading and authoring interface that allow authors to augment their papers by segmenting and localizing talk videos alongside relevant paper passages with automatically generated suggestions. With Papeos, readers can visually skim a paper through clip thumbnails, and fluidly switch between consuming dense text in the paper or visual summaries in the video. In a comparative lab study (n=16), Papeos reduced mental load, scaffolded navigation, and facilitated more comprehensive reading of papers.

著者
Tae Soo Kim
KAIST, Daejeon, Korea, Republic of
Matt Latzke
Allen Institute for AI, Seattle, Washington, United States
Jonathan Bragg
Allen Institute for Artificial Intelligence, Seattle, Washington, United States
Amy X.. Zhang
University of Washington, Seattle, Washington, United States
Joseph Chee Chang
Allen Institute for AI, Seattle, Washington, United States
論文URL

https://doi.org/10.1145/3586183.3606770

動画
Video2Action: Reducing Human Interactions in Action Annotation of App Tutorial Videos
要旨

Tutorial videos of mobile apps have become a popular and compelling way for users to learn unfamiliar app features. To make the video accessible to the users, video creators always need to annotate the actions in the video, including what actions are performed and where to tap. However, this process can be time-consuming and labor-intensive. In this paper, we introduce a lightweight approach Video2Action, to automatically generate the action scenes and predict the action locations from the video by using image-processing and deep-learning methods. The automated experiments demonstrate the good performance of Video2Action in acquiring actions from the videos, and a user study shows the usefulness of our generated action cues in assisting video creators with action annotation.

著者
Sidong Feng
Monash University, Melbourne, Victoria, Australia
Chunyang Chen
Monash University, Melbourne, Victoria, Australia
Zhenchang Xing
CSIRO's Data61 adn Australian National University, ACTON, ACT, Australia
論文URL

https://doi.org/10.1145/3586183.3606778

動画
PEANUT: A Human-AI Collaborative Tool for Annotating Audio-Visual Data
要旨

Audio-visual learning seeks to enhance the computer’s multi-modal perception leveraging the correlation between the auditory and visual modalities. Despite their many useful downstream tasks, such as video retrieval, AR/VR, and accessibility, the performance and adoption of existing audio-visual models have been impeded by the availability of high quality datasets. Annotating audio-visual datasets is laborious, expensive, and time consuming. To address this challenge, we designed and developed an efficient audio visual annotation tool called Peanut. Peanut’s human-AI collaborative pipeline separates the multi-modal task into two single-modal tasks, and utilizes state-of-the-art object detection and sound-tagging models to reduce the annotators’ effort to process each frame and the number of manually-annotated frames needed. A within-subject user study with 20 participants found that Peanut can significantly accelerate the audio-visual data annotation process while maintaining high annotation accuracy.

著者
Zheng Zhang
University of Notre Dame, Notre Dame, Indiana, United States
Zheng Ning
University of Notre Dame, Notre Dame, Indiana, United States
Chenliang Xu
University of Rochester, Rochester, New York, United States
Yapeng Tian
University of Texas at Dallas, Richardson, Texas, United States
Toby Jia-Jun. Li
University of Notre Dame, Notre Dame, Indiana, United States
論文URL

https://doi.org/10.1145/3586183.3606776

動画
Soundify: Matching Sound Effects to Video
要旨

In the art of video editing, sound helps add character to an object and immerse the viewer within a space. Through formative interviews with professional editors (N=10), we found that the task of adding sounds to video can be challenging. This paper presents Soundify, a system that assists editors in matching sounds to video. Given a video, Soundify identifies matching sounds, synchronizes the sounds to the video, and dynamically adjusts panning and volume to create spatial audio. In a human evaluation study (N=889), we show that Soundify is capable of matching sounds to video out-of-the-box for a diverse range of audio categories. In a within-subjects expert study (N=12), we demonstrate the usefulness of Soundify in helping video editors match sounds to video with lighter workload, reduced task completion time, and improved usability.

著者
David Chuan-En Lin
Carnegie Mellon University, Pittsburgh, Pennsylvania, United States
Anastasis Germanidis
Runway, New York, New York, United States
Cristóbal Valenzuela
Runway, New York, New York, United States
Yining Shi
Runway, New York, New York, United States
Nikolas Martelaro
Carnegie Mellon University, Pittsburgh, Pennsylvania, United States
論文URL

https://doi.org/10.1145/3586183.3606823

動画