Masterful Media: Audio and Video Authoring Tools

https://doi.org/10.1145/3586183.3606767

How can we let users adapt video-based meetings as easily as they rearrange furniture in a physical meeting room? We describe a design space for video conferencing systems that includes a five-step ``ladder of tailorability,'' from minor adjustments to live reprogramming of the interface. We then present Mirrorverse and show how it applies the principles of computational media to support live tailoring of video conferencing interfaces to accommodate highly diverse meeting situations. We present multiple use scenarios, including a virtual workshop, an online yoga class, and a stand-up team meeting to evaluate the approach and demonstrate its potential for new, remote meetings with fluid transitions across activities.

Aarhus University, Aarhus, Denmark

Computer Science, Aarhus University, Aarhus, Denmark

Inria, Paris, France

Université Paris-Saclay, Orsay, France

Aarhus University, Aarhus, Denmark

https://doi.org/10.1145/3586183.3606770

Research consumption has been traditionally limited to the reading of academic papers—a static, dense, and formally written format. Alternatively, pre-recorded conference presentation videos, which are more dynamic, concise, and colloquial, have recently become more widely available but potentially under-utilized. In this work, we explore the design space and benefits for combining academic papers and talk videos to leverage their complementary nature to provide a rich and fluid research consumption experience. Based on formative and co-design studies, we present Papeos, a novel reading and authoring interface that allow authors to augment their papers by segmenting and localizing talk videos alongside relevant paper passages with automatically generated suggestions. With Papeos, readers can visually skim a paper through clip thumbnails, and fluidly switch between consuming dense text in the paper or visual summaries in the video. In a comparative lab study (n=16), Papeos reduced mental load, scaffolded navigation, and facilitated more comprehensive reading of papers.

KAIST, Daejeon, Korea, Republic of

Allen Institute for AI, Seattle, Washington, United States

Allen Institute for Artificial Intelligence, Seattle, Washington, United States

University of Washington, Seattle, Washington, United States

Allen Institute for AI, Seattle, Washington, United States

https://doi.org/10.1145/3586183.3606778

Tutorial videos of mobile apps have become a popular and compelling way for users to learn unfamiliar app features. To make the video accessible to the users, video creators always need to annotate the actions in the video, including what actions are performed and where to tap. However, this process can be time-consuming and labor-intensive. In this paper, we introduce a lightweight approach Video2Action, to automatically generate the action scenes and predict the action locations from the video by using image-processing and deep-learning methods. The automated experiments demonstrate the good performance of Video2Action in acquiring actions from the videos, and a user study shows the usefulness of our generated action cues in assisting video creators with action annotation.

Monash University, Melbourne, Victoria, Australia

CSIRO's Data61 adn Australian National University, ACTON, ACT, Australia

https://doi.org/10.1145/3586183.3606776

Audio-visual learning seeks to enhance the computer’s multi-modal perception leveraging the correlation between the auditory and visual modalities. Despite their many useful downstream tasks, such as video retrieval, AR/VR, and accessibility, the performance and adoption of existing audio-visual models have been impeded by the availability of high quality datasets. Annotating audio-visual datasets is laborious, expensive, and time consuming. To address this challenge, we designed and developed an efficient audio visual annotation tool called Peanut. Peanut’s human-AI collaborative pipeline separates the multi-modal task into two single-modal tasks, and utilizes state-of-the-art object detection and sound-tagging models to reduce the annotators’ effort to process each frame and the number of manually-annotated frames needed. A within-subject user study with 20 participants found that Peanut can significantly accelerate the audio-visual data annotation process while maintaining high annotation accuracy.

University of Notre Dame, Notre Dame, Indiana, United States

University of Rochester, Rochester, New York, United States

University of Texas at Dallas, Richardson, Texas, United States

University of Notre Dame, Notre Dame, Indiana, United States

https://doi.org/10.1145/3586183.3606823

In the art of video editing, sound helps add character to an object and immerse the viewer within a space. Through formative interviews with professional editors (N=10), we found that the task of adding sounds to video can be challenging. This paper presents Soundify, a system that assists editors in matching sounds to video. Given a video, Soundify identifies matching sounds, synchronizes the sounds to the video, and dynamically adjusts panning and volume to create spatial audio. In a human evaluation study (N=889), we show that Soundify is capable of matching sounds to video out-of-the-box for a diverse range of audio categories. In a within-subjects expert study (N=12), we demonstrate the usefulness of Soundify in helping video editors match sounds to video with lighter workload, reduced task completion time, and improved usability.

Carnegie Mellon University, Pittsburgh, Pennsylvania, United States

Runway, New York, New York, United States

Carnegie Mellon University, Pittsburgh, Pennsylvania, United States