Storytelling and Presentation

https://doi.org/10.1145/3526113.3545680

Following the prevalence of short-form video, short-form voice content has emerged on social media platforms like Twitter and Facebook. A challenge that creators face is hard constraints on the content length. If the initial recording is not short enough, they need to re-record or edit their content. Both are time-consuming, and the latter, if supported, can have a learning curve. Moreover, creators need to manually create multiple versions to publish content on platforms with different length constraints. To simplify this process, we present ROPE (Record Once, Post Everywhere). Creators can record voice content once, and our system will automatically shorten it to all length limits by removing parts of the recording for each target. We formulate this as a combinatorial optimization problem and propose a novel algorithm that automatically selects optimal sentence combinations from the original content to comply with each length constraint. Creators can customize the algorithmically shortened content by specifying sentences to include or exclude. Our system can also use the user-specified constraints to recompute and provides a new version. We conducted a user study comparing ROPE with a sentence-based manual editing baseline. The results show that ROPE can generate high-quality edits, alleviating the cognitive loads of creators for shortening content. While our system and user study address short-form voice content specifically, we believe that the same concept can also be applied to other media such as video with narration and dialog.

University of Toronto, Toronto, Ontario, Canada

Adobe Research, San Francisco, California, United States

https://doi.org/10.1145/3526113.3545613

Blind people typically access videos via audio descriptions (AD) crafted by sighted describers who comprehend, select, and describe crucial visual content in the videos. 360° video is an emerging storytelling medium that enables immersive experiences that people may not possibly reach in everyday life. However, the omnidirectional nature of 360° videos makes it challenging for describers to perceive the holistic visual content and interpret spatial information that is essential to create immersive ADs for blind people. Through a formative study with a professional describer, we identified key challenges in describing 360° videos and iteratively designed OmniScribe, a system that supports the authoring of immersive ADs for 360° videos. OmniScribe uses AI-generated content-awareness overlays for describers to better grasp 360° video content. Furthermore, OmniScribe enables describers to author spatial AD and immersive labels for blind users to consume the videos immersively with our mobile prototype. In a study with 11 professional and novice describers, we demonstrated the value of OmniScribe in the authoring workflow; and a study with 8 blind participants revealed the promise of immersive AD over standard AD for 360° videos. Finally, we discuss the implications of promoting 360° video accessibility.

National Taiwan University, Taipei, Taiwan

National Taiwan University , Taipei, Taiwan

National Taiwan University, Taipei, Taiwan

National Taiwan University , Taipei , Taiwan

National Taiwan University, Taipei, Taiwan

Audio Description Development Assoication, Taipie, Taiwan

National Taiwan University, Taipei, Taiwan

University of Michigan, Ann Arbor, Michigan, United States

https://doi.org/10.1145/3526113.3545676

Video productions commonly start with a script, especially for talking head videos that feature a speaker narrating to the camera. When the source materials come from a written document -- such as a web tutorial, it takes iterations to refine content from a text article to a spoken dialogue, while considering visual compositions in each scene. We propose Doc2Video, a video prototyping approach that converts a document to interactive scripting with a preview of synthetic talking head videos. Our pipeline decomposes a source document into a series of scenes, each automatically creating a synthesized video of a virtual instructor. Designed for a specific domain -- programming cookbooks, we apply visual elements from the source document, such as a keyword, a code snippet or a screenshot, in suitable layouts. Users edit narration sentences, break or combine sections, and modify visuals to prototype a video in our Editing UI. We evaluated our pipeline with public programming cookbooks. Feedback from professional creators shows that our method provided a reasonable starting point to engage them in interactive scripting for a narrated instructional video.

Google Research, Mountain View, California, United States

Google, Mountain View, California, United States

Google Research, Mountain View, California, United States

Google Research, Pittsburgh, Pennsylvania, United States

Google Research, Mountain View, California, United States

Google, Atlanta, Georgia, United States

https://doi.org/10.1145/3526113.3545702

We present RealityTalk, a system that augments real-time live presentations with speech-driven interactive virtual elements. Augmented presentations leverage embedded visuals and animation for engaging and expressive storytelling. However, existing tools for live presentations often lack interactivity and improvisation, while creating such effects in video editing tools require significant time and expertise. RealityTalk enables users to create live augmented presentations with real-time speech-driven interactions. The user can interactively prompt, move, and manipulate graphical elements through real-time speech and supporting modalities. Based on our analysis of 177 existing video-edited augmented presentations, we propose a novel set of interaction techniques and then incorporated them into RealityTalk. We evaluate our tool from a presenter’s perspective to demonstrate the effectiveness of our system.

University of Calgary, Calgary, Alberta, Canada

Adobe Research, Seattle, Washington, United States

University of Calgary, Calgary, Alberta, Canada

https://doi.org/10.1145/3526113.3545614

To facilitate engaging and nuanced conversations around data, we contribute a touchless approach to interacting directly with visualization in remote presentations. We combine dynamic charts overlaid on a presenter's webcam feed with continuous bimanual hand tracking, demonstrating interactions that highlight and manipulate chart elements appearing in the foreground. These interactions are simultaneously functional and deictic, and some allow for the addition of "rhetorical flourish", or expressive movement used when speaking about quantities, categories, and time intervals. We evaluated our approach in two studies with professionals who routinely deliver and attend presentations about data. The first study considered the presenter perspective, where 12 participants delivered presentations to a remote audience using a presentation environment incorporating our approach. The second study considered the audience experience of 17 participants who attended presentations supported by our environment. Finally, we reflect on observations from these studies and discuss related implications for engaging remote audiences in conversations about data.

University of Michigan, Ann Arbor, Michigan, United States

Simon Fraser University, Surrey, British Columbia, Canada

Tableau Research, Seattle, Washington, United States