ProTAL: A Drag-and-Link Video Programming Framework for Temporal Action Localization
説明

Temporal Action Localization (TAL) aims to detect the start and end timestamps of actions in a video. However, the training of TAL models requires a substantial amount of manually annotated data. Data programming is an efficient method to create training labels with a series of human-defined labeling functions. However, its application in TAL faces difficulties of defining complex actions in the context of temporal video frames. In this paper, we propose ProTAL, a drag-and-link video programming framework for TAL. ProTAL enables users to define \textbf{key events} by dragging nodes representing body parts and objects and linking them to constrain the relations (direction, distance, etc.). These definitions are used to generate action labels for large-scale unlabelled videos. A semi-supervised method is then employed to train TAL models with such labels. We demonstrate the effectiveness of ProTAL through a usage scenario and a user study, providing insights into designing video programming framework.

日本語まとめ
読み込み中…
読み込み中…
Underground AI? Critical Approaches to Generative Cinema through Amateur Filmmaking
説明

Amateurism (e.g., hobbyist and do-it-yourself making) has long helped human-computer interaction (HCI) scholars map alternatives to status quo technology developments, cultures, and practices. Following the 2023 Hollywood film worker strikes, many scholars, artists, and activists alike have called for alternative approaches to AI that reclaim the apparatus for co-creative and resistant means. Towards this end, we conduct an 11-week diary study with 20 amateur filmmakers of 15 AI-infused films, investigating the emerging space of generative cinema as a critical technical practice. Our close reading of the films and filmmakers’ reflections on their processes reveal four critical approaches to negotiating AI use in filmmaking: minimization, maximization, compartmentalization, and revitalization. We discuss how these approaches suggest the potential for underground filmmaking cultures to form around AI with critical amateurs reclaiming social control over the creative possibilities.

日本語まとめ
読み込み中…
読み込み中…
VidSTR: Automatic Spatiotemporal Retargeting of Speech-Driven Video Compositions
説明

Video editors often record multiple versions of a performance with minor differences. When they add graphics atop one video, they may wish to transfer those assets to another recording, but differences in performance, wordings, and timings can cause assets to no longer be aligned with the video content. Fixing this is a time consuming, manual task. We present a technique which preserves the temporal and spatial alignment of the original composition when automatically retargeting speech-driven video compositions. It can transfer graphics between both similar and dissimilar performances, including those varying in speech and gesture. We use a large language model for transcript-based temporal alignment and integer programming for spatial alignment. Results from retargeting between 51 pairs of performances show that we achieve a temporal alignment success rate of 90% compared to hand-generated ground truth compositions. We demonstrate challenging scenarios, retargeting video compositions across different people, aspect ratios, and languages.

日本語まとめ
読み込み中…
読み込み中…
Generating Highlight Videos of a User-Specified Length using Most Replayed Data
説明

A highlight is a short edit of the original video that includes the most engaging moments. Given the rigid timing of TV commercial slots and length limits of social media uploads, generating highlights of specific lengths is crucial. Previous research on automatic highlight generation often overlooked the control over the duration of the final video, producing highlights of arbitrary lengths. We propose a novel system that automatically generates highlights of any user-specified length. Our system leverages Most Replayed Data (MRD), which identifies how frequently a video has been watched over time, to gauge the most engaging parts. It then optimizes the final editing path by adjusting internal segment durations. We evaluated the quality of our system's outputs through two user studies, including a comparison with highlights created by human editors. Results show that our system can automatically produce highlights that are indistinguishable from those created by humans in viewing experience.

日本語まとめ
読み込み中…
読み込み中…
Where is the Boundary? Understanding How People Recognize and Evaluate Generative AI-extended Videos
説明

The rise of video generative models that produce high-quality content has made it increasingly difficult to discern video authenticity. AI-extended videos, which mix real-world footage with generative content, pose new challenges in distinguishing real from manipulated segments. AI-extended videos might be utilized to deceive humans, but they also have the capacity to assist video creators and offer people novel video experiences.

Despite these concerns, research on how people recognize and evaluate AI-extended videos remains limited. To address this, we conducted a user study where participants interacted with AI-extended videos on a web-based system, identifying boundaries between raw and generated content, followed by a survey and one-on-one interviews. Our quantitative and qualitative analyses revealed how individuals perceive these videos, the factors influencing their perception, evaluations and attitudes. We believe that these insights will aid the future development of AI-extended video technologies and ecosystems.

日本語まとめ
読み込み中…
読み込み中…
VideoDiff: Human-AI Video Co-Creation with Alternatives
説明

To make an engaging video, people sequence interesting moments and add visuals such as B-rolls or text. While video editing requires time and effort, AI has recently shown strong potential to make editing easier through suggestions and automation. A key strength of generative models is their ability to quickly generate multiple variations, but when provided with many alternatives, creators struggle to compare them to find the best fit. We propose VideoDiff, an AI video editing tool designed for editing with alternatives. With VideoDiff, creators can generate and review multiple AI recommendations for each editing process: creating a rough cut, inserting B-rolls, and adding text effects. VideoDiff simplifies comparisons by aligning videos and highlighting differences through timelines, transcripts, and video previews. Creators have the flexibility to regenerate and refine AI suggestions as they compare alternatives. Our study participants (N=12) could easily compare and customize alternatives, creating more satisfying results.

日本語まとめ
読み込み中…
読み込み中…
More Than ‘ticking-a-box’: The Affordances of Short-form Video for Community Reporting to Government
説明

Communication between government agencies and not-for-profits (NFPs) within the local funding sector typically require the writing and submitting of long-form text-based reports.

These processes are time and resource intensive and require skill in written communication, placing a significant administrative burden on the small, already under-resourced organisations who interact with programs.

NFP's now have the technical literacy to create rich video content, but little is understood about how video could be used instead of, or alongside traditional written reports.

We present findings from a novel funding acquittal (final report) process that we designed for a government grant programme to explore the affordances of video from the perspective of the grantee.

We discuss the affordances of structured short-form video to overcome the barriers faced by organisations during these reporting processes.

We present design considerations for digitally mediated processes that could support the media augmentation of these established workflows.

日本語まとめ
読み込み中…
読み込み中…