VidSTR: Automatic Spatiotemporal Retargeting of Speech-Driven Video Compositions

要旨

Video editors often record multiple versions of a performance with minor differences. When they add graphics atop one video, they may wish to transfer those assets to another recording, but differences in performance, wordings, and timings can cause assets to no longer be aligned with the video content. Fixing this is a time consuming, manual task. We present a technique which preserves the temporal and spatial alignment of the original composition when automatically retargeting speech-driven video compositions. It can transfer graphics between both similar and dissimilar performances, including those varying in speech and gesture. We use a large language model for transcript-based temporal alignment and integer programming for spatial alignment. Results from retargeting between 51 pairs of performances show that we achieve a temporal alignment success rate of 90% compared to hand-generated ground truth compositions. We demonstrate challenging scenarios, retargeting video compositions across different people, aspect ratios, and languages.

著者
Joshua Kong. Yang
Brown University, Providence, Rhode Island, United States
Mackenzie Leake
Adobe Research, San Francisco, California, United States
Jeff Huang
Brown University, Providence, Rhode Island, United States
Stephen DiVerdi
Adobe Research, San Francisco, California, United States
DOI

10.1145/3706598.3713857

論文URL

https://dl.acm.org/doi/10.1145/3706598.3713857

動画

会議: CHI 2025

The ACM CHI Conference on Human Factors in Computing Systems (https://chi2025.acm.org/)

セッション: Video Making

G303
7 件の発表
2025-04-29 23:10:00
2025-04-30 00:40:00
日本語まとめ
読み込み中…