SpeakEasy: Enhancing Text-to-Speech Interactions for Expressive Content Creation

要旨

Novice content creators often invest significant time recording expressive speech for social media videos. While recent advancements in text-to-speech (TTS) technology can generate highly realistic speech in various languages and accents, many struggle with unintuitive or overly granular TTS interfaces. We propose simplifying TTS generation by allowing users to specify high-level context alongside their script. Our Wizard-of-Oz system, SpeakEasy, leverages user-provided context to inform and influence TTS output, enabling iterative refinement with high-level feedback. This approach was informed by two 8-subject formative studies: one examining content creators' experiences with TTS, and the other drawing on effective strategies from voice actors. Our evaluation shows that participants using SpeakEasy were more successful in generating performances matching their personal standards, without requiring significantly more effort than leading industry interfaces.

著者
Stephen Brade
Massachusetts Institute of Technology, Cambridge, Massachusetts, United States
Sam Anderson
Adobe Research, New York, New York, United States
Rithesh Kumar
Adobe Research, Toronto, Ontario, Canada
Zeyu Jin
Adobe Research, San Francisco, California, United States
Anh Truong
Adobe Research, New York, New York, United States
DOI

10.1145/3706598.3714263

論文URL

https://dl.acm.org/doi/10.1145/3706598.3714263

動画

会議: CHI 2025

The ACM CHI Conference on Human Factors in Computing Systems (https://chi2025.acm.org/)

セッション: Multimodal Interaction

G302
7 件の発表
2025-04-30 18:00:00
2025-04-30 19:30:00
日本語まとめ
読み込み中…