Choice of Voices: A Large-Scale Evaluation of Text-to-Speech Voice Quality for Long-Form Content

要旨

The advancement of text-to-speech (TTS) voices and a rise of commercial TTS platforms allow people to easily experience TTS voices across a variety of technologies, applications, and form factors. As such, we evaluated TTS voices for long-form content: not individual words or sentences, but voices that are pleasant to listen to for several minutes at a time. We introduce a method using a crowdsourcing platform and an online survey to evaluate voices based on listening experience, perception of clarity and quality, and comprehension. We evaluated 18 TTS voices, three human voices, and a text-only control condition. We found that TTS voices are close to rivaling human voices, yet no single voice outperforms the others across all evaluation dimensions. We conclude with considerations for selecting text-to-speech voices for long-form content.

キーワード
voice quality
text-to-speech
TTS
voice interface
synthesized speech
long-form
listening experience
著者
Julia Cambre
Carnegie Mellon University, Pittsburgh, PA, USA
Jessica Colnago
Carnegie Mellon University, Pittsburgh, PA, USA
Jim Maddock
Northwestern University, Evanston, IL, USA
Janice Tsai
Mozilla Corporation, Mountain View, CA, USA
Jofish Kaye
Mozilla Corporation, Mountain View, CA, USA
DOI

10.1145/3313831.3376789

論文URL

https://doi.org/10.1145/3313831.3376789

会議: CHI 2020

The ACM CHI Conference on Human Factors in Computing Systems (https://chi2020.acm.org/)

セッション: Voice & speech interaction

Paper session
306AB
5 件の発表
2020-04-29 01:00:00
2020-04-29 02:15:00
日本語まとめ
読み込み中…