Inspired by the increasing prevalence of digital voice assistants, we demonstrate the feasibility of using voice interfaces to deploy and complete crowd tasks. We have developed Crowd Tasker, a novel system that delivers crowd tasks through a digital voice assistant. In a lab study, we validate our proof-of-concept and show that crowd task performance through a voice assistant is comparable to that of a web interface for voice-compatible and voice-based crowd tasks for native English speakers. We also report on a field study where participants used our system in their homes. We find that crowdsourcing through voice can provide greater flexibility to crowd workers by allowing them to work in brief sessions, enabling multi-tasking, and reducing the time and effort required to initiate tasks. We conclude by proposing a set of design guidelines for the creation of crowd tasks for voice and the development of future voice-based crowdsourcing systems.
Voice user interfaces (VUIs) are rapidly increasing in popularity in the consumer space. This leads to a concurrent explosion of available applications for such devices, with many industries rushing to offer voice interactions for their products. This pressure is then transferred to interface designers; however, a large majority of designers have been only trained to handle the usability challenges specific to Graphical User Interfaces (GUIs). Since VUIs differ significantly in design and usability from GUIs, we investigate in this paper the extent to which current educational resources prepare designers to handle the specific challenges of VUI design. For this, we conducted a preliminary scoping scan and syllabi meta review of HCI curricula at more than twenty top international HCI departments, revealing that the current offering of VUI design training within HCI education is rather limited. Based on this, we advocate for the updating of HCI curricula to incorporate VUI design, and for the development of VUI-specific pedagogical artifacts to be included in new curricula.
The advancement of text-to-speech (TTS) voices and a rise of commercial TTS platforms allow people to easily experience TTS voices across a variety of technologies, applications, and form factors. As such, we evaluated TTS voices for long-form content: not individual words or sentences, but voices that are pleasant to listen to for several minutes at a time. We introduce a method using a crowdsourcing platform and an online survey to evaluate voices based on listening experience, perception of clarity and quality, and comprehension. We evaluated 18 TTS voices, three human voices, and a text-only control condition. We found that TTS voices are close to rivaling human voices, yet no single voice outperforms the others across all evaluation dimensions. We conclude with considerations for selecting text-to-speech voices for long-form content.
We present the first systematic analysis of personality dimensions developed specifically to describe the personality of speech-based conversational agents. Following the psycholexical approach from psychology, we first report on a new multi-method approach to collect potentially descriptive adjectives from 1) a free description task in an online survey (228 unique descriptors), 2) an interaction task in the lab (176 unique descriptors), and 3) a text analysis of 30,000 online reviews of conversational agents (Alexa, Google Assistant, Cortana) (383 unique descriptors). We aggregate the results into a set of 349 adjectives, which are then rated by 744 people in an online survey. A factor analysis reveals that the commonly used Big Five model for human personality does not adequately describe agent personality. As an initial step to developing a personality model, we propose alternative dimensions and discuss implications for the design of agent personalities, personality-aware personalisation, and future research.
There is widespread concern over the ways speech assistant providers currently use humans to listen to users' queries without their knowledge. We report two iterations of the TalkBack smart speaker, which transparently combines machine and human assistance. In the first, we created a prototype to investigate whether people would choose to forward their questions to a human answerer if the machine was unable to help. Longitudinal deployment revealed that most users would do so when given the explicit choice. In the second iteration we extended the prototype to draw upon spoken answers from previous deployments, combining machine efficiency with human richness. Deployment of this second iteration shows that this corpus can help provide relevant, human-created instant responses. We distil lessons learned for those developing conversational agents or other AI-infused systems about how to appropriately enlist human-in-the-loop information services to benefit users, task workers and system performance.