Voice and Speech

https://doi.org/10.1145/3449235

With advances in expressive speech synthesis and conversational understanding, an ever-increasing amount of digital content---including social and personal content---can be consumed through voice. Voice has long been known to convey personal characteristics and emotional states, both of which are prominent aspects of social media. Yet, no study has investigated voice design requirements for social media platforms. We interviewed 15 active social media users about their preferences on using synthesized voices to represent their profiles. Our findings show that participants want to have control over how a voice delivers their content, such as the personality and emotion with which the voice speaks, because these prosodic variations can impact users' online persona and interfere with impression management. We report motivations behind customizing or not customizing voice characteristics in different scenarios, and uncover key challenges around usability and the potential for stereotyping. We argue that synthesized speech for social media should be evaluated not only on listening experience and voice quality but also on its expressivity, degree of customizability, and ability to adapt to contexts (e.g., social media platforms, groups, individual posts). We discuss how our contribution confirms and extends knowledge of voice technology design and online self-presentation, and offer design considerations for voice personalization related to social interactions.

University of Washington, Seattle, Washington, United States

Human Centered Design and Engineering, University of Washington, Seattle, Washington, United States

University of Washington, Seattle, Washington, United States

Human Centered Design and Engineering, University of Washington, Seattle, Washington, United States

Microsoft Research, Redmond, Washington, United States

https://doi.org/10.1145/3449206

Voice-Activated Artificial Intelligence (VAI) is increasingly ubiquitous, whether appearing as context-specific conversational assistants or more personalised and generalised personal assistants such as Alexa or Siri. CSCW and other researchers have regularly studied the (positive and negative) social consequences of their design and deployment. One particular focus has been questions of gender, and the implications that the (often-feminine) gendering of VAIs has for societal norms and user experiences. Studies into this have largely elided transgender (trans) existences; the few exceptions to this operate largely from an external and predetermined idea of trans and/or non-binary user needs, centered on representation. In this study, we undertook a series of qualitative interviews with trans and/or non-binary users of VAIs to explore their experiences and needs. Our results show that these needs are far more than a question of representation, and instead have implications for the framing of gender as a concept by VAI designers, their approach to user privacy, the wider feature set supported, and the structures and contexts in which VAIs are designed. We provide both immediate recommendations for designers and researchers seeking to create trans-inclusive VAIs, and wider, critical proposals for how we as researchers go about assessing technological systems and appropriate points of intervention.

Goldsmiths, University of London, London, United Kingdom

University of Washington, Seattle, Washington, United States

University of Oxford, Oxford, United Kingdom

https://doi.org/10.1145/3449119

Intelligent personal assistants (IPA), such as Amazon Alexa and Google Assistant, are becoming ever more present in multi-user households. This leads to questions of privacy and consent, particularly for those who do not directly own the device they interact with. When these devices are placed in shared spaces, every visitor and cohabitant becomes an indirect user, potentially leading to discomfort, misuse of services, or unintentional sharing of personal data. To better understand how owners and visitors perceive IPAs, we interviewed 10 in-house users (account owners and cohabitants) and 9 visitors from a student and young professionals sample who have interacted with such devices on various occasions. We find that cohabitants in shared households with regular IPA interactions see themselves as owners of the device, although not having the same controls as the account owner. Further, we determine the existence of a smart speaker etiquette which doubles as trust-based boundary management. Both in-house users and visitors demonstrate similar attitudes and concerns around data use, constant monitoring by the device, and the lack of transparency around device operations. We discuss interviewees' system understanding, concerns, and protection strategies and make recommendation to avoid tensions around shared devices.

University of Edinburgh, Edinburgh, United Kingdom

https://doi.org/10.1145/3449101

One key technique people use in conversation and collaboration is conversational repair. Self-repair is the recognition and attempted correction of one's own mistakes. We investigate how the self-repair of errors by intelligent voice assistants affects user interaction. In a controlled human-participant study with N = 101 participants, participants asked Amazon Alexa to perform four tasks, and we manipulated whether Alexa would ``make a mistake'' understanding the participant (for example, playing heavy metal in response to a request for relaxing music) and whether Alexa would perform a correction (for example, stating, ``You don't seem pleased. Did I get that wrong?'') We measured the impact of self-repair on the participant's perception of the interaction in four conditions: correction (mistakes made and repair performed), undercorrection (mistakes made, no repair performed), overcorrection (no mistakes made, but repair performed), and control (no mistakes made, and no repair performed). Subsequently, we conducted free-response interviews with each participant about their interactions. This study finds that self-repair greatly improves people's assessment of an intelligent voice assistant if a mistake has been made, but can degrade assessment if no correction is needed. However, we find that the positive impact of self-repair in the wake of an error outweighs the negative impact of overcorrection. In addition, participants who recently experienced an error saw increased value in self-repair as a feature, regardless of whether they experienced a repair themselves.

Cornell Tech, New York, New York, United States

Cornell University, Ithaca, New York, United States

Cornell Tech, New York, New York, United States

https://doi.org/10.1145/3449171

Research has shown study habits and skills to be correlated with academic success, calling for a deeper comprehension of these behaviors and processes to design effective interventions for struggling students. Chatbots have recently been used as a persuasive technology to help support behavioral change, making them an intriguing design space for students' study habits and skills. This paper investigated the feasibility of using chatbots for promoting behavioral change of college students majoring in Computer Science (CS). We conducted semi-structured interviews with CS peer-tutors and surveyed university freshmen to understand issues on students' study habits and identify opportunities for technical intervention. Inspired by the findings, we designed StudyBuddy, a chatbot prototype deployed in Slack, that periodically sends tips, provides assessment of students' study habits via surveys, helps the students break down assignments, recommends academic resources, and sends reminders. We evaluated the usability of the prototype in-depth with 8 students (both first-year and senior students) and 5 course instructors followed by a large scale evaluative survey (n=117) using video of the prototype. Our research identified important design challenges such as building trust and preserving privacy, limiting interaction costs, and supporting both immediate and long-term sustainable support. Likewise, we propose design recommendations that demonstrate context awareness, personalize the experience based on user preferences, and adapt over time as students mature and grow.

University of Florida, Gainesville, Florida, United States

University of Pittsburgh, Pittsburgh, Pennsylvania, United States

https://doi.org/10.1145/3479515

Modern conversational agents such as Alexa and Google Assistant represent significant progress in speech recognition, natural language processing, and speech synthesis. But as these agents have grown more realistic, concerns have been raised over how their social nature might unconsciously shape our interactions with them. Through a survey of 500 voice assistant users, we explore whether users' relationships with their voice assistants can be quantified using the same metrics as social, interpersonal relationships; as well as if this correlates with how much they trust their devices and the extent to which they anthropomorphise them. Using Knapp's staircase model of human relationships, we find that not only can human-device interactions be modelled in this way, but also that relationship development with voice assistants correlates with increased trust and anthropomorphism.

University of Oxford, Oxford, United Kingdom

University of Oxford, Oxford, Oxfordshire, United Kingdom

https://doi.org/10.1145/3479532

Voice assistants offer users access to an increasing variety of personalized functionalities. The researchers and engineers who build these experiences rely on various signals from users to create the machine learning models powering them. One type of signal is explicit in situ feedback. While collecting explicit in situ user feedback via voice assistants would help improve and inspect the underlying models, from a user perspective it can be disruptive to the overall experience, and the user might not feel compelled to respond. However, careful design can help alleviate friction in the experience. In this paper, we explore the opportunities and the design space for voice assistant feedback elicitation. First, we present four usage categories of explicit in-situ context for model evaluation and improvement, derived from interviews with machine learning practitioners. Then, using realistic scenarios generated for each category and based on examples from the interviews, we conducted an online study to evaluate multiple voice assistant designs. Our results reveal that when the voice assistant is framed as a learner or a collaborator, users were more willing to respond to its request for feedback and felt that the experience was less disruptive. In addition, giving users instructions on how to initiate feedback themselves can reduce the perceived disruptiveness to the experience compared to asking users for feedback directly in the form of a question. Based on our findings, we discuss the implications and potential future directions for designing voice assistants to elicit user feedback for personalized voice experiences.

University of Illinois at Urbana-Champaign, Urbana, Illinois, United States

Spotify, San Francisco, California, United States

Spotify, Boston, Massachusetts, United States