Alerting customers on suspected online-payment fraud and persuade them to terminate transactions is increasingly requested with the rapid growth of digital finance worldwide. We explored the feasibility of using a conversational agent (CA) to fulfill this request. Shing, a voice-based CA, proactively initializes and repairs the conversation with empathetical communication skills in order to alert customers when a suspected online-payment fraud is detected, collects important information for fraud scrutiny and persuades customers to terminate the transaction once the fraud is confirmed. We evaluated our system by comparing it with a rule-based CA with regards to customer response and perceptions in a real-world context where our systems took 144,795 phone calls in total in which 83,019 (57.3%) natural breakdowns happened. Results showed that more customers stopped risky transactions after conversing with Shing. They seemed more willing to converse with Shing for more dialogue turns and provide transaction details. Our work presents practical implications for the design of proactive CA.
https://doi.org/10.1145/3411764.3445129
Building conversational agents that can conduct natural and prolonged conversations has been a major technical and design challenge, especially for community-facing conversational agents. We posit Mutual Theory of Mind as a theoretical framework to design for natural long-term human-AI interactions. From this perspective, we explore a community's perception of a question-answering conversational agent through self-reported surveys and computational linguistic approach in the context of online education. We first examine long-term temporal changes in students' perception of Jill Watson (JW), a virtual teaching assistant deployed in an online class discussion forum. We then explore the feasibility of inferring students' perceptions of JW through linguistic features extracted from student-JW dialogues. We find that students' perception of JW’s anthropomorphism and intelligence changed significantly over time. Regression analyses reveal that linguistic verbosity, readability, sentiment, diversity, and adaptability reflect student perception of JW. We discuss implications for building adaptive community-facing conversational agents as long-term companions and designing towards Mutual Theory of Mind in human-AI interaction.
https://doi.org/10.1145/3411764.3445645
Designing reliable Speech Emotion Recognition systems is a complex task that inevitably requires sufficient data for training purposes. Such extensive datasets are currently available in only a few languages, including English, German, and Italian. In this paper, we present SEMOUR, the first scripted database of emotion-tagged speech in the Urdu language, to design an Urdu Speech Recognition System. Our gender-balanced dataset contains 15,040 unique instances recorded by eight professional actors eliciting a syntactically complex script. The dataset is phonetically balanced, and reliably exhibits a varied set of emotions as marked by the high agreement scores among human raters in experiments. We also provide various baseline speech emotion prediction scores on the database, which could be used for various applications like personalized robot assistants, diagnosis of psychological disorders, and getting feedback from a low-tech-enabled population, etc. On a random test sample, our model correctly predicts an emotion with a state-of-the-art 92% accuracy.
https://doi.org/10.1145/3411764.3445171
Prototyping AI user experiences is challenging due in part to probabilistic AI models making it difficult to anticipate, test, and mitigate AI failures before deployment. In this work, we set out to support practitioners with early AI prototyping, with a focus on natural language (NL)-based technologies. Our interviews with 12 NL practitioners from a large technology company revealed that, in addition to challenges prototyping AI, prototyping was often not happening at all or focused only on idealized scenarios due to a lack of tools and tight timelines. These findings informed our design of the AI Playbook, an interactive and low-cost tool we developed to encourage proactive and systematic consideration of AI errors before deployment. Our evaluation of the AI Playbook demonstrates its potential to 1) encourage product teams to prioritize both ideal and failure scenarios, 2) standardize the articulation of AI failures from a user experience perspective, and 3) act as a boundary object between user experience designers, data scientists, and engineers.
https://doi.org/10.1145/3411764.3445735
In recent years, mobile accessibility has become an important trend with the goal of allowing all users the possibility of using any app without many limitations. User reviews include insights that are useful for app evolution. However, with the increase in the amount of received reviews, manually analyzing them is tedious and time-consuming, especially when searching for accessibility reviews. The goal of this paper is to support the automated identification of accessibility in user reviews, to help technology professionals in prioritizing their handling, and thus, creating more inclusive apps. Particularly, we design a model that takes as input accessibility user reviews, learns their keyword-based features, in order to make a binary decision, for a given review, on whether it is about accessibility or not. The model is evaluated using a total of 5,326 mobile app reviews. The findings show that (1) our model can accurately identify accessibility reviews, outperforming two baselines, namely keyword-based detector and a random classifier; (2) our model achieves an accuracy of 85% with relatively small training dataset; however, the accuracy improves as we increase the size of the training dataset.
https://doi.org/10.1145/3411764.3445281
Machine learning classifiers for human-facing tasks such as comment toxicity and misinformation often score highly on metrics such as ROC AUC but are received poorly in practice. Why this gap? Today, metrics such as ROC AUC, precision, and recall are used to measure technical performance; however, human-computer interaction observes that evaluation of human-facing systems should account for people's reactions to the system. In this paper, we introduce a transformation that more closely aligns machine learning classification metrics with the values and methods of user-facing performance measures. The disagreement deconvolution takes in any multi-annotator (e.g., crowdsourced) dataset, disentangles stable opinions from noise by estimating intra-annotator consistency, and compares each test set prediction to the individual stable opinions from each annotator. Applying the disagreement deconvolution to existing social computing datasets, we find that current metrics dramatically overstate the performance of many human-facing machine learning tasks: for example, performance on a comment toxicity task is corrected from .95 to .73 ROC AUC.
https://doi.org/10.1145/3411764.3445423
Recent studies show the effectiveness of interview chatbots in information elicitation. However, designing an effective interview chatbot is non-trivial. Few tools exist to help designers design, evaluate, and improve an interview chatbot iteratively. Based on a formative study and literature reviews, we propose a computational framework for quantifying the performance of interview chatbots. Incorporating the framework, we have developed iChatProfile, an assistive design tool that can automatically generate a profile of an interview chatbot with quantified performance metrics and offer design suggestions for improving the chatbot based on such metrics. To validate the effectiveness of iChatProfile, we designed and conducted a between-subject study that compared the performance of 10 interview chatbots designed with or without using iChatProfile. Based on the live chats between the 10 chatbots and 1349 users, our results show that iChatProfile helped the designers build significantly more effective interview chatbots, improving both interview quality and user experience.
https://doi.org/10.1145/3411764.3445569
Recent work in fair machine learning has proposed dozens of technical definitions of algorithmic fairness and methods for enforcing these definitions. However, we still lack an understanding of how to develop machine learning systems with fairness criteria that reflect relevant stakeholders' nuanced viewpoints in real-world contexts. To address this gap, we propose a framework for eliciting stakeholders' subjective fairness notions. Combining a user interface that allows stakeholders to examine the data and the algorithm's predictions with an interview protocol to probe stakeholders' thoughts while they are interacting with the interface, we can identify stakeholders' fairness beliefs and principles. We conduct a user study to evaluate our framework in the setting of a child maltreatment predictive system. Our evaluations show that the framework allows stakeholders to comprehensively convey their fairness viewpoints. We also discuss how our results can inform the design of predictive systems.
https://doi.org/10.1145/3411764.3445308
In single-target video object tracking, an initial bounding box is drawn around a target object and propagated through a video. When this bounding box is provided by a careful human expert, it is expected to yield strong overall tracking performance that can be mimicked at scale by novice crowd workers with the help of advanced quality control methods. However, we show through an investigation of 900 crowdsourced initializations that such quality control strategies are inadequate for this task in two major ways: first, the high level of redundancy in these methods (e.g., averaging multiple responses to reduce error) is unnecessary, as 23\% of crowdsourced initializations perform just as well as the gold-standard initialization. Second, even nearly perfect initializations can lead to degraded long-term performance due to the complexity of object tracking. Considering these findings, we evaluate novel approaches for automatically selecting bounding boxes to re-query, and introduce \textit{Smart Replacement}, an efficient method that decides whether to use the crowdsourced replacement initialization.
https://doi.org/10.1145/3411764.3445181
Advances in artificial intelligence (AI) have made it increasingly applicable to supplement expert's decision-making in the form of a decision support system on various tasks. For instance, an AI-based system can provide therapists quantitative analysis on patient's status to improve practices of rehabilitation assessment. However, there is limited knowledge on the potential of these systems. In this paper, we present the development and evaluation of an interactive AI-based system that supports collaborative decision making with therapists for rehabilitation assessment. This system automatically identifies salient features of assessment to generate patient-specific analysis for therapists, and tunes with their feedback. In two evaluations with therapists, we found that our system supports therapists significantly higher agreement on assessment (0.71 average F1-score) than a traditional system without analysis (0.66 average F1-score, $p < 0.05$). After tuning with therapist’s feedback, our system significantly improves its performance from 0.8377 to 0.9116 average F1-scores ($p < 0.01$). This work discusses the potential of a human-AI collaborative system to support more accurate decision making while learning from each other's strengths.
https://doi.org/10.1145/3411764.3445472
Crowdsourcing can collect many diverse ideas by prompting ideators individually, but this can generate redundant ideas. Prior methods reduce redundancy by presenting peers’ ideas or peer-proposed prompts, but these require much human coordination. We introduce Directed Diversity, an automatic prompt selection approach that leverages language model embedding distances to maximize diversity. Ideators can be directed towards diverse prompts and away from prior ideas, thus improving their collective creativity. Since there are diverse metrics of diversity, we present a Diversity Prompting Evaluation Framework consolidating metrics from several research disciplines to analyze along the ideation chain — prompt selection, prompt creativity, prompt-ideation mediation, and ideation creativity. Using this framework, we evaluated Directed Diversity in a series of a simulation study and four user studies for the use case of crowdsourcing motivational messages to encourage physical activity. We show that automated diverse prompting can variously improve collective creativity across many nuanced metrics of diversity.
https://doi.org/10.1145/3411764.3445782
Qualitative research can produce a rich understanding of a phenomenon but requires an essential and strenuous data annotation process known as coding. Coding can be repetitive and time-consuming, particularly for large datasets. Existing AI-based approaches for partially automating coding, like supervised machine learning (ML) or explicit knowledge represented in code rules, require high technical literacy and lack transparency. Further, little is known about the interaction of researchers with AI-based coding assistance. We introduce Cody, an AI-based system that semi-automates coding through code rules and supervised ML. Cody supports researchers with interactively (re)defining code rules and uses ML to extend coding to unseen data. In two studies with qualitative researchers, we found that (1) code rules provide structure and transparency, (2) explanations are commonly desired but rarely used, (3) suggestions benefit coding quality rather than coding speed, increasing the intercoder reliability, calculated with Krippendorff’s Alpha, from 0.085 (MAXQDA) to 0.33 (Cody).
https://doi.org/10.1145/3411764.3445591
Developing human-like conversational agents is a prime area in HCI research and subsumes many tasks. Predicting listener backchannels is one such actively-researched task. While many studies have used different approaches for backchannel prediction, they all have depended on manual annotations for a large dataset. This is a bottleneck impacting the scalability of development. To this end, we propose using semi-supervised techniques to automate the process of identifying backchannels, thereby easing the annotation process. To analyze our identification module's feasibility, we compared the backchannel prediction models trained on (a) manually-annotated and (b) semi-supervised labels. Quantitative analysis revealed that the proposed semi-supervised approach could attain 95% of the former's performance. Our user-study findings revealed that almost 60% of the participants found the backchannel responses predicted by the proposed model more natural. Finally, we also analyzed the impact of personality on the type of backchannel signals and validated our findings in the user-study.
https://doi.org/10.1145/3411764.3445449
AI technologies are often used to aid people in performing discrete tasks with well-defined goals (e.g., recognising faces in images). Emerging technologies that provide continuous, real-time information enable more open-ended AI experiences. In partnership with a blind child, we explore the challenges and opportunities of designing human-AI interaction for a system intended to support social sensemaking. Adopting a research-through-design perspective, we reflect upon working with the uncertain capabilities of AI systems in the design of this experience. We contribute: (i) a concrete example of an open-ended AI system that enabled a blind child to extend his own capabilities; (ii) an illustration of the delta between imagined and actual use, highlighting how capabilities derive from the human-AI interaction and not the AI system alone; and (iii) a discussion of design choices to craft an ongoing human-AI interaction that addresses the challenge of uncertain outputs of AI systems.
https://doi.org/10.1145/3411764.3445290