Computational Human-AI Conversation

https://doi.org/10.1145/3411764.3445645

Building conversational agents that can conduct natural and prolonged conversations has been a major technical and design challenge, especially for community-facing conversational agents. We posit Mutual Theory of Mind as a theoretical framework to design for natural long-term human-AI interactions. From this perspective, we explore a community's perception of a question-answering conversational agent through self-reported surveys and computational linguistic approach in the context of online education. We first examine long-term temporal changes in students' perception of Jill Watson (JW), a virtual teaching assistant deployed in an online class discussion forum. We then explore the feasibility of inferring students' perceptions of JW through linguistic features extracted from student-JW dialogues. We find that students' perception of JW’s anthropomorphism and intelligence changed significantly over time. Regression analyses reveal that linguistic verbosity, readability, sentiment, diversity, and adaptability reflect student perception of JW. We discuss implications for building adaptive community-facing conversational agents as long-term companions and designing towards Mutual Theory of Mind in human-AI interaction.

Georgia Institute of Technology, Atlanta, Georgia, United States

Georgia Tech, Atlanta, Georgia, United States

Georgia Institute of Technology, Atlanta, Georgia, United States

10.1145/3411764.3445645

https://doi.org/10.1145/3411764.3445171

Designing reliable Speech Emotion Recognition systems is a complex task that inevitably requires sufficient data for training purposes. Such extensive datasets are currently available in only a few languages, including English, German, and Italian. In this paper, we present SEMOUR, the first scripted database of emotion-tagged speech in the Urdu language, to design an Urdu Speech Recognition System. Our gender-balanced dataset contains 15,040 unique instances recorded by eight professional actors eliciting a syntactically complex script. The dataset is phonetically balanced, and reliably exhibits a varied set of emotions as marked by the high agreement scores among human raters in experiments. We also provide various baseline speech emotion prediction scores on the database, which could be used for various applications like personalized robot assistants, diagnosis of psychological disorders, and getting feedback from a low-tech-enabled population, etc. On a random test sample, our model correctly predicts an emotion with a state-of-the-art 92% accuracy.

Information Technology University, Lahore, Punjab, Pakistan

Information Technology University, Lahore, Pakistan

Information Technology University, Lahore, Punjab, Pakistan

10.1145/3411764.3445171

https://doi.org/10.1145/3411764.3445735

Prototyping AI user experiences is challenging due in part to probabilistic AI models making it difficult to anticipate, test, and mitigate AI failures before deployment. In this work, we set out to support practitioners with early AI prototyping, with a focus on natural language (NL)-based technologies. Our interviews with 12 NL practitioners from a large technology company revealed that, in addition to challenges prototyping AI, prototyping was often not happening at all or focused only on idealized scenarios due to a lack of tools and tight timelines. These findings informed our design of the AI Playbook, an interactive and low-cost tool we developed to encourage proactive and systematic consideration of AI errors before deployment. Our evaluation of the AI Playbook demonstrates its potential to 1) encourage product teams to prioritize both ideal and failure scenarios, 2) standardize the articulation of AI failures from a user experience perspective, and 3) act as a boundary object between user experience designers, data scientists, and engineers.

University of Washington, Seattle, Washington, United States

Microsoft, Redmond, Washington, United States

10.1145/3411764.3445735

https://doi.org/10.1145/3411764.3445281

In recent years, mobile accessibility has become an important trend with the goal of allowing all users the possibility of using any app without many limitations. User reviews include insights that are useful for app evolution. However, with the increase in the amount of received reviews, manually analyzing them is tedious and time-consuming, especially when searching for accessibility reviews. The goal of this paper is to support the automated identification of accessibility in user reviews, to help technology professionals in prioritizing their handling, and thus, creating more inclusive apps. Particularly, we design a model that takes as input accessibility user reviews, learns their keyword-based features, in order to make a binary decision, for a given review, on whether it is about accessibility or not. The model is evaluated using a total of 5,326 mobile app reviews. The findings show that (1) our model can accurately identify accessibility reviews, outperforming two baselines, namely keyword-based detector and a random classifier; (2) our model achieves an accuracy of 85% with relatively small training dataset; however, the accuracy improves as we increase the size of the training dataset.

Rochester Institute of Technology, Rochester, New York, United States

University of North Texas, Denton, Texas, United States

Rochester Institute of Technology, Rochester, New York, United States

Western Washington University, Bellingham, Washington, United States

10.1145/3411764.3445281

https://doi.org/10.1145/3411764.3445423

Machine learning classifiers for human-facing tasks such as comment toxicity and misinformation often score highly on metrics such as ROC AUC but are received poorly in practice. Why this gap? Today, metrics such as ROC AUC, precision, and recall are used to measure technical performance; however, human-computer interaction observes that evaluation of human-facing systems should account for people's reactions to the system. In this paper, we introduce a transformation that more closely aligns machine learning classification metrics with the values and methods of user-facing performance measures. The disagreement deconvolution takes in any multi-annotator (e.g., crowdsourced) dataset, disentangles stable opinions from noise by estimating intra-annotator consistency, and compares each test set prediction to the individual stable opinions from each annotator. Applying the disagreement deconvolution to existing social computing datasets, we find that current metrics dramatically overstate the performance of many human-facing machine learning tasks: for example, performance on a comment toxicity task is corrected from .95 to .73 ROC AUC.

Stanford University, Stanford, California, United States

Apple Inc, Seattle, Washington, United States

Stanford University, Stanford, California, United States

10.1145/3411764.3445423

https://doi.org/10.1145/3411764.3445569

Recent studies show the effectiveness of interview chatbots in information elicitation. However, designing an effective interview chatbot is non-trivial. Few tools exist to help designers design, evaluate, and improve an interview chatbot iteratively. Based on a formative study and literature reviews, we propose a computational framework for quantifying the performance of interview chatbots. Incorporating the framework, we have developed iChatProfile, an assistive design tool that can automatically generate a profile of an interview chatbot with quantified performance metrics and offer design suggestions for improving the chatbot based on such metrics. To validate the effectiveness of iChatProfile, we designed and conducted a between-subject study that compared the performance of 10 interview chatbots designed with or without using iChatProfile. Based on the live chats between the 10 chatbots and 1349 users, our results show that iChatProfile helped the designers build significantly more effective interview chatbots, improving both interview quality and user experience.

University of Colorado Boulder, Boulder, Colorado, United States

Juji, Inc., San Jose, California, United States

University of Colorado at Boulder, Boulder, Colorado, United States

University of Colorado Boulder, Boulder, Colorado, United States

10.1145/3411764.3445569

https://doi.org/10.1145/3411764.3445308

Recent work in fair machine learning has proposed dozens of technical definitions of algorithmic fairness and methods for enforcing these definitions. However, we still lack an understanding of how to develop machine learning systems with fairness criteria that reflect relevant stakeholders' nuanced viewpoints in real-world contexts. To address this gap, we propose a framework for eliciting stakeholders' subjective fairness notions. Combining a user interface that allows stakeholders to examine the data and the algorithm's predictions with an interview protocol to probe stakeholders' thoughts while they are interacting with the interface, we can identify stakeholders' fairness beliefs and principles. We conduct a user study to evaluate our framework in the setting of a child maltreatment predictive system. Our evaluations show that the framework allows stakeholders to comprehensively convey their fairness viewpoints. We also discuss how our results can inform the design of predictive systems.

University of Minnesota, Minneapolis, Minnesota, United States

Carnegie Mellon University, Pittsburgh, Pennsylvania, United States

Kenyon College, Gambier, Ohio, United States

Carnegie Mellon University, Pittsburgh, Pennsylvania, United States

10.1145/3411764.3445308

https://doi.org/10.1145/3411764.3445181

In single-target video object tracking, an initial bounding box is drawn around a target object and propagated through a video. When this bounding box is provided by a careful human expert, it is expected to yield strong overall tracking performance that can be mimicked at scale by novice crowd workers with the help of advanced quality control methods. However, we show through an investigation of 900 crowdsourced initializations that such quality control strategies are inadequate for this task in two major ways: first, the high level of redundancy in these methods (e.g., averaging multiple responses to reduce error) is unnecessary, as 23\% of crowdsourced initializations perform just as well as the gold-standard initialization. Second, even nearly perfect initializations can lead to degraded long-term performance due to the complexity of object tracking. Considering these findings, we evaluate novel approaches for automatically selecting bounding boxes to re-query, and introduce \textit{Smart Replacement}, an efficient method that decides whether to use the crowdsourced replacement initialization.

University of Michigan, Ann Arbor, Michigan, United States

KAIST, Daejeon, Korea, Republic of

Stevens Institute for Artificial Intelligence, Hoboken, New Jersey, United States

10.1145/3411764.3445181

https://doi.org/10.1145/3411764.3445472

Advances in artificial intelligence (AI) have made it increasingly applicable to supplement expert's decision-making in the form of a decision support system on various tasks. For instance, an AI-based system can provide therapists quantitative analysis on patient's status to improve practices of rehabilitation assessment. However, there is limited knowledge on the potential of these systems. In this paper, we present the development and evaluation of an interactive AI-based system that supports collaborative decision making with therapists for rehabilitation assessment. This system automatically identifies salient features of assessment to generate patient-specific analysis for therapists, and tunes with their feedback. In two evaluations with therapists, we found that our system supports therapists significantly higher agreement on assessment (0.71 average F1-score) than a traditional system without analysis (0.66 average F1-score, $p < 0.05$). After tuning with therapist’s feedback, our system significantly improves its performance from 0.8377 to 0.9116 average F1-scores ($p < 0.01$). This work discusses the potential of a human-AI collaborative system to support more accurate decision making while learning from each other's strengths.

Carnegie Mellon University, Pittsburgh, Pennsylvania, United States

Instituto Superior Tecnico, University of Lisbon, Lisbon, Lisbon, Portugal

Universidade da Madeira, Funchal, Portugal

10.1145/3411764.3445472

https://doi.org/10.1145/3411764.3445782

Crowdsourcing can collect many diverse ideas by prompting ideators individually, but this can generate redundant ideas. Prior methods reduce redundancy by presenting peers’ ideas or peer-proposed prompts, but these require much human coordination. We introduce Directed Diversity, an automatic prompt selection approach that leverages language model embedding distances to maximize diversity. Ideators can be directed towards diverse prompts and away from prior ideas, thus improving their collective creativity. Since there are diverse metrics of diversity, we present a Diversity Prompting Evaluation Framework consolidating metrics from several research disciplines to analyze along the ideation chain — prompt selection, prompt creativity, prompt-ideation mediation, and ideation creativity. Using this framework, we evaluated Directed Diversity in a series of a simulation study and four user studies for the use case of crowdsourcing motivational messages to encourage physical activity. We show that automated diverse prompting can variously improve collective creativity across many nuanced metrics of diversity.

National University of Singapore, Singapore, Singapore, Singapore

National University of Singapore, Singapore, Singapore

National University of Singapore, Singapore, --- Select One ---, Singapore

National University of Singapore, Singapore, Singapore

10.1145/3411764.3445782

https://doi.org/10.1145/3411764.3445591

Qualitative research can produce a rich understanding of a phenomenon but requires an essential and strenuous data annotation process known as coding. Coding can be repetitive and time-consuming, particularly for large datasets. Existing AI-based approaches for partially automating coding, like supervised machine learning (ML) or explicit knowledge represented in code rules, require high technical literacy and lack transparency. Further, little is known about the interaction of researchers with AI-based coding assistance. We introduce Cody, an AI-based system that semi-automates coding through code rules and supervised ML. Cody supports researchers with interactively (re)defining code rules and uses ML to extend coding to unseen data. In two studies with qualitative researchers, we found that (1) code rules provide structure and transparency, (2) explanations are commonly desired but rarely used, (3) suggestions benefit coding quality rather than coding speed, increasing the intercoder reliability, calculated with Krippendorff’s Alpha, from 0.085 (MAXQDA) to 0.33 (Cody).

Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany

10.1145/3411764.3445591

https://doi.org/10.1145/3411764.3445449

Developing human-like conversational agents is a prime area in HCI research and subsumes many tasks. Predicting listener backchannels is one such actively-researched task. While many studies have used different approaches for backchannel prediction, they all have depended on manual annotations for a large dataset. This is a bottleneck impacting the scalability of development. To this end, we propose using semi-supervised techniques to automate the process of identifying backchannels, thereby easing the annotation process. To analyze our identification module's feasibility, we compared the backchannel prediction models trained on (a) manually-annotated and (b) semi-supervised labels. Quantitative analysis revealed that the proposed semi-supervised approach could attain 95% of the former's performance. Our user-study findings revealed that almost 60% of the participants found the backchannel responses predicted by the proposed model more natural. Finally, we also analyzed the impact of personality on the type of backchannel signals and validated our findings in the user-study.

Indraprastha Institute of Information Technology (IIIT), Delhi, Delhi, India

DTU, Delhi, Delhi, India

IIITD, Delhi, Delhi, India

Indraprastha Institute of Information Technology Delhi, New Delhi, Delhi, India

10.1145/3411764.3445449

https://doi.org/10.1145/3411764.3445290

AI technologies are often used to aid people in performing discrete tasks with well-defined goals (e.g., recognising faces in images). Emerging technologies that provide continuous, real-time information enable more open-ended AI experiences. In partnership with a blind child, we explore the challenges and opportunities of designing human-AI interaction for a system intended to support social sensemaking. Adopting a research-through-design perspective, we reflect upon working with the uncertain capabilities of AI systems in the design of this experience. We contribute: (i) a concrete example of an open-ended AI system that enabled a blind child to extend his own capabilities; (ii) an illustration of the delta between imagined and actual use, highlighting how capabilities derive from the human-AI interaction and not the AI system alone; and (iii) a discussion of design choices to craft an ongoing human-AI interaction that addresses the challenge of uncertain outputs of AI systems.

Microsoft Research , Cambridge, United Kingdom

Microsoft Research, Redmond, Washington, United States

Microsoft Research, Cambridge, United Kingdom

City, London, United Kingdom

Microsoft Research, Cambridge, United Kingdom

10.1145/3411764.3445290