Evaluating AI Technologies A

会議の名前
CHI 2024
Are We Asking the Right Questions?: Designing for Community Stakeholders’ Interactions with AI in Policing
要旨

Research into recidivism risk prediction in the criminal justice system has garnered significant attention from HCI, critical algorithm studies, and the emerging field of human-AI decision-making. This study focuses on algorithmic crime mapping, a prevalent yet underexplored form of algorithmic decision support (ADS) in this context. We conducted experiments and follow-up interviews with 60 participants, including community members, technical experts, and law enforcement agents (LEAs), to explore how lived experiences, technical knowledge, and domain expertise shape interactions with the ADS, impacting human-AI decision-making. Surprisingly, we found that domain experts (LEAs) often exhibited anchoring bias, readily accepting and engaging with the first crime map presented to them. Conversely, community members and technical experts were more inclined to engage with the tool, adjust controls, and generate different maps. Our findings highlight that all three stakeholders were able to provide critical feedback regarding AI design and use - community members questioned the core motivation of the tool, technical experts drew attention to the elastic nature of data science practice, and LEAs suggested redesign pathways such that the tool could complement their domain expertise.

著者
Md Romael Haque
Marquette University, Milwaukee, Wisconsin, United States
Devansh Saxena
Carnegie Mellon University, Pittsburgh, Pennsylvania, United States
Katy Weathington
University of Colorado Boulder, Boulder, Colorado, United States
Joseph Chudzik
University of Chicago, Chicago, Illinois, United States
Shion Guha
University of Toronto, Toronto, Ontario, Canada
論文URL

https://doi.org/10.1145/3613904.3642738

動画
Human-LLM Collaborative Annotation Through Effective Verification of LLM Labels
要旨

Large language models (LLMs) have shown remarkable performance across various natural language processing (NLP) tasks, indicating their significant potential as data annotators. Although LLM-generated annotations are more cost-effective and efficient to obtain, they are often erroneous for complex or domain-specific tasks and may introduce bias when compared to human annotations. Therefore, instead of completely replacing human annotators with LLMs, we need to leverage the strengths of both LLMs and humans to ensure the accuracy and reliability of annotations. This paper presents a multi-step human-LLM collaborative approach where (1) LLMs generate labels and provide explanations, (2) a verifier assesses the quality of LLM-generated labels, and (3) human annotators re-annotate a subset of labels with lower verification scores. To facilitate human-LLM collaboration, we make use of LLM's ability to rationalize its decisions. LLM-generated explanations can provide additional information to the verifier model as well as help humans better understand LLM labels. We demonstrate that our verifier is able to identify potentially incorrect LLM labels for human re-annotation. Furthermore, we investigate the impact of presenting LLM labels and explanations on human re-annotation through crowdsourced studies.

著者
Xinru Wang
Purdue University, West Lafayette, Indiana, United States
Hannah Kim
Megagon Labs, Mountain View, California, United States
Sajjadur Rahman
Megagon Labs, Mountain View, California, United States
Kushan Mitra
Megagon Labs, Mountain View, California, United States
Zhengjie Miao
Megagon Labs, Mountain View, California, United States
論文URL

https://doi.org/10.1145/3613904.3641960

動画
"AI enhances our performance, I have no doubt this one will do the same": The Placebo effect is robust to negative descriptions of AI
要旨

Heightened AI expectations facilitate performance in human-AI interactions through placebo effects. While lowering expectations to control for placebo effects is advisable, overly negative expectations could induce nocebo effects. In a letter discrimination task, we informed participants that an AI would either increase or decrease their performance by adapting the interface, when in reality, no AI was present in any condition. A Bayesian analysis showed that participants had high expectations and performed descriptively better irrespective of the AI description when a sham-AI was present. Using cognitive modeling, we could trace this advantage back to participants gathering more information. A replication study verified that negative AI descriptions do not alter expectations, suggesting that performance expectations with AI are biased and robust to negative verbal descriptions. We discuss the impact of user expectations on AI interactions and evaluation.

著者
Agnes Mercedes. Kloft
Aalto University, Espoo, Finland
Robin Welsch
Aalto University, Espoo, Finland
Thomas Kosch
HU Berlin, Berlin, Germany
Steeven Villa
LMU Munich, Munich, Germany
論文URL

https://doi.org/10.1145/3613904.3642633

動画
An Evaluation of Situational Autonomy for Human-AI Collaboration in a Shared Workspace Setting
要旨

Designing interactions for human-AI teams (HATs) can be challenging due to an AI agent's potential autonomy. Previous work suggests that higher autonomy does not always improve team performance, and situation-dependent autonomy adaptation might be beneficial. However, there is a lack of systematic empirical evaluations of such autonomy adaptation in human-AI interaction. Therefore, we propose a cooperative task in a simulated shared workspace to investigate effects of fixed levels of AI autonomy and situation-dependent autonomy adaptation on team performance and user satisfaction. We derive adaptation rules for AI autonomy from previous work and a pilot study. We implement these rule for our main experiment and find that team performance was best when humans collaborated with an agent adjusting its autonomy based on the situation. Additionally, users rated this agent highest in terms of perceived intelligence. From these results, we discuss the influence of varying autonomy degrees on HATs in shared workspaces.

著者
Vildan Salikutluk
TU Darmstadt, Darmstadt, Germany
Janik Schöpper
TU Darmstadt, Darmstadt, Hessen, Germany
Franziska Herbert
TU Darmstadt, Darmstadt, Germany
Katrin Scheuermann
TU Darmstadt, Darmstadt, Germany
Eric Frodl
TU Darmstadt , Darmstadt, Germany
Dirk Balfanz
TU Darmstadt, Darmstadt, Germany
Frank Jäkel
TU Darmstadt, Darmstadt, Germany
Dorothea Koert
TU Darmstadt, Darmstadt, Germany
論文URL

https://doi.org/10.1145/3613904.3642564

動画
Evaluating the Utility of Conformal Prediction Sets for AI-Advised Image Labeling
要旨

As deep neural networks are more commonly deployed in high-stakes domains, their black-box nature makes uncertainty quantification challenging. We investigate the effects of presenting conformal prediction sets---a distribution-free class of methods for generating prediction sets with specified coverage---to express uncertainty in AI-advised decision-making. Through a large online experiment, we compare the utility of conformal prediction sets to displays of Top-$1$ and Top-$k$ predictions for AI-advised image labeling. In a pre-registered analysis, we find that the utility of prediction sets for accuracy varies with the difficulty of the task: while they result in accuracy on par with or less than Top-$1$ and Top-$k$ displays for easy images, prediction sets excel at assisting humans in labeling out-of-distribution (OOD) images, especially when the set size is small. Our results empirically pinpoint practical challenges of conformal prediction sets and provide implications on how to incorporate them for real-world decision-making.

受賞
Honorable Mention
著者
Dongping Zhang
Northwestern University, Evanston, Illinois, United States
Angelos Chatzimparmpas
Northwestern University, Evanston, Illinois, United States
Negar Kamali
Northwestern University, Evanston, Illinois, United States
Jessica Hullman
Northwestern University, Evanston, Illinois, United States
論文URL

https://doi.org/10.1145/3613904.3642446

動画