Evaluating AI Technologies A

https://doi.org/10.1145/3613904.3641960

Large language models (LLMs) have shown remarkable performance across various natural language processing (NLP) tasks, indicating their significant potential as data annotators. Although LLM-generated annotations are more cost-effective and efficient to obtain, they are often erroneous for complex or domain-specific tasks and may introduce bias when compared to human annotations. Therefore, instead of completely replacing human annotators with LLMs, we need to leverage the strengths of both LLMs and humans to ensure the accuracy and reliability of annotations. This paper presents a multi-step human-LLM collaborative approach where (1) LLMs generate labels and provide explanations, (2) a verifier assesses the quality of LLM-generated labels, and (3) human annotators re-annotate a subset of labels with lower verification scores. To facilitate human-LLM collaboration, we make use of LLM's ability to rationalize its decisions. LLM-generated explanations can provide additional information to the verifier model as well as help humans better understand LLM labels. We demonstrate that our verifier is able to identify potentially incorrect LLM labels for human re-annotation. Furthermore, we investigate the impact of presenting LLM labels and explanations on human re-annotation through crowdsourced studies.

Purdue University, West Lafayette, Indiana, United States

Megagon Labs, Mountain View, California, United States

https://doi.org/10.1145/3613904.3642633

Heightened AI expectations facilitate performance in human-AI interactions through placebo effects. While lowering expectations to control for placebo effects is advisable, overly negative expectations could induce nocebo effects. In a letter discrimination task, we informed participants that an AI would either increase or decrease their performance by adapting the interface, when in reality, no AI was present in any condition. A Bayesian analysis showed that participants had high expectations and performed descriptively better irrespective of the AI description when a sham-AI was present. Using cognitive modeling, we could trace this advantage back to participants gathering more information. A replication study verified that negative AI descriptions do not alter expectations, suggesting that performance expectations with AI are biased and robust to negative verbal descriptions. We discuss the impact of user expectations on AI interactions and evaluation.

Aalto University, Espoo, Finland

HU Berlin, Berlin, Germany

LMU Munich, Munich, Germany

https://doi.org/10.1145/3613904.3642564

Designing interactions for human-AI teams (HATs) can be challenging due to an AI agent's potential autonomy. Previous work suggests that higher autonomy does not always improve team performance, and situation-dependent autonomy adaptation might be beneficial. However, there is a lack of systematic empirical evaluations of such autonomy adaptation in human-AI interaction. Therefore, we propose a cooperative task in a simulated shared workspace to investigate effects of fixed levels of AI autonomy and situation-dependent autonomy adaptation on team performance and user satisfaction. We derive adaptation rules for AI autonomy from previous work and a pilot study. We implement these rule for our main experiment and find that team performance was best when humans collaborated with an agent adjusting its autonomy based on the situation. Additionally, users rated this agent highest in terms of perceived intelligence. From these results, we discuss the influence of varying autonomy degrees on HATs in shared workspaces.

TU Darmstadt, Darmstadt, Germany

TU Darmstadt, Darmstadt, Hessen, Germany

TU Darmstadt, Darmstadt, Germany

TU Darmstadt , Darmstadt, Germany

TU Darmstadt, Darmstadt, Germany