Evaluating AI Technologies B

会議の名前
CHI 2024
Listening to the Voices: Describing Ethical Caveats of Conversational User Interfaces According to Experts and Frequent Users
要旨

Advances in natural language processing and understanding have led to a rapid growth in the popularity of conversational user interfaces (CUIs). While CUIs introduce novel benefits, they also yield risks that may exploit people's trust. Although research looking at unethical design deployed through graphical user interfaces (GUIs) established a thorough taxonomy of so-called dark patterns, there is a need for an equally in-depth understanding in the context of CUIs. Addressing this gap, we interviewed 27 participants from three cohorts: researchers, practitioners, and frequent users of CUIs. Applying thematic analysis, we develop five themes reflecting each cohort's insights about ethical design challenges and introduce the CUI Expectation Cycle, bridging system capabilities and user expectations while respecting each theme's ethical caveats. This research aims to inform future work to consider ethical constraints while adopting a human-centred approach.

著者
Thomas Mildner
University of Bremen, Bremen, Germany
Orla Cooney
University College Dublin, Dublin, Ireland
Anna-Maria Meck
BMW Group, Munich, Germany
Marion Bartl
University College Dublin, Dublin, Ireland
Gian-Luca Savino
University of St. Gallen, St. Gallen, Switzerland
Philip R. Doyle
HMD Research, Dublin, Ireland
Diego Garaialde
University College Dublin, Dublin, Ireland
Leigh Clark
Bold Insight, UK, London, United Kingdom
John Sloan
university College Dublin, Dublin, Dublin, Ireland
Nina Wenig
University of Bremen, Bremen, Germany
Rainer Malaka
University of Bremen, Bremen, Germany
Jasmin Niess
University of Oslo, Oslo, Norway
論文URL

https://doi.org/10.1145/3613904.3642542

動画
EvalLM: Interactive Evaluation of Large Language Model Prompts on User-Defined Criteria
要旨

By simply composing prompts, developers can prototype novel generative applications with Large Language Models (LLMs). To refine prototypes into products, however, developers must iteratively revise prompts by evaluating outputs to diagnose weaknesses. Formative interviews (N=8) revealed that developers invest significant effort in manually evaluating outputs as they assess context-specific and subjective criteria. We present EvalLM, an interactive system for iteratively refining prompts by evaluating multiple outputs on user-defined criteria. By describing criteria in natural language, users can employ the system's LLM-based evaluator to get an overview of where prompts excel or fail, and improve these based on the evaluator's feedback. A comparative study (N=12) showed that EvalLM, when compared to manual evaluation, helped participants compose more diverse criteria, examine twice as many outputs, and reach satisfactory prompts with 59% fewer revisions. Beyond prompts, our work can be extended to augment model evaluation and alignment in specific application contexts.

著者
Tae Soo Kim
KAIST, Daejeon, Korea, Republic of
Yoonjoo Lee
KAIST, Daejeon, Korea, Republic of
Jamin Shin
NAVER AI Lab, Seoul, Korea, Republic of
Young-Ho Kim
NAVER AI Lab, Seongnam, Gyeonggi, Korea, Republic of
Juho Kim
KAIST, Daejeon, Korea, Republic of
論文URL

https://doi.org/10.1145/3613904.3642216

動画
Understanding Choice Independence and Error Types in Human-AI Collaboration
要旨

The ability to make appropriate delegation decisions is an important prerequisite of effective human-AI collaboration. Recent work, however, has shown that people struggle to evaluate AI systems in the presence of forecasting errors, falling well short of relying on AI systems appropriately. We use a pre-registered crowdsourcing study ($N=611$) to extend this literature by two underexplored crucial features of human AI decision-making: \textit{choice independence} and \textit{error type}. Subjects in our study repeatedly complete two prediction tasks and choose which predictions they want to delegate to an AI system. For one task, subjects receive a decision heuristic that allows them to make informed and relatively accurate predictions. The second task is substantially harder to solve, and subjects must come up with their own decision rule. We systematically vary the AI system's performance such that it either provides the best possible prediction for both tasks or only for one of the two. Our results demonstrate that people systematically violate choice independence by taking the AI's performance in an unrelated second task into account. Humans who delegate predictions to a superior AI in their own expertise domain significantly reduce appropriate reliance when the model makes systematic errors in a complementary expertise domain. In contrast, humans who delegate predictions to a superior AI in a complementary expertise domain significantly increase appropriate reliance when the model systematically errs in the human expertise domain. Furthermore, we show that humans differentiate between error types and that this effect is conditional on the considered expertise domain. This is the first empirical exploration of choice independence and error types in the context of human-AI collaboration. Our results have broad and important implications for the future design, deployment, and appropriate application of AI systems.

著者
Alexander Erlei
University of Goettingen, Goettingen, Germany
Abhinav Sharma
Indian Institute of Information Technology Guwahati, Guwahati, Assam, India
Ujwal Gadiraju
Delft University of Technology, Delft, Netherlands
論文URL

https://doi.org/10.1145/3613904.3641946

動画
ChainForge: A Visual Toolkit for Prompt Engineering and LLM Hypothesis Testing
要旨

Evaluating outputs of large language models (LLMs) is challenging, requiring making—and making sense of—many responses. Yet tools that go beyond basic prompting tend to require knowledge of programming APIs, focus on narrow domains, or are closed-source. We present ChainForge, an open-source visual toolkit for prompt engineering and on-demand hypothesis testing of text generation LLMs. ChainForge provides a graphical interface for comparison of responses across models and prompt variations. Our system was designed to support three tasks: model selection, prompt template design, and hypothesis testing (e.g., auditing). We released ChainForge early in its development and iterated on its design with academics and online users. Through in-lab and interview studies, we find that a range of people could use ChainForge to investigate hypotheses that matter to them, including in real-world settings. We identify three modes of prompt engineering and LLM hypothesis testing: opportunistic exploration, limited evaluation, and iterative refinement.

受賞
Honorable Mention
著者
Ian Arawjo
Harvard University, Cambridge, Massachusetts, United States
Chelse Swoopes
Harvard University, Cambridge, Massachusetts, United States
Priyan Vaithilingam
Harvard University, Cambridge, Massachusetts, United States
Martin Wattenberg
Harvard, Boston, Massachusetts, United States
Elena L.. Glassman
Harvard University, Cambridge, Massachusetts, United States
論文URL

https://doi.org/10.1145/3613904.3642016

動画
CloChat: Understanding How People Customize, Interact, and Experience Personas in Large Language Models
要旨

Large language models (LLMs) have facilitated significant strides in generating conversational agents, enabling seamless, contextually relevant dialogues across diverse topics. However, the existing LLM-driven conversational agents have fixed personalities and functionalities, limiting their adaptability to individual user needs. Creating personalized agent personas with distinct expertise or traits can address this issue. Nonetheless, we lack knowledge of how people customize and interact with agent personas. In this research, we investigated how users customize agent personas and their impact on interaction quality, diversity, and dynamics. To this end, we developed CloChat, an interface supporting easy and accurate customization of agent personas in LLMs. We conducted a study comparing how participants interact with CloChat and ChatGPT. The results indicate that participants formed emotional bonds with the customized agents, engaged in more dynamic dialogues, and showed interest in sustaining interactions. These findings contribute to design implications for future systems with conversational agents using LLMs.

著者
Juhye Ha
Graduate School of Information Yonsei University, Seoul, Korea, Republic of
Hyeon Jeon
Seoul National University, Seoul, Korea, Republic of
DaEun Han
graduate school, Seoul, Korea, Republic of
Jinwook Seo
Seoul National University, Seoul, Korea, Republic of
Changhoon Oh
Yonsei University, Seoul, Korea, Republic of
論文URL

https://doi.org/10.1145/3613904.3642472

動画