Evaluating AI Technologies B

https://doi.org/10.1145/3613904.3642216

By simply composing prompts, developers can prototype novel generative applications with Large Language Models (LLMs). To refine prototypes into products, however, developers must iteratively revise prompts by evaluating outputs to diagnose weaknesses. Formative interviews (N=8) revealed that developers invest significant effort in manually evaluating outputs as they assess context-specific and subjective criteria. We present EvalLM, an interactive system for iteratively refining prompts by evaluating multiple outputs on user-defined criteria. By describing criteria in natural language, users can employ the system's LLM-based evaluator to get an overview of where prompts excel or fail, and improve these based on the evaluator's feedback. A comparative study (N=12) showed that EvalLM, when compared to manual evaluation, helped participants compose more diverse criteria, examine twice as many outputs, and reach satisfactory prompts with 59% fewer revisions. Beyond prompts, our work can be extended to augment model evaluation and alignment in specific application contexts.

KAIST, Daejeon, Korea, Republic of

NAVER AI Lab, Seoul, Korea, Republic of

NAVER AI Lab, Seongnam, Gyeonggi, Korea, Republic of

KAIST, Daejeon, Korea, Republic of

https://doi.org/10.1145/3613904.3641946

The ability to make appropriate delegation decisions is an important prerequisite of effective human-AI collaboration. Recent work, however, has shown that people struggle to evaluate AI systems in the presence of forecasting errors, falling well short of relying on AI systems appropriately. We use a pre-registered crowdsourcing study ($N=611$) to extend this literature by two underexplored crucial features of human AI decision-making: \textit{choice independence} and \textit{error type}. Subjects in our study repeatedly complete two prediction tasks and choose which predictions they want to delegate to an AI system. For one task, subjects receive a decision heuristic that allows them to make informed and relatively accurate predictions. The second task is substantially harder to solve, and subjects must come up with their own decision rule. We systematically vary the AI system's performance such that it either provides the best possible prediction for both tasks or only for one of the two. Our results demonstrate that people systematically violate choice independence by taking the AI's performance in an unrelated second task into account. Humans who delegate predictions to a superior AI in their own expertise domain significantly reduce appropriate reliance when the model makes systematic errors in a complementary expertise domain. In contrast, humans who delegate predictions to a superior AI in a complementary expertise domain significantly increase appropriate reliance when the model systematically errs in the human expertise domain. Furthermore, we show that humans differentiate between error types and that this effect is conditional on the considered expertise domain. This is the first empirical exploration of choice independence and error types in the context of human-AI collaboration. Our results have broad and important implications for the future design, deployment, and appropriate application of AI systems.

University of Goettingen, Goettingen, Germany

Indian Institute of Information Technology Guwahati, Guwahati, Assam, India

Delft University of Technology, Delft, Netherlands

https://doi.org/10.1145/3613904.3642016

Evaluating outputs of large language models (LLMs) is challenging, requiring making—and making sense of—many responses. Yet tools that go beyond basic prompting tend to require knowledge of programming APIs, focus on narrow domains, or are closed-source. We present ChainForge, an open-source visual toolkit for prompt engineering and on-demand hypothesis testing of text generation LLMs. ChainForge provides a graphical interface for comparison of responses across models and prompt variations. Our system was designed to support three tasks: model selection, prompt template design, and hypothesis testing (e.g., auditing). We released ChainForge early in its development and iterated on its design with academics and online users. Through in-lab and interview studies, we find that a range of people could use ChainForge to investigate hypotheses that matter to them, including in real-world settings. We identify three modes of prompt engineering and LLM hypothesis testing: opportunistic exploration, limited evaluation, and iterative refinement.

Harvard University, Cambridge, Massachusetts, United States

Harvard, Boston, Massachusetts, United States

Harvard University, Cambridge, Massachusetts, United States

https://doi.org/10.1145/3613904.3642472

Large language models (LLMs) have facilitated significant strides in generating conversational agents, enabling seamless, contextually relevant dialogues across diverse topics. However, the existing LLM-driven conversational agents have fixed personalities and functionalities, limiting their adaptability to individual user needs. Creating personalized agent personas with distinct expertise or traits can address this issue. Nonetheless, we lack knowledge of how people customize and interact with agent personas. In this research, we investigated how users customize agent personas and their impact on interaction quality, diversity, and dynamics. To this end, we developed CloChat, an interface supporting easy and accurate customization of agent personas in LLMs. We conducted a study comparing how participants interact with CloChat and ChatGPT. The results indicate that participants formed emotional bonds with the customized agents, engaged in more dynamic dialogues, and showed interest in sustaining interactions. These findings contribute to design implications for future systems with conversational agents using LLMs.

Graduate School of Information Yonsei University, Seoul, Korea, Republic of

Seoul National University, Seoul, Korea, Republic of

graduate school, Seoul, Korea, Republic of

Seoul National University, Seoul, Korea, Republic of

Yonsei University, Seoul, Korea, Republic of