Explaining and Evaluating AI Systems

会議の名前
CHI 2026
Interactive Explainable Ranking
要旨

We propose an interactive decision-making tool for discovering and exploring explainable rankings for a given set of choices (e.g., job offers, vacation destinations, award candidates). We define an explainable ranking as an ordering of choices based on some consistent weighting of measured criteria. Our tool is designed to help users explore different orderings, criteria, and criterion weights in search of an explainable ranking that reflects their own personal preferences. To achieve this, we combine visualization, optimization, and (optionally) the integration of AI to help users identify and correct or explain inconsistencies in their evaluation of different choices. Through user experiments, we demonstrate that our tool leads to more consistent explainable rankings with greater user confidence.

受賞
Best Paper
著者
Chao Zhang
Cornell University, Ithaca, New York, United States
Abe Davis
Cornell University, New York, New York, United States
PaperTrail: A Claim-Evidence Interface for Grounding Provenance in LLM-based Scholarly Q&A
要旨

Large language models (LLMs) are increasingly used in scholarly question-answering (QA) systems to help researchers synthesize vast amounts of literature. However, these systems often produce subtle errors (e.g., unsupported claims, errors of omission), and current provenance mechanisms like source citations are not granular enough for the rigorous verification that scholarly domain requires. To address this, we introduce PaperTrail, a novel interface that decomposes both LLM answers and source documents into discrete claims and evidence, mapping them to reveal supported assertions, unsupported claims, and information omitted from the source texts. We evaluated PaperTrail in a within-subjects study with 26 researchers who performed two scholarly editing tasks using PaperTrail and a baseline interface. Our results show that PaperTrail significantly lowered participants' trust compared to the baseline. However, this increased caution did not translate to behavioral changes, as people continued to rely on LLM-generated scholarly edits to avoid a cognitively burdensome task. We discuss the value of claim-evidence matching for understanding LLM trustworthiness in scholarly settings, and present design implications for cognition-friendly communication of provenance information.

著者
Anna Martin-Boyle
University of Minnesota, Minneapolis, Minnesota, United States
Cara Leckey
NASA Langley, Poquoson, Virginia, United States
Martha Brown
NASA Langley Research Center, Hampton, Virginia, United States
Harmanpreet Kaur
University of Minnesota, Minneapolis, Minnesota, United States
GazeCoT: Unleashing Social Intelligence in Multimodal LLMs With Gaze-Informed Chain-of-Thought Reasoning
要旨

Social intelligence is vital for effective human-AI interaction. While LLMs demonstrate strong text-based social intelligence, the vision modality remains challenging due to the presence of non-verbal social cues. For example, gaze is the primary conveyor of social attention, yet it cannot be accurately perceived and understood by multimodal LLMs (MLLMs). Therefore, we propose GazeCoT, a pipeline using gaze estimation models to provide MLLMs with the attention of people in images or videos. The gaze information is provided as visual and text prompts compiled into a structured context to support MLLM social reasoning. Benchmark evaluation confirms that GazeCoT enhances MLLMs’ social intelligence by improving gaze perception. A user study in a challenging application involving parent-child interactions demonstrates that GazeCoT improves perceived explainability and trustworthiness by aligning MLLM social perception and social reasoning with human norms. We hope that GazeCoT, a versatile plug-and-play pipeline, can enable socially aware, MLLM-based HCI applications.

著者
Zhoutong Ye
Tsinghua University, Beijing, China
Xutong Wang
Tsinghua University, Beijing, China
Chengwen Zhang
Tsinghua University, Beijing, China
Ruiwen Zhang
Tsinghua University, Beijing, China
Mingze Sun
Tsinghua University, Beijing, China
Qinwei Li
Tsinghua University, Beijing, China
Chun Yu
Tsinghua University, Beijing, Beijing, China
Yuanchun Shi
Tsinghua University, Beijing, China
PASTA: A Scalable Framework for Multi-Policy AI Compliance Evaluation
要旨

AI compliance is becoming increasingly critical as AI systems grow more powerful and pervasive. Yet the rapid expansion of AI policies creates substantial burdens for resource-constrained practitioners lacking policy expertise. Existing approaches typically address one policy at a time, making multi-policy compliance costly. We present PASTA, a scalable compliance tool integrating four innovations: (1) a comprehensive model-card format supporting descriptive inputs across development stages; (2) a policy normalization scheme; (3) an efficient LLM-powered pairwise evaluation engine with cost-saving strategies; and (4) an interface delivering interpretable evaluations via compliance heatmaps and actionable recommendations. Expert evaluation shows PASTA’s judgments closely align with human experts (ρ ≥ .626). The system evaluates five major policies in under two minutes at approximately $3. A user study (N = 12) confirms practitioners found outputs easy-to-understand and actionable, introducing a novel framework for scalable automated AI governance.

著者
Yu Yang
University of British Columbia, Vancouver, British Columbia, Canada
Ig-Jae Kim
Korea Institute of Science and Technology, Seoul, Korea, Republic of
Dongwook Yoon
University of British Columbia, Vancouver, British Columbia, Canada
動画
"I think this is fair": Uncovering the Complexities of Stakeholder Decision-Making in AI Fairness Assessment
要旨

Assessing fairness in artificial intelligence (AI) typically involves AI experts who select protected features, fairness metrics, and set fairness thresholds to assess outcome fairness. However, little is known about how stakeholders, particularly those affected by AI outcomes but lacking AI expertise, assess fairness. To address this gap, we conducted a qualitative study with 26 stakeholders without AI expertise, representing potential decision subjects in a credit rating scenario, to examine how they assess fairness when placed in the role of deciding on features with priority, metrics, and thresholds. We reveal that stakeholders' fairness decisions are more complex than typical AI expert practices: they considered features far beyond legally protected features, tailored metrics for specific contexts, set diverse yet stricter fairness thresholds, and even preferred designing customized fairness. Our results extend the understanding of how stakeholders can meaningfully contribute to AI fairness governance and mitigation, underscoring the importance of incorporating stakeholders' nuanced fairness judgments.

著者
Lin Luo
University of Glasgow, Glasgow, United Kingdom
Yuri Nakao
FUJITSU LIMITED, Kawasaki, Kanagawa pref., Japan
Mathieu Chollet
University of Glasgow, Glasgow, United Kingdom
Hiroya Inakoshi
Fujitsu Limited, Kawasaki, Japan
Simone Stumpf
University of Glasgow, Glasgow, United Kingdom
Designing Staged Evaluation Workflows for LLMs: Integrating Domain Experts, Lay Users, and Model-Generated Evaluation Criteria
要旨

Large Language Models (LLMs) are increasingly utilized for domain-specific tasks, yet evaluating their outputs remains challenging. A common strategy is to apply evaluation criteria to assess alignment with domain-specific standards, yet little is understood about how criteria differ across sources or where each type is most useful in the evaluation process. This study investigates criteria developed by domain experts, lay users, and LLMs to identify their complementary roles within an evaluation workflow. Results show that experts produce fact-based criteria with long-term value, lay users emphasize usability with a shorter-term focus, and LLMs target procedural checks for immediate task requirements. We also examine how criteria evolve between a priori and a posteriori phases, noting drift across stages as well as convergence in the a posteriori phase. Based on our observations, we propose design guidelines for a staged evaluation workflow combining the complementary strengths of these sources to balance quality, cost, and scalability.

著者
Annalisa Szymanski
University of Notre Dame, South Bend, Indiana, United States
Simret Araya. Gebreegziabher
University of Notre Dame, Notre Dame, Indiana, United States
Oghenemaro Anuyah
Microsoft, Redmond, Washington, United States
Ronald Metoyer
University of Notre Dame, South Bend, Indiana, United States
Toby Jia-Jun. Li
University of Notre Dame, Notre Dame, Indiana, United States
Empowering Stakeholders with Participatory Auditing of Predictive AI: Perspectives from End-Users and Decision Subjects without AI Expertise
要旨

Artificial intelligence (AI) applications have become ubiquitous in their impact on individuals and society, highlighting a crucial need for their responsible development. Recent research has called for participatory AI auditing, empowering individuals without AI expertise to audit AI applications throughout the entire AI development pipeline. Our work focuses on investigating how to support these kinds of auditors through participatory AI auditing tools and processes. We conducted a series of co-design workshops, using two health-related predictive AI applications as examples. Our results show that participants wanted to be part of AI audits, and were insightful in identifying the potential impacts of applications, but needed to be assisted in conducting audits, especially how to measure impacts. Importantly, participants provided examples of impacts not considered in current risk/harm taxonomies. Our findings provide implications for the design of tools and processes to empower everyone to contribute to responsible AI development in the future.

著者
Patrizia Di Campli San Vito
University of Glasgow, Glasgow, United Kingdom
Eva Fringi
University of Glasgow, Glasgow, United Kingdom
Penny S.. Johnston
University of Stirling, Stirling, United Kingdom
Leonardo C. T.. Bezerra
University of Stirling, Stirling, United Kingdom
Marios Aristodemou
University of York, York, United Kingdom
Siamak F.. Shahandashti
University of York, York, United Kingdom
Emily O'Hara
University of Sheffield, Sheffield, United Kingdom
Laura Fiona. Whyte
University of Glasgow , Glasgow, United Kingdom
Lin Luo
University of Glasgow, Glasgow, United Kingdom
Mark Wong
University of Glasgow, Glasgow, United Kingdom
Ayah Soufan
Strathclyde University , Glasgow, Scotland, United Kingdom
Yashar Moshfeghi
University of Of Strathclyde, Glasgow, United Kingdom
Simone Stumpf
University of Glasgow, Glasgow, United Kingdom