Steering and Evaluating Generative AI

会議の名前
CHI 2026
A Framework to Characterize Reporting on Generative AI Use
要旨

Unlike with traditional predictive AI models, today's generative AI models are increasingly designed to be general-purpose, able to perform a wide range of tasks. This makes it challenging to develop a reliable and useful understanding of the ways in which this technology is and could be used. As a result, academic and policy researchers and generative AI providers have started to publish the results of their own investigations about the use of generative AI. This information is, however, fragmented, potentially incomplete, sometimes ambiguous, and often lacking in methodological specificity. In this paper, we conducted an integrative review to build a multi-dimensional framework that specifies what kind of information about generative AI use could be reported and how, and illustrated its analytical utility by applying the framework to a collection of over 110 industry documents. Our analysis reveals systematic patterns and omissions in current industry reporting and reflects on the narratives this reporting collectively advance about generative AI use.

著者
Agathe Balayn
Microsoft Research, New York City, New York, United States
Varun Nagaraj Rao
Princeton University, Princeton, New Jersey, United States
Su Lin Blodgett
Microsoft Research, Montreal, Quebec, Canada
Aylin Caliskan
University of Washington, Seattle, Washington, United States
Solon Barocas
Microsoft Research, New York, New York, United States
An Expert Schema for Evaluating Large Language Model Errors in Scholarly Question-Answering Systems
要旨

Large Language Models (LLMs) are transforming scholarly tasks like search and summarization, but their reliability remains uncertain. Current evaluation metrics for testing LLM reliability are primarily automated approaches that prioritize efficiency and scalability, but lack contextual nuance and fail to reflect how scientific domain experts assess LLM outputs in practice. We developed and validated a schema for evaluating LLM errors in scholarly question-answering systems that reflects the assessment strategies of practicing scientists. In collaboration with domain experts, we identified 20 error patterns across seven categories through thematic analysis of 68 question-answer pairs. We validated this schema through contextual inquiries with 10 additional scientists, which showed not only which errors experts naturally identify but also how structured evaluation schemas can help them detect previously overlooked issues. Domain experts use systematic assessment strategies, including technical precision testing, value-based evaluation, and meta-evaluation of their own practices. We discuss implications for supporting expert evaluation of LLM outputs, including opportunities for personalized, schema-driven tools that adapt to individual evaluation patterns and expertise levels.

著者
Anna Martin-Boyle
University of Minnesota, Minneapolis, Minnesota, United States
William Humphreys
NASA Langley Research Center, Hampton, Virginia, United States
Martha Brown
NASA Langley Research Center, Hampton, Virginia, United States
Cara Leckey
NASA Langley, Poquoson, Virginia, United States
Harmanpreet Kaur
University of Minnesota, Minneapolis, Minnesota, United States
SemTabla: A Human-in-the-Loop Framework for Semantic Enrichment and Validation of Data Tables
要旨

Data tables are widely used to record critical information, enabling decision-makers to derive insights through table question answering (Table QA). However, the metadata from table schemas alone often fail to capture the underlying business semantics embedded in the tabular data, leading to reasoning errors. Existing automated approaches to semantic enrichment face challenges in insufficient data utilization, narrow feature coverage, and limited interpretability. To overcome these limitations, we propose SemTabla, an interactive system that employs a human-in-the-loop mechanism to extract comprehensive and interpretable semantics from tabular data. Our key contributions include: (1) a hierarchical framework for extracting semantic attributes; (2) a novel sampling method that identifies critical but rare row instances; and (3) an interactive interface that supports visualization, validation, and refinement of the extracted table semantics. A user study confirmed the system’s usability, and quantitative experiments demonstrate that the extracted semantics significantly enhance the reasoning capabilities of large language models.

受賞
Honorable Mention
著者
Zhuochen Jin
Huawei Cloud, Hangzhou, China
Yingjie Mi
Nanjing University, Nanjing, China
Yehang Zhu
Nanjing University, Nanjing, China
yichen yao
Nanjing University, Nanjing, China
Chongyang Yu
Nanjing University, Nanjing, China
Ke Xu
Nanjing University, Nanjing, China
LAPS: Automating Hypothesis-Driven Statistical Analysis of Public Survey Using Large Language Models
要旨

Public surveys are indispensable resources for understanding social dynamics, yet their analysis often imposes a high cognitive load due to structural complexity. In this paper, we present LAPS, a Large Language Model (LLM)-assisted automated framework that supports end-to-end, hypothesis-driven statistical analysis of survey data. LAPS consists of four modules (i.e., Operationalization, Planning, Execution, and Reporting) with human-in-the-loop mechanisms to balance automation with user agency. To evaluate the applicability of LAPS, we conducted a within-subjects user study with 12 social science researchers across three analytical environments: traditional statistical tools, a general-purpose LLM, and LAPS. Our findings demonstrate that LAPS ensures researcher agency and analytical stability, reduces the cognitive burden in the analysis workflow, and produces trustworthy, coherent outputs. Based on these findings, we reflect on how LAPS improves researchers’ workflows and discuss design implications for scalable and trustworthy human-AI collaboration in survey-based research.

著者
Jaehoon Kim
Hanyang University, Seoul, Korea, Republic of
Dayoung Jeong
Hanyang University, Seoul, Korea, Republic of
Beejin Son
hanyang university, Seoul, Korea, Republic of
Hansung Kim
Hanyang University, Seoul, Korea, Republic of
Bogoan Kim
Chungbuk National University, Cheongju, Korea, Republic of
Kyungsik Han
Hanyang University, Seoul, Korea, Republic of
動画
Preference-Guided Prompt Optimization for Text-to-Image Generation
要旨

Generative models are increasingly powerful, yet users struggle to guide them through prompts. The generative process is difficult to control and unpredictable, and user instructions may be ambiguous or under-specified. Prior prompt refinement tools heavily rely on human effort, while prompt optimization methods focus on numerical functions and are not designed for human-centered generative tasks, where feedback is better expressed as binary preferences and demands convergence within few iterations. We present APPO, a preference-guided prompt optimization algorithm. Instead of iterating prompts, users only provide binary preferential feedback. APPO adaptively balances its strategies between exploiting user feedback and exploring new directions, yielding effective and efficient optimization. We evaluate APPO on image generation, and the results show APPO enables achieving satisfactory outcomes in fewer iterations with lower cognitive load than manual prompt editing. We anticipate APPO will advance human-AI collaboration in generative tasks by leveraging user preferences to guide complex content creation.

著者
Zhipeng Li
ETH Zürich, Zurich, Switzerland
Yi-Chi Liao
ETH Zürich, Zürich, Switzerland
Christian Holz
ETH Zürich, Zurich, Switzerland
Privy: Envisioning and Mitigating Privacy Risks for Consumer-facing AI Product Concepts
要旨

AI creates and exacerbates privacy risks, yet practitioners lack effective resources to identify and mitigate these risks. We present Privy, a tool that guides practitioners without privacy expertise through structured privacy impact assessments to: (i) identify relevant risks in novel AI product concepts, and (ii) propose appropriate mitigations. Privy was shaped by a formative study with 11 practitioners, which informed two versions --- one LLM-powered, the other template-based. We evaluated these two versions of Privy through a between-subjects, controlled study with 24 separate practitioners, whose assessments were reviewed by 13 independent privacy experts. Results show that Privy helps practitioners produce privacy assessments that experts deemed high quality: practitioners identified relevant risks and proposed appropriate mitigation strategies. These effects were augmented in the LLM-powered version. Practitioners themselves rated Privy as being useful and usable, and their feedback illustrates how it helps overcome long-standing awareness, motivation, and ability barriers in privacy work.

受賞
Honorable Mention
著者
Hao-Ping (Hank) Lee
Carnegie Mellon University, Pittsburgh, Pennsylvania, United States
Yu-Ju Yang
School of Information Sciences, Champaign, Illinois, United States
Matthew Bilik
University of Washington, Seattle, Washington, United States
Isadora Krsek
Carnegie Mellon University, Pittsburgh, Pennsylvania, United States
Thomas Serban von Davier
Carnegie Mellon University, Pittsburgh, Pennsylvania, United States
Kyzyl Monteiro
Carnegie Mellon University, Pittsburgh, Pennsylvania, United States
Jason Lin
Carnegie Mellon University, Pittsburgh, Pennsylvania, United States
Shivani Agarwal
Carnegie Mellon University, Pittsburgh, Pennsylvania, United States
Jodi Forlizzi
Carnegie Mellon University, Pittsburgh, Pennsylvania, United States
Sauvik Das
Carnegie Mellon University, Pittsburgh, Pennsylvania, United States