Steering and Evaluating Generative AI

Paper Guilds

/

会議の一覧

/

CHI 2026

/

Steering and Evaluating Generative AI

CHI 2026

Spatial Interaction Design, Gestures & Communication

Time & Personhood

Unlike with traditional predictive AI models, today's generative AI models are increasingly designed to be general-purpose, able to perform a wide range of tasks. This makes it challenging to develop a reliable and useful understanding of the ways in which this technology is and could be used. As a result, academic and policy researchers and generative AI providers have started to publish the results of their own investigations about the use of generative AI. This information is, however, fragmented, potentially incomplete, sometimes ambiguous, and often lacking in methodological specificity. In this paper, we conducted an integrative review to build a multi-dimensional framework that specifies what kind of information about generative AI use could be reported and how, and illustrated its analytical utility by applying the framework to a collection of over 110 industry documents. Our analysis reveals systematic patterns and omissions in current industry reporting and reflects on the narratives this reporting collectively advance about generative AI use.

Microsoft Research, New York City, New York, United States

Princeton University, Princeton, New Jersey, United States

Microsoft Research, Montreal, Quebec, Canada

University of Washington, Seattle, Washington, United States

Microsoft Research, New York, New York, United States

お気に入り

あとで読む

コレクション

Large Language Models (LLMs) are transforming scholarly tasks like search and summarization, but their reliability remains uncertain. Current evaluation metrics for testing LLM reliability are primarily automated approaches that prioritize efficiency and scalability, but lack contextual nuance and fail to reflect how scientific domain experts assess LLM outputs in practice. We developed and validated a schema for evaluating LLM errors in scholarly question-answering systems that reflects the assessment strategies of practicing scientists. In collaboration with domain experts, we identified 20 error patterns across seven categories through thematic analysis of 68 question-answer pairs. We validated this schema through contextual inquiries with 10 additional scientists, which showed not only which errors experts naturally identify but also how structured evaluation schemas can help them detect previously overlooked issues. Domain experts use systematic assessment strategies, including technical precision testing, value-based evaluation, and meta-evaluation of their own practices. We discuss implications for supporting expert evaluation of LLM outputs, including opportunities for personalized, schema-driven tools that adapt to individual evaluation patterns and expertise levels.

University of Minnesota, Minneapolis, Minnesota, United States

NASA Langley Research Center, Hampton, Virginia, United States

NASA Langley, Poquoson, Virginia, United States

University of Minnesota, Minneapolis, Minnesota, United States

お気に入り

あとで読む

コレクション

Data tables are widely used to record critical information, enabling decision-makers to derive insights through table question answering (Table QA). However, the metadata from table schemas alone often fail to capture the underlying business semantics embedded in the tabular data, leading to reasoning errors. Existing automated approaches to semantic enrichment face challenges in insufficient data utilization, narrow feature coverage, and limited interpretability. To overcome these limitations, we propose SemTabla, an interactive system that employs a human-in-the-loop mechanism to extract comprehensive and interpretable semantics from tabular data. Our key contributions include: (1) a hierarchical framework for extracting semantic attributes; (2) a novel sampling method that identifies critical but rare row instances; and (3) an interactive interface that supports visualization, validation, and refinement of the extracted table semantics. A user study confirmed the system’s usability, and quantitative experiments demonstrate that the extracted semantics significantly enhance the reasoning capabilities of large language models.

Huawei Cloud, Hangzhou, China

Nanjing University, Nanjing, China

お気に入り

あとで読む

コレクション

Public surveys are indispensable resources for understanding social dynamics, yet their analysis often imposes a high cognitive load due to structural complexity. In this paper, we present LAPS, a Large Language Model (LLM)-assisted automated framework that supports end-to-end, hypothesis-driven statistical analysis of survey data. LAPS consists of four modules (i.e., Operationalization, Planning, Execution, and Reporting) with human-in-the-loop mechanisms to balance automation with user agency. To evaluate the applicability of LAPS, we conducted a within-subjects user study with 12 social science researchers across three analytical environments: traditional statistical tools, a general-purpose LLM, and LAPS. Our findings demonstrate that LAPS ensures researcher agency and analytical stability, reduces the cognitive burden in the analysis workflow, and produces trustworthy, coherent outputs. Based on these findings, we reflect on how LAPS improves researchers’ workflows and discuss design implications for scalable and trustworthy human-AI collaboration in survey-based research.

Hanyang University, Seoul, Korea, Republic of

hanyang university, Seoul, Korea, Republic of

Hanyang University, Seoul, Korea, Republic of

Chungbuk National University, Cheongju, Korea, Republic of

Hanyang University, Seoul, Korea, Republic of

お気に入り

あとで読む

コレクション

Generative models are increasingly powerful, yet users struggle to guide them through prompts. The generative process is difficult to control and unpredictable, and user instructions may be ambiguous or under-specified. Prior prompt refinement tools heavily rely on human effort, while prompt optimization methods focus on numerical functions and are not designed for human-centered generative tasks, where feedback is better expressed as binary preferences and demands convergence within few iterations. We present APPO, a preference-guided prompt optimization algorithm. Instead of iterating prompts, users only provide binary preferential feedback. APPO adaptively balances its strategies between exploiting user feedback and exploring new directions, yielding effective and efficient optimization. We evaluate APPO on image generation, and the results show APPO enables achieving satisfactory outcomes in fewer iterations with lower cognitive load than manual prompt editing. We anticipate APPO will advance human-AI collaboration in generative tasks by leveraging user preferences to guide complex content creation.

ETH Zürich, Zurich, Switzerland

ETH Zürich, Zürich, Switzerland

ETH Zürich, Zurich, Switzerland

お気に入り

あとで読む

コレクション

AI creates and exacerbates privacy risks, yet practitioners lack effective resources to identify and mitigate these risks. We present Privy, a tool that guides practitioners without privacy expertise through structured privacy impact assessments to: (i) identify relevant risks in novel AI product concepts, and (ii) propose appropriate mitigations. Privy was shaped by a formative study with 11 practitioners, which informed two versions --- one LLM-powered, the other template-based. We evaluated these two versions of Privy through a between-subjects, controlled study with 24 separate practitioners, whose assessments were reviewed by 13 independent privacy experts. Results show that Privy helps practitioners produce privacy assessments that experts deemed high quality: practitioners identified relevant risks and proposed appropriate mitigation strategies. These effects were augmented in the LLM-powered version. Practitioners themselves rated Privy as being useful and usable, and their feedback illustrates how it helps overcome long-standing awareness, motivation, and ability barriers in privacy work.

Carnegie Mellon University, Pittsburgh, Pennsylvania, United States

School of Information Sciences, Champaign, Illinois, United States

University of Washington, Seattle, Washington, United States

Carnegie Mellon University, Pittsburgh, Pennsylvania, United States

お気に入り

あとで読む

コレクション

Spatial Interaction Design, Gestures & Communication

Time & Personhood

要旨

著者

要旨

著者

要旨

受賞
Honorable Mention

著者

要旨

著者

動画

要旨

著者

要旨

受賞
Honorable Mention

著者