206. Steering and Evaluating Generative AI

前のセッションの直後

6

5分

A Framework to Characterize Reporting on Generative AI Use

An Expert Schema for Evaluating Large Language Model Errors in Scholarly Question-Answering Systems

SemTabla: A Human-in-the-Loop Framework for Semantic Enrichment and Validation of Data Tables

LAPS: Automating Hypothesis-Driven Statistical Analysis of Public Survey Using Large Language Models

Preference-Guided Prompt Optimization for Text-to-Image Generation

Privy: Envisioning and Mitigating Privacy Risks for Consumer-facing AI Product Concepts

このボタンをクリックすると発表を担当できます。

なお、操作の前に「サインイン」する必要があります。

205. Spatial Interaction Design, Gestures & Communication

207. Time & Personhood

Unlike with traditional predictive AI models, today's generative AI models are increasingly designed to be general-purpose, able to perform a wide range of tasks. This makes it challenging to develop a reliable and useful understanding of the ways in which this technology is and could be used. As a result, academic and policy researchers and generative AI providers have started to publish the results of their own investigations about the use of generative AI. This information is, however, fragmented, potentially incomplete, sometimes ambiguous, and often lacking in methodological specificity. In this paper, we conducted an integrative review to build a multi-dimensional framework that specifies what kind of information about generative AI use could be reported and how, and illustrated its analytical utility by applying the framework to a collection of over 110 industry documents. Our analysis reveals systematic patterns and omissions in current industry reporting and reflects on the narratives this reporting collectively advance about generative AI use.

読み込み中…

Large Language Models (LLMs) are transforming scholarly tasks like search and summarization, but their reliability remains uncertain. Current evaluation metrics for testing LLM reliability are primarily automated approaches that prioritize efficiency and scalability, but lack contextual nuance and fail to reflect how scientific domain experts assess LLM outputs in practice. We developed and validated a schema for evaluating LLM errors in scholarly question-answering systems that reflects the assessment strategies of practicing scientists. In collaboration with domain experts, we identified 20 error patterns across seven categories through thematic analysis of 68 question-answer pairs. We validated this schema through contextual inquiries with 10 additional scientists, which showed not only which errors experts naturally identify but also how structured evaluation schemas can help them detect previously overlooked issues. Domain experts use systematic assessment strategies, including technical precision testing, value-based evaluation, and meta-evaluation of their own practices. We discuss implications for supporting expert evaluation of LLM outputs, including opportunities for personalized, schema-driven tools that adapt to individual evaluation patterns and expertise levels.

読み込み中…

Data tables are widely used to record critical information, enabling decision-makers to derive insights through table question answering (Table QA). However, the metadata from table schemas alone often fail to capture the underlying business semantics embedded in the tabular data, leading to reasoning errors. Existing automated approaches to semantic enrichment face challenges in insufficient data utilization, narrow feature coverage, and limited interpretability. To overcome these limitations, we propose SemTabla, an interactive system that employs a human-in-the-loop mechanism to extract comprehensive and interpretable semantics from tabular data. Our key contributions include: (1) a hierarchical framework for extracting semantic attributes; (2) a novel sampling method that identifies critical but rare row instances; and (3) an interactive interface that supports visualization, validation, and refinement of the extracted table semantics. A user study confirmed the system’s usability, and quantitative experiments demonstrate that the extracted semantics significantly enhance the reasoning capabilities of large language models.

読み込み中…

Public surveys are indispensable resources for understanding social dynamics, yet their analysis often imposes a high cognitive load due to structural complexity. In this paper, we present LAPS, a Large Language Model (LLM)-assisted automated framework that supports end-to-end, hypothesis-driven statistical analysis of survey data. LAPS consists of four modules (i.e., Operationalization, Planning, Execution, and Reporting) with human-in-the-loop mechanisms to balance automation with user agency. To evaluate the applicability of LAPS, we conducted a within-subjects user study with 12 social science researchers across three analytical environments: traditional statistical tools, a general-purpose LLM, and LAPS. Our findings demonstrate that LAPS ensures researcher agency and analytical stability, reduces the cognitive burden in the analysis workflow, and produces trustworthy, coherent outputs. Based on these findings, we reflect on how LAPS improves researchers’ workflows and discuss design implications for scalable and trustworthy human-AI collaboration in survey-based research.

読み込み中…

Generative models are increasingly powerful, yet users struggle to guide them through prompts. The generative process is difficult to control and unpredictable, and user instructions may be ambiguous or under-specified. Prior prompt refinement tools heavily rely on human effort, while prompt optimization methods focus on numerical functions and are not designed for human-centered generative tasks, where feedback is better expressed as binary preferences and demands convergence within few iterations. We present APPO, a preference-guided prompt optimization algorithm. Instead of iterating prompts, users only provide binary preferential feedback. APPO adaptively balances its strategies between exploiting user feedback and exploring new directions, yielding effective and efficient optimization. We evaluate APPO on image generation, and the results show APPO enables achieving satisfactory outcomes in fewer iterations with lower cognitive load than manual prompt editing. We anticipate APPO will advance human-AI collaboration in generative tasks by leveraging user preferences to guide complex content creation.

読み込み中…

AI creates and exacerbates privacy risks, yet practitioners lack effective resources to identify and mitigate these risks. We present Privy, a tool that guides practitioners without privacy expertise through structured privacy impact assessments to: (i) identify relevant risks in novel AI product concepts, and (ii) propose appropriate mitigations. Privy was shaped by a formative study with 11 practitioners, which informed two versions --- one LLM-powered, the other template-based. We evaluated these two versions of Privy through a between-subjects, controlled study with 24 separate practitioners, whose assessments were reviewed by 13 independent privacy experts. Results show that Privy helps practitioners produce privacy assessments that experts deemed high quality: practitioners identified relevant risks and proposed appropriate mitigation strategies. These effects were augmented in the LLM-powered version. Practitioners themselves rated Privy as being useful and usable, and their feedback illustrates how it helps overcome long-standing awareness, motivation, and ability barriers in privacy work.

読み込み中…

目次

セッション割り当て中

説明

日本語まとめ

説明

日本語まとめ

説明

日本語まとめ

説明

日本語まとめ

説明

日本語まとめ

説明

日本語まとめ