More Human or More AI? Visualizing Human-AI Collaboration Disclosures in Journalistic News Production
説明

Within journalistic editorial processes, disclosing AI usage is currently limited to simplistic labels, which misses the nuance of how humans and AI collaborated on a news article. Through co-design sessions (N=10), we elicited 69 disclosure designs and implemented four prototypes that visually disclose human–AI collaboration in journalism. We then ran a within-subjects lab study (N=32) to examine how disclosure visualizations (Textual, Role-based Timeline, Task-based Timeline, Chatbot) and collaboration ratios (Primarily Human vs. Primarily AI) influenced visualization perceptions, gaze patterns, and post-experience responses. We found that textual disclosures were least effective in communicating human-AI collaboration, whereas Chatbot offered the most in-depth information. Furthermore, while role-based timelines amplified AI contribution in primarily human articles, task-based timeline shifted perceptions toward human involvement in primarily AI articles. We contribute Human-AI collaboration disclosure visualizations and their evaluation, and cautionary considerations on how visualizations can alter perceptions of AI’s actual role during news article creation.

日本語まとめ
読み込み中…
読み込み中…
Open-ended Structured Question Assessment with Human-LLM Collaboration
説明

Open-ended Structured Questions (OSQs) assess not only students’ knowledge but also their reasoning and expression. However, grading OSQ requires fine-grained, scoring point–level analysis, which is labor-intensive and difficult to scale. Although recent LLM-based and human–AI collaborative grading systems improve efficiency, they mainly operate at the whole-response level and lack support for point-level inspection, correction, and feedback integration. We present VeriGrader, a novel human–AI collaborative system for OSQ grading. It combines chain-of-thought prompting with scoring point– and response-level in-context learning to enable interpretable LLM grading and iterative refinement from instructor feedback. A coordinated multi-view interface supports efficient verification of response segments, matched scoring points, and rationales. We evaluate VeriGrader using real course data and a user study with 12 participants. Results show that VeriGrader improves both grading efficiency, accuracy, and consistency over the baselines, demonstrating the effectiveness of VeriGrader and promoting human–AI collaboration in educational assessment.

日本語まとめ
読み込み中…
読み込み中…
Reactive Writers: How Co-Writing with AI Changes How We Engage with Ideas
説明

Emerging evidence shows that writing with AI assistance can change both the views people express and the opinions they hold. Yet, we lack a substantive understanding of behavioral and process-level changes in co-writing with AI that underlie the opinion-shaping power of these tools. We conducted a mixed-methods study, combining retrospective interviews with 19 participants about their co-writing experience with quantitative analysis tracing idea engagement in 1,291 AI co-writing sessions. Our analysis shows that engaging with the AI's suggestions---reading them and deciding whether to accept them---becomes a central activity, taking away from more traditional processes of ideation and language generation. As writers often do not complete their own ideation before engaging with suggestions, the suggested ideas and opinions seeded directions that writers then elaborated on. At the same time, writers did not notice the AI's influence and felt in control, as they---in principle---could always edit the final text. We term this shift Reactive Writing: an evaluation-first, suggestion-led writing practice that departs substantially from conventional composing in the presence of AI assistance and is highly vulnerable to AI-induced biases and opinion shifts.

日本語まとめ
読み込み中…
読み込み中…
When Help Hurts: Verification Load and Fatigue with AI Coding Assistants
説明

AI coding assistants help, but developers still spend effort verifying model output. We isolate interface effects by holding a single LLM fixed while N=60 participants solve three Python tasks with Inline, Chat, or Structured prompting, plus a no-AI control. AI reduced workload by -18.2 TLX points and time by 22% (25.0 vs. 32.1 min) and improved correctness (OR=1.71). Within AI, Inline is fastest and lowest-load on simple work; Chat yields higher correctness beyond a per-observation complexity threshold (z≈+0.41) without a time cost; Structured benefits novices at mid complexity. We introduce a mode-agnostic verification-load index (failures, time-to-first-compile, churn, pauses, switches) that partially mediates rising stress/fatigue across tasks. We translate these findings into design guidance: adaptive mode orchestration, transparency on demand, and verification-aware packaging, and propose reporting verification load alongside outcomes to evaluate interfaces as models evolve.

日本語まとめ
読み込み中…
読み込み中…
Operationalizing Perceptions of Agent Gender: Foundations and Guidelines
説明

The “gender” of intelligent agents, virtual characters, social robots, and other agentic machines has emerged as a fundamental topic in studies of people's interactions with computers. Perceptions of agent gender can help explain user attitudes and behaviours—from preferences to toxicity to stereotyping—across a variety of systems and contexts of use. Yet, standards in capturing perceptions of agent gender do not exist. A scoping review was conducted to clarify how agent gender has been operationalized—labelled, defined, and measured—as a perceptual variable. One-third of studies manipulated but did not measure agent gender. Norms in operationalizations remain obscure, limiting comprehension of results, congruity in measurement, and comparability for meta-analyses. The dominance of the gender binary model and latent anthropocentrism have placed arbitrary limits on knowledge generation and reified the status quo. We contribute a systematically-developed and theory-driven meta-level framework that offers operational clarity and practical guidance for greater rigour and inclusivity.

日本語まとめ
読み込み中…
読み込み中…
Agentic Audio Moderators vs Humans in Think-Aloud Usability Testing
説明

Agentic AI holds promise for usability testing, yet its role as an audio moderator in think-aloud protocols is not well understood. This study explores: (1) how to design and develop an agentic audio moderator for think-aloud usability testing, and (2) how participants moderated by an agentic moderator differ from those moderated by a human regarding task performance, verbalization behaviors, user experience, and social perceptions of the moderator. Using a design-based research approach, we interviewed nine UX experts, iteratively developed an AI moderator, and evaluated it in a randomized controlled trial (N=60) with a note-taking application. Results suggest that significant differences were not observed between AI and human moderators in task performance or verbalization behaviors, though AI moderators received lower social perception ratings. This work contributes the first design-oriented evaluation of AI moderators in usability testing, offering implications for developing more acceptable and effective agentic audio moderators.

日本語まとめ
読み込み中…
読み込み中…