AI coding assistants help, but developers still spend effort verifying model output. We isolate interface effects by holding a single LLM fixed while N=60 participants solve three Python tasks with Inline, Chat, or Structured prompting, plus a no-AI control. AI reduced workload by -18.2 TLX points and time by 22% (25.0 vs. 32.1 min) and improved correctness (OR=1.71). Within AI, Inline is fastest and lowest-load on simple work; Chat yields higher correctness beyond a per-observation complexity threshold (z≈+0.41) without a time cost; Structured benefits novices at mid complexity. We introduce a mode-agnostic verification-load index (failures, time-to-first-compile, churn, pauses, switches) that partially mediates rising stress/fatigue across tasks. We translate these findings into design guidance: adaptive mode orchestration, transparency on demand, and verification-aware packaging, and propose reporting verification load alongside outcomes to evaluate interfaces as models evolve.
ACM CHI Conference on Human Factors in Computing Systems