Agentic AI holds promise for usability testing, yet its role as an audio moderator in think-aloud protocols is not well understood. This study explores: (1) how to design and develop an agentic audio moderator for think-aloud usability testing, and (2) how participants moderated by an agentic moderator differ from those moderated by a human regarding task performance, verbalization behaviors, user experience, and social perceptions of the moderator. Using a design-based research approach, we interviewed nine UX experts, iteratively developed an AI moderator, and evaluated it in a randomized controlled trial (N=60) with a note-taking application. Results suggest that significant differences were not observed between AI and human moderators in task performance or verbalization behaviors, though AI moderators received lower social perception ratings. This work contributes the first design-oriented evaluation of AI moderators in usability testing, offering implications for developing more acceptable and effective agentic audio moderators.
ACM CHI Conference on Human Factors in Computing Systems