As multi-agent Large Language Models (LLMs) gain traction, designers must consider how to surface their internal reasoning in ways that foster appropriate trust. We present a design-led, qualitative, comparative structured observation study, exploring how users interpret and evaluate transparency in multi-agent LLMs. Participants interacted with five interface variants, each instantiating different combinations of transparency-related design dimensions, across two task types: information-seeking and logical reasoning. We surface participants’ mental models, the cues they interpret as signals of transparency and trustworthiness, and how they weigh the costs and benefits of increasing process visibility. Transparency needs were dynamic and context-sensitive, with the ideal "Goldilocks" (i.e., "just right" transparency) level shaped jointly by task demands, interface affordances, and user characteristics such as task expertise and dispositional AI trust. We highlight tensions between process visibility, information sufficiency, and cognitive effort, and synthesise these insights into design considerations for aligning transparency with user needs in future multi-agent LLM interfaces.
Despite growing recognition that responsible AI requires domain knowledge, current work on conversational AI primarily draws on clinical expertise that prioritises diagnosis and intervention. However, much of everyday emotional support needs occur in non-clinical contexts, and therefore requires different conversational approaches. We examine how chaplains, who guide individuals through personal crises, grief, and reflection, perceive and engage with conversational AI. We recruited eighteen chaplains to build AI chatbots. While some chaplains viewed chatbots with cautious optimism, the majority expressed limitations of chatbots’ ability to support everyday well-being. Our analysis reveals how chaplains perceive their pastoral care duties and areas where AI chatbots fall short, along the themes of Listening, Connecting, Carrying, and Wanting. These themes resonate with the idea of attunement, recently highlighted as a relational lens for understanding the delicate experiences care technologies provide. This perspective informs chatbot design aimed at supporting well-being in non-clinical contexts.
AI-assisted usability analysis can potentially reduce the time and effort of finding usability problems, yet little is known about how AI's perceived expertise influences evaluators' analytic strategies and perceptions over time. We ran a within-subjects, five-session study (six hours per participant) with 12 professional UX evaluators who worked with two conversational assistants designed to appear novice- or expert-like (differing in suggestion quantity and response accuracy). We logged behavioral measures (number of passes, suggestion acceptance rate), collected subjective ratings (trust, perceived efficiency), and conducted semi-structured interviews. Participants experienced an initial novelty effect and a subsequent dip in trust that recovered over time. Their efficiency improved as they shifted from a two-pass to a one-pass video inspection approach. Evaluators ultimately rated the experienced CA as significantly more efficient, trustworthy, and comprehensive, despite not perceiving expertise differences early on. We conclude with design implications for adapting AI expertise to enable calibrated human-AI collaboration.
Virtual assistants (VAs) are increasingly positioned not just as tools, but as potential social companions—capable of offering either emotional or informational support. Yet, how these forms of support should adapt to varying task difficulties and embodiment styles remains underexplored. We conducted two user studies with cognitive and physical tasks to investigate how support type (emotional vs. informational) shapes user perceptions across variations in task difficulty (easy vs. hard) and embodiment (non-embodied vs. embodied). In Study 1, emotional support positively influenced users' impressions of VA in easy tasks, while informational support was more effective in difficult tasks. In Study 2, participants also preferred emotional support for easy tasks, but differences between support types were less pronounced for difficult tasks. Notably, embodiment exerted no significant influence in either study. These findings underscore the role of context in shaping effective support strategies, offering design insights for VAs as social companions.
ChatGPT’s memory feature is designed to provide users with greater control and more helpful responses. Yet, it remains unclear how users perceive this feature in relation to privacy. To address this gap, we conducted interviews with 20 ChatGPT users from diverse backgrounds. Our findings revealed four major characteristics that distinguish ChatGPT's memory from human memory: perceived unforgetfulness, detailedness, accuracy, and lack of emotions, highlighting the machine-like nature of AI memory. Moreover, both ChatGPT's memory and human memory were perceived as beneficial for relationship building. Notably, most participants experienced negative expectancy violations after learning what ChatGPT remembered about them. They expressed a strong need for greater visibility, accessibility, transparency, and user control in the design of future memory features. Drawing on users' suggestions and theoretical frameworks on privacy management, we provide design implications for developing a more transparent, responsible, and user-aligned memory experience that helps them navigate privacy-personalization trade-offs when interacting with LLM-based memories.
Large Language Models (LLMs) are expected to enhance medical education through personalized clinical skills training. However, their practical application from the student user experience perspective remains underexplored. This gap is critical because without understanding students' needs, LLM-based tools risk poor adoption and suboptimal learning outcomes. This study explores medical students' challenges and expectations when using LLM-based clinical skills training through a two-phase investigation involving 14 medical students. We integrated five Type 2 Diabetes cases into a probe platform and conducted probe-based studies followed by co-design workshops. We identified challenges across three categories: dialogue content (lack of realism, insufficient knowledge depth differentiation); dialogue presentation (information overload, single modality limitations); and dialogue interaction (inadequate guidance and feedback). Co-design workshops revealed expectations for enhanced patient modeling, personalized content delivery, structured presentation frameworks, and collaborative features. These findings provide design considerations for developing more effective, user-centered LLM-based medical education systems.
While conversational agents’ (CAs) semantic and syntactic capabilities have advanced, their pragmatic skills, using language appropriately in context, have emerged as a critical focus in practical applications. Hence, scholars integrate conversational skills derived from human-human interaction into CA designs. However, existing research mainly adopts an empirical approach and focuses on specific CA deployment, making it challenging to identify overarching patterns or develop a comprehensive methodology for transferring human pragmatic skills to CA design. Thus, we conducted a systematic review of 85 studies from primary databases (e.g., ACM, IEEE, etc.), focusing on designing CAs with human-derived conversational skills. We identified skill categories (verbal, paralinguistic, nonverbal), transfer strategies (from dialog data, theories, and via co-design), implementations, and evaluation metrics. We consolidated these insights into a four-stage design process: human skill exploration, definition, transfer, and iterative evaluation. Future research can leverage this to design CAs that achieve conversational goals through contextually appropriate language use.