Building Benchmarks from the Ground Up: Community-Centered Evaluation of LLMs in Healthcare Chatbot Settings

Large Language Models (LLMs) are typically evaluated through general or domain-specific benchmarks testing capabilities that often lack grounding in the lived realities of end users. Critical domains such as healthcare require evaluations that extend beyond artificial or simulated tasks to reflect the everyday needs, cultural practices, and nuanced contexts of communities. We propose Samiksha, a community-driven evaluation pipeline co-created with civil-society organizations (CSOs) and community members. Our approach enables scalable, automated benchmarking through a culturally aware, community-driven pipeline in which community feedback informs what to evaluate, how the benchmark is built, and how outputs are scored. We demonstrate this approach in the health domain in India. Our analysis highlights how current multilingual LLMs address nuanced community health queries, while also offering a scalable pathway for contextually grounded and inclusive LLM evaluation.

Microsoft Corporation, Bangalore, Karnataka, India

Karya, Bengaluru, India

Microsoft Research, Bengaluru, Karnataka, India

Collective Intelligence Project, New York, New York, United States

Microsoft Research Lab India, Bangalore, India

Microsoft Research India, Bangalore, Karnataka, India

ACM CHI Conference on Human Factors in Computing Systems

Auditorium

7 件の発表

開始日時2026-04-13 20:15:00

終了日時2026-04-13 21:45:00

お気に入り

あとで読む

コレクション

Building Benchmarks from the Ground Up: Community-Centered Evaluation of LLMs in Healthcare Chatbot Settings

要旨

受賞
Honorable Mention

著者

動画

会議: CHI 2026

セッション: AI Explanations and Decision Support in Healthcare

Building Benchmarks from the Ground Up: Community-Centered Evaluation of LLMs in Healthcare Chatbot Settings

要旨

受賞Honorable Mention

著者

動画

会議: CHI 2026

セッション: AI Explanations and Decision Support in Healthcare

受賞
Honorable Mention