Building Benchmarks from the Ground Up: Community-Centered Evaluation of LLMs in Healthcare Chatbot Settings

要旨

Large Language Models (LLMs) are typically evaluated through general or domain-specific benchmarks testing capabilities that often lack grounding in the lived realities of end users. Critical domains such as healthcare require evaluations that extend beyond artificial or simulated tasks to reflect the everyday needs, cultural practices, and nuanced contexts of communities. We propose Samiksha, a community-driven evaluation pipeline co-created with civil-society organizations (CSOs) and community members. Our approach enables scalable, automated benchmarking through a culturally aware, community-driven pipeline in which community feedback informs what to evaluate, how the benchmark is built, and how outputs are scored. We demonstrate this approach in the health domain in India. Our analysis highlights how current multilingual LLMs address nuanced community health queries, while also offering a scalable pathway for contextually grounded and inclusive LLM evaluation.

受賞
Honorable Mention
著者
Hamna Hamna
Microsoft Corporation, Bangalore, Karnataka, India
Gayatri Bhat
Karya, Bengaluru, India
Sourabrata Mukherjee
Microsoft Research, Bengaluru, Karnataka, India
Faisal M.. Lalani
Collective Intelligence Project, New York, New York, United States
Evan Hadfield
Collective Intelligence Project, New York, New York, United States
Divya Siddarth
Collective Intelligence Project, New York, New York, United States
Kalika Bali
Microsoft Research Lab India, Bangalore, India
Sunayana Sitaram
Microsoft Research India, Bangalore, Karnataka, India
動画

会議: CHI 2026

ACM CHI Conference on Human Factors in Computing Systems

セッション: AI Explanations and Decision Support in Healthcare

Auditorium
7 件の発表
2026-04-13 20:15:00
2026-04-13 21:45:00