Kaleidoscope: Semantically-grounded, Context-specific ML Model Evaluation

要旨

Desired model behavior often differs across contexts (e.g., different geographies, communities, or institutions), but there is little infrastructure to facilitate context-specific evaluations key to deployment decisions and building trust. Here, we present Kaleidoscope, a system for evaluating models in terms of user-driven, domain-relevant concepts. Kaleidoscope’s iterative workflow enables generalizing from a few examples into a larger, diverse set representing an important concept. These example sets can be used to test model outputs or shifts in model behavior in semantically-meaningful ways. For instance, we might construct a “xenophobic comments” set and test that its examples are more likely to be flagged by a content moderation model than a “civil discussion” set. To evaluate Kaleidoscope, we compare it against template- and DSL-based grouping methods, and conduct a usability study with 13 Reddit users testing a content moderation model. We find that Kaleidoscope facilitates iterative, exploratory hypothesis testing across diverse, conceptually-meaningful example sets.

著者
Harini Suresh
MIT, Cambridge, Massachusetts, United States
Divya Shanmugam
MIT, Cambridge, Massachusetts, United States
Tiffany Chen
MIT, Cambridge, Massachusetts, United States
Annie G. Bryan
MIT, Cambridge, Massachusetts, United States
Alexander D'Amour
Google Research, Cambridge, Massachusetts, United States
John Guttag
MIT, Cambridge, Massachusetts, United States
Arvind Satyanarayan
MIT, Cambridge, Massachusetts, United States
論文URL

https://doi.org/10.1145/3544548.3581482

動画

会議: CHI 2023

The ACM CHI Conference on Human Factors in Computing Systems (https://chi2023.acm.org/)

セッション: User Behavior Simulation and Modeling

Hall G2
6 件の発表
2023-04-27 18:00:00
2023-04-27 19:30:00