Kaleidoscope: Semantically-grounded, Context-specific ML Model Evaluation

Desired model behavior often differs across contexts (e.g., different geographies, communities, or institutions), but there is little infrastructure to facilitate context-specific evaluations key to deployment decisions and building trust. Here, we present Kaleidoscope, a system for evaluating models in terms of user-driven, domain-relevant concepts. Kaleidoscope’s iterative workflow enables generalizing from a few examples into a larger, diverse set representing an important concept. These example sets can be used to test model outputs or shifts in model behavior in semantically-meaningful ways. For instance, we might construct a “xenophobic comments” set and test that its examples are more likely to be flagged by a content moderation model than a “civil discussion” set. To evaluate Kaleidoscope, we compare it against template- and DSL-based grouping methods, and conduct a usability study with 13 Reddit users testing a content moderation model. We find that Kaleidoscope facilitates iterative, exploratory hypothesis testing across diverse, conceptually-meaningful example sets.

MIT, Cambridge, Massachusetts, United States

Google Research, Cambridge, Massachusetts, United States

MIT, Cambridge, Massachusetts, United States

https://doi.org/10.1145/3544548.3581482

The ACM CHI Conference on Human Factors in Computing Systems (https://chi2023.acm.org/)

Hall G2

6 件の発表

開始日時2023-04-27 18:00:00

終了日時2023-04-27 19:30:00

お気に入り

あとで読む

コレクション