Politics of Datasets

https://doi.org/10.1145/3613904.3642014

The challenges of data collection in nonprofits for performance and funding reports are well-established in HCI research. Few studies, however, delve into improving the data collection process. Our study proposes ideas to improve data collection by exploring challenges that social workers experience when labeling their case notes. Through collaboration with an organization that provides intensive case management to those experiencing homelessness in the U.S., we conducted interviews with caseworkers and held design sessions where caseworkers, managers, and program analysts examined storyboarded ideas to improve data labeling. Our findings suggest several design ideas on how data labeling practices can be improved: Aligning labeling with caseworker goals, enabling shared control on data label design for a comprehensive portrayal of caseworker contributions, improving the synthesis of qualitative and quantitative data, and making labeling user-friendly. We contribute design implications for data labeling to better support multiple stakeholder goals in social service contexts.

University of Texas at Austin, Austin, Texas, United States

The University of Texas at Austin, Austin, Texas, United States

University of Texas at Austin, Austin, Texas, United States

The University of Texas at Austin, Austin, Texas, United States

University of Texas at Austin, Austin, Texas, United States

The University of Texas at Austin, Austin, Texas, United States

University of Texas at Austin, Austin, Texas, United States

https://doi.org/10.1145/3613904.3642669

While colonization has sociohistorically impacted people's identities across various dimensions, those colonial values and biases continue to be perpetuated by sociotechnical systems. One category of sociotechnical systems--sentiment analysis tools--can also perpetuate colonial values and bias, yet less attention has been paid to how such tools may be complicit in perpetuating coloniality, although they are often used to guide various practices (e.g., content moderation). In this paper, we explore potential bias in sentiment analysis tools in the context of Bengali communities who have experienced and continue to experience the impacts of colonialism. Drawing on identity categories most impacted by colonialism amongst local Bengali communities, we focused our analytic attention on gender, religion, and nationality. We conducted an algorithmic audit of all sentiment analysis tools for Bengali, available on the Python package index (PyPI) and GitHub. Despite similar semantic content and structure, our analyses showed that in addition to inconsistencies in output from different tools, Bengali sentiment analysis tools exhibit bias between different identity categories and respond differently to different ways of identity expression. Connecting our findings with colonially shaped sociocultural structures of Bengali communities, we discuss the implications of downstream bias of sentiment analysis tools.

University of Colorado Boulder, Boulder, Colorado, United States

University of Toronto, Toronto, Ontario, Canada

University of Colorado Boulder, Boulder, Colorado, United States

https://doi.org/10.1145/3613904.3642830

Data analysts have long sought to turn unstructured text data into meaningful concepts. Though common, topic modeling and clustering focus on lower-level keywords and require significant interpretative work. We introduce concept induction, a computational process that instead produces high-level concepts, defined by explicit inclusion criteria, from unstructured text. For a dataset of toxic online comments, where a state-of-the-art BERTopic model outputs “women, power, female,” concept induction produces high-level concepts such as “Criticism of traditional gender roles” and “Dismissal of women's concerns.” We present LLooM, a concept induction algorithm that leverages large language models to iteratively synthesize sampled text and propose human-interpretable concepts of increasing generality. We then instantiate LLooM in a mixed-initiative text analysis tool, enabling analysts to shift their attention from interpreting topics to engaging in theory-driven analysis. Through technical evaluations and four analysis scenarios ranging from literature review to content moderation, we find that LLooM’s concepts improve upon the prior art of topic models in terms of quality and data coverage. In expert case studies, LLooM helped researchers to uncover new insights even from familiar datasets, for example by suggesting a previously unnoticed concept of attacks on out-party stances in a political social media dataset.

Stanford University, Stanford, California, United States

University of Washington, Seattle, Washington, United States

Stanford University, Stanford, California, United States

https://doi.org/10.1145/3613904.3642452

Activists, governments, and academics regularly advocate for more open data. But how is data made open, and for whom is it made useful and usable? In this paper, we investigate and describe the work of making eviction data open to tenant organizers. We do this through an ethnographic description of ongoing work with a local housing activist organization. This work combines observation, direct participation in data work, and creating media artifacts, specifically digital maps. Our interpretation is grounded in D’Ignazio and Klein’s Data Feminism, emphasizing standpoint theory. Through our analysis and discussion, we highlight how shifting positionalities from data intermediaries to data accomplices affects the design of data sets and maps. We provide HCI scholars with three design implications when situating data for grassroots organizers: becoming a domain beginner, striving for data actionability, and evaluating our design artifacts by the social relations they sustain rather than just their technical efficacy.

Georgia Institute of Technology, Atlanta, Georgia, United States

Carnegie Mellon University, Pittsburgh, Pennsylvania, United States

Georgia Institute of Technology, Atlanta, Georgia, United States