Sensemaking is the iterative process of identifying, extracting, and explaining insights from data, where each iteration is referred to as the "sensemaking loop." However, little is known about how sensemaking behavior evolves from exploration and explanation during this process. This gap limits our ability to understand the full scope of sensemaking, which in turn inhibits the design of tools that support the process. We contribute the first mixed-method to characterize how sensemaking evolves within computational notebooks. We study 2,574 Jupyter notebooks mined from GitHub by identifying data science notebooks that have undergone significant iterations, presenting a regression model that automatically characterizes sensemaking activity, and using this regression model to calculate and analyze shifts in activity across GitHub versions. Our results show that notebook authors participate in various sensemaking tasks over time, such as annotation, branching analysis, and documentation. We use our insights to recommend extensions to current notebook environments.
https://doi.org/10.1145/3544548.3580997
Beyond self-report data, we lack reliable and non-intrusive methods for identifying flow. However, taking a step back and acknowledging that flow occurs during periods of focus gives us the opportunity to make progress towards measuring flow by isolating focused work. Here, we take a mixed-methods approach to design a logs-based metric that leverages machine learning and a comprehensive collection of logs data to identify periods of related actions (indicating focus), and validate this metric against self-reported time in focus or flow using diary data and quarterly survey data. Our results indicate that we can determine when software engineers at a large technology company experience focused work which includes instances of flow. This metric speaks to engineering work, but can be leveraged in other domains to non-disruptively measure when people experience focus. Future research can build upon this work to identify signals associated with other facets of flow.
https://doi.org/10.1145/3544548.3581562
When conducting qualitative research it is necessary to decide how many researchers should be involved in coding the data: Is one enough or are more coders beneficial? To offer empirical evidence for this question, we designed a series of studies investigating qualitative coding. We replicated and extended a usable security and privacy study by Ion et al. to gather both simple survey data and complex interview data. We had a total of 65 students and seven researchers analyze different parts of this data. We analyzed the codebook creation process, similarity of outcomes, inter-rater reliability, and compared the student to the researcher outcomes. We also surveyed five years of SOUPS-PC members about their views on coding. The reviewers view on coding practices for complex and simple data are almost identical. However, our results suggest that the coding process can be different for the two types of data, with complex data benefiting more from interaction between coders.
https://doi.org/10.1145/3544548.3580766
This paper examines the practices involved in mobilizing social media data from their site of production to the institutional context of non-profit organizations. We report on nine months of fieldwork with a transnational and intergovernmental organization using social media data to understand the role of grassroots initiatives in Mexico, in the unique context of the COVID-19 pandemic. We show how different stakeholders negotiate the definition of problems to be addressed with social media data, the collective creation of ground-truth, and the limitations involved in the process of extracting value from data. The meanings of social media data are not defined in advance; instead, they are contingent on the practices and needs of the organization that seeks to extract insights from the analysis. We conclude with a list of reflections and questions for researchers who mediate in the mobilization of social media data into non-profit organizations to inform humanitarian action.
https://doi.org/10.1145/3544548.3580916
Geospatial data is playing an increasingly critical role in the work of Earth and climate scientists, social scientists, and data journalists exploring spatiotemporal change in our environment and societies. However, existing software and programming tools for geospatial analysis and visualization are challenging to learn and difficult to use. The aim of this work is to identify the unmet computing needs of the diverse and expanding community of geospatial data users. We conducted a contextual inquiry study (n = 25) with domain experts using geospatial data in their current work. Through a thematic analysis, we found that participants struggled to (1) find and transform geospatial data to satisfy spatiotemporal constraints, (2) understand the behavior of geospatial operators, (3) track geospatial data provenance, and (4) explore the cartographic design space. These findings suggest design opportunities for developers and designers of geospatial analysis and visualization systems.
https://doi.org/10.1145/3544548.3581370
The work involved in gathering, wrangling, cleaning, and otherwise preparing data for analysis is often the most time consuming and tedious aspect of data work. Although many studies describe data preparation within the context of data science workflows, there has been little research on data preparation in data journalism. We address this gap with a hybrid form of thematic analysis that combines deductive codes derived from existing accounts of data science workflows and inductive codes arising from an interview study with 36 professional data journalists. We extend a previous model of data science work to incorporate detailed activities of data preparation. We synthesize 60 dirty data issues from 16 taxonomies on dirty data and our interview data, and we provide a novel taxonomy to characterize these dirty data issues as discrepancies between mental models. We also identify four challenges faced by journalists: diachronic, regional, fragmented, and disparate data sources.
https://doi.org/10.1145/3544548.3581271