Mobile sensor data collection in people’s daily lives is essential for understanding fine-grained human behaviors. However, in-the-wild data collection often results in missing data due to participant and system-related issues. While existing monitoring systems in the mobile sensing field provide an opportunity to detect missing data, they fall short in monitoring data across many participants and sensors and diagnosing the root causes of missing data, accounting for heterogeneous sensing characteristics of mobile sensor data. To address these limitations, we undertook a multi-year iterative design process to develop a system for monitoring missing data in mobile sensor data collection. Our final prototype, DataSentry, enables the detection, diagnosis, and addressing of missing data issues across many participants and sensors, considering both within- and between-person variability. Based on the iterative design process, we share our experiences, lessons learned, and design implications for developing advanced missing data management systems.
Sketching is one of the oldest techniques humans use to express themselves.
We sketch to visualize concepts, externalize memory, and communicate ideas.
However, we barely use sketching to interact with computers.
Given how naturally sketching comes to humans, we believe untapped potential exists in being able to simply draw commands onto a user interface.
In this paper, we present results of an elicitation study about expressing common operations in spreadsheets through sketching.
Spreadsheets are an interesting class of applications because they are widely used, support complex data and operations, and are available on touch-enabled devices.
Our results show that despite considerable variation in syntactic details, participants gravitate towards recurring patterns (\eg\ enclosures and arrows, examples and cross-references, and temporal sequences of strokes).
The sketch patterns we identified can be a first step towards developing interpreters of sketched commands, and thus enable new means of interacting with spreadsheets and other applications.
Data augmentation is crucial to make machine learning models more robust and safe. However, augmenting data can be challenging as it requires generating diverse data points to rigorously evaluate model behavior on edge cases and mitigate potential harms. Creating high-quality augmentations that cover these "unknown unknowns" is a time- and creativity-intensive task. In this work, we introduce Amplio, an interactive tool to help practitioners navigate "unknown unknowns" in unstructured text datasets and improve data diversity by systematically identifying empty data spaces to explore. Amplio includes three human-in-the-loop data augmentation techniques: Augment with Concepts, Augment by Interpolation, and Augment with Large Language Model. In a user study with 18 professional red teamers, we demonstrate the utility of our augmentation methods in helping generate high-quality, diverse, and relevant model safety prompts. We find that Amplio enabled red teamers to augment data quickly and creatively, highlighting the transformative potential of interactive augmentation workflows.
Data analysts frequently employ code completion tools in writing custom scripts to tackle complex tabular data wrangling tasks. However, existing tools do not sufficiently link the data contexts such as schemas and values with the code being edited. This not only leads to poor code suggestions, but also frequent interruptions in coding processes as users need additional code to locate and understand relevant data. We introduce Xavier, a tool designed to enhance data wrangling script authoring in computational notebooks. Xavier maintains users' awareness of data contexts while providing data-aware code suggestions. It automatically highlights the most relevant data based on the user's code, integrates both code and data contexts for more accurate suggestions, and instantly previews data transformation results for easy verification. To evaluate the effectiveness and usability of Xavier, we conducted a user study with 16 data analysts, showing its potential to streamline data wrangling scripts authoring.
Analyzing data subgroups is a common data science task to build intuition about a dataset and identify areas to improve model performance. However, subgroup analysis is prohibitively difficult in datasets with many features, and existing tools limit unexpected discoveries by relying on user-defined or static subgroups. We propose exploratory subgroup analysis as a set of tasks in which practitioners discover, evaluate, and curate interesting subgroups to build understanding about datasets and models. To support these tasks we introduce Divisi, an interactive notebook-based tool underpinned by a fast approximate subgroup discovery algorithm. Divisi's interface allows data scientists to interactively re-rank and refine subgroups and to visualize their overlap and coverage in the novel Subgroup Map. Through a think-aloud study with 13 practitioners, we find that Divisi can help uncover surprising patterns in data features and their interactions, and that it encourages more thorough exploration of subtypes in complex data.
TableCanoniser is a declarative grammar and interactive system for constructing relational tables from messy tabular inputs such as spreadsheets. We propose the concept of axis alignment to categorise input types and characterise the expanded scope of our system relative to existing tools. The declarative grammar consists of match conditions, which specify repeating patterns of input cells, and extract operations, which specify how matched values map to the output table. In the interactive interface, users can specify match and extract patterns by interacting with an input table, or author more advanced specifications in the coding panel. To refine and verify specifications, users interact with grammar-based provenance visualisations such as linked highlighting of input and output values, tree-based visualisation of matching patterns, and a mini-map overview of matched instances of patterns with annotations showing where cells are extracted to. We motivate and illustrate our work with real-world usage scenarios and workflows.
Data is one of the foundational aspects of making Artificial Intelligence (AI) work as intended. As large language models (LLMs) become the epicenter of AI, it is crucial to understand better how the datasets that maintain such models are created. The emergent nature of LLMs makes it critical to understand the challenges practitioners developing Gen AI technologies face to design alternatives for better responding to Gen AI's ethical issues. In this paper, we provide such understanding by reporting on 25 interviews with practitioners who handle data in three distinct development stages of different LLMs. Our contributions are (1) empirical evidence of how uncertainty, data practices, and reliance mechanisms change across LLMs' development cycle; (2) how the unique qualities of LLMs impact data practices and their implications for the future of Gen AI technologies; and (3) provide three opportunities for HCI researchers interested in supporting practitioners developing Gen AI technologies.