Playing with Data

https://dl.acm.org/doi/10.1145/3706598.3715269

Sketching is one of the oldest techniques humans use to express themselves. We sketch to visualize concepts, externalize memory, and communicate ideas. However, we barely use sketching to interact with computers. Given how naturally sketching comes to humans, we believe untapped potential exists in being able to simply draw commands onto a user interface. In this paper, we present results of an elicitation study about expressing common operations in spreadsheets through sketching. Spreadsheets are an interesting class of applications because they are widely used, support complex data and operations, and are available on touch-enabled devices. Our results show that despite considerable variation in syntactic details, participants gravitate towards recurring patterns (\eg\ enclosures and arrows, examples and cross-references, and temporal sequences of strokes). The sketch patterns we identified can be a first step towards developing interpreters of sketched commands, and thus enable new means of interacting with spreadsheets and other applications.

University of Duisburg-Essen, Essen, Germany

University of Iceland, Reykjavík, Iceland

University of Iceland, Reykjavik, Iceland

10.1145/3706598.3715269

https://dl.acm.org/doi/10.1145/3706598.3713491

Data augmentation is crucial to make machine learning models more robust and safe. However, augmenting data can be challenging as it requires generating diverse data points to rigorously evaluate model behavior on edge cases and mitigate potential harms. Creating high-quality augmentations that cover these "unknown unknowns" is a time- and creativity-intensive task. In this work, we introduce Amplio, an interactive tool to help practitioners navigate "unknown unknowns" in unstructured text datasets and improve data diversity by systematically identifying empty data spaces to explore. Amplio includes three human-in-the-loop data augmentation techniques: Augment with Concepts, Augment by Interpolation, and Augment with Large Language Model. In a user study with 18 professional red teamers, we demonstrate the utility of our augmentation methods in helping generate high-quality, diverse, and relevant model safety prompts. We find that Amplio enabled red teamers to augment data quickly and creatively, highlighting the transformative potential of interactive augmentation workflows.

Harvard University, Boston, Massachusetts, United States

Apple, Seattle, Washington, United States

Apple, Cambridge, Massachusetts, United States

Apple, Pittsburgh, Pennsylvania, United States

Apple, Seattle, Washington, United States

10.1145/3706598.3713491

https://dl.acm.org/doi/10.1145/3706598.3714239

Data analysts frequently employ code completion tools in writing custom scripts to tackle complex tabular data wrangling tasks. However, existing tools do not sufficiently link the data contexts such as schemas and values with the code being edited. This not only leads to poor code suggestions, but also frequent interruptions in coding processes as users need additional code to locate and understand relevant data. We introduce Xavier, a tool designed to enhance data wrangling script authoring in computational notebooks. Xavier maintains users' awareness of data contexts while providing data-aware code suggestions. It automatically highlights the most relevant data based on the user's code, integrates both code and data contexts for more accurate suggestions, and instantly previews data transformation results for easy verification. To evaluate the effectiveness and usability of Xavier, we conducted a user study with 16 data analysts, showing its potential to streamline data wrangling scripts authoring.

Zhejiang University, Hangzhou, Zhejiang, China

Microsoft Research Asia, Beijing, China

The Hong Kong University of Science and Technology, Hong Kong, China

Zhejiang University, Ningbo, Zhejiang, China

Zhejiang University, Hangzhou, Zhejiang, China

10.1145/3706598.3714239

https://dl.acm.org/doi/10.1145/3706598.3713103

Analyzing data subgroups is a common data science task to build intuition about a dataset and identify areas to improve model performance. However, subgroup analysis is prohibitively difficult in datasets with many features, and existing tools limit unexpected discoveries by relying on user-defined or static subgroups. We propose exploratory subgroup analysis as a set of tasks in which practitioners discover, evaluate, and curate interesting subgroups to build understanding about datasets and models. To support these tasks we introduce Divisi, an interactive notebook-based tool underpinned by a fast approximate subgroup discovery algorithm. Divisi's interface allows data scientists to interactively re-rank and refine subgroups and to visualize their overlap and coverage in the novel Subgroup Map. Through a think-aloud study with 13 practitioners, we find that Divisi can help uncover surprising patterns in data features and their interactions, and that it encourages more thorough exploration of subtypes in complex data.

Carnegie Mellon University, Pittsburgh, Pennsylvania, United States

University of Michigan, Ann Arbor, Michigan, United States

Carnegie Mellon University, Pittsburgh, Pennsylvania, United States

10.1145/3706598.3713103

https://dl.acm.org/doi/10.1145/3706598.3714321

TableCanoniser is a declarative grammar and interactive system for constructing relational tables from messy tabular inputs such as spreadsheets. We propose the concept of axis alignment to categorise input types and characterise the expanded scope of our system relative to existing tools. The declarative grammar consists of match conditions, which specify repeating patterns of input cells, and extract operations, which specify how matched values map to the output table. In the interactive interface, users can specify match and extract patterns by interacting with an input table, or author more advanced specifications in the coding panel. To refine and verify specifications, users interact with grammar-based provenance visualisations such as linked highlighting of input and output values, tree-based visualisation of matching patterns, and a mini-map overview of matched instances of patterns with annotations showing where cells are extracted to. We motivate and illustrate our work with real-world usage scenarios and workflows.

Zhejiang University, Hangzhou, Zhejiang, China

Monash University, Melbourne, Victoria, Australia

Monash University, Melbourne, VIC, Australia

Zhejiang University, Hangzhou, Zhejiang, China

10.1145/3706598.3714321

https://dl.acm.org/doi/10.1145/3706598.3714069

Data is one of the foundational aspects of making Artificial Intelligence (AI) work as intended. As large language models (LLMs) become the epicenter of AI, it is crucial to understand better how the datasets that maintain such models are created. The emergent nature of LLMs makes it critical to understand the challenges practitioners developing Gen AI technologies face to design alternatives for better responding to Gen AI's ethical issues. In this paper, we provide such understanding by reporting on 25 interviews with practitioners who handle data in three distinct development stages of different LLMs. Our contributions are (1) empirical evidence of how uncertainty, data practices, and reliance mechanisms change across LLMs' development cycle; (2) how the unique qualities of LLMs impact data practices and their implications for the future of Gen AI technologies; and (3) provide three opportunities for HCI researchers interested in supporting practitioners developing Gen AI technologies.

IBM Research, Yorktown Heights, New York, United States

IBM Research, Sao Paulo, Brazil

University of Notre Dame, South Bend, Indiana, United States

Escuela Superior Politécnica del Litoral, Guayaquil, Ecuador

10.1145/3706598.3714069