In data-driven systems, integrating disparate data sources becomes challenging when incoming data does not conform to the system's specifications. Despite advances in automated schema matching systems, data integration tasks involving complex semantic interrelationships still require users to manually identify and define transformations between datasets, which can be cognitively demanding and time-consuming. We present DataSpeck, an end-to-end system that automates the conversion of disparate data sources to fit any pre-existing data specification. DataSpeck employs an AI-driven human-in-the-loop design, using LLMs to analyze semantic relationships and generate step-by-step transformation pipelines autonomously, while only requesting user attention to resolve semantic ambiguities. In our technical evaluation, DataSpeck successfully automated ~86% of varied data transformations while generating interpretable strategies with confidence scores and targeted clarification requests. In a user study (N=12), participants completed data conversion tasks ~53% faster with significantly reduced cognitive load using DataSpeck compared to Microsoft Excel with Copilot.
ACM CHI Conference on Human Factors in Computing Systems