Modern mobile applications rely on hidden interactions—gestures without visual cues like long presses and swipes—to provide functionality without cluttering interfaces. While experienced users may discover these interactions through prior use or onboarding tutorials, their implicit nature makes them difficult for most users to uncover. Similarly, mobile agents—systems designed to automate tasks on mobile user interfaces, powered by vision language models (VLMs)—struggle to detect veiled interactions or determine actions for completing tasks. To address this challenge, we present GhostUI, a new dataset designed to enable the detection of hidden interactions in mobile applications. GhostUI provides before-and-after screenshots, simplified view hierarchies, gesture metadata, and task descriptions, allowing VLMs to better recognize concealed gestures and anticipate post-interaction states. Quantitative evaluations with VLMs show that models fine-tuned on GhostUI outperform baseline VLMs, particularly in predicting hidden interactions and inferring post-interaction screens, underscoring GhostUI's potential as a foundation for advancing mobile task automation.
Augmentation allows rapid reconfiguration of passive physical interfaces to improve accessibility, support independent living through domestic automation, and more. However, its potential is largely unrealized for novice users due to several key barriers. First, users rarely identify latent interaction problems within their built environments. Second, they often lack the knowledge to clearly express design intent. Third, many innovative solutions remain in research prototypes, limiting access. We introduce EUREXA, an agentic AI system to share the spirit of discovery (“Eureka!”). EUREXA supports end-users through a \textit{diagnose–discover–describe} workflow: from input with varying ambiguity and complexity, it surfaces latent interaction challenges, presents reconfiguration opportunities through augmentations, and produces interpretable designs. Its novelty is a dual search across public augmentation repositories and research articles, enabling reusable designs even when no design libraries or parametric tools exist. EUREXA transforms non-parametric models into parametric ones or directly generates fully explainable designs. To evaluate EUREXA across varied user inputs, complexities, and clarity levels, we define ambiguity metrics, conduct a user study, and report critical factors for advancing generative AI to help end-users readily augment physical interfaces through fabrication.
Web AI agents such as ChatGPT Agent and GenSpark are increasingly used for routine web-based tasks, yet they still rely on text-based input prompts, lack proactive detection of user intent, and offer no support for interactive data analysis and decision making. We present WebSeek, a mixed-initiative browser extension that enables users to discover and extract information from webpages to then flexibly build, transform, and refine tangible data artifacts–such as tables, lists, and visualizations–all within an interactive canvas. Within this environment, users can perform analysis–including data transformations such as joining tables or creating visualizations–while an in-built AI both proactively offers context-aware guidance and automation, and reactively responds to explicit user requests. An exploratory user study (N=15) with WebSeek as a probe reveals participants' diverse analysis strategies, underscoring their desire for transparency and control during human-AI collaboration.
Information seeking on mobile devices is often fragmented, trapping users in repetitive cycles of context switching and data re-entry, which increases cognitive load and disrupts workflow. Existing mobile agents provide limited cross-source integration and are largely opaque, presenting progress as a linear feed with few opportunities to intervene, steer, or take control. We present DroidRetriever, a transparent, steerable system for cross-source mobile information seeking. It accepts voice or typed input and the multi-LLM system decomposes the task, navigates to target pages, takes screenshots, and synthesizes a concise report with citation-linked screenshots. We make the process transparent through a progress dashboard combining sub-task progress and real-time exploration maps for seamless takeover. DroidRetriever also pauses on detected privacy or high-risk screens and prompts intervention. Across 35 tasks over 24 apps, experiments and user studies demonstrate improvements in coverage, transparency, and reduced workload. We release our code at https://github.com/AkimotoAyako/DroidRetriever.
Large language models promise a broad set of functions, but when not given a specific objective, they default to generic results. We demonstrate that inferring the user's in-the-moment objective, then rapidly optimizing for that singular objective, enables LLMs to produce specialized tools, interfaces, and responses. Our work introduces just-in-time objectives, which model a user's goals to specialize LLM systems on the fly. We contribute an architecture for automatically inducing such objectives by passively observing user behavior, then steering downstream AI systems through generation and evaluation against this objective. Inducing just-in-time objectives (e.g., “Clarify the abstract’s research contribution”) enables automatic generation of tools, e.g., those that critique a draft based on relevant HCI methodologies, anticipate related researchers' reactions, or surface ambiguous terminology. In a series of experiments on participants' own tasks, JIT objectives enable LLM outputs that achieve 66–86% win rates over typical LLMs. In-person use sessions confirm that JIT objectives produce specialized tools that are unique to each participant and are rated as significantly higher quality than a standard LLM chat tool.
LLM-assisted technologies are increasingly used to support cognitive processing and information interpretation, yet their role in aiding memory recall—and how people choose to engage with them—remains underexplored. We studied participants who watched a short robbery video (approximating a one-time eyewitness scenario) and composed recall statements using either a default GPT or a guided GPT prompted with a standardized eyewitness protocol. Results show that default-condition participants who believed they had a clearer understanding of the event were more likely to trust GPT’s output, whereas guided-condition participants showed stronger alignment between subjective clarity and actual recall. Additionally, participants evaluated the legitimacy of the individuals in the incident differently across conditions. Interaction analysis further revealed that default-GPT users spontaneously developed diverse strategies, including building on existing recollections, requesting potentially missing details, and treating GPT as a recall coach. This work shows how GPT–user interplay subconsciously affects beliefs and perceptions of remembered events.
Learning to use feature-rich software is a persistent challenge, but generative AI tools promise to lower this barrier by replacing complex navigation with natural language prompts. We investigated how people approach prompt-based tools for 3D modeling in an observational study with 26 participants (14 casuals, 12 professionals). Consistent with earlier work, participants skipped tutorials and manuals, relying on trial and error. What differed in the generative AI context was where and how they sought support: the prompt box became the entry point for learning, collapsing onboarding into immediate action, while some casual users turned to external LLMs for prompts. Professionals used 3D expertise to refine iterations and critically evaluated outputs, often discarding models that did not meet their standards, whereas casual users settled for ``good enough.'' We contribute empirical insights into how generative AI reshapes help-seeking, highlighting new practices of onboarding, recursive AI-for-AI support, and shifting expertise in interpreting outputs.