Recent advancements in large language models (LLMs) have led to the creation of intelligent agents capable of performing complex tasks. This paper introduces a novel LLM-based multimodal agent framework designed to operate smartphone applications. Our framework allows the agent to mimic human-like interactions such as tapping and swiping through a simplified action space, eliminating the need for system back-end access and enhancing its versatility across various apps. Central to the agent's functionality is an innovative in-context learning method, where it either autonomously explores or learns from human demonstrations, creating a knowledge base used to execute complex tasks across diverse applications. We conducted extensive testing with our agent on over 50 tasks spanning 10 applications, ranging from social media to sophisticated image editing tools. Additionally, a user study confirmed the agent's superior performance and practicality in handling a diverse array of high-level tasks, demonstrating its effectiveness in real-world settings. Our project page is available at \url{https://appagent-official.github.io/}.
https://dl.acm.org/doi/10.1145/3706598.3713600
Current AI sound awareness systems can provide deaf and hard of hearing people with information about sounds, including discrete sound sources and transcriptions. However, synthesizing AI outputs based on DHH people’s ever-changing intents in complex auditory environments remains a challenge. In this paper, we describe the co-design process of SoundWeaver, a sound awareness system prototype that dynamically weaves AI outputs from different AI models based on users’ intents and presents synthesized information through a heads-up display. Adopting a Research through Design perspective, we created SoundWeaver with one DHH co-designer, adapting it to his personal contexts and goals (e.g., cooking at home and chatting in a game store). Through this process, we present design implications for the future of “intent-driven” AI systems for sound accessibility.
https://dl.acm.org/doi/10.1145/3706598.3714268
Perfume and fragrance have captivated people for centuries across different cultures. Inspired by the ephemeral nature of sprayable olfactory interactions and experiences, we explore the potential of applying a similar interaction principle to the auditory modality. In this paper, we present SoundMist, a sonic interaction method that enables users to generate ephemeral auditory presences by physically dispersing a liquid into the air, much like the fading phenomenon of fragrance. We conducted a study to understand the experiential factors inherent in sprayable sound interaction and held an ideation workshop to identify potential design spaces or opportunities that this interaction could shape. Our findings, derived from thematic analysis, suggest that physically sprayable sound interaction can induce experiences related to four key factors—materiality of sound produced by dispersed liquid particles, different sounds entangled with each liquid, illusive perception of temporally floating sound, and enjoyment derived from blending different sounds—and can be applied to artistic practices, safety indications, multisensory approaches, and emotional interfaces.
https://dl.acm.org/doi/10.1145/3706598.3713786
Non-speech sounds play an important role in setting the mood of a video and aiding comprehension. However, current non-speech sound captioning practices focus primarily on sound categories, which fails to provide a rich sound experience for d/Deaf and hard-of-hearing (DHH) viewers. Onomatopoeia, which succinctly captures expressive sound information, offers a potential solution but remains underutilized in non-speech sound captioning. This paper investigates how onomatopoeia benefits DHH audiences in non-speech sound captioning. We collected 7,962 sound-onomatopoeia pairs from listeners and developed a sound-onomatopoeia model that automatically transcribes sounds into onomatopoeic descriptions indistinguishable from human-generated ones. A user evaluation of 25 DHH participants using the model-generated onomatopoeia demonstrated that onomatopoeia significantly improved their video viewing experience. Participants most favored captions with onomatopoeia and category, and expressed a desire to see such captions across genres. We discuss the benefits and challenges of using onomatopoeia in non-speech sound captions, offering insights for future practices.
https://dl.acm.org/doi/10.1145/3706598.3713911
Speech-to-text capabilities on mobile devices have proven helpful for hearing and speech accessibility, language translation, note-taking, and meeting transcripts. However, our foundational large-scale survey (n=263) shows that the inability to distinguish and indicate speaker direction makes them challenging in group conversations. SpeechCompass addresses this limitation through real-time, multi-microphone speech localization, where the direction of speech allows visual separation and guidance (e.g., arrows) in the user interface. We introduce efficient real-time audio localization algorithms and custom sound perception hardware, running on a low-power microcontroller with four integrated microphones, which we characterize in technical evaluations. Informed by a large-scale survey (n=494), we conducted an in-person study of group conversations with eight frequent users of mobile speech-to-text, who provided feedback on five visualization styles. The value of diarization and visualizing localization was consistent across participants, with everyone agreeing on the value and potential of directional guidance for group conversations.
https://dl.acm.org/doi/10.1145/3706598.3713631
We introduce SPECTRA, a novel pipeline for personalizable sound recognition designed to understand DHH users' needs when collecting audio data, creating a training dataset, and reasoning about the quality of a model. To evaluate the prototype, we recruited 12 DHH participants who trained personalized models for their homes. We investigated waveforms, spectrograms, interactive clustering, and data annotating to support DHH users throughout this workflow, and we explored the impact of a hands-on training session on their experience and attitudes toward sound recognition tools. Our findings reveal the potential for clustering visualizations and waveforms to enrich users' understanding of audio data and refinement of training datasets, along with data annotations to promote varied data collection. We provide insights into DHH users' experiences and perspectives on personalizing a sound recognition pipeline. Finally, we share design considerations for future interactive systems to support this population.
https://dl.acm.org/doi/10.1145/3706598.3713294
Circadian fatigue, largely caused by sleep deprivation, significantly diminishes alertness and situational awareness. This issue becomes critical in environments where auditory awareness—such as responding to verbal instructions or localizing alarms—is essential for performance and safety. While head-mounted displays have demonstrated potential in enhancing situational awareness through visual cues, their effectiveness in supporting sound localization under the influence of circadian fatigue remains under-explored. This study addresses this knowledge gap through a longitudinal study (N=19) conducted over 2–4 months, tracking participants’ fatigue levels through daily assessments. Participants were called in to perform non-line-of-sight sound source identification and localization tasks in a virtual environment under high- and low-fatigue conditions, both with and without head-up display assistance. The results show task-dependent effects of circadian fatigue. Unexpectedly, reaction times were shorter across all tasks under high-fatigue conditions. Yet, in sound localization, where precision is key, the HUD offered the greatest performance enhancement by reducing pointing error. The results suggest the auditory channel is a robust means of enhancing situational awareness and providing support for incorporating spatial audio cues and HUD as standard features in augmented reality platforms for fatigue-prone scenarios.
https://dl.acm.org/doi/10.1145/3706598.3713402