AppAgent: Multimodal Agents as Smartphone Users
説明

Recent advancements in large language models (LLMs) have led to the creation of intelligent agents capable of performing complex tasks. This paper introduces a novel LLM-based multimodal agent framework designed to operate smartphone applications. Our framework allows the agent to mimic human-like interactions such as tapping and swiping through a simplified action space, eliminating the need for system back-end access and enhancing its versatility across various apps. Central to the agent's functionality is an innovative in-context learning method, where it either autonomously explores or learns from human demonstrations, creating a knowledge base used to execute complex tasks across diverse applications. We conducted extensive testing with our agent on over 50 tasks spanning 10 applications, ranging from social media to sophisticated image editing tools. Additionally, a user study confirmed the agent's superior performance and practicality in handling a diverse array of high-level tasks, demonstrating its effectiveness in real-world settings. Our project page is available at \url{https://appagent-official.github.io/}.

日本語まとめ
読み込み中…
読み込み中…
Weaving Sound Information to Support Real-Time Sensemaking of Auditory Environments: Co-Designing with a DHH User
説明

Current AI sound awareness systems can provide deaf and hard of hearing people with information about sounds, including discrete sound sources and transcriptions. However, synthesizing AI outputs based on DHH people’s ever-changing intents in complex auditory environments remains a challenge. In this paper, we describe the co-design process of SoundWeaver, a sound awareness system prototype that dynamically weaves AI outputs from different AI models based on users’ intents and presents synthesized information through a heads-up display. Adopting a Research through Design perspective, we created SoundWeaver with one DHH co-designer, adapting it to his personal contexts and goals (e.g., cooking at home and chatting in a game store). Through this process, we present design implications for the future of “intent-driven” AI systems for sound accessibility.

日本語まとめ
読み込み中…
読み込み中…
Sprayable Sound: Exploring the Experiential and Design Potential of Physically Spraying Sound Interaction
説明

Perfume and fragrance have captivated people for centuries across different cultures. Inspired by the ephemeral nature of sprayable olfactory interactions and experiences, we explore the potential of applying a similar interaction principle to the auditory modality. In this paper, we present SoundMist, a sonic interaction method that enables users to generate ephemeral auditory presences by physically dispersing a liquid into the air, much like the fading phenomenon of fragrance. We conducted a study to understand the experiential factors inherent in sprayable sound interaction and held an ideation workshop to identify potential design spaces or opportunities that this interaction could shape. Our findings, derived from thematic analysis, suggest that physically sprayable sound interaction can induce experiences related to four key factors—materiality of sound produced by dispersed liquid particles, different sounds entangled with each liquid, illusive perception of temporally floating sound, and enjoyment derived from blending different sounds—and can be applied to artistic practices, safety indications, multisensory approaches, and emotional interfaces.

日本語まとめ
読み込み中…
読み込み中…
OnomaCap: Making Non-speech Sound Captions Accessible and Enjoyable through Onomatopoeic Sound Representation
説明

Non-speech sounds play an important role in setting the mood of a video and aiding comprehension. However, current non-speech sound captioning practices focus primarily on sound categories, which fails to provide a rich sound experience for d/Deaf and hard-of-hearing (DHH) viewers. Onomatopoeia, which succinctly captures expressive sound information, offers a potential solution but remains underutilized in non-speech sound captioning. This paper investigates how onomatopoeia benefits DHH audiences in non-speech sound captioning. We collected 7,962 sound-onomatopoeia pairs from listeners and developed a sound-onomatopoeia model that automatically transcribes sounds into onomatopoeic descriptions indistinguishable from human-generated ones. A user evaluation of 25 DHH participants using the model-generated onomatopoeia demonstrated that onomatopoeia significantly improved their video viewing experience. Participants most favored captions with onomatopoeia and category, and expressed a desire to see such captions across genres. We discuss the benefits and challenges of using onomatopoeia in non-speech sound captions, offering insights for future practices.

日本語まとめ
読み込み中…
読み込み中…
SpeechCompass: Enhancing Mobile Captioning with Diarization and Directional Guidance via Multi-Microphone Localization
説明

Speech-to-text capabilities on mobile devices have proven helpful for hearing and speech accessibility, language translation, note-taking, and meeting transcripts. However, our foundational large-scale survey (n=263) shows that the inability to distinguish and indicate speaker direction makes them challenging in group conversations. SpeechCompass addresses this limitation through real-time, multi-microphone speech localization, where the direction of speech allows visual separation and guidance (e.g., arrows) in the user interface. We introduce efficient real-time audio localization algorithms and custom sound perception hardware, running on a low-power microcontroller with four integrated microphones, which we characterize in technical evaluations. Informed by a large-scale survey (n=494), we conducted an in-person study of group conversations with eight frequent users of mobile speech-to-text, who provided feedback on five visualization styles. The value of diarization and visualizing localization was consistent across participants, with everyone agreeing on the value and potential of directional guidance for group conversations.

日本語まとめ
読み込み中…
読み込み中…
SPECTRA: Personalizable Sound Recognition for Deaf and Hard of Hearing Users through Interactive Machine Learning
説明

We introduce SPECTRA, a novel pipeline for personalizable sound recognition designed to understand DHH users' needs when collecting audio data, creating a training dataset, and reasoning about the quality of a model. To evaluate the prototype, we recruited 12 DHH participants who trained personalized models for their homes. We investigated waveforms, spectrograms, interactive clustering, and data annotating to support DHH users throughout this workflow, and we explored the impact of a hands-on training session on their experience and attitudes toward sound recognition tools. Our findings reveal the potential for clustering visualizations and waveforms to enrich users' understanding of audio data and refinement of training datasets, along with data annotations to promote varied data collection. We provide insights into DHH users' experiences and perspectives on personalizing a sound recognition pipeline. Finally, we share design considerations for future interactive systems to support this population.

日本語まとめ
読み込み中…
読み込み中…
A Longitudinal Study on the Effects of Circadian Fatigue on Sound Source Identification and Localization using a Heads-Up Display
説明

Circadian fatigue, largely caused by sleep deprivation, significantly diminishes alertness and situational awareness. This issue becomes critical in environments where auditory awareness—such as responding to verbal instructions or localizing alarms—is essential for performance and safety. While head-mounted displays have demonstrated potential in enhancing situational awareness through visual cues, their effectiveness in supporting sound localization under the influence of circadian fatigue remains under-explored. This study addresses this knowledge gap through a longitudinal study (N=19) conducted over 2–4 months, tracking participants’ fatigue levels through daily assessments. Participants were called in to perform non-line-of-sight sound source identification and localization tasks in a virtual environment under high- and low-fatigue conditions, both with and without head-up display assistance. The results show task-dependent effects of circadian fatigue. Unexpectedly, reaction times were shorter across all tasks under high-fatigue conditions. Yet, in sound localization, where precision is key, the HUD offered the greatest performance enhancement by reducing pointing error. The results suggest the auditory channel is a robust means of enhancing situational awareness and providing support for incorporating spatial audio cues and HUD as standard features in augmented reality platforms for fatigue-prone scenarios.

日本語まとめ
読み込み中…
読み込み中…