2. Sound & Music

https://doi.org/10.1145/3654777.3676406

Sound plays a crucial role in enhancing user experience and immersiveness in Augmented Reality (AR). However, current platforms lack support for AR sound authoring due to limited interaction types, challenges in collecting and specifying context information, and difficulty in acquiring matching sound assets. We present SonifyAR, an LLM-based AR sound authoring system that generates context-aware sound effects for AR experiences. SonifyAR expands the current design space of AR sound and implements a Programming by Demonstration (PbD) pipeline to automatically collect contextual information of AR events, including virtual-content-semantics and real-world context. This context information is then processed by a large language model to acquire sound effects with Recommendation, Retrieval, Generation, and Transfer methods. To evaluate the usability and performance of our system, we conducted a user study with eight participants and created five example applications, including an AR-based science experiment, and an assistive application for low-vision AR users.

University of Washington, Seattle, Washington, United States

Adobe Research, San Jose, California, United States

https://doi.org/10.1145/3654777.3676424

Spatial audio in Extended Reality (XR) provides users with better awareness of where virtual elements are placed, and efficiently guides them to events such as notifications, system alerts from different windows, or approaching avatars. Humans, however, are inaccurate in localizing sound cues, especially with multiple sources due to limitations in human auditory perception such as angular discrimination error and front-back confusion. This decreases the efficiency of XR interfaces because users misidentify from which XR element a sound is coming. To address this, we propose Auptimize, a novel computational approach for placing XR sound sources, which mitigates such localization errors by utilizing the ventriloquist effect. Auptimize disentangles the sound source locations from the visual elements and relocates the sound sources to optimal positions for unambiguous identification of sound cues, avoiding errors due to inter-source proximity and front-back confusion. Our evaluation shows that Auptimize decreases spatial audio-based source identification errors compared to playing sound cues at the paired visual-sound locations. We demonstrate the applicability of Auptimize for diverse spatial audio-based interactive XR scenarios.

Carnegie Mellon University, Pittsburgh, Pennsylvania, United States

University of Rochester, Rochester, New York, United States

Carnegie Mellon University, Pittsburgh, Pennsylvania, United States

https://doi.org/10.1145/3654777.3676367

We introduce EarHover, an innovative system that enables mid-air gesture input for hearables. Mid-air gesture input, which eliminates the need to touch the device and thus helps to keep hands and the device clean, has been known to have high demand based on previous surveys. However, existing mid-air gesture input methods for hearables have been limited to adding cameras or infrared sensors. By focusing on the sound leakage phenomenon unique to hearables, we have realized mid-air gesture recognition using a speaker and an external microphone that are highly compatible with hearables. The signal leaked to the outside of the device due to sound leakage can be measured by an external microphone, which detects the differences in reflection characteristics caused by the hand's speed and shape during mid-air gestures. Among 27 types of gestures, we determined the seven most suitable gestures for EarHover in terms of signal discrimination and user acceptability. We then evaluated the gesture detection and classification performance of two prototype devices (in-ear type/open-ear type) for real-world application scenarios.

Keio University, Yokohama, Japan

Hokkaido University, Sapporo, Japan

University of Tsukuba, Tsukuba, Ibaraki, Japan

Keio University, Yokohama, Japan