Despite their ubiquity, wireless earbuds remain audio-centric due to size and power constraints. We present VueBuds, the first camera-integrated wireless earbuds for egocentric vision, capable of operating within stringent power and form-factor limits. Each VueBud embeds a camera into a Sony WF-1000XM3 to stream visual data over Bluetooth to a host device for on-device vision language model (VLM) processing. We show analytically and empirically that while each camera's field of view is partially occluded by the face, the combined binocular perspective provides comprehensive forward coverage. By integrating VueBuds with VLMs, we build an end-to-end system for real-time scene understanding, translation, visual reasoning, and text reading; all from low-resolution monochrome cameras drawing under 5mW through on-demand activation. Through online and in-person user studies with 90 participants, we compare VueBuds against smart glasses across 17 visual question-answering tasks, and show that our system achieves response quality on par with Ray-Ban Meta. Our work establishes low-power camera-equipped earbuds as a compelling platform for visual intelligence, bringing rapidly advancing VLM capabilities to one of the most ubiquitous wearable form factors.
Embodiment can enhance conversational agents, such as increasing their perceived presence. This is typically achieved through visual representations of a virtual body; however, visual modalities are not always available, such as when users interact with agents using headphones or display-less glasses. In this work, we explore auditory embodiment. By introducing auditory cues of bodily presence — through spatially localized voice and situated Foley audio from environmental interactions — we investigate how audio alone can convey embodiment and influence perceptions of a conversational agent. We conducted a 2 (spatialization: monaural vs. spatialized) × 2 (Foley: none vs. Foley) within-subjects study, where participants (n=24) engaged in conversations with agents. Our results show that spatialization and Foley increase co-presence, but reduce users’ perceptions of the agent’s attention and other social attributes.
Hands are the chief appendage with which we manipulate the world around us, creating sounds as they go. As such, they are a rich source of information that computers can leverage for input and context sensing. Indeed, many prior works in HCI have explored this idea by instrumenting users' hands with a microphone, often integrated into a ring, wristband, or watch. In this work, we explore an alternative bare-hands approach --- by using a microphone array integrated into a user's headset/glasses, we can use beamforming to create a virtual microphone that tracks with the user's fingers in 3D space. We show this method can capture even the subtle noise of a finger translating across surfaces, including skin-to-skin contact for micro-gestures, as well as passive widget interactions.
Fog screens are a projection medium for mid-air graphics, they are see-through and reach-through. Fog screens are traditionally generated by laminar flows coming out of linear outlets and thus are usually static. Here, we show that ultrasonic beams create a streaming field of constrained airflow that direct aerosols in a fast and controllable way. We characterize various aerosols under streaming and optical diffusion, and evaluate control methods for obtaining different aerosol shapes. Aerosol columns are raised, moved and pushed in a few hundred milliseconds to serve as a projection medium. Hands and other objects do not disturb the fog columns significantly, allowing for direct interaction. We showcase applications by projecting on multiple columns that can be moved continuously within the display volume, reach-through interactions and mixed-reality for tabletop games.
Silent and whispered speech offer promise for always-available voice interaction with AI, yet existing methods struggle to balance vocabulary size, wearability, silence, and noise robustness. We present NasoVoce, a nose-bridge–mounted interface that integrates a microphone and a vibration sensor. Positioned at the nasal pads of smart glasses, it unobtrusively captures both acoustic and vibration signals. The nasal bridge, close to the mouth, allows access to bone- and skin-conducted speech and enables reliable capture of low-volume utterances such as whispered speech. While the microphone captures high-quality audio, it is highly sensitive to environmental noise. Conversely, the vibration sensor is robust to noise but yields lower signal quality. By fusing these complementary inputs, NasoVoce generates high-quality speech robust against interference. Evaluation with Whisper Large-v2, PESQ, STOI, and MUSHRA ratings confirms improved recognition and quality. NasoVoce demonstrates the feasibility of a practical interface for always-available, continuous, and discreet AI voice conversations.
On-body computing systems offer new forms of interaction, but while they are increasingly integrated into everyday contexts, their unique privacy and safety challenges remain understudied.
This paper examines these challenges through a two-round interview study with $N = 15$ experts in human-computer interaction, and privacy and safety, using speculative scenarios and adversarial roleplaying to elicit insights.
Our findings reveal risks specific to on-body interactions, including over-collection of sensitive data, unwanted inferences, harm to bystanders, and threats to bodily autonomy and psychological well-being.
Importantly, in the on-body context, privacy and safety concerns are deeply interconnected and cannot be addressed in isolation.
We contribute an empirically grounded characterization of these entangled challenges and derive eight actionable design guidelines to support safer, more privacy-aware, on-body systems.
This work informs future research and design in ubiquitous computing by highlighting the need for proactive and integrated approaches to privacy and safety in trustworthy on-body computing.
Camera glasses create fundamental privacy tensions between wearers seeking recording functionality and bystanders concerned about unauthorized surveillance. We present a systematic multi-stakeholder evaluation of privacy mechanisms through surveys (N=525) and paired interviews (N=20) in China. Study 1 quantifies expectation-willingness gaps: bystanders consistently demand stronger information transparency and protective measures than wearers will provide, with disparities intensifying in sensitive contexts where 65–90\% of bystanders would take defensive action. Study 2 evaluates twelve privacy-enhancing technologies, revealing four fundamental trade-offs that undermine current approaches: visibility versus disruption, empowerment versus burden, protection versus agency, and accountability versus exposure. These gaps reflect structural incompatibilities rather than inadequate goodwill, with context emerging as the primary determinant of privacy acceptability. We propose context-adaptive pathways that dynamically adjust protection strategies: minimal-friction visibility in public spaces, structured negotiation in semi-public environments, and automatic protection in sensitive contexts. Our findings contribute a diagnostic framework for evaluating privacy mechanisms and implications for context-aware design in ubiquitous sensing.