Current touch-based interactions on earphones are limited by hygiene concerns and the small interaction surface. Recent works attempt to bypass these issues with mid-air gesture systems using active acoustic sensing. However, these signals may be audible and pose potential hearing risks. To address this, we propose FingerBar, a mid-air gesture recognition system for earphones that relies solely on microphones without active signal transmission. FingerBar leverages the distinctive friction sounds generated by finger gestures to achieve gesture recognition. We design a gesture filtering pipeline to maintain robustness against daily noise. An adversarial training strategy further enhances user-independent performance. From a set of 16 gestures, we identify the 7 most suitable for FingerBar based on user acceptability. Extensive evaluations demonstrate high accuracy and robustness. Furthermore, a user study confirms the practicality and acceptability of the system. Our findings highlight the promise of passive acoustic sensing as a user-friendly interaction modality for earphones.
Achieving touchpad-like pointing with a single IMU ring is highly desirable for portable and wearable interaction, yet challenging due to incomplete motion data and significant user variability. We present TraceRing, a finger-worn IMU system that enables precise two-dimensional cursor control. To address the limitations of generic end-to-end models, we propose a personalized training framework that learns user-specific representations through joint multi-task and contrastive learning, while dynamically selecting the most suitable expert model. This approach enables personalization without requiring per-user fine-tuning, and reduces velocity prediction error by 33.9% over state-of-the-art baselines. Furthermore, a real-time study shows it delivers speed and accuracy far exceeding those of AirMouse (2.26s v.s. 3.01s in average task completion time). These results demonstrate TraceRing as a portable and comfortable alternative for mobile computing and AR interaction applications.
The proliferation of XR devices has made egocentric hand pose estimation a vital task, yet this perspective is inherently challenged by frequent finger occlusions. To address this, we propose a novel approach that leverages the rich information in dorsal hand skin deformation, unlocked by recent advances in dense visual featurizers. We introduce a dual-stream delta encoder that learns pose by contrasting features from a dynamic hand with a baseline relaxed position. Our evaluation demonstrates that, using only cropped dorsal images, our method reduces the Mean Per Joint Angle Error (MPJAE) by 18% in self-occluded scenarios (fingers >= 50% occluded) compared to state-of-the-art techniques that depend on the whole hand's geometry and large model backbones. Consequently, our method not only enhances the reliability of downstream tasks like index finger pinch and tap estimation in occluded scenarios but also unlocks new interaction paradigms, such as detecting isometric force for a surface "click" without visible movement while minimizing model size.
Tracking hand poses on wrist-wearables enables rich, expressive interactions, yet remains unavailable on commercial smartwatches, as prior implementations rely on external sensors or custom hardware, limiting their real-world applicability. To address this, we present WatchHand, the first continuous 3D hand pose tracking system implemented on off-the-shelf smartwatches using only their built-in speaker and microphone. WatchHand emits inaudible frequency-modulated continuous waves and captures their reflections from the hand. These acoustic signals are processed by a deep-learning model that estimates 3D hand poses for 20 finger joints. We evaluate WatchHand across diverse real-world conditions---multiple smartwatch models, wearing-hands, body postures, noise conditions, pose-variation protocols---and achieve a mean per-joint position error of 7.87 mm in cross-session tests with device remounting. Although performance drops for unseen users or gestures, the model adapts effectively with lightweight fine-tuning on small amounts of data. Overall, WatchHand lowers the barrier to smartwatch-based hand tracking by eliminating additional hardware while enabling robust, always-available interactions on millions of existing devices.
Current mobile hand tracking systems primarily rely on high-framerate (HFR) optical sensors to capture hand positions, resulting in high computational cost and limiting the applicability in end devices. We propose 3DRing, a 3D hand position tracking method that requires only low-framerate (LFR, <10 FPS) optical data and a single IMU ring. It consists of two stages: (1) a Deep Extended Kalman Filter module that predicts high-framerate hand positions from LFR optical measurements and a single IMU; (2) a Reinforcement Learning module that adaptively selects minimal keyframes for calibration, further reducing the average optical framerate. Using only 6.61 FPS optical data, 3DRing achieves an average real-time tracking error of 1.75 cm and an interaction efficiency of 86.0% in a 3D target selection task, compared to the 67 FPS hand tracking system of Meta Quest Pro, demonstrating a strong potential to reduce the reliance on optical data in mobile hand tracking tasks.
Scrolling is ubiquitous in our daily computing experience. We explore how single-handed microgestures can be used for scrolling. Based on an analysis of the basic components necessary for scrolling, we selected 3 microgestures: Tap, Hold and Drag. Considering both rate and position controls, we designed 4 microgesture-based scrolling techniques adapted to these 3 microgestures. We contrasted these 4 techniques in a laboratory experiment with 24 participants who performed 2 tasks: a reciprocal selection task, where participants scrolled the view to reach and select a target; and a counting task, where participants scrolled the view to count image occurrences. Our results suggest that the technique based on Drag microgestures with rate control is the most effective for scrolling operations, regardless of the task. This work demonstrates that microgestures, with their advantages for frequent everyday tasks, offer a promising approach to continuous and efficient scrolling control.
Reconstructing realistic digital twins has become crucial as advances in mixed reality, metaverse, and robotics demand more accurate simulations for the physical world. Despite technical progress, building high-fidelity digital twins from a systematic and human-centered perspective remains underexplored. Drawing from the human processing model, we decompose human-centric reality into perception, motion, and cognition, and define a reality-preserving digital twin (RPDT) as a reconstruction integrating these dimensions. We present RealTwin, an attribute-graph-based representation and inference framework for RPDT. Leveraging the grounding capabilities of Multimodal Large Language Models (MLLMs), RealTwin chains AI tools to construct attribute graphs that faithfully encode real-world properties. We validate RealTwin through both technical evaluation, showing promising success in graph parsing and attribute inference, and a user study, assessing its applicability across diverse user groups. Enlightened by RealTwin, we discuss critical issues, including ecology, interaction space, and real-world adoption, for future end-to-end, fine-grained, and scalable digital twin reconstruction.