108. Speech and Remapping Techniques

前のセッションの直後

6

3分30秒

Uzuki Kumeta

Performative Vocal Synthesis for Foreign Language Intonation Practice

LipLearner: Customizable Silent Speech Interactions on Mobile Devices

LipIO: Enabling Lips as both Input and Output Surface

Visuo-haptic Crossmodal Shape Perception Model for Shape-Changing Handheld Controllers Bridged by Inertial Tensor

WESPER: Zero-shot and Realtime Whisper to Normal Voice Conversion for Whisper-based Speech interactions

Towards Applied Remapped Physical-Virtual Interfaces: Synchronization Methods for Resolving Control State Conflicts

この勉強会は終了しました。ご参加ありがとうございました。

107. Social Media and Moderation

110. AR & VR

リンク: https://doi.org/10.1145/3544548.3581210

Typical foreign language (L2) pronunciation training focuses mainly on individual sounds. Intonation, the patterns of pitch change across words or phrases is often neglected, despite its key role in word-level intelligibility and in the expression of attitudes and affect. This paper examines hand-controlled real-time vocal synthesis, known as Performative Vocal Synthesis (PVS), as an interaction technique for practicing L2 intonation in computer aided pronunciation training (CAPT).

We evaluate a tablet-based interface where users gesturally control the pitch of a pre-recorded utterance by drawing curves on the touchscreen. 24 subjects (12 French learners, 12 British controls) imitated English phrases with their voice and the interface. Results of an acoustic analysis and expert perceptive evaluation showed that learners’ gestural imitations yielded more accurate results than vocal imitations of the fall-rise intonation pattern typically difficult for francophones, suggesting that PVS can help learners produce intonation patterns beyond the capabilities of their natural voice.

読み込み中…

リンク: https://doi.org/10.1145/3544548.3581465

Silent speech interface is a promising technology that enables private communications in natural language. However, previous approaches only support a small and inflexible vocabulary, which leads to limited expressiveness. We leverage contrastive learning to learn efficient lipreading representations, enabling few-shot command customization with minimal user effort. Our model exhibits high robustness to different lighting, posture, and gesture conditions on an in-the-wild dataset. For 25-command classification, an F1-score of 0.8947 is achievable only using one shot, and its performance can be further boosted by adaptively learning from more data. This generalizability allowed us to develop a mobile silent speech interface empowered with on-device fine-tuning and visual keyword spotting. A user study demonstrated that with LipLearner, users could define their own commands with high reliability guaranteed by an online incremental learning scheme. Subjective feedback indicated that our system provides essential functionalities for customizable silent speech interactions with high usability and learnability.

読み込み中…

リンク: https://doi.org/10.1145/3544548.3580775

We engineered LipIO, a novel device enabling the lips to be used simultaneously as an input and output surface. LipIO comprises two overlapping flexible electrode arrays: an outward-facing array for capacitive touch and a lip-facing array for electrotactile stimulation. While wearing LipIO, users feel the interface's state via lip stimulation and respond by touching their lip with their tongue or opposing lip. More importantly, LipIO provides co-located tactile feedback that allows users to feel where in the lip they are touching—this is key to enabling eyes- and hands-free interactions. Our three studies verified participants perceived electrotactile output on their lips and subsequently touched the target location with their tongue with an average accuracy of 93%, while wearing LipIO with five I/O electrodes with co-located feedback. Finally, we demonstrate the potential of LipIO in four exemplary applications that illustrate how it enables new types of eyes- and hands-free micro-interactions.

読み込み中…

リンク: https://doi.org/10.1145/3544548.3580724

We present a visuo-haptic crossmodal model of shape perception designed for shape-changing handheld controllers. The model uses the inertia tensor of an object to bridge the two senses. The model was constructed from the results of three perceptual experiments. In the first two experiments, we validate that the primary moment and product of inertia (MOI and POI) in the inertia tensor have critical effects on the haptic perception of object length and asymmetry. Then, we estimate a haptic-to-visual shape matching model using MOI and POI as two link variables from the results of the third experiment for crossmodal magnitude production. Finally, we validate in a summative user study that the inverse of the shape matching model is effective for pairing a perceptually-congruent haptic object from a virtual object-the functionality we need for shape-changing handheld interfaces to afford perceptually-fulfilling sensory experiences in virtual reality.

読み込み中…

リンク: https://doi.org/10.1145/3544548.3580706

Recognizing whispered speech and converting it to normal speech creates many possibilities for speech interaction. Because the sound pressure of whispered speech is significantly lower than that of normal speech, it can be used as a semi-silent speech interaction in public places without being audible to others. Converting whispers to normal speech also improves the speech quality for people with speech or hearing impairments. However, conventional speech conversion techniques do not provide sufficient conversion quality or require speaker-dependent datasets consisting of pairs of whispered and normal speech utterances. To address these problems, we propose WESPER, a zero-shot, real-time whisper-to-normal speech conversion mechanism based on self-supervised learning.

WESPER consists of a speech-to-unit encoder, which generates hidden speech units common to both whispered and normal speech, and a unit-to-speech (UTS) decoder, which reconstructs speech from the encoded speech units. Unlike the existing methods, this conversion is user-independent and does not require a paired dataset for whispered and normal speech. The UTS decoder can reconstruct speech in any target speaker's voice from speech units, and it requires only an unlabeled target speaker's speech data. We confirmed that the quality of the speech converted from a whisper was improved while preserving its natural prosody. Additionally, we confirmed the effectiveness of the proposed approach to perform speech reconstruction for people with speech or hearing disabilities.

読み込み中…

リンク: https://doi.org/10.1145/3544548.3580723

User interfaces in virtual reality enable diverse interactions within the virtual world, though they typically lack the haptic cues provided by physical interface controls. Haptic retargeting enables flexible mapping between dynamic virtual interfaces and physical controls to provide real haptic feedback. This investigation aims to extend these remapped interfaces to support more diverse control types. Many interfaces incorporate sliders, switches, and knobs. These controls hold fixed states between interactions creating potential conflicts where a virtual control has a different state from the physical control. This paper presents two methods, ``manual'' and ``automatic'', for synchronizing physical and virtual control states and explores the effects of these methods on the usability of remapped interfaces. Results showed that interfaces without retargeting were the ideal configuration, but they lack the flexibility that remapped interfaces provide. Automatic synchronization was faster and more usable; however, manual synchronization is suitable for a broader range of physical interfaces.

読み込み中…

発表担当

目次

終了した勉強会

説明

日本語まとめ

説明

日本語まとめ

説明

日本語まとめ

説明

日本語まとめ

説明

日本語まとめ

説明

日本語まとめ