70. Modeling Spatial, Linguistic, and Sensory Errors

前のセッションの直後

6

5分

Akiyori Tsurukawa

Lost in Transcription: Subtitle Errors in Automatic Speech Recognition Reduce Speaker and Content Evaluations

Skewed Dual Normal Distribution Model: Predicting 1D Touch Pointing Success Rate for Targets Near Screen Edges

Capturing Team Cognition: A Multimodal Dataset for Adaptive Collaborative Interfaces

Desirable Unfamiliarity: Insights from Eye Movements on Engagement and Readability of Dictation Interfaces

Lost in Corridors: Modeling and Mitigating Spatial Disorientation by Sensing Environmental Characteristics and User Behavior

Colour in Translation: Data, Models, and Benchmarking for Cross-Linguistic Colour Naming

64. Interactive Visualization for Model Inspection and Debugging

75. Social VR

Researchers have demonstrated that Automatic Speech Recognition (ASR) systems perform differently across demographic groups. In this work, we examined how subtitle errors affect evaluations of speakers and their content using a preregistered online experiment (N=207, U.S.-based crowdworkers). Participants watched speakers with various accents deliver a talk in which the subtitles were accurate or error-prone. Our results indicate that error-prone subtitles consistently reduce both speaker and content evaluations for all speakers. We did not see disparate impact between the accent groups, controlling for subtitle quality. Taken together, though, the findings of this short paper imply that speakers with accents for which ASR systems perform poorly are likely to be further penalized by viewers with lower evaluations.

読み込み中…

Typical success-rate prediction models for tapping exclude targets near screen edges; however, design constraints often force such placements. Additionally, in scrollable UIs any element can move close to an edge. In this work, we model how target--edge distance affects 1D touch pointing accuracy. We propose the Skewed Dual Normal Distribution Model, which assumes the tap coordinate distribution is skewed by a nearby edge. The results of two smartphone experiments showed that, as targets approached the edge, the distribution's peak shifted toward the edge and its tail extended away. In contrast to prior reports, the success rate improved when the target touched the edge, suggesting a strategy of ``tapping the target together with the edge.'' By accounting for skew, our model predicts success rates across a wide range of conditions, including edge‑adjacent targets, thus extending coverage to the whole screen and informing UI design support tools.

読み込み中…

We introduce a multimodal dataset and experimental setup designed to support the development of adaptive collaborative systems. Data were collected from distributed teams working simultaneously across two continents, demonstrating the feasibility of sensing team cognition in geographically dispersed settings. The dataset includes synchronized EEG, audio transcripts, screen recordings, and behavioral annotations, enabling fine-grained analysis of collaboration in naturalistic settings. Our setup integrates neural and behavioral sensing to model team processes, using metrics such as task engagement, neural synchrony, and interaction patterns. These analyses reveal relationships between cognitive states and team dynamics, suggesting new directions for brain-computer interfaces that respond to team-level signals. By providing a shareable dataset, robust sensing infrastructure, and techniques for modeling distributed collaboration, this work enables future interactive systems that sense and support distributed teamwork in real time.

読み込み中…

Transcripts displayed on dictation interfaces can be hard to read due to recognition errors and disfluencies. LLM-based text auto-correction could help, but changing the text during production could lead to distraction and unintended phrasing. To understand how to balance readability, attention, and accuracy, we conducted an eye-tracking experiment with 20 participants to compare five dictation interfaces: PLAIN (real-time transcription), AOC (periodic corrections), RAKE (keyword highlights), GP-TSM (grammar-preserving highlights), and SUMMARY (LLM-generated abstractive summary). By analyzing participants’ gaze patterns during speech composition and reviewing processes, we found that during composition, participants spent only 7%-11% of their time in active reading regardless of the interface. Although SUMMARY introduced unfamiliar words and phrasing during composition, it was easier to read and more preferred by participants. Our findings suggest a high user tolerance for altering spoken words in LLM-enabled diction interfaces.

読み込み中…

Repetitive indoor layouts frequently cause spatial disorientation. Current navigation systems typically intervene reactively with generic instructions, lacking insight into the environmental root causes or the user’s cognitive state. To address this problem, we conducted a VR experiment (N=40) systematically manipulating geometric symmetry and feature similarity while capturing multimodal behaviors. Results reveal a functional separation: geometric symmetry primarily drives exploratory body rotation, whereas feature similarity determines navigation outcomes. Critically, simultaneous cue failure triggers a performance collapse—increasing mean hesitation duration by 370%—and forces users to switch from active reorientation (scanning via body rotation) to locomotor compensation (e.g., wall-following) based on a dynamic cost-benefit trade-off. Leveraging these patterns, our CNN-BiLSTM model detects the behaviorally defined getting lost state with >90% agreement with heuristic labels. We contribute design principles for dual context-awareness systems. By integrating environment context (geometric or featural ambiguity) and user context (cognitive state), systems can deploy content-adaptive aid—specifically orienting or discriminating aids—to dynamically balance navigation efficiency with active spatial cognition.

読み込み中…

Colour naming links vision and language. Yet, effective cross linguistic colour communication is limited by the lack of multilingual data and computational models for comprehensive colour name translation. We collected 6,408 unique colour naming responses in five languages using online experiments and fieldwork. For each language, we train a "spin colour forest", a novel partially rotated decision trees model that accurately estimate colour naming distributions across the full gamut, consistently outperforming existing methods. Unlike prior work that assumed 11 universal colour categories, our results reveal cross-linguistic variation in naming granularity: American English uses 47 indispensable colour names, British English 32, French 27, Greek 32, and the Himba 7 to categorise the same perceptually uniform colour space. Building on these findings, we develop a colour translation benchmark, which we demonstrate by evaluating both the lexical and perceptual accuracy of a large language model. Our evaluation reveals a critical lexical-perceptual disconnect, demonstrating that language models lack perceptual grounding in colour translation. Our data, models, and benchmark provide an empirical foundation for inclusive design that reflects how people communicate colour across cultures.

読み込み中…

発表担当

目次

説明

日本語まとめ

説明

日本語まとめ

説明

日本語まとめ

説明

日本語まとめ

説明

日本語まとめ

説明

日本語まとめ