In multilingual conferences, translation support should not compromise non‑verbal cues or social interaction. Prior work on eXtended Reality (XR) subtitles aids comprehension but rarely examines translation latency. We conducted a VR-simulated conference, testing latencies of 0, 1.5, 3, 4.5, and 6 seconds to measure overall comprehension and attribution of verbal and non‑verbal information. Results showed that latencies beyond 3 seconds significantly increased subjective difficulty and affected accuracy, while shorter latencies showed no significant effects. Furthermore, participants noted that very low delay drew attention to subtitles, reducing opportunities to observe the speaker. Guided by these insights, we designed and evaluated four VR subtitle interfaces, including one traditional and three novel designs. Across delay conditions, Merged Subtitles improved opportunities to observe the speaker and resulted in better emotion attribution and user experience than other designs. We also proposed design guidelines for XR subtitle interfaces based on different levels of translation latency.
ACM CHI Conference on Human Factors in Computing Systems