Vision-Based Multimodal Interfaces: A Survey and Taxonomy for Enhanced Context-Aware System Design

The recent surge in artificial intelligence, particularly in multimodal processing technology, has advanced human-computer interaction, by altering how intelligent systems perceive, understand, and respond to contextual information (i.e., context awareness). Despite such advancements, there is a significant gap in comprehensive reviews examining these advances, especially from a multimodal data perspective, which is crucial for refining system design. This paper addresses a key aspect of this gap by conducting a systematic survey of data modality-driven Vision-based Multimodal Interfaces (VMIs). VMIs are essential for integrating multimodal data, enabling more precise interpretation of user intentions and complex interactions across physical and digital environments. Unlike previous task- or scenario-driven surveys, this study highlights the critical role of the visual modality in processing contextual information and facilitating multimodal interaction. Adopting a design framework moving from the whole to the details and back, it classifies VMIs across dimensions, providing insights for developing effective, context-aware systems.

University of New South Wales, Sydney, NSW, Australia

Huazhong University of Science and Technology, Wuhan, China

Southern University of Science and Technology, Shenzhen, China

RIKEN AIP, Tokyo, Japan

Tsinghua University, Beijing, China

Monash University, Melbourne, Australia

Monash University, Melbourne, VIC, Australia

UNSW, Syndey, New South Wales, Australia

CSIRO’s Data61 , Sydney, NSW, Australia

10.1145/3706598.3714161

https://dl.acm.org/doi/10.1145/3706598.3714161

The ACM CHI Conference on Human Factors in Computing Systems (https://chi2025.acm.org/)