GazeCoT: Unleashing Social Intelligence in Multimodal LLMs With Gaze-Informed Chain-of-Thought Reasoning

要旨

Social intelligence is vital for effective human-AI interaction. While LLMs demonstrate strong text-based social intelligence, the vision modality remains challenging due to the presence of non-verbal social cues. For example, gaze is the primary conveyor of social attention, yet it cannot be accurately perceived and understood by multimodal LLMs (MLLMs). Therefore, we propose GazeCoT, a pipeline using gaze estimation models to provide MLLMs with the attention of people in images or videos. The gaze information is provided as visual and text prompts compiled into a structured context to support MLLM social reasoning. Benchmark evaluation confirms that GazeCoT enhances MLLMs’ social intelligence by improving gaze perception. A user study in a challenging application involving parent-child interactions demonstrates that GazeCoT improves perceived explainability and trustworthiness by aligning MLLM social perception and social reasoning with human norms. We hope that GazeCoT, a versatile plug-and-play pipeline, can enable socially aware, MLLM-based HCI applications.

著者
Zhoutong Ye
Tsinghua University, Beijing, China
Xutong Wang
Tsinghua University, Beijing, China
Chengwen Zhang
Tsinghua University, Beijing, China
Ruiwen Zhang
Tsinghua University, Beijing, China
Mingze Sun
Tsinghua University, Beijing, China
Qinwei Li
Tsinghua University, Beijing, China
Chun Yu
Tsinghua University, Beijing, Beijing, China
Yuanchun Shi
Tsinghua University, Beijing, China

会議: CHI 2026

ACM CHI Conference on Human Factors in Computing Systems

セッション: Explaining and Evaluating AI Systems

Area 1 + 2 + 3: theatre
7 件の発表
2026-04-16 20:15:00
2026-04-16 21:45:00