GazeCoT: Unleashing Social Intelligence in Multimodal LLMs With Gaze-Informed Chain-of-Thought Reasoning

Social intelligence is vital for effective human-AI interaction. While LLMs demonstrate strong text-based social intelligence, the vision modality remains challenging due to the presence of non-verbal social cues. For example, gaze is the primary conveyor of social attention, yet it cannot be accurately perceived and understood by multimodal LLMs (MLLMs). Therefore, we propose GazeCoT, a pipeline using gaze estimation models to provide MLLMs with the attention of people in images or videos. The gaze information is provided as visual and text prompts compiled into a structured context to support MLLM social reasoning. Benchmark evaluation confirms that GazeCoT enhances MLLMs’ social intelligence by improving gaze perception. A user study in a challenging application involving parent-child interactions demonstrates that GazeCoT improves perceived explainability and trustworthiness by aligning MLLM social perception and social reasoning with human norms. We hope that GazeCoT, a versatile plug-and-play pipeline, can enable socially aware, MLLM-based HCI applications.

Tsinghua University, Beijing, China

Tsinghua University, Beijing, Beijing, China

Tsinghua University, Beijing, China

ACM CHI Conference on Human Factors in Computing Systems

Area 1 + 2 + 3: theatre

7 件の発表

開始日時2026-04-16 20:15:00

終了日時2026-04-16 21:45:00

お気に入り

あとで読む

コレクション

GazeCoT: Unleashing Social Intelligence in Multimodal LLMs With Gaze-Informed Chain-of-Thought Reasoning

要旨

著者

会議: CHI 2026

セッション: Explaining and Evaluating AI Systems