AppAgent: Multimodal Agents as Smartphone Users

要旨

Recent advancements in large language models (LLMs) have led to the creation of intelligent agents capable of performing complex tasks. This paper introduces a novel LLM-based multimodal agent framework designed to operate smartphone applications. Our framework allows the agent to mimic human-like interactions such as tapping and swiping through a simplified action space, eliminating the need for system back-end access and enhancing its versatility across various apps. Central to the agent's functionality is an innovative in-context learning method, where it either autonomously explores or learns from human demonstrations, creating a knowledge base used to execute complex tasks across diverse applications. We conducted extensive testing with our agent on over 50 tasks spanning 10 applications, ranging from social media to sophisticated image editing tools. Additionally, a user study confirmed the agent's superior performance and practicality in handling a diverse array of high-level tasks, demonstrating its effectiveness in real-world settings. Our project page is available at \url{https://appagent-official.github.io/}.

著者
Chi Zhang
Westlake University, Hangzhou, Zhejiang, China
Zhao Yang
Shanghai Supwisdom Information Technology Co., Ltd., Shanghai, Shanghai, China
Jiaxuan Liu
Tencent, Shanghai, China
Yanda Li
University of Technology Sydney, Sydney, NSW, Australia
Yucheng Han
Nanyang Technological University, Singapore, Singaore, Singapore
Xin Chen
ShanghaiTech University, Shanghai, -Select-, China
Zebiao Huang
Tencent, Shanghai, China
bin fu
Tencent, Shanghai, China
Gang Yu
StepFun, shanghai, China
DOI

10.1145/3706598.3713600

論文URL

https://dl.acm.org/doi/10.1145/3706598.3713600

動画

会議: CHI 2025

The ACM CHI Conference on Human Factors in Computing Systems (https://chi2025.acm.org/)

セッション: Auditory UI

G402
7 件の発表
2025-04-28 20:10:00
2025-04-28 21:40:00
日本語まとめ
読み込み中…