GUIs, Gaze, and Gesture-based Interaction

会議の名前
CHI 2023
WebUI: A Dataset for Enhancing Visual UI Understanding with Web Semantics
要旨

Modeling user interfaces (UIs) from visual information allows systems to make inferences about the functionality and semantics needed to support use cases in accessibility, app automation, and testing. Current datasets for training machine learning models are limited in size due to the costly and time-consuming process of manually collecting and annotating UIs. We crawled the web to construct WebUI, a large dataset of 400,000 rendered web pages associated with automatically extracted metadata. We analyze the composition of WebUI and show that while automatically extracted data is noisy, most examples meet basic criteria for visual UI modeling. We applied several strategies for incorporating semantics found in web pages to increase the performance of visual UI understanding models in the mobile domain, where less labeled data is available: (i) element detection, (ii) screen classification and (iii) screen similarity.

受賞
Honorable Mention
著者
Jason Wu
Carnegie Mellon University, Pittsburgh, Pennsylvania, United States
Siyan Wang
Wellesley College, Wellesley, Massachusetts, United States
Siman Shen
Grinnell College, Grinnell, Iowa, United States
Yi-Hao Peng
Carnegie Mellon University, Pittsburgh, Pennsylvania, United States
Jeffrey Nichols
Snooty Bird LLC, San Diego, California, United States
Jeffrey P. Bigham
Carnegie Mellon University, Pittsburgh, Pennsylvania, United States
論文URL

https://doi.org/10.1145/3544548.3581158

動画
WordGesture-GAN: Modeling Word-Gesture Movement with Generative Adversarial Network
要旨

Word-gesture production models that can synthesize word-gestures are critical to the training and evaluation of word-gesture keyboard decoders. We propose WordGesture-GAN, a conditional generative adversarial network that takes arbitrary text as input to generate realistic word-gesture movements in both spatial (i.e., $(x,y)$ coordinates of touch points) and temporal (i.e., timestamps of touch points) dimensions. WordGesture-GAN introduces a Variational Auto-Encoder to extract and embed variations of user-drawn gestures into a Gaussian distribution which can be sampled to control variation in generated gestures. Our experiments on a dataset with 38k gesture samples show that WordGesture-GAN outperforms existing gesture production models including the minimum jerk model [37] and the style-transfer GAN [31,32] in generating realistic gestures. Overall, our research demonstrates that the proposed GAN structure can learn variations in user-drawn gestures, and the resulting WordGesture-GAN can generate word-gesture movement and predict the distribution of gestures. WordGesture-GAN can serve as a valuable tool for designing and evaluating gestural input systems.

受賞
Honorable Mention
著者
Jeremy Chu
Stony Brook University, Stony Brook, New York, United States
Dongsheng An
Stony Brook University, Stony brook, New York, United States
Yan Ma
Stony Brook University, Stony Brook, New York, United States
Wenzhe Cui
Stony Brook University, Stony Brook, New York, United States
Shumin Zhai
Google, Mountain View, California, United States
Xianfeng David Gu
Stony Brook University, Stony Brook, New York, United States
Xiaojun Bi
Stony Brook University, Stony Brook, New York, United States
論文URL

https://doi.org/10.1145/3544548.3581279

動画
UEyes: Understanding Visual Saliency across User Interface Types
要旨

While user interfaces (UIs) display elements such as images and text in a grid-based layout, UI types differ significantly in the number of elements and how they are displayed. For example, webpage designs rely heavily on images and text, whereas desktop UIs tend to feature numerous small images. To examine how such differences affect the way users look at UIs, we collected and analyzed a large eye-tracking-based dataset, \textit{UEyes} (62 participants and 1,980 UI screenshots), covering four major UI types: webpage, desktop UI, mobile UI, and poster. We analyze its differences in biases related to such factors as color, location, and gaze direction. We also compare state-of-the-art predictive models and propose improvements for better capturing typical tendencies across UI types. Both the dataset and the models are publicly available.

著者
Yue Jiang
Aalto University, Espoo, Finland
Luis A.. Leiva
University of Luxembourg, Esch-sur-Alzette, Luxembourg
Hamed Rezazadegan Tavakoli
Nokia Technologies, Espoo, Finland
Paul R. B. Houssel
University of Luxembourg, Esch-sur-Alzette, Luxembourg
Julia Kylmälä
Aalto University, Espoo, Finland
Antti Oulasvirta
Aalto University, Helsinki, Finland
論文URL

https://doi.org/10.1145/3544548.3581096

動画
Relative Design Acquisition: A Computational Approach for Creating Visual Interfaces to Steer User Choices
要旨

A central objective in computational design is that an optimal design is desired which optimizes a performance metric. We explore a different problem class with a computational approach we call relative design acquisition. As a motivational example, consider a user prompted to make a choice using buttons. One button may have a more visually appealing design and hence is visually optimal to steer users to click it more often than the second button. In such a design case, a relative design is acquired of a certain quality with respect to a reference design to guide a user decision. After mathematically formalizing this problem, we report the results of three experiments that demonstrate the approach’s efficacy in generating relative designs in a visual interface preference setting. The relative designs are controllable by a quality factor, which affects both comparative ratings and human decision time between the reference and relative designs.

著者
George B. Mo
University of Cambridge, Cambridge, United Kingdom
Per Ola Kristensson
University of Cambridge, Cambridge, United Kingdom
論文URL

https://doi.org/10.1145/3544548.3581028

動画
Predicting Gaze-based Target Selection in Augmented Reality Headsets based on Eye and Head Endpoint Distributions
要旨

Target selection is a fundamental task in interactive Augmented Reality (AR) systems. Predicting the intended target of selection in such systems can provide users with a smooth, low-friction interaction experience. Our work aims to predict gaze-based target selection in AR headsets with eye and head endpoint distributions, which describe the probability distribution of eye and head 3D orientation when a user triggers a selection input. We first conducted a user study to collect users’ eye and head behavior in a gaze-based pointing selection task with two confirmation mechanisms (air tap and blinking). Based on the study results, we then built two models: a unimodal model using only eye endpoints and a multimodal model using both eye and head endpoints. Results from a second user study showed that the pointing accuracy is improved by approximately 32% after integrating our models into gaze-based selection techniques.

著者
Yushi Wei
Xi'an Jiaotong-Liverpool University, Suzhou, China
Rongkai Shi
Xi'an Jiaotong-Liverpool University, Suzhou, Jiangsu, China
Difeng Yu
University of Melbourne, Melbourne, Victoria, Australia
Yihong Wang
Xi'an Jiaotong-Liverpool University, Suzhou, China
Yue Li
Xi'an Jiaotong-Liverpool University, Suzhou, Jiangsu, China
Lingyun Yu
Xi'an Jiaotong-Liverpool University, Suzhou, Jiangsu, China
Hai-Ning Liang
Xi'an Jiaotong-Liverpool University, Suzhou, Jiangsu, China
論文URL

https://doi.org/10.1145/3544548.3581042

動画
Effective 2D Stroke-based Gesture Augmentation for RNNs
要旨

Recurrent neural networks (RNN) require large training datasets from which they learn new class models. This limitation prohibits their use in custom gesture applications where only one or two end user samples are given per gesture class. One common way to enhance sparse datasets is to use data augmentation to synthesize new samples. Although there are numerous known techniques, they are often treated as standalone approaches when in reality they are often complementary. We show that by intelligently chaining augmentation techniques together that simulate different gesture production variability types, such as those affecting the temporal and spatial qualities of a gesture, we can significantly increase RNN accuracy without sacrificing training time. Through experimentation on four public 2D gesture datasets, we show that RNNs trained with our data augmentation chaining technique achieves state-of-the-art recognition accuracy in both writer-dependent and writer-independent test scenarios.

著者
Mykola Maslych
University of Central Florida, Orlando, Florida, United States
Eugene Matthew. Taranta
University of Central Florida, Orlando, Florida, United States
Mostafa Aldilati
University of Central Florida, Orlando, Florida, United States
Joseph LaViola
University of Central Florida, Orlando, Florida, United States
論文URL

https://doi.org/10.1145/3544548.3581358

動画