40. GUIs, Gaze, and Gesture-based Interaction

WebUI: A Dataset for Enhancing Visual UI Understanding with Web Semantics
説明

Modeling user interfaces (UIs) from visual information allows systems to make inferences about the functionality and semantics needed to support use cases in accessibility, app automation, and testing. Current datasets for training machine learning models are limited in size due to the costly and time-consuming process of manually collecting and annotating UIs. We crawled the web to construct WebUI, a large dataset of 400,000 rendered web pages associated with automatically extracted metadata. We analyze the composition of WebUI and show that while automatically extracted data is noisy, most examples meet basic criteria for visual UI modeling. We applied several strategies for incorporating semantics found in web pages to increase the performance of visual UI understanding models in the mobile domain, where less labeled data is available: (i) element detection, (ii) screen classification and (iii) screen similarity.

日本語まとめ
読み込み中…
読み込み中…
WordGesture-GAN: Modeling Word-Gesture Movement with Generative Adversarial Network
説明

Word-gesture production models that can synthesize word-gestures are critical to the training and evaluation of word-gesture keyboard decoders. We propose WordGesture-GAN, a conditional generative adversarial network that takes arbitrary text as input to generate realistic word-gesture movements in both spatial (i.e., $(x,y)$ coordinates of touch points) and temporal (i.e., timestamps of touch points) dimensions. WordGesture-GAN introduces a Variational Auto-Encoder to extract and embed variations of user-drawn gestures into a Gaussian distribution which can be sampled to control variation in generated gestures. Our experiments on a dataset with 38k gesture samples show that WordGesture-GAN outperforms existing gesture production models including the minimum jerk model [37] and the style-transfer GAN [31,32] in generating realistic gestures. Overall, our research demonstrates that the proposed GAN structure can learn variations in user-drawn gestures, and the resulting WordGesture-GAN can generate word-gesture movement and predict the distribution of gestures. WordGesture-GAN can serve as a valuable tool for designing and evaluating gestural input systems.

日本語まとめ
読み込み中…
読み込み中…
UEyes: Understanding Visual Saliency across User Interface Types
説明

While user interfaces (UIs) display elements such as images and text in a grid-based layout, UI types differ significantly in the number of elements and how they are displayed. For example, webpage designs rely heavily on images and text, whereas desktop UIs tend to feature numerous small images. To examine how such differences affect the way users look at UIs, we collected and analyzed a large eye-tracking-based dataset, \textit{UEyes} (62 participants and 1,980 UI screenshots), covering four major UI types: webpage, desktop UI, mobile UI, and poster. We analyze its differences in biases related to such factors as color, location, and gaze direction. We also compare state-of-the-art predictive models and propose improvements for better capturing typical tendencies across UI types. Both the dataset and the models are publicly available.

日本語まとめ
読み込み中…
読み込み中…
Relative Design Acquisition: A Computational Approach for Creating Visual Interfaces to Steer User Choices
説明

A central objective in computational design is that an optimal design is desired which optimizes a performance metric.

We explore a different problem class with a computational approach we call relative design acquisition.

As a motivational example, consider a user prompted to make a choice using buttons.

One button may have a more visually appealing design and hence is visually optimal to steer users to click it more often than the second button.

In such a design case, a relative design is acquired of a certain quality with respect to a reference design to guide a user decision.

After mathematically formalizing this problem, we report the results of three experiments that demonstrate the approach’s efficacy in generating relative designs in a visual interface preference setting.

The relative designs are controllable by a quality factor, which affects both comparative ratings and human decision time between the reference and relative designs.

日本語まとめ
読み込み中…
読み込み中…
Predicting Gaze-based Target Selection in Augmented Reality Headsets based on Eye and Head Endpoint Distributions
説明

Target selection is a fundamental task in interactive Augmented Reality (AR) systems. Predicting the intended target of selection in such systems can provide users with a smooth, low-friction interaction experience. Our work aims to predict gaze-based target selection in AR headsets with eye and head endpoint distributions, which describe the probability distribution of eye and head 3D orientation when a user triggers a selection input. We first conducted a user study to collect users’ eye and head behavior in a gaze-based pointing selection task with two confirmation mechanisms (air tap and blinking). Based on the study results, we then built two models: a unimodal model using only eye endpoints and a multimodal model using both eye and head endpoints. Results from a second user study showed that the pointing accuracy is improved by approximately 32% after integrating our models into gaze-based selection techniques.

日本語まとめ
読み込み中…
読み込み中…
Effective 2D Stroke-based Gesture Augmentation for RNNs
説明

Recurrent neural networks (RNN) require large training datasets from which they learn new class models. This limitation prohibits their use in custom gesture applications where only one or two end user samples are given per gesture class. One common way to enhance sparse datasets is to use data augmentation to synthesize new samples. Although there are numerous known techniques, they are often treated as standalone approaches when in reality they are often complementary. We show that by intelligently chaining augmentation techniques together that simulate different gesture production variability types, such as those affecting the temporal and spatial qualities of a gesture, we can significantly increase RNN accuracy without sacrificing training time. Through experimentation on four public 2D gesture datasets, we show that RNNs trained with our data augmentation chaining technique achieves state-of-the-art recognition accuracy in both writer-dependent and writer-independent test scenarios.

日本語まとめ
読み込み中…
読み込み中…