Engineering Interactive Applications

https://doi.org/10.1145/3411764.3445131

Directly manipulating the timeline, such as scrubbing for thumbnails, is the standard way of controlling how-to videos. However, when how-to videos involve physical activities, people inconveniently alternate between controlling the video and performing the tasks. Adopting a voice user interface allows people to control the video with voice while performing the tasks with hands. However, naively translating timeline manipulation into voice user interfaces (VUI) results in temporal referencing (e.g. ``rewind 20 seconds''), which requires a different mental model for navigation and thereby limiting users' ability to peek into the content. We present RubySlippers, a system that supports efficient content-based voice navigation through keyword-based queries. Our computational pipeline automatically detects referenceable elements in the video, and finds the video segmentation that minimizes the number of needed navigational commands. Our evaluation (N=12) shows that participants could perform three representative navigation tasks with fewer commands and less frustration using RubySlippers than the conventional voice-enabled video interface.

KAIST, Daejeon, Korea, Republic of

10.1145/3411764.3445131

https://doi.org/10.1145/3411764.3445162

Learning musical instruments using online instructional videos has become increasingly prevalent. However, pre-recorded videos lack the instantaneous feedback and personal tailoring that human tutors provide. In addition, existing video navigations are not optimized for instrument learning, making the learning experience encumbered. Guided by our formative interviews with guitar players and prior literature, we designed Soloist, a mixed-initiative learning framework that automatically generates customizable curriculums from off-the-shelf guitar video lessons. Soloist takes raw videos as input and leverages deep-learning based audio processing to extract musical information. This back-end processing is used to provide an interactive visualization to support effective video navigation and real-time feedback on the user’s performance, creating a guided learning experience. We demonstrate the capabilities and specific use-cases of Soloist within the domain of learning electric guitar solos using instructional YouTube videos. A remote user study, conducted to gather feedback from guitar players, shows encouraging results as the users unanimously preferred learning with Soloist over unconverted instructional videos.

University of Toronto, Toronto, Ontario, Canada

10.1145/3411764.3445162

https://doi.org/10.1145/3411764.3445339

Explicitly alerting users is not always an optimal intervention, especially when they are not motivated to obey. For example, in video-based learning, learners who are distracted from the video would not follow an alert asking them to pay attention. Inspired by the concept of Mindless Computing, we propose a novel intervention approach, Mindless Attractor, that leverages the nature of human speech communication to help learners refocus their attention without relying on their motivation. Specifically, it perturbs the voice in the video to direct their attention without consuming their conscious awareness. Our experiments not only confirmed the validity of the proposed approach but also emphasized its advantages in combination with a machine learning-based sensing module. Namely, it would not frustrate users even though the intervention is activated by false-positive detection of their attentive state. Our intervention approach can be a reliable way to induce behavioral change in human-AI symbiosis.

The University of Tokyo, Hongo, Japan

University of Tsukuba, Tsukuba, Japan

10.1145/3411764.3445339

https://doi.org/10.1145/3411764.3445227

The need to find or construct tables arises routinely to accomplish many tasks in everyday life, as a table is a common format for organizing data. However, when relevant data is found on the web, it is often scattered across multiple tables on different web pages, requiring tedious manual searching and copy-pasting to collect data. We propose KTabulator, an interactive system to effectively extract, build, or extend ad hoc tables from large corpora, by leveraging their computerized structures in the form of knowledge graphs. We developed and evaluated KTabulator using Wikipedia and its knowledge graph DBpedia as our testbed. Starting from an entity or an existing table, KTabulator allows users to extend their tables by finding relevant entities, their properties, and other relevant tables, while providing meaningful suggestions and guidance. The results of a user study indicate the usefulness and efficiency of KTabulator in ad hoc table creation.

University of Waterloo, Waterloo, Ontario, Canada

10.1145/3411764.3445227

https://doi.org/10.1145/3411764.3445786

Toolkits for shape-changing interfaces (SCIs) enable designers and researchers to easily explore the broad design space of SCIs. However, despite their utility, existing approaches are often limited in the number of shape-change features they can express. This paper introduces MorpheesPlug, a toolkit for creating SCIs that covers seven of the eleven shape-change features identified in the literature. MorpheesPlug is comprised of (1) a set of six standardized widgets that express the shape-change features with user-definable parameters; (2) software for 3D-modeling the widgets to create 3D-printable pneumatic SCIs; and (3) a hardware platform to control the widgets. To evaluate MorpheesPlug we carried out ten open-ended interviews with novice and expert designers who were asked to design a SCI using our software. Participants highlighted the ease of use and expressivity of the MorpheesPlug.

University of Copenhagen, Copenhagen, Denmark

University of Bristol, Bristol, United Kingdom

University of Copenhagen, Copenhagen, Denmark

10.1145/3411764.3445786

https://doi.org/10.1145/3411764.3445708

Embodied conversational agents have changed the ways we can interact with machines. However, these systems often do not meet users' expectations. A limitation is that the agents are monotonic in behavior and do not adapt to an interlocutor. We present SIVA (a Socially Intelligent Virtual Agent), an expressive, embodied conversational agent that can recognize human behavior during open-ended conversations and automatically align its responses to the conversational and expressive style of the other party. SIVA leverages multimodal inputs to produce rich and perceptually valid responses (lip syncing and facial expressions) during the conversation. We conducted a user study (N=30) in which participants rated SIVA as being more empathetic and believable than the control (agent without style matching). Based on almost 10 hours of interaction, participants who preferred interpersonal involvement evaluated SIVA as significantly more animate than the participants who valued consideration and independence.

Adobe Research, Seattle, Washington, United States

Institute for Creative Technologies, Los Angeles, California, United States

Microsoft, Seattle, Washington, United States

Microsoft Research, Redmond, Washington, United States

10.1145/3411764.3445708

https://doi.org/10.1145/3411764.3445563

Automated vehicles promise a future where drivers can engage in non-driving tasks without hands on the steering wheels for a prolonged period. Nevertheless, automated vehicles may still need to occasionally hand the control back to drivers due to technology limitations and legal requirements. While some systems determine the need for driver takeover using driver context and road condition to initiate a takeover request, studies show that the driver may not react to it. We present DeepTake, a novel deep neural network-based framework that predicts multiple aspects of takeover behavior to ensure that the driver is able to safely take over the control when engaged in non-driving tasks. Using features from vehicle data, driver biometrics, and subjective measurements, DeepTake predicts the driver's intention, time, and quality of takeover. We evaluate DeepTake performance using multiple evaluation metrics. Results show that DeepTake reliably predicts the takeover intention, time, and quality, with an accuracy of 96%, 93%, and 83%, respectively. Results also indicate that DeepTake outperforms previous state-of-the-art methods on predicting driver takeover time and quality. Our findings have implications for the algorithm development of driver monitoring and state detection.

University of Virginia, Charlottesville, Virginia, United States

Bar-Ilan Univ., Ramat-Gan, Israel

University of Virginia, Charlottesville, Virginia, United States

10.1145/3411764.3445563

https://doi.org/10.1145/3411764.3445368

This paper describes how machine learning training data and symbolic knowledge from curators of conversational systems can be used together to improve the accuracy of those systems and to enable better curatorial tools. This is done in the context of a real-world practice of curators of conversational systems who often embed taxonomically-structured meta-knowledge into their documentation. The paper provides evidence that the practice is quite common among curators, that is used as part of their collaborative practices, and that the embedded knowledge can be mined by algorithms. Further, this meta-knowledge can be integrated, using neuro-symbolic algorithms, to the machine learning-based conversational system, to improve its run-time accuracy and to enable tools to support curatorial tasks. Those results point towards new ways of designing development tools which explore an integrated use of code and documentation by machines.

IBM Research Brazil, Sao Paulo, Brazil

IBM Research, Sao Paulo, Brazil

IBM Research Brazil, Sao Paulo, Brazil

10.1145/3411764.3445368

https://doi.org/10.1145/3411764.3445646

Program synthesis, which generates programs based on user-provided specifications, can be obscure and brittle: users have few ways to understand and recover from synthesis failures. We propose interpretable program synthesis, a novel approach that unveils the synthesis process and enables users to monitor and guide the synthesis. We designed three representations that explain the underlying synthesis process with different levels of fidelity. We implemented an interpretable synthesizer and conducted a within-subjects study with eighteen participants on three challenging regular expression programming tasks. With interpretable synthesis, participants were able to reason about synthesis failures and strategically provide feedback, achieving a significantly higher success rate compared with a state-of-the-art synthesizer. In particular, participants with a high engagement tendency (as measured by NCS-6) preferred a deductive representation that shows the synthesis process in a search tree, while participants with a relatively low engagement tendency preferred an inductive representation that renders representative samples of programs enumerated during synthesis.

Harvard University, Cambridge, Massachusetts, United States

University of Michigan, Ann Arbor, Michigan, United States

Harvard University, Cambridge, Massachusetts, United States

University of Michigan, Ann Arbor, Michigan, United States

Harvard University, Cambridge, Massachusetts, United States

10.1145/3411764.3445646

https://doi.org/10.1145/3411764.3445249

Modern visualization tools aim to allow data analysts to easily create exploratory visualizations. When the input data layout conforms to the visualization design, users can easily specify visualizations by mapping data columns to visual channels of the design. However, when there is a mismatch between data layout and the design, users need to spend significant effort on data transformation. We propose Falx, a synthesis-powered visualization tool that allows users to specify visualizations in a similarly simple way but without needing to worry about data layout. In Falx, users specify visualizations using examples of how concrete values in the input are mapped to visual channels, and Falx automatically infers the visualization specification and transforms the data to match the design. In a study with 33 data analysts on four visualization tasks involving data transformation, we found that users can effectively adopt Falx to create visualizations they otherwise cannot implement.

University of Washington, Seattle, Washington, United States

University of California, Santa Barbara, Santa Barbara, California, United States

University of Washington, Seattle, Washington, United States

University of Texas, Austin, Austin, Texas, United States

University of California, Berkeley, Berkeley, California, United States

University of Washington, Seattle, Washington, United States

10.1145/3411764.3445249

https://doi.org/10.1145/3411764.3445323

There is increased interest in using virtual reality in education, but it often remains an isolated experience that is difficult to integrate into current instructional experiences. In this work, we adapt virtual production techniques from filmmaking to enable mixed reality capture of instructors so that they appear to be standing directly in the virtual scene. We also capitalize on the growing popularity of live streaming software for video conferencing and live production. With XRStudio, we develop a pipeline for giving lectures in VR, enabling live compositing using a variety of presets and real-time output to traditional video and more immersive formats. We present interviews with media designers experienced in film and MOOC production that informed our design. Through walkthrough demonstrations of XRStudio with instructors experienced with VR, we learn how it could be used in a variety of domains. In end-to-end evaluations with students, we analyze and compare differences of traditional video vs. more immersive lectures with XRStudio.

University of Michigan, Ann Arbor, Michigan, United States

Swarthmore College, Swarthmore, Pennsylvania, United States, Swarthmore, Pennsylvania, United States

University of Michigan, Ann Arbor, Michigan, United States

10.1145/3411764.3445323

https://doi.org/10.1145/3411764.3445721

We present a multi-modal approach for automatically generating hierarchical tutorials from instructional makeup videos. Our approach is inspired by prior research in cognitive psychology, which suggests that people mentally segment procedural tasks into event hierarchies, where coarse-grained events focus on objects while fine-grained events focus on actions. In the instructional makeup domain, we find that objects correspond to facial parts while fine-grained steps correspond to actions on those facial parts. Given an input instructional makeup video, we apply a set of heuristics that combine computer vision techniques with transcript text analysis to automatically identify the fine-level action steps and group these steps by facial part to form the coarse-level events. We provide a voice-enabled, mixed-media UI to visualize the resulting hierarchy and allow users to efficiently navigate the tutorial (e.g., skip ahead, return to previous steps) at their own pace. Users can navigate the hierarchy at both the facial-part and action-step levels using click-based interactions and voice commands. We demonstrate the effectiveness of segmentation algorithms and the resulting mixed-media UI on a variety of input makeup videos. A user study shows that users prefer following instructional makeup videos in our mixed-media format to the standard video UI and that they find our format much easier to navigate.

Stanford University, Stanford, California, United States

Google Research, Mountain View, California, United States

Google Research, San Francisco, California, United States

Google, Atlanta, Georgia, United States

Stanford University, Stanford, California, United States

10.1145/3411764.3445721

https://doi.org/10.1145/3411764.3445419

Live streaming is gaining popularity across diverse application domains in recent years. A core part of the experience is streamer-viewer interaction, which has been mainly text-based. Recent systems explored extending viewer interaction to include visual elements with richer expression and increased engagement. However, understanding expressive visual inputs becomes challenging with many viewers, primarily due to the relative lack of structure in visual input. On the other hand, adding rigid structures can limit viewer interactions to narrow use cases or decrease the expressiveness of viewer inputs. To facilitate the sensemaking of many visual inputs while retaining the expressiveness or versatility of viewer interactions, we introduce a visual input management framework(VIMF) and a system, VisPoll, that help streamers specify, aggregate, and visualize many visual inputs. A pilot evaluation indicated that VisPoll can expand the types of viewer interactions. Our framework provides insights for designing scalable and expressive visual communication for live streaming.

University of Michigan, Ann Arbor, Michigan, United States

Adobe Research, Cambridge, Massachusetts, United States

University of California, San Diego, San Diego, California, United States

Adobe Research, San Jose, California, United States

Adobe Research, Seattle, Washington, United States

10.1145/3411764.3445419