ReactGenie: A Development Framework for Complex Multimodal Interactions Using Large Language Models

Authors : Jackie (Junrui) Yang , Yingtian Shi , Yuhan Zhang , Karina Li , + 6 , Daniel Wan Rosli , Anisha Jain , + 4 , Shuning Zhang , Tianshi Li , James A. Landay , Monica S. Lam (Less) Authors Info & Claims

Article No.: 483, Pages 1 - 23 Published : 11 May 2024 Publication History 0 citation 1,433 Downloads Total Citations 0 Total Downloads 1,433 Last 12 Months 1,433 Last 6 weeks 423 Get Citation Alerts

New Citation Alert added!

This alert has been successfully added and will be sent to: You will be notified whenever a record that you have chosen has been cited.

To manage your alert preferences, click on the button below. Manage my Alerts

New Citation Alert!

Information & Contributors
Bibliometrics & Citations
View Options
References
Media
Tables
Share

Abstract

By combining voice and touch interactions, multimodal interfaces can surpass the efficiency of either modality alone. Traditional multimodal frameworks require laborious developer work to support rich multimodal commands where the user’s multimodal command involves possibly exponential combinations of actions/function invocations. This paper presents ReactGenie, a programming framework that better separates multimodal input from the computational model to enable developers to create efficient and capable multimodal interfaces with ease. ReactGenie translates multimodal user commands into NLPL (Natural Language Programming Language), a programming language we created, using a neural semantic parser based on large-language models. The ReactGenie runtime interprets the parsed NLPL and composes primitives in the computational model to implement complex user commands. As a result, ReactGenie allows easy implementation and unprecedented richness in commands for end-users of multimodal apps. Our evaluation showed that 12 developers can learn and build a non-trivial ReactGenie application in under 2.5 hours on average. In addition, compared with a traditional GUI, end-users can complete tasks faster and with less task load using ReactGenie apps.

1 Introduction

Multimodal interactions, combining multiple different input and output modalities, such as touch, voice, and graphical user interfaces (GUIs), offer increased flexibility, efficiency, and adaptability for diverse users and tasks [52]. However, the development of multimodal applications remains challenging for developers due to the complexity of managing multimodal commands and handling the low-level control logic for interactions. Existing frameworks [12, 14, 32, 41, 42, 49, 50] often require developers to manually handle these complexities, significantly increasing development costs and time. The voice modality, in particular, presents a unique challenge due to the compositionality and expressiveness of natural language. Sub-par implementations often greatly reduce the expressiveness of these multimodal interfaces. Various systems [28, 53] can automatically handle voice commands by converting them to UI actions, but they are prone to error and do not allow developers to fully control the app’s behavior.

The research described in this paper aims to provide developers with a simple programming abstraction (see Figure 1) by hiding the complexity of natural language understanding and supporting the composition of different modalities automatically. Our goal is to enable users to access off-screen content/actions and complete tasks that normally involve multiple GUI taps in a single multimodal command, as illustrated in Figure 2. This flexibility is achieved with little additional effort from developers compared to traditional GUI apps. This approach encourages the adoption of multimodal interactions and makes multimodal interactions more accessible to end-users.

This paper presents ReactGenie 1 , a declarative programming framework for developing multimodal applications. The core concept behind ReactGenie is a better abstraction that separates the multimodal input and output interfaces from the underlying computation models. ReactGenie uses an object-oriented state abstraction to represent the computation model of the app and uses declarative UI components to represent the UI. Users’ compound multimodal commands are translated into a composition of multiple function calls using large language models (LLMs), e.g., to find the referred object/objects and make the right state change.

Existing declarative UI state management frameworks, such as Redux [6], use a single global state store to manage all of the state changes of the UI. The straightforward way to implement rich multimodal user commands in these existing frameworks is by making many imperative-style function calls. However, these function calls require the error-prone creation of many intermediate variables to store return values that are then used in the next function call as the programmer traverses the complex state stored in the monolithic object. These intermediate variables commonly cause missing references to variables when the neural semantic parser translates the user’s natural language input into code [36]. In contrast, the object-oriented state abstraction in ReactGenie encourages componentized classes instead of a single global state store. The componentized classes result in smaller objects, each equipped with methods for relevant operations. This design supports multiple chained method calls/property accesses (method chaining) and provides a straightforward representation of the user’s command with no need for intermediate variables (as shown in the example NLPL command in Figure 1). This allows ReactGenie to accurately compose the methods and properties of existing states needed for executing rich multimodal commands.

With ReactGenie, developers build graphical interfaces using a development workflow similar to a typical React + Redux [5] application. To add multimodality, the developer simply adds a few annotations to their code and example parses (pairs of expected end-user voice command examples and the corresponding function calls). These command examples indicate what methods/properties can be used in voice and how. By using the extracted class definitions and example parses from the developer’s state code, ReactGenie creates a parser that leverages an LLM [17] to translate the user’s natural language to a new domain-specific language (DSL), NLPL, Natural Language Programming Language. Combined with a custom-designed interpreter, ReactGenie can seamlessly handle multimodal commands and present the results in the graphical UI that the developer builds as usual.

As shown in Figure 1 left, developers can define both object-oriented state abstraction classes to handle data changes and UI components that explicitly map the state to the UI. Similar to React, when the user interacts with the app, the app’s state will be updated, and the UI will be re-rendered. What sets ReactGenie apart is its unique ability to support rich multimodal input, as shown in Figure 1 right.

The main contributions of this research are as follows:

ReactGenie, a multimodal app development framework based on an object-oriented state abstraction that is easy for developers to learn and use and generates apps that support rich multimodal interactions.

A programming language, NLPL, used to represent user’s multimodal commands. This involves the design of the high-level annotation of user-accessible functions, the automatic generation of a natural semantic parser using LLMs that targets NLPL, a new DSL for rich multimodal commands, and an interpreter that executes NLPL. These systems support automatic and accurate handling of natural language understanding in ReactGenie.

Evaluations of ReactGenie:

For developers, we demonstrated its expressiveness through building three representative demo apps in different domains, its low development cost by comparing it with GPT-3 function calling, and its usability and learnability through a study with 12 developers successfully building a demo app.

For end-users, we measured the parser accuracy to be 90% with elicited commands from 50 participants and evaluated the usability of apps built using ReactGenie in a user study with 16 participants. We found users had a reduced cognitive load when using an app with ReactGenie-supported multimodal interactions compared to using a graphical user interface (GUI) app. They also preferred the multimodal app to the GUI-based app.

1.1 Targeted Interactions

ReactGenie supports rich interactions that are complex for current computer systems, but are intuitive for users. One example of a rich multimodal interaction is shown in the center of Figure 1: the user says, “Reorder my last meal from this restaurant” while touching the restaurant displayed on the screen. Such commands are common in human-to-human communication. Still, they involve multiple steps (retrieving the history of orders from the restaurant, creating an order, and adding food to the order) for the app. These commands are complex to implement today as they require combining inputs from both modalities and/or composition of different features.

ReactGenie supports a typical family of gesture + speech multimodal interactions. This aligns with one of the categories of speech and gesture multimodal applications proposed by Sharon Oviatt’s seminal work [9]: The recognition modes ReactGenie supports are simultaneous and individual modes, meaning that ReactGenie supports users to use either speech-only interactions, gesture-only interactions, or both interactions at the same time (“What is the last time I ordered from this [touch on a restaurant] restaurant”). The supported gesture input type is touch/pen input, and the size of the gesture vocabulary is a deictic selection. This means that ReactGenie focuses on scenarios where the user’s gesture input resolves object references through pointing in a multimodal command. The size of speech vocabulary is arbitrary human sentences, and the type of linguistic processing is large-language model processing. The last two terms are new types we invented to better describe ReactGenie’s support for rich commands and the use of highly generalizable large-language models. Following Oviatt’s original classification, ReactGenie would be classified as large vocabulary and statistical language processing. ReactGenie uses late semantic fusion to fuse input from different modalities, which means the system integrates and interprets the meaning of inputs from multiple modalities only after each input has been independently processed and understood.

With ReactGenie, the developer simply provides a small amount of additional information associated with each input method and function. Our system supports the full compositionality of input modalities and functions by automatically translating a user command into one of exponentially many possible action sequences. The richness of user interaction afforded by our system is unprecedented, as traditional multimodal programming frameworks require developers to hard-code every combination of features supported.

ReactGenie lets the programmer simply describe the functionality of their code, including actions they support and the relationship between UI and data. This allows ReactGenie to handle these rich multimodal commands in arbitrary combinations of actions without requiring direct developer input. The example in Figure 1 is supported by:

ReactGenie first translates the user’s voice command to the NLPL code. For example, the user refers to an element in the UI by voice (“this restaurant”), and the semantic parser generates a special reference Restaurant.current() .

ReactGenie extracts the tap point from the UI and uses the UI component code to map the tap point back to a state object Restaurant(name: "Taco Bell") .

With the parsed DSL and UI context, ReactGenie’s interpreter can execute the generated NLPL using developer-defined states. It first retrieves the most recent order from “Taco Bell”, designated as “Taco 3/3”. Then, it creates a new order, designated as “New Taco”. Finally, the interpreter adds all the food items from “Taco 3/3” to “New Taco” and returns the new order.

ReactGenie passes the return value of the NLPL statement to the output UI mapping module. Because the return value is an Order object, ReactGenie searches in the developer’s UI component code to find a corresponding representation (Output UI Mapping) to present the result to the user. ReactGenie also generates a text response using the LLM based on the user’s input, parsed NLPL, and the return value: “Your cart is updated with the same order from this restaurant as the last time.”

During this process, the ReactGenie framework uses its knowledge about the developer’s app to automatically understand a multimodal compositional command, compose actions to execute, and find the appropriate interface to present the results to the user. This pipeline allows ReactGenie to handle more commands than prior frameworks with little developer input.

2 Related Work

In this section, we review related work on multimodal interaction systems, Graphical and Voice UI frameworks, and multimodal interaction frameworks.

2.1 Multimodal Interaction Systems

Many researchers have proposed multimodal interaction systems. The earliest multimodal interaction systems, such as Bolt’s “Put-that-there”, were developed in the 1980s [15]. They demonstrated that users can interact with a computer using voice and gestures. QuickSet [21] further demonstrated use cases of multimodal interaction on a mobile device and showed military and medical applications.

Recent work has explored different applications of multimodal interaction, including care-taking of older adults [48, 51], photo editing [38], and digital virtual assistants [33]. Researchers have also explored different devices and environments for multimodal interaction, including augmented reality [59], virtual reality [39, 56], wearables [16], and the Internet of Things [25, 34, 55, 58].

These projects have demonstrated the great potential of multimodal interaction systems. However, multimodal systems still have limited adoption in the real world due to the development complexity they currently require.

2.2 Graphical UI frameworks

ReactGenie is built on top of an existing graphical UI framework to provide a familiar development experience. Model–view–controller (MVC) [35] is the traditional basis of UI development frameworks and is used in frameworks such as Microsoft’s Windows Forms [29], and Apple’s UIKit [1]. The model stores data while the controller manages GUI input and updates the GUI view based on data changes. Typically implemented in object-oriented programming languages, MVC can be compared to a shadow play, where objects (controllers) manipulate GUIs and data to maintain synchronization. However, updating the model with alternative modalities, such as voice, is not feasible due to the strong entanglement between models and GUI updates.

Garnet [43, 45], a user interface development environment introduced in the late 1980s, is another notable approach to GUI development. Garnet introduced concepts like data binding, which allows the GUI to be updated automatically when the data changes. It also tries abstracting the GUI state away from the presentation using interactors [44]. While interactors allow the UI state to be rewired and thus to be updated using another modality like voice or gesture [37], they do not enable manipulation of more abstract states (e.g., foods in a delivery order) that are not directly mapped to a single UI control.

Declarative UI frameworks, such as React [2], Flutter [3], and SwiftUI [7], are a more recent approach to UI development. With declarative UI frameworks, programmers write functions to transform data into UI interfaces, and the system automatically manages updates. To ease the management of states that may be updated by and reflected on multiple UI interfaces, centralized state management frameworks, such as Redux [6], Flux [10], and Pinia [4], are often used together with these declarative UI frameworks. They provide a single source of truth for the application state and allow state updates to be reflected across all presented UIs. This approach can be likened to an overhead projector, where the centralized state represents the writing and the transform functions represent the lens projecting the UI to the user. While this approach improves separation and UI updating, it sacrifices the object-oriented nature of the data model. This centralized state works well with button pushes but comes short in dynamically composing multiple actions to support rich multimodal commands.

ReactGenie reintroduces object-orientedness to centralized state management systems by representing the state as a sum of all class instances in the memory. Developers can declare classes and describe actions as member functions of the classes. ReactGenie captures all instantiated classes and stores them in a central state. This more modularized model is analogous to actors (class instances) in a movie set, with views (UI components) acting as cameras capturing different angles of the centralized state. In this way, ReactGenie enables rich action composition through type-checked function calls. Furthermore, developers can tag specific cameras to point at certain objects, enabling automatic UI updates from state changes. These features allow ReactGenie apps to easily support the compositionality of multimodal input and enable the interleaving of multimodal input with other graphical UI actions.

2.3 Voice UI frameworks

Commercial voice or chatbot frameworks, such as Amazon Lex, Google Dialogflow, and Rasa, are designed to handle natural language understanding and generation. These frameworks allow developers to define intents and entities and then train the model to recognize the intents and entities from the user’s input. In this context, intents refer to categories of the user’s action, such as making a reservation or asking for weather information, and one action can only be mapped to one intent. Intents are usually mapped to different programming implementations to handle commands in the corresponding intent categories. These frameworks require a complete redevelopment of an application to support voice-only input. Frameworks such as Alexa Skills Kit and Google Actions allow developers to extend existing applications to support voice input. However, these still require manual work to build functions only for voice, and the visual UI updates are limited to simple text and a few pre-defined UI elements. Additionally, the one-intent-one-implementation nature of the intent-based architecture limits the compositionality of the voice commands.

Research-focused voice/natural language frameworks, such as Genie [19, 54] and other semantic parsers [13, 46], are designed to support better compositionality of voice commands. However, given that today’s app development is primarily geared toward mobile and graphical interfaces, these frameworks require extra work from the developer and do not support multimodal features. ReactGenie improves this experience by integrating the development of voice and graphical UIs, allowing developers to extensively reuse existing code and support multimodal interactions.

2.4 Multimodal Interaction Frameworks

Prior work has also proposed multimodal interaction frameworks that allow developers to build multimodal applications. One of the earliest works is presented by Cohen et al. [20]. It includes ideas like forming the user’s voice command as a function call and using the user’s touch point as a parameter to the function call. Later, researchers created standards [23, 24] and frameworks [12, 14, 32, 41, 42, 49, 50] to help developers build apps that can handle multiple inputs across different devices. Although these frameworks provide scaffolding for developers to build multimodal applications, they mostly treated voice as an event source that can trigger functions the developer has to explicitly implement for voice. Developers also have to manually update the UI to reflect the result of the voice command. This manual process limits voice commands to simple single-action commands and makes it difficult for developers to build richer multimodal applications.

Recently, there are research projects on generating voice commands by learning from demonstration [27, 40, 47], extracting from graphical user interfaces with large language models [28, 53], or building multimodal applications using existing voice skills [57]. The first approach still requires developers to manually create demonstrations for each action and limits the compositionality of the voice commands. The second approach is useful for accessibility purposes, but it relies on the features being easily extractable from the GUI. It is uncertain how well the first two approaches can generalize to more complex UI tasks that require multiple UI actions. The third approach is constrained by what is provided by the voice skills and, traditionally, these have been limited due to the added development effort.

In comparison, ReactGenie leverages the existing GUI development workflow and requires only minimal annotations to the code to generate multimodal applications. Having access to the full object-oriented state programming codebase, ReactGenie can handle the natural complexity of multimodal input, compose the right series of function calls, and update the UI to reflect the result automatically.

3 System Design

In this section, we first define the design goals of the framework. Then, we describe the theory of operation that addresses the design goals. Finally, we discuss the implementation of the system components and workflow.

3.1 Design goals

Our design goals include aspects of the interaction design of ReactGenie apps as well as the design of the framework itself.

3.1.1 Interaction Design.

ReactGenie is primarily designed to enhance user interaction with mobile applications, but the concept should also apply to apps on other platforms. Today, mobile applications are well-optimized for touch and graphical interactions. Users can use the graphical interface to see content on the screen and use touch to access actions on the screen. To further enhance the user’s performance and reduce cognitive load, ReactGenie focuses on supporting interactions that often involve touch actions used together with a voice command.

Here is a series of example commands in interactions with a food ordering app between user A and their friend user B:

A knows what they want, so A says, “Show me what I ordered last week from McDonald’s.” The app responds with the order history.

A wants to add a previously ordered food into the cart (not available on UI). A says, “Order this hamburger,” with a tap on the “Big Mac” entry in the order history, and the app adds a “Big Mac” to the shopping cart.