Design: XLang Agents

We will discuss how to bridge the gap between previous concepts and building real-world usable XLang Agents on this page, hoping to offer pragmatic reference to those who are interested in building a functional chat-ui agent under certain scenarios.

We’ll first describe the shared architecture and techniques employed by the three agents we've developed. Following that, details about the unique part of each agent are discussed.

Note: this page will be updated to stay in pace with the demo changes and the code release.

Architecture & Techniques

Agent

We adopt & adapt LangChain to construct our agent. LangChain is a middleware/framework for building LLM-powered applications, with various built-in components to use LLMs as we need. Specifically, our agents are based on the ReAct in LangChain: a paradigm where an agent iteratively performs Thinking, Acting, and Observing to accomplish the assigned task(s):

The Thinking stage generates a reasoning trace providing useful insights for further actions;
The Acting stage involves interaction with the environment.
The Observing stage reflects on the environment observation(state) to gear up for the next decision.

Though ReAct offers a good starting point, we find there were crucial elements that still needed addressing to truly create a versatile, robust, and extensible agent designed for real-world applications.

We filled in the gap when developing our own XLang Agents:

Limited Tools: Only a few tools are provided in LangChain. In order to enable our agents to tackle complex real-world tasks, we have armed XLang Agents with a comprehensive set of tools.
User Interface: While LangChain offers chat formatting by providing a memory for human-agent conversation history, the users are bound to console or program inputs. This limits interaction mainly to developers, making it challenging for everyday users to communicate with the agent. However, feedback is important to help the agent advance its tasks. Thus, we have implemented a web UI and its backend server to let real users effortlessly communicate with agents.
Information Presentation: The same piece of information (e.g., LLM context) often needs to be represented differently depending on the situation—be it on the user's end (frontend messages), the agent's memory, or in database storage (backend storage). To accomplish this, we have created DataModel class to map raw information to different modalities without pain.
Prompting: The current prompt in LangChain suggests that the user will utilize tools rather than the agent itself, leading to inconsistencies in the agent’s responses. Furthermore, the prompt explicitly includes phrases like “Think” and “Observation,” which might not be user-friendly or visually appealing for frontend display.

Environment

The environment implementation is largely dependent on the action space of the target agent. LangChain provides the Tool class which serves to encapsulate the initiation and execution of every action, referred to as a tool in the following context. Actually, agents fit into their scenarios by using different scopes and combinations of tools. For example:

The Data Agent leverages programming languages like Python & SQL, along with powerful data-centric tools catering to advanced data analysis scenarios.
The Plugins Agent interfaces with a wide selection of over 200 plugins from third-party sources, aiding users in numerous daily tasks and activities.
The Web Agent utilizes the capabilities of the web browser extension, enabling automated exploration and navigation of websites.

Each agent's environment and the tools they utilize will be explored in greater depth in their respective sections.

In-context Human Feedback

XLang Agents employ in-context human feedback as a pivotal element of the agent framework, which (1) assists in completing the original task; (2) expands the scope of the task, adding depth and insight. This approach requires two critical components:

Frontend Chat UI: Users chat via the web-based UI, continuously giving natural language feedbacks based on the agent’s responses. These feedbacks act as iterative “priors” that assist the agent in successfully navigating and exploring the task at hand. We base our front-end on chatbot-ui, an interface that closely resembles ChatGPT. As this front-end inherently supports chat, we have added options for tool selection and an adaptive UI capable of accommodating different types of tool responses, including JSON, tables, code, and images, among others.
Memory: When the front-end obtains human input, it's transferred to the agent's memory as a component of the LLM history context. Here is a comparison of the LLM contexts seen by different agents:
- RL Agent: ${(h, a_1, o_1, …, a_{t-1}, o_{t-1})}$
- ReAct Agent: $(h, a_1[…], o_1[…], r_1, …, a_{t-1}[…], o_{t-1}[…], r_{t-1})$
- XLang Agents: $(h_1, a_1[…], o_1[…], r_1, …, h_{t-1}, a_{t-1}[…], o_{t-1}[…], r_{t-1})$
  where $h$ denotes human input, $a$ denotes action, $o$ represents observation, $r$ refers agent response in natural language, and $[…]$ denotes that the action-observation loop can be iteratively performed by the agent. As demonstrated, XLang Agents have the capability to incorporate multi-turn user feedback into the historical context.

Data Agent

As outlined in previous sections, we have developed the XLang Data Agent that specializes in handling various aspects related to data. We expect it will provide users assistance with a diverse set of data-centric activities, which include but are not limited to data search, data query, data profiling, data visualization, and data transformation. For instance, as shown in the above figure, users can upload and apply their data file and query about it.

Therefore, we have carefully selected tools that are suited to support our targeted scenarios.

Coding Tools

Python

The utmost feature of the Data Agent is to write & execute code, and present the result to the user. Of many programming languages LLM masters well, we've chosen Python as the most suitable. This is due to Python's robust open-source ecosystem, which hosts tens of thousands of packages serving various functions, all readily available for integration through simple import statements.

With the Python tool, many data-related tasks can be done through a few lines of Python code. Here are the steps of a Python tool’s working pipeline:

In the Python tool, an LLM that takes in the input from the agent and transforms the user’s intent into an executable code.
This generated code is then parsed and filtered for accuracy and safety.
The code is then executed in a Docker container which provides a Python code interpreter execution sandbox.
The final result is stored as a DataModel object, which is shared with both the user and the agent.

By enabling the tool to generate its own code, we not only streamline the agent's complexity but also bolster its performance when delivering responses.

SQL

Apart from Python, we've integrated SQL into our agent. Similar to Python, Similar to Python, the LLM generates an SQL query based on the user's input. We then use an in-memory SQL engine (we utilize sqlite3 in Python) to execute the SQL query on the relevant data. This way, the user can receive the resulting data promptly and efficiently.

Data Tools

We’ve also augmented Data Agent with various high-quality data tools, designed to significantly enhance your data analysis performance and productivity. Note, all tools are triggered with plain chat!

(We are adding more data tools!)

Kaggle Search

Kaggle is a popular platform for data science and machine learning enthusiasts, providing a vast collection of publicly available datasets and resources. We mainly leverage the official Kaggle API and use LLM to understand the user’s query and generate search queries. We presently support dataset search and connection, returning the top four results to the user, or connecting user’s specified dataset uri. Users are free to click "Download", whereby the datasets will be fetched, downloaded, and subsequently unzipped to their file system.

ECharts

ECharts is a powerful open-source data visualization library. It provides a wide variety of interactive charts and graphs that can be used to represent and visualize data in a user-friendly and engaging manner.

We've identified a particularly effective ECharts Python package— pyecharts, supporting using Python to generate ECharts. Similar to the Python tool, we let the LLM write pyecharts code, and execute the code to get some ECharts configuration JSON. This JSON is then passed to the front-end, which renders an interactive HTML object.

Data Profiling

The data profiling tool is a heuristic, rule-based tool that offers fundamental information about the applied data. For instance, we will design some rules to check whether the uploaded table contains some noisy cell values or missing information.

Additionally, we utilize the capabilities of the LLM to provide insights into actionable steps the user can take with the data. This coordination grants both the user and the system an initial understanding of the raw data, serving as an introductory overview of the dataset.

Plugins Agent

In our journey to support tool-use system development and research beyond just data-science applications, we've incorporated a diverse range of real-world plugins. These cover almost every area such as Information & News, Travel & Accommodation, Food & Leisure, Education & Learning, Productivity & Assistance, Finance & Shopping, Sports & Recreation, Technology & Development and more. These plugins aim to assist LLMs in accessing real-time data and performing various tasks seamlessly.

Our system can manage a variety of tasks, from searching for products on shopping websites to facilitating communication or even assisting in website creation. With a repository of over 200 plugins, we are proud to contribute significantly to the open-source community, inspired by systems like ChatGPT plugins. And we draw a roadmap to generate, connect, monitor and control automatically in the future.

One of the features we're most excited about is our user-friendly approach: Auto-selection. We've made an effort to simplify the process for users. Instead of manually selecting plugins, users can just choose “Auto” mode and express their needs, and our system endeavors to find the most suitable plugin, aiming for a more intuitive experience.

Currently we support 200+ plugins and an automatic plugins selector which uses embedding-based retrieval that covers the aspects. More Plugins will come in the future to benefit the whole scenario of daily usage.

General Plugins

Meeting diverse user needs necessitates a robust and expansive array of plugins. A mere handful won't suffice; it's a matter of reaching a critical mass where sheer volume sparks a qualitative leap in user experience. This transformative principle underpins our approach: quantity driving quality.

Inspired by and in collaboration with the ingenuity of the OpenAI-powered plugin community, we've curated and constructed a suite of plugins unparalleled in its breadth and versatility. Moreover, our vision extends beyond our immediate horizons, as we aspire to foster and expand this initiative alongside the open-source community, tapping into collective brilliance.

Once you've made your plugin selections, you can seamlessly embark on your dialogue journey. Beyond the constraints of conventional language models, you can now:

Shop for favorite products.
Manage and search emails efficiently.
Generate images from descriptions and visualize them.
Dive deep into file searches and document retrievals.
Access real-time financial insights.
Engage in academic research and discourse.
Stay updated with the latest in sports.
Digest breaking news.
Monitor weather updates tailored to their locale.
Tap into real-time data streams.
Interact with private databases securely and effectively.
By bridging these capabilities, we're not merely offering plugins; we're reshaping the very fabric of user interaction and potential.
…

(We will keep adding more plugins from all sources, and monitoring the status of these plugins.)

Understanding Plugins

At its core, a plugin can be visualized as an API call function. This bridges the Plugin Agent to a plethora of external data sources, made accessible via the respective APIs. The process is fairly straightforward:

The agent devises the right input for the plugin.
Upon execution, the plugin returns an output, i.e. observation.
The agent integrates this observation to enhance its response accuracy.

Incorporating OpenAPI for Plugin Familiarity

To ensure the agent comprehends the nuances of each plugin – from its functionality and parameter needs to its return format – we utilize openapi files. These files are parsed and transformed into an easy-to-understand “manual” for the LLM. Think of this as a guidebook, ensuring the agent knows precisely how to leverage each plugin.

Workflow

Let's break down the process:

The Plugin Agent is presented with instructions alongside a parsed YAML file. This file enlists the available APIs and their respective descriptions.
As the conversation ensues, the agent decides whether to invoke a plugin.
If deemed necessary, the agent crafts the appropriate input for the selected plugin.
The API call function executes, returning its observation to the agent.
Using a combination of the user's instructions and the fresh observation, the agent crafts a more refined and accurate response.

Auto Plugin Selection

Discover ease, efficiency, and optimal performance with our next-generation plugins system. It's not just about having numerous plugins; it's about having the RIGHT ones.

In the vast world of plugins, many existing systems operate on an outdated premise: that users should pre-select which plugins they wish to use. This practice, though often seen as a technological necessity, is far from ideal. Asking users to sift through a myriad of plugins and determine which ones best suit their needs is like asking someone to find a needle in a haystack. Not only is it time-consuming, but it can also be overwhelming, especially given the sheer volume and complexity of plugins available today.

We believe technology should simplify processes, not complicate them. That's why we've revolutionized the way plugins are used with our new system. Instead of placing the onus on the user,

our system intelligently auto-selects the best plugins based on individual user needs.

Say goodbye to the tedious and often confusing task of manually selecting plugins.

Our system does the hard work for you, ensuring that you have the best, tailor-made plugin experience every time.

Web Agent

In today's digital age, interaction with the web has become an integral part of our daily lives. Imagine a tool that can replicate human interactions on the web, with efficiency and accuracy that improve over time. This is the essence of our Web Agent.

Key points

A robust framework: We provide a robust framework (to be open-sourced later) to test and apply automatic web navigation techniques in reality. Researchers and developers can apply this to showcase their research results or develop excellent applications.
Power of the LLM: We leverage the capabilities of the LLM to engineer an agent that not only communicates but also interacts with its environment—in this case, the World Wide Web.
Beyond Just Web Navigation: The Web Agent isn't solely about web navigation. It synergizes with a chat interface, opening the doors to a multitude of diverse tasks.

Use Cases

What the Web Agent can do can be divided into two categories which correspond to two main things that humans use web to do: information retrieval and task execution:

For example, searching for information about Elon Musk or getting people’s opinions about a movie can be divided into information retrieval while posting a thread on Twitter can be regarded as a kind of task execution.

Here are some common use cases:

Movie Comments: Want to catch a film and curious about what others are saying? The Web Agent navigates to the relevant comments page, summarizing feedback for you.
Stay Updated: Interested in Elon Musk’s latest tweets? Engage with the Web Agent, and it'll not only bring you a summary but can also help retweet or draft a new tweet inspired by Elon's latest.
And more...

For more use cases, you may refer to XLang Web Agent Use Cases

Workflow

Our Web Agent ecosystem is underpinned by two crucial agents:

Chat Agent: This is the user's primary interface. As you chat with the Chat Agent, it determines when to call upon its “comrade”, the Web Navigation Agent. To understand more easily, you can regard it as a normal chatbot armed with a massive and powerful plugin — the Web Navigation Agent, i.e., the Chrome extension you will need to use.
Web Navigation Agent: This Agent takes a query (e.g., "Search information about Elon Musk" or "Book a flight from HK to NYC departing today") as input and starts at a certain webpage (e.g., Google or Skyscanner). It then navigates the web as a human would and stops at a designated webpage. It will return the observation of the last webpage to the Chat Agent, and the Chat Agent will use the information it provides to answer the user’s question.

Here's a simple one-turn example corresponding to the flow chart at the beginning of this session (It can be multi-turn like a normal chat):

Imagine a movie night scenario: you want to watch movie M tonight, and you want to know people’s comments about it. You ask our Web Agent this question. The Chat Agent realizes that it needs to call the Web Navigation Agent. So it converts your question into a query: “Search comments about movie M” and a URL to start: “https://imdb.com”. Then a new window of IMDb will open, and the Web Navigation Agent will begin to navigate the website with the query provided by the Chat Agent. It may input something into the search bar, click the “search” button, and... eventually finish on the comments page. After that, the Web Navigation Agent will return the observation of this page and its action history to the Chat Agent. The Chat Agent will then leverage this information to answer your question. The chat can continue, and the Chat Agent will call the Web Agent anytime it needs assistance.

Vision

The transformative potential of the Web Agent is boundless. Imagine a day when you effortlessly merge verbal commands with digital actions. A mere mention, like "Web Agent, handle my emails," and while you enjoy your morning coffee, the agent reads, sorts, and responds just as you would. Whether you seek news, a specific playlist, or a summary of work discussions, the Web Agent has got you covered. As we embrace the future, the Web Agent aims not just to be an assistant but a digital extension of oneself. No longer will the digital realm be daunting; with the Web Agent, it's all within arm's reach.

This feature, currently in its beta stage, encapsulates the spirit of actual, human-like, real-time browsing—a web navigation agent set to revolutionize our digital interactions.

Design: XLang Agents

Architecture & Techniques​

Agent​

Environment​

In-context Human Feedback​

Data Agent​

Coding Tools​

Python​

SQL​

Data Tools​

Kaggle Search​

ECharts​

Data Profiling​

Plugins Agent​

General Plugins​

Understanding Plugins​

Incorporating OpenAPI for Plugin Familiarity​

Workflow​

Auto Plugin Selection​

Web Agent​

Key points​

Use Cases​

Workflow​

Vision​

Architecture & Techniques

Agent

Environment

In-context Human Feedback

Data Agent

Coding Tools

Python

SQL

Data Tools

Kaggle Search

ECharts

Data Profiling

Plugins Agent

General Plugins

Understanding Plugins

Incorporating OpenAPI for Plugin Familiarity

Workflow

Auto Plugin Selection

Web Agent

Key points

Use Cases

Workflow

Vision