Skip to content
Go back

GUI Agent Implementation Based on UI-TARS

Published:  at  16:00

Based on the UI-TARS multimodal vision model, combined with MCP (Model Context Protocol), this article explores building a next-generation cross-platform autonomous perception GUI Agent system. Consider this a starting point for discussion, and let’s explore the technology, scenarios, and future of GUI Agents together!

For the UI-TARS project, see http://github.com/bytedance/UI-TARS-desktop

Glossary

TermExplanation
GUI AgentA GUI Agent (Graphical User Interface Agent) is a type of AI system driven by multimodal vision models. It can automatically reason about and execute tasks that interact with graphical interfaces, such as PCs, the Web, mobile apps, and more. It can simulate human user operations, including clicking, typing, dragging, reading interface information, and so on, thereby automatically completing tasks proposed by users.
UI-TARSUI-TARS is a self-learning GUI Agent open-sourced by ByteDance. It is a next-generation native GUI agent model designed to interact seamlessly with graphical user interfaces (GUIs) using human-like perception, reasoning, and operation capabilities. Unlike traditional modular frameworks, UI-TARS integrates all key components—perception, reasoning, reflection, and memory—into a single Vision-Language Model (VLM), enabling end-to-end task automation without predefined workflows or manual rules. For details, see ByteDance’s Operator beats OpenAI to the punch? Free and open source, users say: saved $200!
Computer UseComputer Use is a capability first proposed by Anthropic based on the Claude 3.5 Sonnet model. It allows AI to interact with a virtual machine desktop environment and perform operating-system-level tasks.
MCPModel Context Protocol* (Model Context Protocol)* is an open protocol that standardizes how applications provide context to LLMs. You can think of MCP as a USB-C port for AI applications. Just as USB-C provides a standard way for your devices to connect to various peripherals and accessories, MCP provides a standard way for your AI models to connect to different data sources and tools. For details, see Practical Development of AI Agent Applications Based on MCP
UI AgentsUI Agents technology uses large model technologies (VLM / LLM) to enable agents to automatically operate phones or computers, simulating human behavior to complete specified tasks.
VLMVision Language Models (Vision-Language Models), refers to models that can process both visual and language modalities at the same time.
MLLMMLLM, Multimodal Large Language Model (Multimodal Large Language Model) uses powerful large language models (LLMs) as the “brain” to perform multimodal tasks. MLLMs exhibit astonishing emergent capabilities, such as writing stories based on images and performing mathematical reasoning without OCR.
SSEServer-sent Event* (SSE, Server-Sent Events)* is a technology based on HTTP connections that allows servers to push data to clients in real time and in one direction. For scenarios where the server only needs to push data to the client and does not need to receive data from the client, it is a simple and efficient alternative to WebSockets.
VNCVNC (Virtual Network Computing) is a graphical desktop sharing system that uses the Remote Frame Buffer protocol (RFB) to remotely control another computer. It transmits keyboard and mouse input from one computer to another over the network and relays graphical screen updates.
RPARobotic Process Automation is a category of process automation software tools that automate rule-based routine operations by using and understanding existing enterprise applications through the user interface.
HITLHuman-in-the-loop (HITL) refers to models that require interaction with humans. Human judgment is integrated into the automation process, thereby enhancing the capabilities of AI systems.

Background

Why do we need GUI Agents?

What is the essential way humans use electronic devices?

  1. Visual perception: observing and understanding the content on the screen through the eyes
  2. Finger operation: interacting with interfaces through gestures such as clicking and swiping
  3. Goal orientation: planning a series of operation steps based on task goals

Based on first principles, GUI Agents simulate the way humans use electronic devices, enabling truly native end-to-end general automation.

Demo

Device UsedInstructionScreen Recording
Local computer (Computer Use)Please help me open the autosave feature of VS Code and delay AutoSave operations for 500 milliseconds in the VSCode settingLink
Local browser (Browser Use)Could you help me check the latest open issue of the UI-TARS-Desktop project on Github?Link
Remote virtual machine (Remote Computer)Recognize the receipt content and organize it into ExcelLink
Remote browser (Remote Browser)Order a Big Mac meal from McDonald’s and deliver it to DinghaoLink
TV (TV Use)Play episode 5 of The Lychee RoadLink

For more showcases, see: https://seed-tars.com/showcase

Detailed Design

Overall Overview

To build a GUI Agent system, three core components are needed:

  1. VLM (vision model): Responsible for understanding screen content and user instructions. Based on the user instruction + screenshot, it generates a natural language command (NL Command).

  2. Agent Operator: Based on the user instruction, it calls the model and invokes device capabilities through the MCP Client. Essentially, it is a workflow that decouples the logic of how the LLM obtains different contexts through the MCP architecture.

  3. Devices (external devices): Exposed as MCP Services packages. They can be PCs, mobile devices, virtual machines, Raspberry Pis, and so on. As long as they are electronic devices, they are peripherals and can be integrated into the GUI Agent system.

Process Flow

The core GUI Agent process can roughly be divided into:

  1. Task perception: The system receives user instructions through natural language or screenshots, uses a multimodal model to parse them, and outputs an NLCommand (for example: Action: click(start_box='(529,46)')).

  2. Coordinate mapping: Convert the pixel coordinates perceived by the model into screen coordinates.

  3. Instruction conversion: Convert the parsed NLCommand into an executable Command. This involves converting the coordinate system from image coordinates to screen coordinates, preparing for subsequent execution.

  4. Command execution: Invoke MCP Services to execute the converted command.

Prerequisites

Agent Logic

At the Agent layer, a GUI Agent is mainly a loop. It pushes screenshots, model outputs, Actions, and so on to the client according to task execution status. Therefore, you only need to implement the following logic:

import { GUIAgent } from '@ui-tars/sdk';
import { NutJSOperator } from '@ui-tars/operator-nut-js';

const guiAgent = new GUIAgent({
  model: {
    baseURL: config.baseURL,
    apiKey: config.apiKey,
    model: config.model,
  },
  operator: new NutJSOperator(),
  onData: ({ data }) => {
    console.log(data)
  },
  onError: ({ data, error }) => {
    console.error(error, data);
  },
});

await guiAgent.run('send "hello world" to x.com');

The Operator can be replaced with any corresponding operation tool / framework, such as browser control (operator-browser), Android device control (operator-adb), and so on.

Task Perception (Multimodal Model)

This is provided by the UI-TARS model. A System Prompt is defined, and by passing in a “screenshot” and a “task instruction,” it returns an ++operation tuple (NLCommand)++ in natural language. The benefit of this is that it ++decouples++ different device operation instructions.

Taking the System Prompt provided by the PC MCP Server as an example:

You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task.

## Output Format
```
Action_Summary: ...
Action: ...
```diff

## Action Space

click(start_box='[x1, y1, x2, y2]')
left_double(start_box='[x1, y1, x2, y2]')
right_single(start_box='[x1, y1, x2, y2]')
drag(start_box='[x1, y1, x2, y2]', end_box='[x3, y3, x4, y4]')
hotkey(key='')
type(content='') #If you want to submit your input, use "\n" at the end of `content`.
scroll(start_box='[x1, y1, x2, y2]', direction='down or up or right or left')
wait() #Sleep for 5s and take a screenshot to check for any changes.
finished()

## Note
- Use Chinese in `Action_Summary` part.

## User Instruction
{instruction}

Explanation of the model output fields:

Taking the task “open the Chrome browser” on the PC side as an example, the screenshot size is 2560 x 1440, and the model output is Action: left_double(start_box='(130,226)'):

Why does the System Prompt click use two coordinates [x1, y1, x2, y2] instead of directly returning a single coordinate? Early UI-TARS multimodal models were not trained only for Computer Use scenarios. They were also trained for object detection, recognition and understanding, and generated corresponding boxes (x1, y1, x2, y2). For Computer Use scenarios, when x1=x2, it can be directly reused as the same point. When they are not equal, the center point (x1+x2)/2 is used.

Coordinate Mapping

Still using the “open Chrome” example above: how is the image-relative coordinate (130,226) calculated into the final absolute screen coordinate (332,325)?

Parameter descriptions:

Relative Coordinates and Absolute Coordinates

Positions on the screen are represented by X and Y Cartesian coordinates. The X coordinate starts at 0 on the left and increases to the right. Unlike in mathematics, the Y coordinate starts at 0 at the top and increases downward.

UI-TARS 模型的坐标系:
0,0       X increases -->
+---------------------------+
|                           | Y increases
| *(130, 226)               |     |
|   1000 x 1000 screen      |     |
|                           |     V
|                           |
|                           |
+---------------------------+ 999, 999

相对坐标:(0.02, 0.247)

映射到实际屏幕的坐标系上:
0,0       X increases -->
+---------------------------+
|                           | Y increases
| *(332, 325)               |     |
|   2560 x 1440 screen      |     |
|                           |     V
|                           |
|                           |
+---------------------------+ 1919, 1079

Instruction Conversion

Different devices have specific operation instructions (also known as action spaces). The Operator for each device converts the corresponding NLCommand into that device’s operation instructions.

The currently supported action spaces are as follows:

# PC
PC = Enums[
    "hotkey",        # 键盘按键
    "type",          # 键盘输入文本
    "scroll",        # 鼠标滚动
    "drag",          # 拖拽
    "click",         # 左键点击
    "left_double",   # 左键双击
    "right_single",  # 右键点击
]

# Android 手机
Mobile = Enums[
    "click",         # 单击
    "scroll",        # 上下左右滑动
    "type",          # 输入
    "long_press",    # 长按
    "KEY_HOME",      # 返回 Home
    "KEY_APPSELECT", # APP 切换
    "KEY_BACK",      # 返回
]

Different device operations require the model to add corresponding training data so that it can better complete the corresponding tasks.

This step now has an SDK that can be used directly: @ui-tars/action-parser. See the test case for usage.

Command Execution

After obtaining the concrete command to execute, directly call the corresponding device MCP’s internal execCommand method. The flowchart for remote command execution on PC and mobile is as follows:

SDK (Developer Tools)

If implementing the above process feels cumbersome, you can use the UI-TARS SDK to implement it quickly:

MCP Servers

UI-TARS-related Operator tools can also be provided as MCP Servers:

Thoughts

The first thing that comes to mind when looking at the current vision-model-based GUI Agent solution is the evolution of Tesla FSD’s autonomous driving architecture.

Premise: Humans can operate a system UI by seeing it with their eyes (vision), so AI can also achieve this through vision. Condition for success: unlimited data + large-scale compute, until all edge cases are solved.

The benefits of vision-based GUI Agents in terms of technical evolution are:

The drawbacks are:

Application Scenarios

Agentic User Testing

Testing applications and products the way a human would operate them. For example, TestDriver is an asynchronous automated testing tool designed for GitHub. It can intelligently generate test cases and, by simulating real user behavior, provide broader test coverage than traditional selector-based frameworks. It supports functional testing for desktop applications, Chrome extensions, spelling and grammar, OAuth login, PDF generation, and more.

Computer Use can therefore be used for end-to-end functional verification, including checking layout integrity, element responsiveness, and visual consistency. The model can quickly identify interface issues, reducing the manual inspection workload. Inspection items include:

Schedule Tasks

Based on the ChatGPT Tasks feature, we can implement requirements such as “automatically clock in every morning at 9:30.”

Consumer-Grade Applications

Computer Use is not yet ready for consumer-grade production deployment. Several problems need to be solved before it can be widely used:

For example: asking AI to send me a red packet, buy a cup of coffee, etc. does not offer much advantage over directly using the UI manually. For now, I have not thought of a good landing scenario. It is more like an “iOS Shortcuts”-style RPA approach. In the future, if Computer Use uses Agents as the traffic entry point and integrates the ecosystem through MCP (such as food, entertainment, and lifestyle apps, README-style application manuals, etc.), it will be very imaginative.

Future

Vision

Looking at the film Her, released in 2013, the sci-fi scenario of AI helping humans operate computers to complete tasks is gradually becoming reality.

A New Generation of Human-Computer Interaction Paradigm

Human-in-the-loop

When a GUI Agent cannot handle something and needs human help, it hands control back to the human. This is somewhat like the “safety driver” in “fully autonomous driving.” When AI capabilities are insufficient, humans step in, edge-case data is collected, and the AI is continuously iterated.

Bot-to-Bot Interaction

GUI Agent uses AI Coding to generate a Snake game

Q & A

Why do I need AI to operate my device for me?

I had this question at first too. Having AI order coffee is not as fast or reliable as tapping a few times myself.

In the long run:

Using the standards of the autonomous driving industry, GUI Agents can be classified as follows:

LevelNameDefinitionTask ParticipationTask Scenarios
L0No automationThe task is fully controlled by humans, and the automated system performs no operations.HumanAll
L1Basic computer assistanceThe computer provides certain assistive functions, such as automation tools or suggestions, but final decisions are still made by the user.HumanLimited (for example: automatic spell check, simple data entry autofill)
L2 (current)Computer-assisted execution (Copilot stage)The computer can perform certain operations in specific tasks, but the user still needs to intervene or supervise.Human (80%) + AI (20%)Limited. For example: user adjustment is required
L3Partial automation (Agent stage)The computer can independently execute tasks in more situations, but still requires user intervention in specific cases.AI (50%) + Human (50%)Limited
L4High automationThe computer can automatically handle most task scenarios, with the user only supervising in specific cases.AI (80%) + Human (20%)Limited
L5Full automationThe computer completes tasks fully autonomously, without user intervention or operation.AI (100%)All

Difference from RPA?

After receiving a “task instruction,” a GUI Agent lists an action plan and performs the next round of thinking, planning, and operation based on “real-time screen changes.” It can actively explore and trial-and-error on ++unknown interfaces++; RPA, on the other hand, is more about fixed-process operations. This is a huge difference. For example, if a pop-up suddenly appears on the interface, a GUI Agent can handle it by clicking “Agree” or “Disagree.”

References

MCP-related:

Related papers:


Share this post on:

Previous Post
AIO Sandbox: An Integrated, Customizable Sandbox Environment Built for AI Agents
Next Post
AI Agent Application Development Practice Based on MCP