name: cua-cloud description: Comprehensive guide for building Computer Use Agents with the CUA framework. This skill should be used when automating desktop applications, building vision-based agents, controlling virtual machines (Linux/Windows/macOS), or integrating computer-use models from Anthropic, OpenAI, or other providers. Covers Computer SDK (click, type, scroll, screenshot), Agent SDK (model configuration, composition), supported models, provider setup, and MCP integration.
CUA Framework
Overview
CUA ("koo-ah") is an open-source framework for building Computer Use Agents—AI systems that see, understand, and interact with desktop applications through vision and action. It supports Windows, Linux, and macOS automation.
Key capabilities:
- Vision-based UI automation via screenshot analysis
- Multi-platform desktop control (click, type, scroll, drag)
- 100+ LLM providers via LiteLLM integration
- Composed agents (grounding + planning models)
- Local and cloud execution options
Installation
# Computer SDK - desktop control
pip install cua-computer
# Agent SDK - autonomous agents
pip install cua-agent[all]
# MCP Server (optional)
pip install cua-mcp-server
CLI Installation:
# macOS/Linux
curl -LsSf https://cua.ai/cli/install.sh | sh
# Windows
powershell -ExecutionPolicy ByPass -c "irm https://cua.ai/cli/install.ps1 | iex"
Computer SDK
Computer Class
from computer import Computer
import os
os.environ["CUA_API_KEY"] = "sk_cua-api01_..."
computer = Computer(
os_type="linux", # "linux" | "macos" | "windows"
provider_type="cloud", # "cloud" | "docker" | "lume" | "windows_sandbox"
name="sandbox-name"
)
try:
await computer.run()
# Use computer.interface methods here
finally:
await computer.close()
Interface Methods
Screenshot:
screenshot = await computer.interface.screenshot()
Mouse Actions:
await computer.interface.left_click(x, y) # Left click at coordinates
await computer.interface.right_click(x, y) # Right click
await computer.interface.double_click(x, y) # Double click
await computer.interface.move_cursor(x, y) # Move cursor without clicking
await computer.interface.drag(x1, y1, x2, y2) # Click and drag
Keyboard Actions:
await computer.interface.type_text("Hello!") # Type text
await computer.interface.key_press("enter") # Press single key
await computer.interface.hotkey("ctrl", "c") # Key combination
Scrolling:
await computer.interface.scroll(direction, amount) # Scroll up/down/left/right
File Operations:
content = await computer.interface.read_file("/path/to/file")
await computer.interface.write_file("/path/to/file", "content")
Clipboard:
text = await computer.interface.get_clipboard()
await computer.interface.set_clipboard("text to copy")
Supported Actions (Message Format)
OpenAI-style:
ClickAction- button: left/right/wheel/back/forward, x, y coordinatesDoubleClickAction- same parameters as clickDragAction- start and end coordinatesKeyPressAction- key nameMoveAction- x, y coordinatesScreenshotAction- no parametersScrollAction- direction and amountTypeAction- text stringWaitAction- duration
Anthropic-style:
LeftMouseDownAction- x, y coordinatesLeftMouseUpAction- x, y coordinates
Agent SDK
ComputerAgent Class
from agent import ComputerAgent
agent = ComputerAgent(
model="anthropic/claude-sonnet-4-5-20250929",
tools=[computer],
max_trajectory_budget=5.0 # Cost limit in USD
)
messages = [{"role": "user", "content": "Open Firefox and go to google.com"}]
async for result in agent.run(messages):
for item in result["output"]:
if item["type"] == "message":
print(item["content"][0]["text"])
Response Structure
{
"output": [AgentMessage, ...], # List of messages
"usage": {
"prompt_tokens": int,
"completion_tokens": int,
"total_tokens": int,
"response_cost": float
}
}
Message Types:
UserMessage- Input from user/systemAssistantMessage- Text output from agentReasoningMessage- Agent thinking/summaryComputerCallMessage- Intent to perform actionComputerCallOutputMessage- Screenshot resultFunctionCallMessage- Python tool invocationFunctionCallOutputMessage- Function result
Supported Models
CUA VLM Router (Recommended)
model="cua/anthropic/claude-sonnet-4.5" # Recommended
model="cua/anthropic/claude-haiku-4.5" # Faster, cheaper
Single API key, cost tracking, managed infrastructure.
Anthropic (BYOK)
os.environ["ANTHROPIC_API_KEY"] = "sk-ant-..."
model="anthropic/claude-sonnet-4-5-20250929"
model="anthropic/claude-haiku-4-5-20251001"
model="anthropic/claude-opus-4-20250514"
model="anthropic/claude-3-7-sonnet-20250219"
OpenAI (BYOK)
os.environ["OPENAI_API_KEY"] = "sk-..."
model="openai/computer-use-preview"
Google Gemini
model="gemini-2.5-computer-use-preview-10-2025"
Local Models
model="huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B"
model="ollama_chat/0000/ui-tars-1.5-7b"
Composed Agents
Combine grounding models with planning models:
model="huggingface-local/GTA1-7B+openai/gpt-4o"
model="moondream3+openai/gpt-4o"
model="omniparser+anthropic/claude-sonnet-4-5-20250929"
model="omniparser+ollama_chat/mistral-small3.2"
Grounding Models: UI-TARS, GTA, Holo, Moondream, OmniParser, OpenCUA
Human-in-the-Loop
model="human/human" # Pause for user approval
Provider Types
Cloud (Recommended)
computer = Computer(
os_type="linux", # linux, windows, macos
provider_type="cloud",
name="sandbox-name",
api_key="sk_cua-api01_..."
)
Get API key from cloud.trycua.com.
Docker (Local)
computer = Computer(
os_type="linux",
provider_type="docker"
)
Images: trycua/cua-xfce:latest, trycua/cua-ubuntu:latest
Lume (macOS Local)
computer = Computer(
os_type="linux",
provider_type="lume"
)
Requires Lume CLI installation.
Windows Sandbox
computer = Computer(
os_type="windows",
provider_type="windows_sandbox"
)
Requires pywinsandbox and Windows Sandbox feature enabled.
MCP Integration
This project uses the CUA MCP Server for Claude Code integration:
{
"mcpServers": {
"cua": {
"type": "http",
"url": "https://cua-mcp-server.vercel.app/mcp"
}
}
}
MCP Tools Available
Sandbox Management:
mcp__cua__list_sandboxes- List all sandboxesmcp__cua__create_sandbox- Create VM (os, size, region)mcp__cua__start/stop/restart/delete_sandbox
Task Execution:
mcp__cua__run_task- Autonomous task executionmcp__cua__describe_screen- Vision analysis without actionmcp__cua__get_task_history- Retrieve task results
Best Practices
Task Design
# Good - specific and sequential
"Open Chrome, navigate to github.com, click the Sign In button"
# Avoid - vague
"Log into GitHub"
Error Recovery
async for result in agent.run(messages):
if result.get("error"):
# Take screenshot to understand state
screenshot = await computer.interface.screenshot()
# Retry with more specific instructions
Resource Management
try:
await computer.run()
# ... perform tasks
finally:
await computer.close() # Always cleanup
Cost Control
agent = ComputerAgent(
model="cua/anthropic/claude-sonnet-4.5",
max_trajectory_budget=5.0 # Stop at $5 spent
)