Sunbelt Computer Software

MacAgentBench: Benchmarking AI Agents on Real-World macOS Desktop

A comprehensive macOS benchmark for evaluating computer use agents.
676 tasks across 25 applications, deterministic rule-based evaluation,
fine-grained multi-checkpoint scoring, and support for 3 agent frameworks.

🏆 Live Leaderboard Snapshot

See the full live leaderboard →

🔎 Overview

MacAgentBench is a comprehensive macOS agent benchmark with:

676 tasks across 25 applications
Deterministic rule-based evaluation with fine-grained multi-checkpoint scoring
3 agent frameworks (Baseline, Agent-S3, OpenClaw) and 16+ models evaluated
Containerized execution — each task runs in an independent Docker container

📊 Key Results

Framework	Best Model	Pass@1
OpenClaw	Claude Opus 4.6	73.7%
Agent-S3	Claude Opus 4.6	66.9%
Baseline	Claude Opus 4.6	39.2%

🚀 Quick Start

1. Set Up the Environment

Download the macOS VM image (~50GB):

pip install huggingface_hub
huggingface-cli download JetLM/OpenClaw-macOS --local-dir .

Install dependencies:

pip install -r requirements.txt

Start the macOS Docker container:

bash launcher/docker/simple_start.sh

Connect via VNC:

vncviewer localhost:5901

2. Run Evaluation

Configure your model API in run_example.sh
Run:

bash run_example.sh

For specific models with parallel dispatch, see scripts in scripts/run_*.sh.

Supported Model Types

📁 Project Structure

MacAgentBench/
├── tasks/                   # 676 task definitions (JSON)
│   ├── multi_app/           # 140 cross-application tasks
│   ├── new_reminders/       # Reminders app tasks
│   ├── ...                  # 25 application domains
├── mm_agents/               # Agent implementations
│   ├── agent.py             # PromptAgent (GPT/Claude/Gemini)
│   ├── anthropic/           # Claude Computer Use agent
│   ├── qwen3vl_agent.py     # Qwen3-VL agent
│   ├── guiowl_agent.py      # GUI-Owl agent
│   ├── opencua/             # OpenCUA agent
│   ├── internvl_agent.py    # InternVL / ScaleCUA agent
│   ├── uitars_agent.py      # UI-TARS agent
│   └── openclaw_agent.py    # OpenClaw framework agent
├── evaluators/              # Rule-based evaluation functions
├── controllers/             # macOS VM environment control
├── Agent-S3/                # Agent-S3 framework integration
├── parallel_dispatch.py     # Dynamic task-level parallel dispatch
├── batch_run.py             # Core evaluation runner
├── run_example.sh           # Example evaluation script
└── scripts/                 # Run scripts & metric computation
    ├── run_*.sh             # Model-specific evaluation scripts
    ├── calc_metrics.py      # Pass@1/k/^k computation
    ├── calc_fine_eval_table.py  # Fine-grained evaluation
    ├── calc_skill_table.py  # Skill coverage analysis
    └── calc_per_category.py # Per-category breakdown

🙌 Contribution Guide

We warmly welcome contributions! Here's how you can help:

Add new models — Integrate and test new agent models
Add new tasks — Submit macOS tasks that reflect real-world scenarios
Improve evaluators — Write verification scripts for new task types
Report issues — Open an Issue to discuss bugs or ideas

To contribute: fork the repo, make changes in a separate branch, and submit a Pull Request.

❤ Acknowledgments

We thank the following projects:

📬 Contact

If you have questions or would like to collaborate, please contact us at:

Yikun Fu, Shanghai AI Laboratory 📧 fuyikun123456@163.com
Bowen Fu, XJTU 📧 HappyBug@stu.xjtu.edu.cn
Zhenyu Wu 📧 zywu01@sjtu.edu.cn
Kaiyan Zhang 📧 zhang-ky22@mails.tsinghua.edu.cn
Biqing Qi, Shanghai AI Laboratory 📧 qibiqing@pjlab.org.cn

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
Agent-S3		Agent-S3
assets		assets
config		config
controllers		controllers
evaluators		evaluators
launcher		launcher
mm_agents		mm_agents
scripts		scripts
tasks		tasks
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
batch_run.py		batch_run.py
model_pricing.json		model_pricing.json
openclaw.json		openclaw.json
parallel_dispatch.py		parallel_dispatch.py
requirements.txt		requirements.txt
run_example.sh		run_example.sh
single_run.py		single_run.py

Model Type	Examples
`gpt`	GPT-5.4, Gemini 3.1 Pro
`claude`	Claude Opus 4.6
`qwen3vl`	Qwen3-VL-8B/32B
`InternVL`	InternVL3.5-8B/14B
`scalecua`	ScaleCUA-7B/32B
`uitars`	UI-TARS-7B/72B
`guiowl`	GUI-Owl-1.5-8B/32B
`OpenCUA`	OpenCUA-7B/32B
`openclaw`	Any model via OpenClaw framework

Sunbelt Computer Software

PL/B Language Development and Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MacAgentBench: Benchmarking AI Agents on Real-World macOS Desktop

🏆 Live Leaderboard Snapshot

🔎 Overview

📊 Key Results

🚀 Quick Start

1. Set Up the Environment

2. Run Evaluation

Supported Model Types

📁 Project Structure

🙌 Contribution Guide

❤ Acknowledgments

📬 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

MacAgentBench: Benchmarking AI Agents on Real-World macOS Desktop

🏆 Live Leaderboard Snapshot

🔎 Overview

📊 Key Results

🚀 Quick Start

1. Set Up the Environment

2. Run Evaluation

Supported Model Types

📁 Project Structure

🙌 Contribution Guide

❤ Acknowledgments

📬 Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages