A comprehensive macOS benchmark for evaluating computer use agents.
676 tasks across 25 applications, deterministic rule-based evaluation,
fine-grained multi-checkpoint scoring, and support for 3 agent frameworks.
See the full live leaderboard →
MacAgentBench is a comprehensive macOS agent benchmark with:
- 676 tasks across 25 applications
- Deterministic rule-based evaluation with fine-grained multi-checkpoint scoring
- 3 agent frameworks (Baseline, Agent-S3, OpenClaw) and 16+ models evaluated
- Containerized execution — each task runs in an independent Docker container
| Framework | Best Model | Pass@1 |
|---|---|---|
| OpenClaw | Claude Opus 4.6 | 73.7% |
| Agent-S3 | Claude Opus 4.6 | 66.9% |
| Baseline | Claude Opus 4.6 | 39.2% |
Download the macOS VM image (~50GB):
pip install huggingface_hub
huggingface-cli download JetLM/OpenClaw-macOS --local-dir .Install dependencies:
pip install -r requirements.txtStart the macOS Docker container:
bash launcher/docker/simple_start.shConnect via VNC:
vncviewer localhost:5901- Configure your model API in
run_example.sh - Run:
bash run_example.shFor specific models with parallel dispatch, see scripts in scripts/run_*.sh.
MacAgentBench/
├── tasks/ # 676 task definitions (JSON)
│ ├── multi_app/ # 140 cross-application tasks
│ ├── new_reminders/ # Reminders app tasks
│ ├── ... # 25 application domains
├── mm_agents/ # Agent implementations
│ ├── agent.py # PromptAgent (GPT/Claude/Gemini)
│ ├── anthropic/ # Claude Computer Use agent
│ ├── qwen3vl_agent.py # Qwen3-VL agent
│ ├── guiowl_agent.py # GUI-Owl agent
│ ├── opencua/ # OpenCUA agent
│ ├── internvl_agent.py # InternVL / ScaleCUA agent
│ ├── uitars_agent.py # UI-TARS agent
│ └── openclaw_agent.py # OpenClaw framework agent
├── evaluators/ # Rule-based evaluation functions
├── controllers/ # macOS VM environment control
├── Agent-S3/ # Agent-S3 framework integration
├── parallel_dispatch.py # Dynamic task-level parallel dispatch
├── batch_run.py # Core evaluation runner
├── run_example.sh # Example evaluation script
└── scripts/ # Run scripts & metric computation
├── run_*.sh # Model-specific evaluation scripts
├── calc_metrics.py # Pass@1/k/^k computation
├── calc_fine_eval_table.py # Fine-grained evaluation
├── calc_skill_table.py # Skill coverage analysis
└── calc_per_category.py # Per-category breakdown
We warmly welcome contributions! Here's how you can help:
- Add new models — Integrate and test new agent models
- Add new tasks — Submit macOS tasks that reflect real-world scenarios
- Improve evaluators — Write verification scripts for new task types
- Report issues — Open an Issue to discuss bugs or ideas
To contribute: fork the repo, make changes in a separate branch, and submit a Pull Request.
We thank the following projects:
If you have questions or would like to collaborate, please contact us at:
-
Yikun Fu, Shanghai AI Laboratory 📧 fuyikun123456@163.com
-
Bowen Fu, XJTU 📧 HappyBug@stu.xjtu.edu.cn
-
Biqing Qi, Shanghai AI Laboratory 📧 qibiqing@pjlab.org.cn


