GitHub - JetAstra/MacAgentBench: MacAgentBench: Benchmark agents where they actually work — on macOS. · GitHub
Skip to content

JetAstra/MacAgentBench

Repository files navigation

MacAgentBench icon  MacAgentBench: Benchmarking AI Agents on Real-World macOS Desktop

A comprehensive macOS benchmark for evaluating computer use agents.
676 tasks across 25 applications, deterministic rule-based evaluation,
fine-grained multi-checkpoint scoring, and support for 3 agent frameworks.

Leaderboard Quick Start Categories Tasks Models License

🏆 Live Leaderboard Snapshot

MacAgentBench live leaderboard screenshot

See the full live leaderboard →

MacAgentBench overview


🔎 Overview

MacAgentBench is a comprehensive macOS agent benchmark with:

  • 676 tasks across 25 applications
  • Deterministic rule-based evaluation with fine-grained multi-checkpoint scoring
  • 3 agent frameworks (Baseline, Agent-S3, OpenClaw) and 16+ models evaluated
  • Containerized execution — each task runs in an independent Docker container

📊 Key Results

Framework Best Model Pass@1
OpenClaw Claude Opus 4.6 73.7%
Agent-S3 Claude Opus 4.6 66.9%
Baseline Claude Opus 4.6 39.2%

🚀 Quick Start

1. Set Up the Environment

Download the macOS VM image (~50GB):

pip install huggingface_hub
huggingface-cli download JetLM/OpenClaw-macOS --local-dir .

Install dependencies:

pip install -r requirements.txt

Start the macOS Docker container:

bash launcher/docker/simple_start.sh

Connect via VNC:

vncviewer localhost:5901

macOS VM screenshot

2. Run Evaluation

  1. Configure your model API in run_example.sh
  2. Run:
bash run_example.sh

For specific models with parallel dispatch, see scripts in scripts/run_*.sh.

Supported Model Types

Model Type Examples
gpt GPT-5.4, Gemini 3.1 Pro
claude Claude Opus 4.6
qwen3vl Qwen3-VL-8B/32B
InternVL InternVL3.5-8B/14B
scalecua ScaleCUA-7B/32B
uitars UI-TARS-7B/72B
guiowl GUI-Owl-1.5-8B/32B
OpenCUA OpenCUA-7B/32B
openclaw Any model via OpenClaw framework

📁 Project Structure

MacAgentBench/
├── tasks/                   # 676 task definitions (JSON)
│   ├── multi_app/           # 140 cross-application tasks
│   ├── new_reminders/       # Reminders app tasks
│   ├── ...                  # 25 application domains
├── mm_agents/               # Agent implementations
│   ├── agent.py             # PromptAgent (GPT/Claude/Gemini)
│   ├── anthropic/           # Claude Computer Use agent
│   ├── qwen3vl_agent.py     # Qwen3-VL agent
│   ├── guiowl_agent.py      # GUI-Owl agent
│   ├── opencua/             # OpenCUA agent
│   ├── internvl_agent.py    # InternVL / ScaleCUA agent
│   ├── uitars_agent.py      # UI-TARS agent
│   └── openclaw_agent.py    # OpenClaw framework agent
├── evaluators/              # Rule-based evaluation functions
├── controllers/             # macOS VM environment control
├── Agent-S3/                # Agent-S3 framework integration
├── parallel_dispatch.py     # Dynamic task-level parallel dispatch
├── batch_run.py             # Core evaluation runner
├── run_example.sh           # Example evaluation script
└── scripts/                 # Run scripts & metric computation
    ├── run_*.sh             # Model-specific evaluation scripts
    ├── calc_metrics.py      # Pass@1/k/^k computation
    ├── calc_fine_eval_table.py  # Fine-grained evaluation
    ├── calc_skill_table.py  # Skill coverage analysis
    └── calc_per_category.py # Per-category breakdown

🙌 Contribution Guide

We warmly welcome contributions! Here's how you can help:

  • Add new models — Integrate and test new agent models
  • Add new tasks — Submit macOS tasks that reflect real-world scenarios
  • Improve evaluators — Write verification scripts for new task types
  • Report issues — Open an Issue to discuss bugs or ideas

To contribute: fork the repo, make changes in a separate branch, and submit a Pull Request.

❤ Acknowledgments

We thank the following projects:

📬 Contact

If you have questions or would like to collaborate, please contact us at:

About

MacAgentBench: Benchmark agents where they actually work — on macOS.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

Contributors