About
AgentBench measures how well your AI agent performs across real-world tasks — file operations, research, data analysis, multi-step workflows, memory, error recovery, and tool usage.
40 tasks scored 0–100 across three layers: structural checks (40%), metrics (40%), and behavioral analysis (20%). All scoring is rule-based and automated — no LLM judges. Scores may vary ±3-5 points between runs due to non-deterministic agent execution. Composite scores roll up by domain and overall.
Two people using the same model can score 30 points apart based on their agent config alone.
This is not SWE-bench
Tests code bug fixes
Measures the model
Pass/fail scoring
Same setup, swap the model
Tests general agent ability across 7 domains (40 tasks)
Measures your setup + config + prompts
3-layer 0–100 scoring, all rule-based
Same model, swap the config
vs. Academic Benchmarks
Academic agent benchmarks like THUDM/AgentBench test which model is best — they hold the setup constant and swap the LLM. Many rely on LLM-as-judge scoring, introducing subjectivity and non-reproducibility.
AgentBench tests which agent setup is best — same model, different configs, prompts, tools, and workflows. All scoring is rule-based — no LLM judges, no subjective grading. We recommend running 3x and averaging for a reliable baseline. Your score reflects how well you've configured your agent, not which API key you're using.
What's New
Per-tool-call tracing — every tool call is logged with millisecond timestamps into a trace.jsonl file. See exactly what your agent did, when, and how long each step took.
Custom prompt benchmarking — run /benchmark --custom "Your prompt" to test your setup on any prompt. Scores how your agent works (metrics + behavior), not what it produces.
Hookless metrics — tracing works on all platforms. No hooks required for per-call instrumentation.
Why a plugin, not a CLI?
The benchmark runner IS the thing being benchmarked.
AgentBench runs as a Claude Code plugin (or OpenClaw skill) — not as a standalone Python script. This is a deliberate choice. Your agent orchestrates the entire run: discovering tasks, managing context, spawning subagents, recovering from errors, and staying within limits. How well it handles all of that is part of the score.
A Python CLI would only test “can Claude write files when told to.” A plugin tests the full agent experience — the same experience you rely on daily.
Tests your real agent workflow, not a simulation
Your CLAUDE.md, MCP servers, custom tools all contribute to the score
Context management and error recovery are scored, not bypassed
Zero install friction — one command to add
Same environment you use for real work
Scores can vary ±3-5 points between runs (agents aren't deterministic)
Long runs may hit context limits — run 3x and average for accuracy
Requires hooks to fire correctly for full scoring
Setup scripts assume Unix environment (macOS/Linux)
Our recommendation: run the benchmark 3 times and take the average. This smooths out variance and gives you a reliable baseline. Then change your config and run again — the delta is what matters.