About

AgentBench measures how well your AI agent performs across real-world tasks — file operations, research, data analysis, multi-step workflows, memory, error recovery, and tool usage.

40 tasks scored 0–100 across three layers: structural checks (40%), metrics (40%), and behavioral analysis (20%). All scoring is rule-based and automated — no LLM judges. Scores may vary ±3-5 points between runs due to non-deterministic agent execution. Composite scores roll up by domain and overall.

Two people using the same model can score 30 points apart based on their agent config alone.

This is not SWE-bench

SWE-bench

Tests code bug fixes

Measures the model

Pass/fail scoring

Same setup, swap the model

AgentBench

Tests general agent ability across 7 domains (40 tasks)

Measures your setup + config + prompts

3-layer 0–100 scoring, all rule-based

Same model, swap the config

vs. Academic Benchmarks

Academic agent benchmarks like THUDM/AgentBench test which model is best — they hold the setup constant and swap the LLM. Many rely on LLM-as-judge scoring, introducing subjectivity and non-reproducibility.

AgentBench tests which agent setup is best — same model, different configs, prompts, tools, and workflows. All scoring is rule-based — no LLM judges, no subjective grading. We recommend running 3x and averaging for a reliable baseline. Your score reflects how well you've configured your agent, not which API key you're using.

What's New

Per-tool-call tracing — every tool call is logged with millisecond timestamps into a trace.jsonl file. See exactly what your agent did, when, and how long each step took.

Custom prompt benchmarking — run /benchmark --custom "Your prompt" to test your setup on any prompt. Scores how your agent works (metrics + behavior), not what it produces.

Hookless metrics — tracing works on all platforms. No hooks required for per-call instrumentation.

Why a plugin, not a CLI?

The benchmark runner IS the thing being benchmarked.

AgentBench runs as a Claude Code plugin (or OpenClaw skill) — not as a standalone Python script. This is a deliberate choice. Your agent orchestrates the entire run: discovering tasks, managing context, spawning subagents, recovering from errors, and staying within limits. How well it handles all of that is part of the score.

A Python CLI would only test “can Claude write files when told to.” A plugin tests the full agent experience — the same experience you rely on daily.

Advantages

Tests your real agent workflow, not a simulation

Your CLAUDE.md, MCP servers, custom tools all contribute to the score

Context management and error recovery are scored, not bypassed

Zero install friction — one command to add

Same environment you use for real work

Trade-offs

Scores can vary ±3-5 points between runs (agents aren't deterministic)

Long runs may hit context limits — run 3x and average for accuracy

Requires hooks to fire correctly for full scoring

Setup scripts assume Unix environment (macOS/Linux)

Our recommendation: run the benchmark 3 times and take the average. This smooths out variance and gives you a reliable baseline. Then change your config and run again — the delta is what matters.