AgentBench

Benchmark your AI agent across 40 real-world tasks. Rule-based scoring — no LLM judges. 3-layer automated evaluation across file creation, research, data analysis, multi-step workflows, memory, error handling, and tool efficiency.

40
tasks
7
domains
3
scoring layers

Features

Per-Call Tracing

Every tool call logged with millisecond timestamps. Full trace.jsonl for every run.

Custom Benchmarks

/benchmark --custom — test any prompt. Scores behavior, not output.

Rule-Based Scoring

No LLM judges. 3 layers of automated, deterministic evaluation.

Trace every tool call. Measure every millisecond.

{"seq":1,"tool":"Read","target":"inputs/data.csv","status":"ok"}
{"seq":2,"tool":"Bash","target":"wc -l data.csv","status":"ok"}
{"seq":3,"tool":"Write","target":"analysis.md","status":"ok"}

How it works

01
Add the marketplace
/plugin marketplace add agentbench/agentbench
02
Install the plugin
/plugin install agentbench
03
Run benchmarks
/benchmark
04
Submit your results
Upload results.json to the leaderboard