AgentBench
Benchmark your AI agent across 40 real-world tasks. Rule-based scoring — no LLM judges. 3-layer automated evaluation across file creation, research, data analysis, multi-step workflows, memory, error handling, and tool efficiency.
40
tasks
7
domains
3
scoring layers
Features
Per-Call Tracing
Every tool call logged with millisecond timestamps. Full trace.jsonl for every run.
Custom Benchmarks
/benchmark --custom — test any prompt. Scores behavior, not output.
Rule-Based Scoring
No LLM judges. 3 layers of automated, deterministic evaluation.
Trace every tool call. Measure every millisecond.
{"seq":1,"tool":"Read","target":"inputs/data.csv","status":"ok"}
{"seq":2,"tool":"Bash","target":"wc -l data.csv","status":"ok"}
{"seq":3,"tool":"Write","target":"analysis.md","status":"ok"}
How it works
01
Add the marketplace
/plugin marketplace add agentbench/agentbench02
Install the plugin
/plugin install agentbench03
Run benchmarks
/benchmark04
Submit your results
Upload results.json to the leaderboard