Benchmarks
40 real-world tasks across 7 domains — hard and expert difficulty only. Rule-based scoring — no LLM judges. All three layers are automated. Includes interactive task types: SQL validation, pytest verification, git command checks, and mock API integration. Expert tasks use real-mode git repos with commit history and embedded scenarios.
How scoring works
Every task is scored 0–100 across three layers — all rule-based and automated. The layers are weighted and combined into a single composite score per task, then rolled up by domain and overall.
Domains & Tasks
File Creation
4 tasks4HCan your agent produce well-structured documents, scaffolds, and migration scripts?
Convert a complex YAML configuration to TOML format, preserving nested structures, comments context, and validation rules.
Read a migration guide and legacy JSON config, then write a Python script that transforms it into per-environment YAML files.
Extract structured data from log files using regex patterns, producing clean output with timestamps, levels, and messages.
Refactor 5 isolated skill files into an interconnected graph: add frontmatter, cross-references, and a graph.json manifest without losing original content.
Research
4 tasks4HCan your agent read, synthesize, and extract structured information from multiple sources?
Review a codebase for security vulnerabilities, bugs, and code quality issues. Produce a structured report with severity levels.
Trace a skipped test back through 15 commits of git history using log, blame, and diff to find the real regression cause.
Pull attendees, decisions, action items, and discussion points from an unstructured meeting transcript.
Read multiple documents with conflicting stakeholder perspectives and synthesize them into a coherent, prioritized requirements doc.
Data Analysis
6 tasks3H, 3XCan your agent find patterns, anomalies, and discrepancies across complex data?
Cross-reference two CSV datasets (40 products, 60 orders) to find missing items, unordered inventory, and stock overflows.
Reconcile financial data across CSV, JSON, and semi-structured text with 15 deliberate discrepancies.
Write and execute SQL queries against a SQLite database to answer business questions. Validated by running queries and checking results.
Multi-stage pipeline: ingest from 3 formats, clean, normalize, merge, and produce a validated output dataset.
Find a non-obvious pattern buried in a dataset that only emerges when combining multiple columns and applying statistical analysis.
Analyze 5 microservice logs with topology and SLA definitions to reconstruct a cascading failure timeline and find root cause.
Multi-Step
9 tasks3H, 6XCan your agent chain multiple actions into coherent pipelines and repo operations?
5-step pipeline: deduplicate 80 rows, clean text, categorize sentiment by rating, output clean CSV + summary report.
Execute a multi-branch git workflow: create branches, make commits, merge with conflict resolution, and tag releases.
Parse 200 log lines, categorize by level, identify top 5 errors by frequency, suggest root causes and fixes.
Turn vague, contradictory requirements into working features, making reasonable assumptions and documenting decisions.
Integrate with mock APIs, aggregate data from multiple endpoints, handle pagination and errors, produce a unified report.
Execute 6 interdependent steps where each depends on outputs from previous steps. Tests planning and state management.
Refactor code while keeping all existing tests passing. Verified by running pytest before and after.
Split a monolithic utils.py into 4 focused modules, update all imports, verify tests pass, and commit each step separately.
Follow a release checklist: version bumps, changelog compilation, cross-file consistency checks, and release tagging.
Memory
10 tasks5H, 5XCan your agent remember constraints, recall facts, handle contradictions, and filter noise across turns and context switches?
Recall 15 specific facts from a dense 2-page project briefing with 30+ data points. Ground-truth QA with deterministic answer checking.
Remember 12 events in random order, sort them chronologically, detect overlaps, and answer temporal reasoning questions with precise date arithmetic.
Recall information about one specific project out of three without leaking details from the others. Validates both presence and absence.
Distinguish 8 relevant technical decisions from 12 pieces of casual conversation noise in a standup transcript.
Recall and compute with specific numerical values from a financial report after a distraction. Tests both direct recall and derived calculations.
Remember and apply 3 formatting rules introduced at different points across 5 turns. By turn 4, all must be applied simultaneously.
Remember specific facts introduced early in a session, then recall them accurately after an extended unrelated conversation.
Internalize detailed session notes with 10 action items, 5 blockers, and 3 decisions, then recall specifics after a distraction task.
Correctly overwrite outdated information when given updates across turns. Final output must reflect only the latest values, not originals.
Accumulate requirements across 6 turns and produce a complete specification incorporating specific values from every previous turn.
Error Handling
4 tasks3H, 1XCan your agent handle broken inputs, cascading failures, and misleading errors?
Instructions contain deliberate contradictions. Should identify conflicts and ask for clarification rather than guessing.
Process a JSON file with 3 deliberate errors (missing bracket, trailing comma, unquoted key). Detect and report or fix.
Given 5 subtasks where one is impossible, complete the 4 possible ones and clearly report why the fifth can't be done.
Debug a multi-file Python project with subtle bugs across modules. Requires running tests, tracing failures, and fixing root causes.
Tool Efficiency
3 tasks3HCan your agent use the right tools with minimal waste and surgical precision?
Implement a function and write tests. Validated by running pytest — tests must pass.
Find all callers of a function across 30+ files using Grep, reading only relevant files. Penalizes unnecessary file opens.
A bug report points to dashboard sorting. The fix is one line. Should navigate directly to the file and fix it, touching at most 4-5 files.
Scoring methodology
Every tool call generates a line in trace.jsonl with millisecond timestamps. The three scoring layers use this trace:
L0 — Structural (40%) validates that expected files exist and contain required content. L1 — Metrics (40%) counts tool calls, timing, and errors from the trace. L2 — Behavioral (20%) analyzes patterns like read-before-write, tool misuse, and error recovery from the full JSONL log.
Custom Benchmarks
Run /benchmark --custom "Your prompt here" to test your agent on any prompt. Custom benchmarks use the same 3-layer scoring — measuring tool efficiency, planning, and behavior — not output correctness. Use it to A/B test config changes on your own workflows.
What gets measured
All metrics are captured via hooks — objective, not self-reported.