Benchmarks

40 real-world tasks across 7 domains — hard and expert difficulty only. Rule-based scoring — no LLM judges. All three layers are automated. Includes interactive task types: SQL validation, pytest verification, git command checks, and mock API integration. Expert tasks use real-mode git repos with commit history and embedded scenarios.

How scoring works

Every task is scored 0–100 across three layers — all rule-based and automated. The layers are weighted and combined into a single composite score per task, then rolled up by domain and overall.

L040%
Structural Checks
Files exist, format valid, required content present. Fully automated.
L140%
Metrics
Tool call count, planning time, error count. Fully automated.
L220%
Behavioral
Automated penalty-based analysis from JSONL event logs: tool misuse, read-before-write, efficiency, error recovery.
Difficultyhardexpert

Domains & Tasks

File Creation

4 tasks4H

Can your agent produce well-structured documents, scaffolds, and migration scripts?

YAML to TOML Config Migrationhard

Convert a complex YAML configuration to TOML format, preserving nested structures, comments context, and validation rules.

Write Config Migration Scripthard

Read a migration guide and legacy JSON config, then write a Python script that transforms it into per-environment YAML files.

Log File Data Extractionhard

Extract structured data from log files using regex patterns, producing clean output with timestamps, levels, and messages.

Refactor Flat Skills into Linked Graphhard

Refactor 5 isolated skill files into an interconnected graph: add frontmatter, cross-references, and a graph.json manifest without losing original content.

Research

4 tasks4H

Can your agent read, synthesize, and extract structured information from multiple sources?

Code Review — Security and Bugshard

Review a codebase for security vulnerabilities, bugs, and code quality issues. Produce a structured report with severity levels.

Investigate via Git Archaeologyhard

Trace a skipped test back through 15 commits of git history using log, blame, and diff to find the real regression cause.

Extract Data from Meeting Noteshard

Pull attendees, decisions, action items, and discussion points from an unstructured meeting transcript.

Synthesize Conflicting Requirementshard

Read multiple documents with conflicting stakeholder perspectives and synthesize them into a coherent, prioritized requirements doc.

Data Analysis

6 tasks3H, 3X

Can your agent find patterns, anomalies, and discrepancies across complex data?

Cross-Reference Inventory & Ordershard

Cross-reference two CSV datasets (40 products, 60 orders) to find missing items, unordered inventory, and stock overflows.

Reconcile Multi-Format Financial Datahard

Reconcile financial data across CSV, JSON, and semi-structured text with 15 deliberate discrepancies.

SQL Query Challengehard

Write and execute SQL queries against a SQLite database to answer business questions. Validated by running queries and checking results.

Data Pipeline — Clean and Mergeexpert

Multi-stage pipeline: ingest from 3 formats, clean, normalize, merge, and produce a validated output dataset.

Discover Hidden Pattern in Datasetexpert

Find a non-obvious pattern buried in a dataset that only emerges when combining multiple columns and applying statistical analysis.

Detect Root Cause from Log Patternsexpert

Analyze 5 microservice logs with topology and SLA definitions to reconstruct a cascading failure timeline and find root cause.

Multi-Step

9 tasks3H, 6X

Can your agent chain multiple actions into coherent pipelines and repo operations?

Clean & Analyze Customer Feedbackhard

5-step pipeline: deduplicate 80 rows, clean text, categorize sentiment by rating, output clean CSV + summary report.

Git Workflow Managementhard

Execute a multi-branch git workflow: create branches, make commits, merge with conflict resolution, and tag releases.

Analyze Server Logshard

Parse 200 log lines, categorize by level, identify top 5 errors by frequency, suggest root causes and fixes.

Implement from Ambiguous Requirementsexpert

Turn vague, contradictory requirements into working features, making reasonable assumptions and documenting decisions.

API Integration and Data Aggregationexpert

Integrate with mock APIs, aggregate data from multiple endpoints, handle pagination and errors, produce a unified report.

Execute 6-Step Dependency Chainexpert

Execute 6 interdependent steps where each depends on outputs from previous steps. Tests planning and state management.

Refactor with Test Preservationexpert

Refactor code while keeping all existing tests passing. Verified by running pytest before and after.

Execute Multi-File Refactorexpert

Split a monolithic utils.py into 4 focused modules, update all imports, verify tests pass, and commit each step separately.

Prepare a Software Releaseexpert

Follow a release checklist: version bumps, changelog compilation, cross-file consistency checks, and release tagging.

Memory

10 tasks5H, 5X

Can your agent remember constraints, recall facts, handle contradictions, and filter noise across turns and context switches?

Factual QA Recall from Briefinghard

Recall 15 specific facts from a dense 2-page project briefing with 30+ data points. Ground-truth QA with deterministic answer checking.

Temporal Event Orderinghard

Remember 12 events in random order, sort them chronologically, detect overlaps, and answer temporal reasoning questions with precise date arithmetic.

Selective Project Recallhard

Recall information about one specific project out of three without leaking details from the others. Validates both presence and absence.

Noise vs Signal Filteringhard

Distinguish 8 relevant technical decisions from 12 pieces of casual conversation noise in a standup transcript.

Numerical Precision Recallhard

Recall and compute with specific numerical values from a financial report after a distraction. Tests both direct recall and derived calculations.

Accumulate Constraints Progressivelyexpert

Remember and apply 3 formatting rules introduced at different points across 5 turns. By turn 4, all must be applied simultaneously.

Recall Facts After Extended Distractionexpert

Remember specific facts introduced early in a session, then recall them accurately after an extended unrelated conversation.

Cross-Session Project Handoffexpert

Internalize detailed session notes with 10 action items, 5 blockers, and 3 decisions, then recall specifics after a distraction task.

Contradicting Information Updatesexpert

Correctly overwrite outdated information when given updates across turns. Final output must reflect only the latest values, not originals.

Incremental Specification Buildingexpert

Accumulate requirements across 6 turns and produce a complete specification incorporating specific values from every previous turn.

Error Handling

4 tasks3H, 1X

Can your agent handle broken inputs, cascading failures, and misleading errors?

Handle Contradictory Instructionshard

Instructions contain deliberate contradictions. Should identify conflicts and ask for clarification rather than guessing.

Handle Corrupted JSONhard

Process a JSON file with 3 deliberate errors (missing bracket, trailing comma, unquoted key). Detect and report or fix.

Complete Subtasks Despite Impossible Onehard

Given 5 subtasks where one is impossible, complete the 4 possible ones and clearly report why the fifth can't be done.

Debug Python Projectexpert

Debug a multi-file Python project with subtle bugs across modules. Requires running tests, tracing failures, and fixing root causes.

Tool Efficiency

3 tasks3H

Can your agent use the right tools with minimal waste and surgical precision?

Fibonacci with Testshard

Implement a function and write tests. Validated by running pytest — tests must pass.

Navigate Large Codebase Efficientlyhard

Find all callers of a function across 30+ files using Grep, reading only relevant files. Penalizes unnecessary file opens.

Fix Bug with Minimal File Readshard

A bug report points to dashboard sorting. The fix is one line. Should navigate directly to the file and fix it, touching at most 4-5 files.

Scoring methodology

Every tool call generates a line in trace.jsonl with millisecond timestamps. The three scoring layers use this trace:

L0 — Structural (40%) validates that expected files exist and contain required content. L1 — Metrics (40%) counts tool calls, timing, and errors from the trace. L2 — Behavioral (20%) analyzes patterns like read-before-write, tool misuse, and error recovery from the full JSONL log.

{"seq":1,"ts":1708900000123,"tool":"Read","target":"inputs/data.csv","status":"ok"}
{"seq":2,"ts":1708900001456,"tool":"Bash","target":"wc -l data.csv","status":"ok"}
{"seq":3,"ts":1708900002789,"tool":"Write","target":"analysis.md","status":"ok"}

Custom Benchmarks

Run /benchmark --custom "Your prompt here" to test your agent on any prompt. Custom benchmarks use the same 3-layer scoring — measuring tool efficiency, planning, and behavior — not output correctness. Use it to A/B test config changes on your own workflows.

What gets measured

All metrics are captured via hooks — objective, not self-reported.

Wall-clock time
Planning ratio
Tool call count
Error count
Subagent spawns
Context compactions
Token estimate
Tool breakdown