Benchmarks

40 real-world tasks across 7 domains, from easy to expert. Not a coding benchmark — tests how well your Claude Code agent handles the messy, varied work you actually throw at it every day. Expert tasks use real-mode git repos with commit history and embedded scenarios.

How scoring works

Every task is scored 0–100 across four layers. The layers are weighted and combined into a single composite score per task, then rolled up by domain and overall.

L020%
Automated Checks
Files exist, format is valid, required content is present
L135%
Metrics
Tool call count, planning time, error count
L220%
Behavioral
Instruction adherence, tool choice, approach quality
L325%
Output Quality
Completeness, accuracy, formatting, polish
Difficultyeasymediumhardexpert

Domains & Tasks

File Creation

6 tasks2E, 2M, 2H

Can your agent produce well-structured documents, scaffolds, and migration scripts?

Organize Employee Data into CSVeasy

Parse messy, inconsistent text records into a clean, structured CSV. Tests data cleaning and format normalization.

Create Registration Form Templateeasy

Read form requirements and produce a well-organized template with field types, validation rules, and required/optional labels.

Create Pitch Deck Outlinemedium

Read a startup description and produce a presentation-ready outline with 8 sections, bullet points, and numbers.

Create Project Proposalmedium

Turn a project brief into a full proposal with executive summary, objectives, timeline, budget table, and risk assessment.

Scaffold Project from Requirementshard

Read a requirements doc and scaffold a complete Python package with CLI entry point, module stubs, config handling, and test files.

Write Config Migration Scripthard

Read a migration guide and legacy JSON config, then write a Python script that transforms it into per-environment YAML files.

Research

5 tasks3M, 2H

Can your agent read, synthesize, and extract structured information from multiple sources?

Compare Database Technologiesmedium

Read two technology descriptions and produce a structured comparison with pros/cons, a comparison table, and recommendations.

Extract Data from Meeting Notesmedium

Pull attendees, decisions, action items, and discussion points from an unstructured meeting transcript.

Summarize Whitepapermedium

Read a lengthy document and produce 3 key findings, a limitations section, and a one-paragraph executive summary.

Synthesize Conflicting Requirementshard

Read multiple documents with conflicting stakeholder perspectives and synthesize them into a coherent, prioritized requirements doc.

Investigate via Git Archaeologyhard

Trace a skipped test back through 15 commits of git history using log, blame, and diff to find the real regression cause.

Data Analysis

5 tasks1E, 1M, 1H, 1X

Can your agent find patterns, anomalies, and discrepancies across complex data?

Generate Summary Statisticseasy

Compute mean/median/min/max, break down scores by department, and derive 3 actionable insights from survey data.

Find Anomalies in Sales Datamedium

Spot deliberate data quality issues: negative quantities, mismatched totals, duplicates, future dates, and price outliers.

Cross-Reference Inventory & Ordershard

Cross-reference two CSV datasets (40 products, 60 orders) to find missing items, unordered inventory, and stock overflows.

Reconcile Multi-Format Financial Datahard

Reconcile financial data across CSV, JSON, and semi-structured text with 15 deliberate discrepancies.

Detect Root Cause from Log Patternsexpert

Analyze 5 microservice logs with topology and SLA definitions to reconstruct a cascading failure timeline and find root cause.

Multi-Step

5 tasks1M, 2H, 2X

Can your agent chain multiple actions into coherent pipelines and repo operations?

Meeting Notes to Task Listmedium

Extract 8 action items from a conversational transcript, assign owners, add deadlines, and prioritize High/Medium/Low.

Clean & Analyze Customer Feedbackhard

5-step pipeline: deduplicate 80 rows, clean text, categorize sentiment by rating, output clean CSV + summary report.

Analyze Server Logshard

Parse 200 log lines, categorize by level, identify top 5 errors by frequency, suggest root causes and fixes.

Execute Multi-File Refactorexpert

Split a monolithic utils.py into 4 focused modules, update all imports, verify tests pass, and commit each step separately.

Prepare a Software Releaseexpert

Follow a release checklist: version bumps, changelog compilation, cross-file consistency checks, and release tagging.

Memory

5 tasks2M, 1H, 1X

Can your agent remember constraints and recall facts across turns and context switches?

Persist & Recall Preferencesmedium

Write user preferences (TypeScript, pnpm, no auto-commit) to memory files, then recall them accurately on demand.

Recall After Distractionmedium

Remember project details (budget, lead, deadline, stack), complete an unrelated task, then recall the details correctly.

Retain Constraints Across Turnshard

Remember to use kebab-case filenames and British English spelling across 4 consecutive file creation tasks.

Context Switch Between Projectshard

Investigate a bug in project-alpha, get interrupted to work on project-beta, then return to alpha with investigation state intact.

Accumulate Constraints Progressivelyexpert

Remember and apply 3 formatting rules introduced at different points across 5 turns. By turn 4, all must be applied simultaneously.

Error Handling

6 tasks1E, 2M, 3H

Can your agent handle broken inputs, cascading failures, and misleading errors?

Handle Missing Fileeasy

Attempt to read a file that doesn't exist. Should report it's missing rather than hallucinating data or crashing.

Handle Corrupted JSONmedium

Process a JSON file with 3 deliberate errors (missing bracket, trailing comma, unquoted key). Detect and report or fix.

Identify Impossible Requestmedium

Spot contradictions: CSV can't contain charts, a file can't be both CSV and JSON, and 50K rows can't fit in 5KB.

Debug Silent Data Losshard

A 4-stage pipeline runs without errors but loses ~10% of records. Find where the validate stage crashes on nulls and transform silently drops rows.

Debug Misleading Stack Tracehard

Error appears in orders.js but root cause is auth middleware failing to attach req.user when session cache expires.

Implement Partial Recoveryhard

A data import script crashes on the first bad file. Implement per-file error isolation and produce a meaningful error report.

Tool Efficiency

5 tasks3E, 2H

Can your agent use the right tools with minimal waste and surgical precision?

Answer with Minimal Tool Useeasy

Answer a simple question from a config file. Should read it once and respond — no extra operations.

Review Without Editingeasy

Review a report for factual errors. Should identify issues without modifying the file.

Use the Right Toolseasy

Extract emails from a text file. Should use native Read/Grep/Write tools, not bash commands like grep or sed.

Navigate Large Codebase Efficientlyhard

Find all callers of a function across 30+ files using Grep, reading only relevant files. Penalizes unnecessary file opens.

Fix Bug with Minimal File Readshard

A bug report points to dashboard sorting. The fix is one line. Should navigate directly to the file and fix it, touching at most 4-5 files.

What gets measured

All metrics are captured via hooks — objective, not self-reported.

Wall-clock time
Planning ratio
Tool call count
Error count
Subagent spawns
Context compactions
Token estimate
Tool breakdown