Benchmarks
40 real-world tasks across 7 domains, from easy to expert. Not a coding benchmark — tests how well your Claude Code agent handles the messy, varied work you actually throw at it every day. Expert tasks use real-mode git repos with commit history and embedded scenarios.
How scoring works
Every task is scored 0–100 across four layers. The layers are weighted and combined into a single composite score per task, then rolled up by domain and overall.
Domains & Tasks
File Creation
6 tasks2E, 2M, 2HCan your agent produce well-structured documents, scaffolds, and migration scripts?
Parse messy, inconsistent text records into a clean, structured CSV. Tests data cleaning and format normalization.
Read form requirements and produce a well-organized template with field types, validation rules, and required/optional labels.
Read a startup description and produce a presentation-ready outline with 8 sections, bullet points, and numbers.
Turn a project brief into a full proposal with executive summary, objectives, timeline, budget table, and risk assessment.
Read a requirements doc and scaffold a complete Python package with CLI entry point, module stubs, config handling, and test files.
Read a migration guide and legacy JSON config, then write a Python script that transforms it into per-environment YAML files.
Research
5 tasks3M, 2HCan your agent read, synthesize, and extract structured information from multiple sources?
Read two technology descriptions and produce a structured comparison with pros/cons, a comparison table, and recommendations.
Pull attendees, decisions, action items, and discussion points from an unstructured meeting transcript.
Read a lengthy document and produce 3 key findings, a limitations section, and a one-paragraph executive summary.
Read multiple documents with conflicting stakeholder perspectives and synthesize them into a coherent, prioritized requirements doc.
Trace a skipped test back through 15 commits of git history using log, blame, and diff to find the real regression cause.
Data Analysis
5 tasks1E, 1M, 1H, 1XCan your agent find patterns, anomalies, and discrepancies across complex data?
Compute mean/median/min/max, break down scores by department, and derive 3 actionable insights from survey data.
Spot deliberate data quality issues: negative quantities, mismatched totals, duplicates, future dates, and price outliers.
Cross-reference two CSV datasets (40 products, 60 orders) to find missing items, unordered inventory, and stock overflows.
Reconcile financial data across CSV, JSON, and semi-structured text with 15 deliberate discrepancies.
Analyze 5 microservice logs with topology and SLA definitions to reconstruct a cascading failure timeline and find root cause.
Multi-Step
5 tasks1M, 2H, 2XCan your agent chain multiple actions into coherent pipelines and repo operations?
Extract 8 action items from a conversational transcript, assign owners, add deadlines, and prioritize High/Medium/Low.
5-step pipeline: deduplicate 80 rows, clean text, categorize sentiment by rating, output clean CSV + summary report.
Parse 200 log lines, categorize by level, identify top 5 errors by frequency, suggest root causes and fixes.
Split a monolithic utils.py into 4 focused modules, update all imports, verify tests pass, and commit each step separately.
Follow a release checklist: version bumps, changelog compilation, cross-file consistency checks, and release tagging.
Memory
5 tasks2M, 1H, 1XCan your agent remember constraints and recall facts across turns and context switches?
Write user preferences (TypeScript, pnpm, no auto-commit) to memory files, then recall them accurately on demand.
Remember project details (budget, lead, deadline, stack), complete an unrelated task, then recall the details correctly.
Remember to use kebab-case filenames and British English spelling across 4 consecutive file creation tasks.
Investigate a bug in project-alpha, get interrupted to work on project-beta, then return to alpha with investigation state intact.
Remember and apply 3 formatting rules introduced at different points across 5 turns. By turn 4, all must be applied simultaneously.
Error Handling
6 tasks1E, 2M, 3HCan your agent handle broken inputs, cascading failures, and misleading errors?
Attempt to read a file that doesn't exist. Should report it's missing rather than hallucinating data or crashing.
Process a JSON file with 3 deliberate errors (missing bracket, trailing comma, unquoted key). Detect and report or fix.
Spot contradictions: CSV can't contain charts, a file can't be both CSV and JSON, and 50K rows can't fit in 5KB.
A 4-stage pipeline runs without errors but loses ~10% of records. Find where the validate stage crashes on nulls and transform silently drops rows.
Error appears in orders.js but root cause is auth middleware failing to attach req.user when session cache expires.
A data import script crashes on the first bad file. Implement per-file error isolation and produce a meaningful error report.
Tool Efficiency
5 tasks3E, 2HCan your agent use the right tools with minimal waste and surgical precision?
Answer a simple question from a config file. Should read it once and respond — no extra operations.
Review a report for factual errors. Should identify issues without modifying the file.
Extract emails from a text file. Should use native Read/Grep/Write tools, not bash commands like grep or sed.
Find all callers of a function across 30+ files using Grep, reading only relevant files. Penalizes unnecessary file opens.
A bug report points to dashboard sorting. The fix is one line. Should navigate directly to the file and fix it, touching at most 4-5 files.
What gets measured
All metrics are captured via hooks — objective, not self-reported.