The full agent trajectory.
Not just the code.

258k real engineering tasks captured end-to-end: reasoning traces, tool calls, code edits, and explicit human acceptance signals. Built for labs training the next generation of coding agents.

Why this goes beyond the benchmarks?
Capability SWE-bench HumanEval CodeSearchNet Zstate SWE
Full reasoning traces
Tool use captured
Human acceptance signals
Real production tasks ~
Action-level reward signal
Scale (tasks) 300 164 ~100k 258k

One corpus.
Three lenses.

258k
Tasks
Task dataset
Real engineering problems with cleaned prompts and execution summaries. The foundation for SFT on problem comprehension and solution planning.
Engg Problems Prompts Summaries
3.7M
Steps
Trajectory dataset
Full step-by-step agent traces: reasoning, tool usage, and code generation at every decision point. The complete picture of how an expert agent solves problems.
Traces Tool Use Reasoning
130k
Signals
Reward dataset
Explicit user acceptance signals at the action level. 50% acceptance rate. Supports iterative multi-accept workflows and action-level reward modelling.
Reward Signals RLHF RLAIF
258k
Real engineering
tasks
3.7M
Trajectory
steps
1.7M
Tool interactions
across 22 tools
130k
Accepted code
actions
14.5
Steps per task
on average
63k
Tasks with acceptance signal
Each capturing explicit human approval at the action level, not just pass/fail at test time.
~50%
Acceptance rate on code actions
A meaningful signal-to-noise ratio for reward model training.
6–7
Tool calls per task on average
Semantic search, call graph analysis, file edits, CLI execution, and more, logged with full context.
25%
Tasks carry acceptance signal
The most valuable subset for reward-sensitive training. Remaining tasks retain full trajectory data for SFT.

Every tool interaction
logged with context.

Semantic search
Call graph analysis
File edits
CLI execution
Directory traversal
Test runner
Symbol lookup
Dependency graph
Code commenting
Diff generation
Static analysis
more tools across the corpus

Average 6–7 tool calls per task. Each call recorded with inputs, outputs, and the reasoning step that preceded it, giving reward models the full decision context, not just the outcome.

Ready to see
the schema?

We're packaging a sample and schema for AI lab outreach now. Get in touch to be first in line, or to discuss a curated subset built for your training pipeline.

Request schema + sample Discuss a custom subset