258k real engineering tasks captured end-to-end: reasoning traces, tool calls, code edits, and explicit human acceptance signals. Built for labs training the next generation of coding agents.
| Capability | SWE-bench | HumanEval | CodeSearchNet | Zstate SWE |
|---|---|---|---|---|
| Full reasoning traces | ✕ | ✕ | ✕ | ✓ |
| Tool use captured | ✕ | ✕ | ✕ | ✓ |
| Human acceptance signals | ✕ | ✕ | ✕ | ✓ |
| Real production tasks | ✓ | ✕ | ~ | ✓ |
| Action-level reward signal | ✕ | ✕ | ✕ | ✓ |
| Scale (tasks) | 300 | 164 | ~100k | 258k |
Average 6–7 tool calls per task. Each call recorded with inputs, outputs, and the reasoning step that preceded it, giving reward models the full decision context, not just the outcome.
We're packaging a sample and schema for AI lab outreach now. Get in touch to be first in line, or to discuss a curated subset built for your training pipeline.