AI data built by experts who actually know the domain

Zstate delivers RLHF training data, SFT datasets, and evaluations built by credentialed specialists. Our engineering team takes your models into production.

Coding dataset
258k
Real software engineering tasks with full agent trajectories, tool calls, and human acceptance signals. Nothing synthetic.
3.7M
Trajectory
Steps
130k
Reward
Signals
1.7M
Tool
Interactions
Explore →
Domain experts
500+
Specialists across software engineering, healthcare & finance
Agentic systems shipped
40+
Production agentic AI systems deployed in regulated industries
What we work on
SWE Trajectories RLHF Data RL Environments EHR Abstraction Agentic Systems Clinical NLP

Generic AI data breaks down
exactly where it matters most

Generic Annotation
  • Crowdsourced workers with no clinical background labeling diagnostic reasoning tasks
  • Financial tasks assigned to general contractors who've never read a 10-K
  • High-volume throughput with no understanding of downstream model behavior
Zstate Domain Experts
  • Credentialed clinicians, nurses, and health informaticists who understand clinical workflows
  • Analysts, CPAs, and compliance officers evaluating financial model outputs
  • Expert-graded, audit-ready datasets built with deployment outcomes in mind

Two ways we help you build better AI

01
Primary

AI data & RLHF

Expert-annotated training data, SFT datasets, preference data, and evaluations. All built by domain specialists who understand what they're labeling.

  • RLHF preference data & reward model training
  • SFT instruction datasets from domain experts
  • Red-teaming & adversarial evaluation
  • Clinical NLP, diagnostic Q&A, EHR abstraction
  • Earnings analysis, risk data, compliance evaluation
  • Medical coding & ICD abstraction
Start a data project →
02
Powered by the same expertise

Agentic AI engineering

Production-grade agentic systems across software engineering, healthcare, and finance, built on AI-native architecture from the ground up to scale.

  • End-to-end agentic system design & build
  • Multi-agent pipelines & workflow automation
  • AI-native architecture, not retrofitted legacy code
  • From prototype to scalable production deployment
  • Compliance-aware engineering for regulated industries
Talk to our engineers →

Four areas. Genuine depth in each.

Clinical NLP & reasoning
Training data for models that interpret clinical notes, discharge summaries, and physician reasoning. Evaluated by practicing clinicians.
Diagnostic Q&A & multimodal
Expert-graded preference data for diagnostic reasoning tasks, imaging interpretation, and clinical decision support evaluation.
Medical coding & EHR abstraction
ICD-10 and CPT coding validation, and EHR data abstraction tasks handled by certified coders and health informaticists.
Earnings & analyst evaluation
Preference data and SFT datasets for models reasoning over earnings reports, 10-K filings, and sell-side research. Evaluated by credentialed analysts.
Risk & compliance data
Training and evaluation data for risk model assessment, regulatory compliance tasks, and stress testing scenarios. Reviewed by risk professionals.
Fraud detection & trade rationale
Expert-annotated datasets for fraud detection, trade rationale evaluation, and financial reasoning benchmarks.
Agentic system design
Architecture and build of multi-agent systems from scratch, including tool use, memory, orchestration, and handoff logic designed for complex, long-horizon workflows.
RL environment engineering
Custom reinforcement learning environments that simulate real expert decision workflows. Built to generate high-signal training data and meaningful evaluation benchmarks.
Production deployment & ops
From working prototype to production system, with guardrails, observability, human-in-the-loop checkpoints, and the infrastructure to run agents reliably at scale.
258k real engineering tasks
Complete agent trajectories across 258k real-world software engineering problems with reasoning traces, tool calls, code edits, and explicit user acceptance signals. Nothing synthetic.
Three derived datasets
Task dataset (258k cleaned prompts), Trajectory dataset (3.7M full agent traces with tool use and code generation), and Reward dataset (130k explicit user acceptance signals supporting multi-accept workflows).
Beyond SWE-Bench
Where SWE-bench captures prompt → code, ours captures the full lifecycle: reasoning → tool calls (6–7 per task across 22 tools) → code edits → human acceptance. Real production tasks, not curated benchmarks.
Explore the dataset →

From scope to delivery, without the hand-holding

01
Define scope
Task type, domain, quality bar, compliance requirements. We help you spec this if needed. We've seen what works.
02
Expert matching
We assign credentialed specialists from our vetted pool. Not crowd workers. Experts who understand what they're evaluating.
03
Iterative delivery
Data delivered in structured batches with QA loops, inter-annotator agreement metrics, and full audit trails.
04
Ongoing evals
Red-teaming, model feedback loops, and continuous evaluation as your model evolves. We stay in the cycle.

Need to go further? Our engineering team can deploy what we train →

What makes our data structurally different

Credentialed experts, not crowd workers
Our annotators hold software engineering credentials, clinical certifications, and finance licenses. They understand the task, not just the label schema. This is the difference between a data vendor and a domain partner.
Compliance-first by design
Compliance-first workflows aren't an add-on. They are the architecture. Built for the domains where data handling mistakes have legal and human consequences.
Engineers who ship production AI
We've built agentic systems for regulated industries. That means our training data is built with deployment outcomes in mind, not just F1 scores. We know what good data produces downstream.
Vertical depth, not horizontal breadth
We go deep in software engineering, healthcare, and finance instead of shallow across twenty industries. That depth is why our data is defensibly better, and why our clients don't look elsewhere.
Get started

Ready to build AI your domain trusts?

Whether you need expert training data or a production AI system, let's start with a conversation.

Data services
Start a data project
RLHF, SFT datasets, evaluations & red-teaming by domain experts
Engineering
Book an engineering call
Production-grade agentic AI systems for regulated industries
Get in touch →