We watch your agent work,then help you train asmarter and cheaper successor

Understudy is an open-source toolkit your coding agents use to capture evidence from production LLM workloads, evaluate cheaper models against it, and ship specialist routes you own. Start local — one command, no account:

curl -fsSL https://raw.githubusercontent.com/UnderstudyLabs/understudy-agent-tools/main/install.sh | bash

Run it from the repo you want to optimize. It installs the CLI and the Claude Code plugin, then opens Claude Code — /understudy:onboard takes it from there. No model downloads, no uploads, no account required.

What we do

Repeated LLM work rarely needs a frontier model — it needs enough intelligence at the right latency and price, with proof. Understudy turns your production traces and your experts' judgment into evals, then climbs an optimization ladder — prompt tuning, supervised fine-tuning, RL when the task earns it — until a cheaper route beats your baseline on held-out evidence.

The loop is capture → evaluate → train → deploy. It starts on your machine inside the coding agents your team already uses, and you own everything it produces: the prompts, the evals, and the model weights.

Why it matters

Teams pay frontier prices for routine calls because replacing a model safely requires evidence nobody has time to collect. On our public benchmarks, optimized specialist routes score 13% higher than Sonnet 4.6 on a CRM task, serve 5.2× faster, and cut per-call cost by 50× on sentiment work — each claim against held-out evals you can read.

read the benchmarks →
How it ships

Local first: the toolkit runs offline and nothing leaves your system. When a candidate is ready for live traffic, the hosted gateway here is a drop-in for the Anthropic and OpenAI endpoints — it captures per workload, routes 5% of a call site to the candidate, and ratchets up only as the evidence holds. Rollback is one field.