Why we only ship Claude

Multi-model routing is the default pattern in 2026. We picked one family and committed. Here is the engineering case.

§01 The position

Cadence runs Claude. Opus 4.7, Sonnet 4.6, Haiku 4.5. That is the entire model lineup. No GPT, no Gemini, no Llama, no router, no "best model per task" abstraction layer.

This is contrarian. The dominant pattern in AI products today is multi-model: pick the cheapest model that passes your eval, swap when prices change, route by latency. It looks like flexibility. We think it is mostly cope.

§02 Tool use is the only benchmark that matters for this product

Cadence is an agent product. Agents call tools. Multi-step tool use under load is where models actually differ from each other, and most public evals do not measure it.

Our internal benchmark, run quarterly:

40-task suite across 8 tool categories — filesystem, shell, search, code edit, computer use, MCP, web fetch, structured output
Tasks require 4 to 12 tool calls in sequence with state dependencies between them
Failure modes scored: dropped calls, hallucinated arguments, infinite loops, premature completion, cost runaway

Internal tool-use benchmark · 2026 · Q1

Claude Opus 4.794% pass

Closest competitor (rotating)71% pass

Open-weight frontier (8B class)38% pass

Gap, four runs running~22 pp · stable

You can argue with our methodology. You can't argue with the user-facing impact. Agent products built on Opus feel reliable. Agent products built on the rest feel like demos that hit the showroom floor too early.

§03 Coherence beats configurability

Every major model family has a personality. It comes through in tool selection, error messages, code style, refusal patterns, and the way it asks clarifying questions. Anthropic's models are direct, opinionated, and willing to say "I don't know." They push back when a request is malformed instead of confabulating an answer.

That voice fits our product. It also creates compounding returns. Every prompt we tune, every system message, every UI affordance built around model output gets calibrated to one consistent behavioral profile. A multi-model product writes prompts that have to work across an averaged behavioral surface, which means writing for the lowest-common-denominator model in the rotation. We chose less optionality and more depth, and we don't regret it.

§04 The features only available downstream of commitment

Single-vendor commitment unlocks features that routing layers cannot expose:

Prompt caching. A 90% discount on repeat tokens turns long-context agent flows from prohibitive to default. We cache aggressively across system prompts, tool definitions, and conversation prefixes. Routers cannot do this — the cache key isn't portable across vendors, and the price/latency model changes underneath them.

Extended thinking. Sonnet and Opus expose thinking budgets we control per call. Other vendors approximate this with chain-of-thought prompting. Approximations leak — the budget isn't real, the trace isn't reliable, the signal you wanted is gone.

Computer use. Anthropic shipped a model trained to operate a screen. We use it. There is no abstraction layer that papers over "this model can drive a mouse and that one cannot." Either you commit to the capability or you build around its absence.

MCP. The Model Context Protocol Anthropic open-sourced is the most coherent agent-tool interface anyone has shipped. We built our broker on it. Every tool Cadence exposes (system actions, file I/O, app integrations) speaks MCP natively. The stack is portable across Claude versions and survives model upgrades without rewriting tool definitions.

None of these features is decisive on its own. Stacked, they're what separates a Claude-flavored wrapper from a product whose entire architecture assumes one model family.

§05 The research direction is the right one

The serious AI safety work in 2026 is happening at Anthropic. Interpretability research, constitutional AI, model welfare, sandboxing for computer use, scalable oversight: each of these has working artifacts you can read, not press releases you can't. Other labs treat them as PR exercises or skip them entirely. Anthropic publishes results, ships them into the model, and lets the safety team gate releases.

That is not a soft preference. It is a bet about which lab will still be shipping useful frontier models in 2030 and which will have either capped out, pivoted, or shipped something that breaks loudly enough to lose its enterprise contracts. We are betting on the lab that is doing the science.

For a product whose entire value proposition is "trust this autonomous software to drive your computer," picking the model family with the strongest interpretability and alignment story is not a vibes call. It is the only defensible engineering decision.

§06 On the criticism

We have heard the complaints. Refusal rates on edge-case prompts. Pricing per token. Context-window economics. Latency on Opus calls.

Some of it is real. Most of it is solved at the product layer. Prompt caching brings cost down by roughly an order of magnitude. Streaming hides latency. Prompt design pulls refusal rates near zero for legitimate use cases. The complaints are loudest from people building shallow integrations and rotating providers constantly. The hundreds of companies doing depth work with Claude have largely stopped having these problems.

We picked the constraint. The product is better for it.

§07 What this means in practice

There's no model picker in Cadence and there isn't going to be one. The choice between speed and quality gets made automatically, per request, against one consistent and very good family of models. We took the responsibility of being right about that choice. Users don't have to.

The routing layer is straightforward. Haiku 4.5 handles command parsing and quick classifications. Sonnet 4.6 runs most agent steps. Opus 4.7 takes the hard ones: multi-step plans, code reasoning under uncertainty, decisions that affect the rest of the run. The routing is deterministic per task type and tuned against quality outcomes rather than per-token cost.

If Anthropic ships something better, we update the day it lands. Until then, every pixel of this product is calibrated to Claude. The case for that choice is what we built Cadence Labs to write down, and the consequence of it is what the product looks like in practice.

Why we only ship Claude.