Dictation is the UI layer for agentic AI

Models output 1500 words per minute. Click-driven interfaces serialize at four seconds per decision. Voice is the only input channel that scales to multi-agent control — provided you build it correctly.

§01 The throughput problem

A skilled typist sustains 80 words per minute. Sustained professional dictation runs 150 to 180. Reading comprehension peaks around 250 to 300. Frontier LLMs produce between 100 and 500 tokens per second at the API layer — call it 600 to 3,000 words per minute, depending on the model.

Channel bandwidth · 2026

Operator typing~80 wpm

Operator dictation~150 wpm

Operator reading~280 wpm

Frontier model output~1,500 wpm

Two years ago this was fine. You typed a prompt, you read the answer, the limiting factor was the model. The model is now cheap and fast and running tools. The constraint moved.

It moved to the operator. Specifically, the channel from operator intent to system action.

§02 Why GUIs hit a ceiling

Click-driven interfaces encode every action as a coordinate plus a target. A mouse trip averages 1.2 seconds end-to-end including target reacquisition (Fitts's Law, well-replicated since 1954). A typical multi-agent operation runs about four seconds of operator time: switch window, navigate to the agent card, click an approval. During those four seconds, an agent on the other side of the screen has emitted six thousand tokens of new context.

The math compounds with concurrency. Three agents in parallel, each waiting on operator decisions, turns the GUI driver into a serialization point. The agents are async. The human, working through a mouse, is not.

Bigger buttons don't fix this. Better keyboard shortcuts don't either. The deeper problem is that the input channel and the response channel are sharing one cursor.

§03 Why voice fits the shape of the problem

Voice has three properties that matter for agent orchestration:

Concurrency-safe input. Speaking "approve the migration" while looking at a different agent's output costs almost nothing. Switching windows to click an approval button breaks visual attention. The eyes and the voice are independent muscle groups; the eyes and the cursor are not.

Name-based addressing. "Atlas, kill that" routes to one of N agents in a single syntactic move. In cursor-driven systems the addressee is encoded in where you click, so the cursor has to physically be there before the command can begin. Voice routes by name in any window state. The corollary: the more agents you run, the larger voice's lead grows.

Granularity at speed. "Approve the migration but skip the rename" is one second of speech and unambiguous to a properly-routed parser. Clicking the same intent requires three discrete UI affordances, all visible, all correctly hit-tested. The number of distinct commands a voice operator can issue per minute exceeds the number a click operator can issue by an order of magnitude. We've measured this. The gap is wider than the WPM ratio because of the per-action setup cost on the cursor side.

The asymmetry is the design point. Voice is the command channel because it is bandwidth-efficient for emit. Screen is the response channel because it is bandwidth-efficient for consume. Don't try to use either channel for both.

§04 What dictation has to be to make this work

Most dictation is wrong for agent control. Voice assistants and traditional speech-to-text were built for short queries to a single endpoint, with cloud round-trips and ambiguous wake words. Operating a multi-agent system needs something different.

Local execution

A 400ms cloud round-trip per command is fine for "what's the weather." It is lethal in a flow where you issue a command every five to fifteen seconds. We run Whisper locally; first transcript token lands in under 200ms on Apple Silicon. Cloud-only dictation is a non-starter for this use case, regardless of which model is in the cloud.

Persistent listening with cheap activation

Wake words are friction. A held hotkey is faster, more deliberate, and works in noisy environments. We default to push-to-talk and let users opt into wake-word activation. The activation cost has to round to zero or operators stop using the channel.

Context capture at command time

Every command attaches the active app, the current selection, the focused agent, and the clipboard. The parser disambiguates "approve that" without asking back. Context-free voice command systems force the operator to over-specify, which negates the speed advantage.

Routing, not monolithic intent

Commands fan out to one of: a system action, a specific agent by name, the agent currently in focus, or a broadcast. Routing is part of the parse, not a step after it. This is what lets "Atlas, kill that. Cypher, keep going" resolve in one utterance.

Confidence-aware execution

High-confidence commands execute. Medium-confidence commands prompt a single-keystroke confirmation. Low-confidence is rejected silently. The thresholds are tuned against false-positive cost, not against transcript accuracy. Accidentally killing an agent is much worse than missing a command, and the tuning reflects that asymmetry.

§05 The substrate is the product

We did not set out to build a dictation app. We set out to build the operator interface for multi-agent AI. Dictation is the substrate that makes that interface possible at speeds the GUI cannot match.

If you are driving one agent, GUI works. Type, read, type, read. If you are driving four, and four is realistic now rather than a future scenario, GUI is the bottleneck. You spend more time orchestrating windows than directing work.

The interface model that scales is voice in, screen out, agents named and addressable, state visible at a glance. Cadence is what that looks like running on a Mac.

§06 Where this falls down

Voice as primary input is not a universal claim. There are three places where it's the wrong choice, and we want to be specific about them so the position doesn't read as ideology.

Open offices. Shared desks, glass meeting rooms, hot-desking. Speaking commands draws attention you may not want, and the social cost is real. Push-to-talk with a held key helps. Foot pedals help more. Some of our heavier users wear bone-conduction headsets with directional mics, which moves the social problem rather than eliminating it. We're not going to pretend this is solved at the OS layer.

High-precision text editing. Renaming a variable, fixing a typo, picking the right closing brace. These are tasks where the cursor is faster than naming what you want, because the cursor's whole purpose is high-precision spatial selection. Cadence stays out of the way during them. The hotkey is silent until you press it.

Searching for the words. Voice is great when you know what you want to say and bad when you're inventing it. Brainstorming, exploratory writing, hunting for a phrase. These are jobs for the keyboard. We don't try to sell voice as an everything-tool. Voice in, screen out is a design principle, not a religious commitment.

Where does that leave the thesis? Roughly: voice wins decisively when you're directing work and reading results, which is the dominant shape of agent operation. It loses when you're producing prose, formatting it precisely, or talking to a system you don't have words for yet. We built Cadence for the first kind of work and we put real effort into staying out of the way during the second.

Dictation is the UI layer for agentic AI.