For a long time we believed the voice agent problem was a latency problem. The user speaks. We transcribe. We think. We synthesize. If we could get each step low enough — fifty milliseconds here, two hundred there — the whole pipeline would feel responsive. We were wrong about almost every part of that sentence.
The first system we shipped was a sequential pipeline of the kind every voice product is built from: voice activity detection, speech-to-text, an LLM, text-to-speech. The components were good. The whole thing was unusable. End-to-end response landed at three to seven seconds, no matter how aggressively we tuned each stage. We could feel the user wait. We could see them give up.
We tried to fix it. The Parallel Cognitive Voice Architecture ran two LLMs at once: an 8B model handling the immediate response, a 32B model thinking longer and allowed to override. It is a beautiful idea. It did not save time. We were still bottlenecked on speech generation. No amount of thinking-parallelism could move tokens through a synthesizer faster than the synthesizer was willing to produce them.
We tried streaming sentence-level TTS. Detect a sentence boundary in the LLM output, hand off to the synthesizer, start playback before the rest of the response existed. This dropped first-audio latency from two seconds to roughly five hundred milliseconds. It also introduced an entire class of problems we had not predicted: the synthesizer's prosody no longer matched the not-yet-generated remainder of the sentence; the LLM, asked to produce broadcast-friendly chunks, started writing differently from how it usually wrote. We had made the system faster and quieter — and a little dumber.
We tried predictive turn-taking. Instead of the conventional fixed silence threshold, train a small model to predict, from the trailing two seconds of audio, whether the user will be done speaking. Discourse markers, prosodic contour, clause completeness — all measurable signals. This worked. The system started responding sooner, sometimes uncannily soon. But the response itself was still being assembled by the same sequence of stages. We had moved the starting line; the race was still slow.
We had spent months shaving two hundred milliseconds off a five-second problem. The actual constraint was not optimization. It was design.
The mistake we kept making
Each of these architectures — sequential, parallel-cognitive, streaming, predictive — assumed the same thing: that listening and speaking were separate stages, and the agent's job was to switch cleanly between them. That assumption is so old it disappears into the wallpaper. It is the architecture of a walkie-talkie. It is also wrong.
Real conversation is not a sequence of passes. People interrupt themselves. People begin speaking before the other person has finished — sometimes politely, often not. Backchannels — the mm-hm, the yeah — happen during the other speaker's sentence, and a system that cannot produce them sounds, at best, formal, and at worst, deaf. None of our pipelines could backchannel. None of them were even built to try.
What worked
The fix was architectural, not optimizational. We adopted a full-duplex model — Moshi, from Kyutai — in which a single neural network handles both audio streams simultaneously. There is no turn detection because there is no turn boundary; the model is always listening and (potentially) always speaking. ASR happens as a free side-effect of the model's inner monologue. End-to-end response time dropped to roughly two hundred milliseconds, but the more important number is the one we never tried to measure before: the rate at which the system can produce something during the user's speech, rather than waiting for the user to finish.
This was humbling. We had been optimizing within a frame that was itself the problem. The right move was not to make the pipeline faster; it was to stop having a pipeline.
What we carry forward
Two things, mostly.
The first is a small piece of suspicion we now apply to every system we build: when an obvious-seeming optimization stops paying off, the structure may be wrong, not the parameters. Real-time is a property of design before it is a property of compute.
The second is taste for systems that handle parallelism inside the model rather than around it. Pipelines are easy to reason about and easy to debug. They are also easy to make slow in ways that have nothing to do with their components. We will reach for them less.
We did not ship the voice agent we set out to ship. We shipped a different one, later, that we believe in more.