You say something out loud. A beat passes. Ophie answers. If that beat is too long, the whole thing falls apart — you start to feel like you're talking to a machine that is buffering, not a companion that is listening. The gap between your words ending and the reply beginning is the single most important number in a voice product, and most of the engineering we do on the voice side exists to keep that gap short.
This is the companion piece to our post on encryption at rest. That one was about how we protect what you say. This one is about how Ophie hears it, thinks, and speaks back — fast enough that the conversation feels like a conversation.
Why a fraction of a second decides everything
Human conversation runs on a clock most people never notice. When two people talk, the gaps between turns are tiny — on the order of 200 milliseconds — even though the mental work of producing even a simple spoken reply takes well over 600 milliseconds. We start planning our answer before the other person has finished. That overlap is what makes talking feel effortless, and it is the threshold a voice system has to aim for if it wants to feel natural rather than transactional.
The target isn't "fast." The target is "the silence after you stop talking is about as long as it would be with a person."
That research figure — drawn from cross-language studies of how people take turns — is why Ophie's voice pipeline is designed around a sub-200ms latency target over WebRTC. We want to be honest about that word: it is a design target the system aims for, not a benchmarked guarantee we measure and publish. Real-world latency depends on your network, the length of the reply, and how much reasoning a given turn requires. The architecture below is built to keep getting closer to that natural-conversation threshold.
The pipeline, stage by stage
When you speak to Ophie, your voice travels through four distinct stages before it comes back as speech. Each one is a separate system, and each one is a place where milliseconds can be won or lost.
- Capture and transcription. Your microphone audio is streamed to Deepgram running the
nova-2-generalmodel. It transcribes as you talk, not after you finish — streaming speech-to-text so the words are ready the moment you stop. - Reasoning. The transcript goes to the main model: Qwen3.5-397B-A17B, served on a DigitalOcean vLLM endpoint with thinking mode off for responsiveness. This is the part of Ophie that decides what to say.
- Speech synthesis. The reply text streams into Cartesia's
sonic-3model — by default the voice named "Brooke," with other curated voices available — which turns it into expressive, spoken audio. There is also a fallback synthesis path so a single provider hiccup doesn't leave you in silence. - Transport. That audio is streamed back to your browser over WebRTC, the same open standard browsers use for real-time calls, orchestrated by LiveKit Agents on the server side and the LiveKit JavaScript SDK on the client.
A quick note on the names, because we see them mixed up a lot: the reasoning model that actually talks to you is the Qwen model on DigitalOcean. A few smaller models do support work off to the side — summarizing past sessions for recall, running safety checks — but they are not the voice you hear. We'd rather be precise than impressive.
Why WebRTC, and not just an audio file
The transport choice matters more than it sounds. WebRTC is the browser technology that lets a web app capture and stream audio between endpoints without an intermediary server re-encoding everything in the middle. It is built on an open standard designed for exactly this: real-time, low-latency media in the browser.
The alternative — generate the whole reply, render it to an audio file, send the file, play it — adds delay at every step and forces you to wait for the entire sentence before you hear the first word. Streaming over WebRTC means audio starts flowing back while the reply is still being produced. It is the difference between a reply that arrives and a reply that begins.
Turn-taking: knowing when you're done
Speed is only half of natural conversation. The other half is knowing when it is your turn. If Ophie jumps in while you're mid-thought, that's worse than a slow reply. If it waits too long after you genuinely finish, the silence feels dead.
Inside the LiveKit session, Ophie uses Silero VAD— voice activity detection — to sense when you are and aren't speaking, which is what makes responsive interruption possible. On top of that, end-of-turn detection prefers a semantic model that tries to tell whether you've actually finished a thought versus just paused to breathe, falling back to transcription-based detection when needed. How long it waits before treating a pause as the end of your turn is tunable, so the behavior can be dialed toward snappier or more patient.
Barge-in is the feature you only notice when it's missing: you should be able to cut Ophie off mid-sentence and have it stop and listen.
That's real here. Ophie's spoken responses allow interruptions, so if it starts down the wrong path you can talk over it and it will yield — the way a person would. Conversation isn't a walkie-talkie.
Engineered for the ear, not the screen
One detail we find telling: Ophie's replies are written to be heard, not read. The system that generates responses is instructed to avoid markdown, bullet lists, and other formatting that only makes sense on a screen, because the output is going straight to speech. It even inserts small timed pauses into the reply — slightly longer ones before emotionally heavy words — so the synthesized voice breathes the way a person does instead of racing through.
None of this changes what Ophie is. It is supplementary support, not therapy, and not a substitute for a licensed therapist— it never pretends to be a human clinician, and it's built to encourage healthy real-world connections rather than reliance on an app. The voice pipeline is in service of that: making the support that does exist feel less like dictating to a tool and more like being heard.
When safety overrides speed
There is one case where we deliberately throw the latency budget away. If Ophie's safety system detects a high-risk moment, it cancels whatever reply was being generated and delivers a pre-written crisis response instead of a freshly reasoned one. And no matter what fails upstream, a hardcoded fallback always surfaces the 988 Suicide and Crisis Lifeline and the Crisis Text Line (text HOME to 741741).
Fast is the goal almost all the time. But the pipeline is built so that when it actually matters, getting the right words out beats getting words out quickly.
References
- Levinson & Torreira (2015). Timing in turn-taking and its implications for processing models of language. Frontiers in Psychology (via NIH PMC).
- Mozilla / MDN Web Docs (2025). WebRTC API.
- WebRTC.org (2025). WebRTC — Real-time communication for the web.
Ophie offers supplementary, educational support for adults 18 and over. It is not medical advice, not therapy, and not a substitute for professional care. If you're in crisis, contact your local emergency services or a crisis line — in the US, call or text 988, or text HOME to 741741.
Read more: Encryption at Rest · /security overview · How it works