Skip to main contentSkip to main content
Back to blogBehind the Scenes

Why We Built Ophie Voice-First

Ophie Team Mar 18, 2026 6 min read

Think about the last time something was bothering you and a friend asked what was wrong. You probably didn't open a notes app and start typing. You talked. The words came out half-formed, looped back, corrected themselves mid-sentence — and somewhere in that messy process, you understood your own feeling a little better than before you started. That ordinary act of talking something out is the whole reason Ophie is voice-first.

When we sat down to design a companion for everyday emotional struggles, the first real decision wasn't which model to use or how to store transcripts. It was: how should a person talk to this thing? We chose voice as the default — with text always one tap away — and this post is the thinking, and a little of the engineering, behind that choice.

Why voice, not text

For most of what people do with a computer, typing is fine. You type to search, to write, to code. But emotional processing isn't a search query. When you're anxious or upset, the gap between what you feel and what you can put into a neatly typed sentence is exactly where the friction lives. Typing asks you to compose. Talking lets you think out loud.

Voice also carries things text strips away — pace, hesitation, the place where your sentence trails off. We're not making clinical claims about any of that. The simpler point is the one most people already know from their own lives: saying a worry out loud, even to no one in particular, often takes some of the weight off it. Ophie is built around that instinct rather than asking you to override it.

For emotional support, talking is more natural than typing. So voice is the default, not a feature we bolted on later.

The mission behind it

Ophie exists to be accessible, supplementary support for the ordinary stuff — a hard day, a stretch of loneliness, a minor conflict you keep replaying. It is not therapy and not a substitute for a licensed professional. We're deliberate about that boundary, and we built the product to respect it rather than blur it.

That includes designing against over-reliance. A companion that quietly becomes the only place you ever take your feelings is not a healthy one. Ophie is meant to be a place to think out loud and then go reconnect with the people in your life — it encourages healthy external connection rather than positioning itself as a replacement for it. Voice fits that mission because a quick conversation is lighter-weight than sitting down to journal. You can open Ophie, talk for a few minutes, and get on with your day.

The access problem we're trying to dent

Real mental-health care is hard to reach. It costs money many people don't have, waitlists can stretch for weeks, and stigma keeps a lot of people from ever picking up the phone. Those are real barriers, and Ophie doesn't pretend to remove them — a companion app is not a clinician.

But there's a large space below the threshold of professional care where someone mostly needs a lighter-weight, always-available place to think out loud. Ophie is available around the clock. No appointment, no waitlist, no scheduling a slot three weeks out for a feeling you have right now. When something hits at 1 a.m., the option to just talk is there.

What it takes to make voice feel real-time

A voice companion only works if it feels like a conversation, and conversation is unforgiving about timing. A delay that would be invisible in a chat app feels like a dead silence when you're speaking. So a lot of the engineering behind Ophie is, quietly, about shaving milliseconds.

Under the hood, your words pass through a few stages: speech-to-text turns what you say into words (we use Deepgram), a reasoning model works out a thoughtful response, and natural text-to-speech turns that response back into a voice you can listen to (we use Cartesia). All of it streams over WebRTC through LiveKit, the same kind of real-time transport that powers live calls.

  • Speech-to-text (Deepgram) transcribes your voice as you speak, not after you finish.
  • A reasoning model decides how to respond, with the context of being a companion rather than a general assistant.
  • Text-to-speech (Cartesia) renders the reply in a natural voice instead of a robotic one.
  • WebRTC over LiveKit streams it all in both directions, the way a live call does.

The target we design toward is responsiveness in the sub-200ms range for the parts of the pipeline we control — fast enough that the back and forth feels like talking, not like waiting on a machine to catch up.

Why text still matters

Voice-first does not mean voice-only. A persistent text input is always there, and you can switch to it mid-session without losing your place. Sometimes you're somewhere you can't speak out loud — a quiet office, a shared room, a late hour with someone asleep nearby. Sometimes a person simply prefers to type, and that preference is reason enough.

Accessibility matters here too. Voice can't be the only door in. Some people can't use voice comfortably, or at all, and a companion that only listens would shut them out. So text is a first-class way to talk to Ophie, not a fallback we tolerate. The two run in parallel; you pick whichever fits the moment.

The boundary: a companion, not a therapist

Everything above is in service of one honest position. Ophie is a supportive companion for everyday struggles. It does not diagnose, it does not treat conditions, and it is not a replacement for a licensed professional. It is built for adults, and it's designed to point you toward real human support — friends, family, professionals — not to stand in for them.

We chose voice because it's the most natural way most people already process a feeling, and we built the rest of the product to stay inside its lane while doing that one thing well. If you want to see how the conversation actually works, the how-it-works page walks through a session, and our approach page goes deeper on the thinking behind it.