Skip to main contentSkip to main content
Back to blogResearch

Our Approach to AI Safety in Mental Health

Ophie Team Dec 28, 2025 10 min read

If you are in crisis right now, please don't spend it reading this. Contact your local emergency services or a crisis line in your country — they are staffed by trained humans who can help in ways an app cannot. The rest of this post is for people who want to understand, calmly and in detail, how Ophie handles the moments when a conversation turns serious.

Building a voice-first AI companion for mental health means accepting a hard truth up front: some of the people who talk to Ophie will be having one of the worst days of their lives. A general-purpose chatbot can treat that as an edge case. We can't. Safety isn't a feature we bolted on at the end — it's the part of the system we designed first, and the part we are most willing to make Ophie worse at everything else to protect.

This post explains how that works: the two layers that watch for danger, what happens when one of them fires, and why Ophie refuses certain requests no matter how they're framed. The short version lives at /safety. The honest version is below.

Why a mental health app needs more than a content filter

Mainstream models are trained to be agreeable and broadly helpful. That instinct, which makes them pleasant to use, is exactly what makes them risky in a mental health context. They can be talked around. Reframe a harmful request as fiction, as a hypothetical, as research "for a story," and a model optimized to be cooperative will often go along with it. For most uses that's a quirk. For someone in distress, it's a failure mode with real stakes.

We don't want Ophie to be agreeable. We want it to be safe, and to hold that line even when holding it makes for an awkward conversation.

So we built safety as a separate system that sits around the language model, not inside its good intentions. It watches every message, it can interrupt the model mid-thought, and the words it speaks in a crisis were not written by the model at all.

Two layers of detection

Every message you send to Ophie passes through two independent checks. They run differently and catch different things on purpose — one is fast and literal, the other is slower and understands context.

Layer one is a deterministic pass.It runs synchronously on every single user message, before anything else happens. It's a fast keyword-and-heuristic check tuned to recognize clear, unambiguous markers: explicit statements of suicidal ideation, active self-harm, active intent to harm a specific identified person, and severe hopelessness. It is not clever, and that's the point — it is fast, predictable, and never "reasons its way" out of catching an obvious red flag.

Layer two is a machine-learning safety classifier. It's an open-weight safeguard model running on fast inference, and it looks at a short window of the conversation rather than a single message — because risk often lives in the shape of an exchange, not one sentence. It runs asynchronously so it never adds latency to Ophie's voice, and it returns three things: a category, a severity level on a five-point scale (NONE, LOW, MODERATE, HIGH, CRITICAL), and a confidence score.

The two layers cover each other's blind spots. The deterministic pass catches the blunt, obvious cases instantly, even if the classifier is mid-evaluation. The classifier catches the cases that don't use any obvious keyword at all — the quiet, oblique signals that a literal filter would miss.

What happens at HIGH or CRITICAL

When either layer surfaces a HIGH or CRITICAL signal, Ophie stops behaving like a conversational model and follows a protocol. Several things happen in sequence:

  • Generation is interrupted. Whatever the language model was about to say is cut off. The model does not get to improvise its way through a crisis.
  • A pre-drafted response is delivered. Ophie speaks a deliberate, plain-language crisis response that our safety team wrote in advance — not something the LLM generated on the fly. It surfaces relevant hotlines and local emergency resources.
  • The topic is paused.Ophie does not keep generating on the sensitive thread until the person clearly signals they want to continue. It doesn't push.
  • The event is logged. A record goes to a restricted safety-audit log so we can review how the system performed and improve the protocols over time.

The reason the crisis language is written ahead of time is simple: a moment of acute distress is the worst possible time to trust a probabilistic text generator to find the right words. We would rather Ophie say something careful and human-authored than something fluent and improvised.

Refusing the workarounds

A safety system is only as good as its resistance to being talked around. The most common way people get harmful content out of a general model is reframing: "it's for a novel," "hypothetically," "for a school project," "let's role-play." Ophie treats those framings as what they are — attempts to reach the same content through a side door — and refuses regardless.

Specifically, Ophie will not produce content that encourages or facilitates suicide, self-harm, or harm to others, even when the request is dressed up as fiction, a hypothetical, academic inquiry, or role-play. Concretely, that means Ophie will not:

  • describe methods of self-harm or suicide;
  • give dosage, method, or "how-to" advice;
  • evaluate "would this work" hypotheticals about lethal means;
  • provide lethality comparisons; or
  • rank or compare methods in any way.

Persistent attempts to push past these refusals can end the session. That is a deliberate choice. Some conversations are not ones we want Ophie to continue, and walking away is a valid safety response.

The honest goal isn't to win an argument with someone trying to circumvent the guardrails. It's to refuse plainly, point toward real help, and not be useful for harm.

How we test that any of this works

Writing safety rules is easy. Knowing whether they hold under pressure is the hard part. We evaluate Ophie qualitatively against safety rubrics — structured scenarios covering crisis signals, circumvention attempts, and the kinds of oblique distress that don't announce themselves. We review how Ophie responds, where it falls short, and we feed those findings back into both the deterministic rules and the protocols.

We're deliberately not putting a number on it here. Our evaluation currently relies on a single judge, which means any headline score would carry a real caveat — and a precise-looking metric would imply more certainty than we have. We'd rather describe the practice honestly than dress it in statistics it can't support. You can read more about how we think about measurement and accountability at /trust.

What Ophie is not

The most important safety feature is knowing what we are not. Ophie is supplementary support — a companion for everyday struggles, moments of loneliness, and minor conflicts. It is for adults, 18 and older. It is not therapy, not a clinician, and not a diagnostic or treatment tool. It does not replace professional care, and it is not built for acute conditions or emergencies.

No detection system is perfect. A classifier can miss a signal; a keyword pass can be fooled; a person in crisis may not say the thing that would trigger either layer. We build these systems to reduce harm and to point firmly toward real help — not to be a substitute for it. If you or someone you care about is in danger, contact local emergency services or a crisis line. That is the right tool for that moment, and Ophie is designed to say so.

Safety work is never finished. The protocols will keep evolving, the rubrics will get harder, and we'll keep being explicit about where the limits are. If you spot something we've gotten wrong, please email team@ophie.app. We take it seriously.