WebSocket API Overview

tl;dr: Use `/ws3`.

/ws3 is Rime’s flagship WebSocket endpoint. It supports all current and future Rime TTS models, delivers the lowest possible TTFB via direct model streaming, and returns audio as structured JSON messages. Unless you have a specific reason to use a legacy endpoint, start here.

Authentication

All WebSocket endpoints (/ws, /ws2, /ws3) use the same bearer token as the rest of the Rime API. Send Authorization: Bearer YOUR_API_KEY as a connection header when establishing the connection, and pass synthesis arguments (speaker, modelId, audioFormat, lang, …) as query parameters on the connection URL. See API authentication for how to create a key, and the per-endpoint reference pages below for complete, runnable connection code.

Send the token in the Authorization header — connections without valid credentials are rejected with 401.Your API key is a secret: keep it out of client-side code. The browser WebSocket API can’t set request headers, so connect from a server-side process (Node, Python, etc.), or proxy the connection through your own backend. The voice-agent guides show a complete browser ↔ server ↔ Rime bridge.

Choosing an endpoint

Rime offers three WebSocket endpoints. Here’s how they compare:

	`/ws3`	`/ws2`	`/ws`
Message format	JSON	JSON	Raw binary audio
Coda	✅	❌	✅
Arcana	✅	❌	✅
Mist v1	✅	✅	✅
Mist v2	✅	✅	✅
Mist v3	✅	❌	✅
Word-level timestamps	✅	✅	❌
Context IDs	✅	✅	❌
TTFB optimization	✅ Actively optimized	❌	❌

`/ws3` — JSON WebSocket (flagship)

wss://users-ws.rime.ai/ws3

This is the endpoint Rime actively invests in. It supports all current and upcoming Rime TTS models and streams audio as base64-encoded JSON chunks. TTFB is minimized by streaming model responses directly from the engine as they are produced. Supported models:

coda
arcana (all versions)
mistv1, mistv2, mistv3
All future Rime TTS models

If you’re onboarding to Rime’s WebSocket API for the first time, use /ws3. It’s the most capable endpoint and the one we’ll keep improving.

`/ws2` — Legacy JSON WebSocket

wss://users-ws.rime.ai/ws2

/ws2 supports the same JSON message format as /ws3 but is limited to the mist model family (v1 and v2) running on Baseten. TTFB is not further optimized beyond what Baseten’s infrastructure provides. This endpoint will not receive new model support. Supported models:

mistv1, mistv2

/ws2 is functionally equivalent to /ws3 for mist workloads, but /ws3 is preferred. Existing integrations using /ws2 will continue to work.

`/ws` — Binary WebSocket (legacy)

wss://users-ws.rime.ai/ws

/ws sends and receives raw audio bytes rather than JSON. It supports a broader model set than /ws2 but does not benefit from TTFB optimization. This endpoint is suited for clients that need raw PCM/binary audio and cannot handle JSON framing. Supported models:

mistv1, mistv2
arcana
coda

/ws returns raw audio bytes with no JSON framing. It does not support word-level timestamps, context IDs, or structured error events. If you need any of these features, use /ws3 or /ws2 instead.

Features available on JSON endpoints (`/ws3` and `/ws2`)

Word-level timestamps

Both JSON endpoints return word-level timing data alongside audio. This is useful for tracking which words have already been spoken — for example, when an end-user interrupts the assistant mid-sentence.

type TimestampsEvent = {
  type: "timestamps",
  word_timestamps: {
    words: string[],
    start: number[],
    end: number[],
  },
  contextId: string | null,
}

Context IDs

On /ws3 and /ws2, you can attach a contextId to any text message and it will be echoed back on the corresponding audio chunk event. This is useful for correlating audio output to specific turns or requests in a multi-turn conversation.

{
  "text": "Hello, how can I help you today?",
  "contextId": "turn-001"
}

Rime does not maintain multiple simultaneous context IDs. The audio chunk event will carry the most recent context ID that was active at the time the audio was requested. If you send two messages before any audio is synthesized, only the later context ID will be reflected on the first audio chunk. Once set, a context ID also persists across subsequent messages that don’t provide one — events keep carrying the previous ID until you send a new one.

Operations

In addition to sending text, your client can send structured operation messages to control the synthesis pipeline.

`flush`

Forces the current text buffer to be synthesized immediately and the resulting audio to be sent.

{ "operation": "flush" }

This is used explicitly when running in segment=never mode. See the Segmentation guide for details.

`clear`

Discards the accumulated text buffer without synthesizing it. Useful when the user interrupts the assistant and you want to cancel queued speech.

{ "operation": "clear" }

`eos` (end of stream)

Synthesizes whatever remains in the buffer, sends a done event, then immediately closes the connection.

{ "operation": "eos" }

Synthesis completion (`done`)

After all audio for a synthesis batch has been sent, the server emits a done event. This is the signal that the current utterance is fully delivered. If the client triggers further synthesis, another done will follow.

type DoneEvent = {
  type: "done",
  contextId: string | null,
}

On /ws2, the event additionally carries a boolean done: true field; match on type === "done" to handle both endpoints uniformly. done fires at different points depending on the segment setting:

segment=never — fires once per flush, after all audio for that flush has been sent. eos also fires done for any content remaining in the buffer.
segment=bySentence / segment=immediate — fires once per synthesis run, after the last segment completes and the buffer is empty. Intermediate sentence boundaries do not emit done.
eos (all modes) — always emits done before closing the connection.

See Segmentation & Behavior Settings for details on each mode.

Reference & code examples

Each endpoint has a reference page with full message schemas and runnable connection code (Python websockets):

Endpoint	Coda	Arcana	Mist v3	Mist v2
`/ws3` (JSON)	Coda	Arcana	Mist v3	—
`/ws2` (JSON)	—	—	—	Mist v2
`/ws` (binary)	Coda	Arcana	Mist v3	Mist v2

Next steps

How text is buffered and when synthesis is triggered is controlled by the segment parameter. Understanding segmentation is the key to getting predictable, low-latency behavior from the WebSocket API.

Segmentation & behavior settings

Learn how segment=never, segment=bySentence, and segment=immediate work, and how to choose the right one for your use case.

Introduction

Getting started

Voices & models

Customizing speech

Streaming & performance

Integrations

On-prem

Account & updates

WebSocket API Overview

tl;dr: Use `/ws3`.

Authentication

Choosing an endpoint

`/ws3` — JSON WebSocket (flagship)

`/ws2` — Legacy JSON WebSocket

`/ws` — Binary WebSocket (legacy)

Features available on JSON endpoints (`/ws3` and `/ws2`)

Word-level timestamps

Context IDs

Operations

`flush`

`clear`

`eos` (end of stream)

Synthesis completion (`done`)

Reference & code examples

Next steps

Segmentation & behavior settings

​tl;dr: Use /ws3.

​Authentication

​Choosing an endpoint

​/ws3 — JSON WebSocket (flagship)

​/ws2 — Legacy JSON WebSocket

​/ws — Binary WebSocket (legacy)

​Features available on JSON endpoints (/ws3 and /ws2)

​Word-level timestamps

​Context IDs

​Operations

​flush

​clear

​eos (end of stream)

​Synthesis completion (done)

​Reference & code examples

​Next steps

Segmentation & behavior settings

tl;dr: Use `/ws3`.

Authentication

Choosing an endpoint

`/ws3` — JSON WebSocket (flagship)

`/ws2` — Legacy JSON WebSocket

`/ws` — Binary WebSocket (legacy)

Features available on JSON endpoints (`/ws3` and `/ws2`)

Word-level timestamps

Context IDs

Operations

`flush`

`clear`

`eos` (end of stream)

Synthesis completion (`done`)

Reference & code examples

Next steps