tl;dr: All Rime models stream.
Rime’s TTS API streams audio as it is generated, rather than waiting for the full utterance to be synthesized. Every Rime model — Coda, Arcana, and Mist — supports streaming over both HTTP and WebSockets, with sub-200ms end-to-end latency standard via the cloud API. For real-time voice agents, streaming is the default mode of operation, not an add-on.Choosing a transport
| HTTP streaming | WebSocket (JSON) | SSE | |
|---|---|---|---|
| Endpoint | POST /v1/rime-tts | wss://users-ws.rime.ai/ws3 | POST /v1/rime-tts with Accept: text/event-stream |
| Models | All | All | Mist v2 only |
| Connection | One request per utterance | Persistent, multi-utterance | One request per utterance |
| Input | Full text upfront | Incremental text (e.g. from a streaming LLM) | Full text upfront |
| Word-level timestamps | ❌ | ✅ | ❌ |
| Interruption handling | Close the connection | clear operation + context IDs | Close the connection |
| Best for | Simple integrations, server-side synthesis | Voice agents, conversational AI, telephony | Browser EventSource clients |
- Building a voice agent? Use the WebSocket API. A persistent connection avoids per-utterance handshakes, you can feed text in as your LLM generates it, and word-level timestamps tell you exactly what was spoken when a caller interrupts.
- Synthesizing complete sentences server-side? HTTP streaming is the simplest path: one POST, audio bytes stream back in the response body.
- Already using LiveKit, Pipecat, Vapi, or Daily? The integrations handle transport for you — Rime plugs in as the TTS stage of the pipeline.
HTTP streaming
A singlePOST to https://users.rime.ai/v1/rime-tts returns audio bytes in the response body as they are generated. The audio format is controlled by the Accept header (Opus, MP3, WAV, PCM, or G.711 μ-law — see the full format table).
cURL
Python
WebSocket streaming
Rime’s WebSocket API (wss://users-ws.rime.ai/ws3) holds a persistent connection: send text messages as your application produces them, and receive structured JSON events back — base64 audio chunks, word-level timestamps, and a done event per synthesis batch.
Python
chunk (base64 audio), timestamps (word-level timing), done (synthesis batch complete), and error. Text buffering and synthesis triggering are controlled by the segment parameter — see Segmentation & behavior settings.
WebSocket API overview
Choosing between /ws3, /ws2, and /ws, the full event schema, context IDs, and interruption handling.
Server-sent events (SSE)
For clients built aroundEventSource, Mist v2 supports server-sent events: the same POST /v1/rime-tts endpoint with Accept: text/event-stream streams audio as events over a standard HTTP response. SSE is only available for Mist v2 — for other models, use HTTP or WebSocket streaming.
Streaming for telephony and IVR
For phone-based voice agents and IVR systems, request audio that matches the telephony codec directly — Rime synthesizes G.711 μ-law (audio/PCMU) and 8kHz-sampled audio natively, so no transcoding step sits between synthesis and the caller:
- Set
Accept: audio/PCMU(HTTP) oraudioFormat=mulaw(WebSocket) for μ-law output. - Set
samplingRate: 8000to match the telephony stream and shrink payloads.
Latency
Streaming is the single biggest lever on perceived responsiveness: time-to-first-audio, not total synthesis time, is what a listener experiences. Rime’s cloud API delivers sub-200ms end-to-end latency as standard — withmistv3, typical time-to-first-byte is well below 100ms — and Coda achieves sub-100ms model latency on the GPU engine when self-hosted or on-prem.
Latency
Measured benchmarks per model, what affects response time, and how to reduce it — regional endpoints, payload sizing, and text normalization.

