Skip to main content

tl;dr: All Rime models stream.

Rime’s TTS API streams audio as it is generated, rather than waiting for the full utterance to be synthesized. Every Rime model — Coda, Arcana, and Mist — supports streaming over both HTTP and WebSockets, with sub-200ms end-to-end latency standard via the cloud API. For real-time voice agents, streaming is the default mode of operation, not an add-on.

Choosing a transport

HTTP streamingWebSocket (JSON)SSE
EndpointPOST /v1/rime-ttswss://users-ws.rime.ai/ws3POST /v1/rime-tts with Accept: text/event-stream
ModelsAllAllMist v2 only
ConnectionOne request per utterancePersistent, multi-utteranceOne request per utterance
InputFull text upfrontIncremental text (e.g. from a streaming LLM)Full text upfront
Word-level timestamps
Interruption handlingClose the connectionclear operation + context IDsClose the connection
Best forSimple integrations, server-side synthesisVoice agents, conversational AI, telephonyBrowser EventSource clients
Rules of thumb:
  • Building a voice agent? Use the WebSocket API. A persistent connection avoids per-utterance handshakes, you can feed text in as your LLM generates it, and word-level timestamps tell you exactly what was spoken when a caller interrupts.
  • Synthesizing complete sentences server-side? HTTP streaming is the simplest path: one POST, audio bytes stream back in the response body.
  • Already using LiveKit, Pipecat, Vapi, or Daily? The integrations handle transport for you — Rime plugs in as the TTS stage of the pipeline.

HTTP streaming

A single POST to https://users.rime.ai/v1/rime-tts returns audio bytes in the response body as they are generated. The audio format is controlled by the Accept header (Opus, MP3, WAV, PCM, or G.711 μ-law — see the full format table).
cURL
curl --request POST \
  --url https://users.rime.ai/v1/rime-tts \
  --header 'Accept: audio/mpeg' \
  --header 'Authorization: Bearer $RIME_API_KEY' \
  --header 'Content-Type: application/json' \
  --output output.mp3 \
  --data '{
  "text": "Streaming audio from Rime, as it is generated.",
  "modelId": "coda",
  "speaker": "astra",
  "language": "en"
}'
Python
import os, requests

with requests.post(
    "https://users.rime.ai/v1/rime-tts",
    headers={
        "Authorization": f"Bearer {os.environ['RIME_API_KEY']}",
        "Content-Type": "application/json",
        "Accept": "audio/mpeg",
    },
    json={
        "text": "Streaming audio from Rime, as it is generated.",
        "modelId": "coda",
        "speaker": "astra",
        "language": "en",
    },
    stream=True,
) as response:
    response.raise_for_status()
    for chunk in response.iter_content(chunk_size=4096):
        ...  # feed chunk to your player, telephony stream, or file
The first audio bytes arrive long before synthesis finishes — consume the body incrementally (as above) rather than buffering the whole response, or you give back most of the latency win. See Latency for measured time-to-first-audio per model. Endpoint references: Coda · Arcana · Mist v3 · Mist v2

WebSocket streaming

Rime’s WebSocket API (wss://users-ws.rime.ai/ws3) holds a persistent connection: send text messages as your application produces them, and receive structured JSON events back — base64 audio chunks, word-level timestamps, and a done event per synthesis batch.
Python
import asyncio, base64, json, os
import websockets

async def main():
    url = "wss://users-ws.rime.ai/ws3?speaker=astra&modelId=coda&audioFormat=mp3"
    headers = {"Authorization": f"Bearer {os.environ['RIME_API_KEY']}"}

    async with websockets.connect(url, additional_headers=headers) as ws:
        await ws.send(json.dumps({"text": "Hello from Rime over WebSockets."}))
        await ws.send(json.dumps({"operation": "eos"}))  # synthesize and close

        async for message in ws:
            event = json.loads(message)
            if event["type"] == "chunk":
                audio_bytes = base64.b64decode(event["data"])
                ...  # play or forward the audio
            elif event["type"] == "timestamps":
                print(event["word_timestamps"]["words"])
            elif event["type"] == "done":
                print("synthesis complete")

asyncio.run(main())
The server emits four event types: chunk (base64 audio), timestamps (word-level timing), done (synthesis batch complete), and error. Text buffering and synthesis triggering are controlled by the segment parameter — see Segmentation & behavior settings.

WebSocket API overview

Choosing between /ws3, /ws2, and /ws, the full event schema, context IDs, and interruption handling.
Endpoint references: Coda · Arcana · Mist v3 · Mist v2

Server-sent events (SSE)

For clients built around EventSource, Mist v2 supports server-sent events: the same POST /v1/rime-tts endpoint with Accept: text/event-stream streams audio as events over a standard HTTP response. SSE is only available for Mist v2 — for other models, use HTTP or WebSocket streaming.

Streaming for telephony and IVR

For phone-based voice agents and IVR systems, request audio that matches the telephony codec directly — Rime synthesizes G.711 μ-law (audio/PCMU) and 8kHz-sampled audio natively, so no transcoding step sits between synthesis and the caller:
  • Set Accept: audio/PCMU (HTTP) or audioFormat=mulaw (WebSocket) for μ-law output.
  • Set samplingRate: 8000 to match the telephony stream and shrink payloads.
Telephony and voice-agent platform guides: LiveKit · Vapi · SignalWire · Daily · VideoSDK

Latency

Streaming is the single biggest lever on perceived responsiveness: time-to-first-audio, not total synthesis time, is what a listener experiences. Rime’s cloud API delivers sub-200ms end-to-end latency as standard — with mistv3, typical time-to-first-byte is well below 100ms — and Coda achieves sub-100ms model latency on the GPU engine when self-hosted or on-prem.

Latency

Measured benchmarks per model, what affects response time, and how to reduce it — regional endpoints, payload sizing, and text normalization.