Segmentation & Behavior Settings

What is segmentation?

When you stream text to Rime’s WebSocket API token by token, Rime needs to know when to synthesize audio. Should it wait for a complete sentence? Synthesize every token immediately? Wait until you say so explicitly? The segment query parameter answers this question. It controls how Rime buffers the text you send and when it fires off synthesis to the model backend.

wss://users-ws.rime.ai/ws3?speaker=<voice>&modelId=<model>&segment=<setting>

There are three settings:

Setting	Default?	Best for
`segment=never`	No (recommended)	Full control — you decide when to synthesize
`segment=bySentence`	Yes	Sentence-structured text streamed token by token
`segment=immediate`	No	Pre-segmented phrases or real-time pass-through

All three settings are available on /ws3 and /ws2. The segment parameter is not applicable to /ws (binary WebSocket).

`segment=never` — Recommended

This is the recommended setting for production voice agents and conversational AI applications. It gives you precise control over synthesis timing and avoids the edge cases and heuristics associated with automatic segmentation.

How it works

Under segment=never, Rime never synthesizes audio automatically. It accumulates every token you send into a buffer and waits. Audio is only produced when your client explicitly sends a flush operation.

{ "operation": "flush" }

Once flushed, Rime synthesizes the entire accumulated buffer as a single utterance. If synthesis from a previous flush is still in progress when a new flush arrives, Rime will hold the newly accumulated text and synthesize it as soon as the current synthesis finishes.

What you are responsible for

Sending well-formed, concatenable tokens. If someone concatenated every token you send, the result should be a properly spaced, punctuated utterance. For example:
["Hello, ", "how ", "can ", "I ", "help ", "you ", "today?"] → "Hello, how can I help you today?" ✅
Avoid sending tokens that, when joined, produce malformed text:
["Hello,", "how", "can"] → "Hello,howcan" ❌ (missing spaces)
Sending flush when you’re done with an utterance. This is your signal to Rime that the buffer contains a complete, speakable phrase.

What Rime is responsible for

Synthesizing audio whenever a flush is received.
Queuing the buffer for synthesis if the previous utterance is still being produced.
Never synthesizing mid-stream without your explicit instruction.
Sending a done event after all audio for each flush has been delivered. Each flush produces exactly one done. eos also emits done for any content remaining in the buffer.

Example

import asyncio
import json
import websockets
import base64

API_KEY = "your-api-key"
URL = "wss://users-ws.rime.ai/ws3?speaker=astra&modelId=coda&segment=never"

async def run():
    async with websockets.connect(URL, additional_headers={"Authorization": f"Bearer {API_KEY}"}) as ws:
        # Stream an LLM response token by token
        tokens = ["Sure, ", "I'd ", "be ", "happy ", "to ", "help ", "with ", "that."]
        for token in tokens:
            await ws.send(json.dumps({"text": token}))

        # Tell Rime to synthesize everything accumulated so far
        await ws.send(json.dumps({"operation": "flush"}))

        # Receive and play audio chunks
        while True:
            msg = json.loads(await ws.recv())
            if msg["type"] == "chunk":
                audio = base64.b64decode(msg["data"])
                # stream audio to your player...
            elif msg["type"] == "timestamps":
                print("Timestamps:", msg["word_timestamps"])

asyncio.run(run())

Handling interruptions

If your user interrupts the assistant while audio is playing, send a clear operation to discard the buffer and stop queued synthesis:

{ "operation": "clear" }

Then begin streaming your new response tokens immediately.

`segment=bySentence` — Default

This is the default behavior when segment is not specified. Rime buffers tokens and synthesizes audio each time it detects a sentence or phrase boundary in the accumulated text.

How it works

Rime watches the incoming token stream for sentence-ending punctuation: ., ?, !. When one is encountered and no audio is currently being synthesized, Rime synthesizes everything up to that boundary and sends the audio back.

What you are responsible for

Separating sentences with spaces. Tokens sent without trailing spaces can cause words to run together after concatenation.
Not splitting tokens at sentence-ending punctuation. Rime’s heuristics fire on received tokens. If a single token ends with sentence-ending punctuation in the middle of what should be a larger phrase (e.g., "2." in "2.5ml"), it may trigger an early synthesis.
❌ tokens = ["take ", "2.", "5ml ", "of ", "the ", "solution."] ✅ tokens = ["take ", "2.5ml ", "of ", "the ", "solution."]

What Rime is responsible for

Accumulating tokens until a sentence boundary is detected.
Synthesizing the buffer at that boundary, only if no audio is currently being produced.
Using heuristics to determine whether a given punctuation mark constitutes a sentence end.
Sending a done event once per synthesis run, after the last segment has been delivered and the text buffer is empty. Intermediate sentence boundaries within a run do not emit done.

When to use this

segment=bySentence works well when you’re streaming text that is already well-structured — clean sentence boundaries, no numbers or abbreviations that could confuse the period heuristic. It requires less client-side coordination than segment=never but is less predictable in edge cases.

segment=bySentence relies on heuristics. Text with decimal numbers, abbreviations (e.g. Dr., 2.5ml), or mid-sentence ellipses can cause early synthesis. For production voice agents, consider segment=never for more reliable control.

Example

import asyncio
import json
import websockets
import base64

API_KEY = "your-api-key"
# segment=bySentence is the default, but shown explicitly here for clarity
URL = "wss://users-ws.rime.ai/ws3?speaker=cove&modelId=mistv3&segment=bySentence"

async def run():
    async with websockets.connect(URL, additional_headers={"Authorization": f"Bearer {API_KEY}"}) as ws:
        sentences = [
            "The weather today is sunny. ",
            "Temperatures will reach the mid-seventies. ",
            "Enjoy the sunshine!",
        ]
        for sentence in sentences:
            await ws.send(json.dumps({"text": sentence}))

        await ws.send(json.dumps({"operation": "eos"}))

        while True:
            try:
                msg = json.loads(await ws.recv())
            except websockets.exceptions.ConnectionClosedOK:
                break
            if msg["type"] == "chunk":
                audio = base64.b64decode(msg["data"])
                # stream to player...

asyncio.run(run())

`segment=immediate` — Synthesize on receipt

Under segment=immediate, Rime synthesizes audio as soon as text arrives in the buffer — provided no audio is currently being produced.

How it works

Each time Rime receives a text message and the synthesis pipeline is idle, it synthesizes whatever is in the buffer immediately. If synthesis is already in progress, incoming tokens continue to accumulate. Once the current synthesis finishes, Rime flushes the entire accumulated buffer as a single utterance.

What you are responsible for

Sending complete, speakable phrases. Because Rime may synthesize on the very first token it receives, each message — or the concatenation of messages received while synthesis is busy — should form something that sounds natural when spoken on its own.
Ensuring concatenated tokens are properly spaced and formatted. When multiple tokens arrive during an active synthesis, they’ll be joined and synthesized together. The same concatenation rules apply as in segment=never.

What Rime is responsible for

Synthesizing immediately upon receiving text when the pipeline is idle.
Accumulating tokens while synthesis is active, then synthesizing the full buffer once idle again.
Sending a done event once per synthesis run, after the last segment has been delivered and the buffer is empty.

When to use this

segment=immediate is useful when your client is sending pre-segmented phrases that are already complete utterances — for example, if you’re controlling segmentation on your side and just want Rime to synthesize each phrase as fast as possible without any punctuation-based logic.

Example

import asyncio
import json
import websockets
import base64

API_KEY = "your-api-key"
URL = "wss://users-ws.rime.ai/ws3?speaker=astra&modelId=coda&segment=immediate"

async def run():
    async with websockets.connect(URL, additional_headers={"Authorization": f"Bearer {API_KEY}"}) as ws:
        # Each message is already a complete phrase
        phrases = [
            "Your order has been confirmed.",
            "It will arrive within three to five business days.",
            "Is there anything else I can help you with?",
        ]
        for phrase in phrases:
            await ws.send(json.dumps({"text": phrase}))
            await asyncio.sleep(0.05)  # small delay to allow synthesis to start

        await ws.send(json.dumps({"operation": "eos"}))

        while True:
            try:
                msg = json.loads(await ws.recv())
            except websockets.exceptions.ConnectionClosedOK:
                break
            if msg["type"] == "chunk":
                audio = base64.b64decode(msg["data"])
                # stream to player...

asyncio.run(run())

Summary: which setting should I use?

Use segment=never if:

You’re building a voice agent or conversational AI application.
You’re streaming LLM output token by token and want full control over when synthesis fires.
You want deterministic, predictable behavior without relying on punctuation heuristics.

Use segment=bySentence if:

Your text is well-formed prose with clean sentence boundaries.
You want Rime to handle segmentation automatically without sending flush.
You’re prototyping and simplicity matters more than precision.

Use segment=immediate if:

You’re sending pre-segmented, complete phrases from your own logic.
You want the fastest possible synthesis start for each phrase.
You’re not streaming token by token — you’re sending complete utterances at once.

WebSocket API Overview

Compare /ws3, /ws2, and /ws endpoints and understand their capabilities.

Latency

Understand the factors that affect TTFB and how to minimize them.

Introduction

Getting started

Voices & models

Customizing speech

Streaming & performance

Integrations

On-prem

Account & updates

Segmentation & Behavior Settings

What is segmentation?

`segment=never` — Recommended

How it works

What you are responsible for

What Rime is responsible for

Example

Handling interruptions

`segment=bySentence` — Default

How it works

What you are responsible for

What Rime is responsible for

When to use this

Example

`segment=immediate` — Synthesize on receipt

How it works

What you are responsible for

What Rime is responsible for

When to use this

Example

Summary: which setting should I use?

WebSocket API Overview

Latency

​What is segmentation?

​segment=never — Recommended

​How it works

​What you are responsible for

​What Rime is responsible for

​Example

​Handling interruptions

​segment=bySentence — Default

​How it works

​What you are responsible for

​What Rime is responsible for

​When to use this

​Example

​segment=immediate — Synthesize on receipt

​How it works

​What you are responsible for

​What Rime is responsible for

​When to use this

​Example

​Summary: which setting should I use?

​Related

WebSocket API Overview

Latency

What is segmentation?

`segment=never` — Recommended

How it works

What you are responsible for

What Rime is responsible for

Example

Handling interruptions

`segment=bySentence` — Default

How it works

What you are responsible for

What Rime is responsible for

When to use this

Example

`segment=immediate` — Synthesize on receipt

How it works

What you are responsible for

What Rime is responsible for

When to use this

Example

Summary: which setting should I use?

Related