Websockets JSON

Overview

In addition to a plaintext websocket implementation, Rime also has an implementation that sends and receives events as JSON objects. Like the other implementation, all synthesis arguments are provided as query parameters when establishing the connection. The websocket API buffers inputs up to on of the following punctuation characters: ., ?, !. This is most pertinent for the initial messages sent to the API, as synthesis won’t begin until there are sufficient tokens to generate audio with natural prosody. After the first synthesis of any given utterance, typically enough time has elapsed that subsequent audio contains multiple clauses, and the buffering becomes largely invisible.

Messages

Send

Text

This is the most common message, which contains text for synthesis. schema:

type TextMessage = {
  text: string,
  contextId?: string,
}

examples:

{
    "text": "this is the minimum text message."
}

{
    "text": "this is a text message with a context id.",
    "contextId": "159495B1-5C81-4C73-A51A-9CE10A08239E"
}

Context IDs can be provided, which will be attached to subsequent messages that the server sends back to the client. Rime will not maintain multiple s imultaneous context ids. The events will contain the most recent context ID at the time that audio was requested. In the above examples, even if both messages are received by the server before it sends any audio, the audio response for the first sentence will be tagged with contextId: null, and the audio for the second will be tagged with its UUID.

Clear

Your client can clear out the accumulated buffer, which is useful in the case of interruptions.

{ "operation": "clear" }

Flush

This forces whatever buffer exists, if any, to be synthesized, and the generated audio to be sent over.

{ "operation": "flush" }

EOS

At times, your client would like to generate audio for whatever remains in the buffer, and then have the connection immediately closed.

{ "operation" : "eos" }

Receive

Chunk

The most common event will be the audio chunk.

type Base64String = string

type AudioChunkEvent = {
  type: "chunk",
  data: Base64String,
  contextId: string | null,
}

The audio will be a base64 encoded chunk of audio bytes in the audio format specified when the connection was established. If you provided any context id when sending the relevant text, it’ll be included here.

Timestamps

Word timestamps are provided to better understand what precisely has been already said, in the event of an interruption.

type TimestampsEvent = {
  type: "timestamps",
  word_timestamps: {
    words: string[],
    start: number[],
    end: number[],
  },
}

Error

In the event of a malformed or unexpected input, the server will immediately respond with an error message. The server will not close the connection, and will still accept subsequent well-formed messages. It’s up to the client to decide if it wants to close upon receiving an error.

type ErrorEvent = {
  type: "error",
  message: string,
}

Variable Parameters

speaker

string

required

Must be one of the voices listed in our documentation for arcana.

text

string

required

The text you’d like spoken. Character limit per request is 500 via the API and 1,000 in the dashboard UI.

modelId

string

This value must be set to arcana else the websockets server will default to mistv2 for speech synthesis.

audioFormat

string

One of mp3, mulaw, or pcm

lang

string

default:"eng"

If provided, the language must match the language spoken by the provided speaker. This can be checked in our voices documentation.

repetition_penalty

float

default:"1.5"

The repetition penalty. We do not recommend changing this from the default value. Typical range is 1 to 2.Float that penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > 1 encourage the model to use new tokens, while values < 1 encourage the model to repeat tokens.

temperature

float

default:"0.5"

The temperature. We do not recommend changing this from the default value. Typical range is 0 to 1.Float that controls the randomness of the sampling. Lower values make the model more deterministic, while higher values make the model more random. Zero means greedy sampling.

top_p

float

default:"1"

The top p. We do not recommend changing this from the default value. Typical range is 0 to 1.Float that controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens.

samplingRate

int

default:"24000"

The sampling rate (Hz).

On-cloud: Accepted values: 8000, 16000, 22050, 24000, 44100, 48000, 96000. Anything above 24000 is up sampling.
On-prem: Any value is accepted.

segment

string

default:"bySentence"

Controls how text is segmented for synthesis. Available options:

“immediate” - Synthesizes text immediately without waiting for complete sentences
“never” - Never segments the text, waits for explicit flush or EOS
“bySentence” (default) - Waits for complete sentences before synthesis

Note: For backward compatibility, setting immediate=true in query params is equivalent to segment=immediate. If a null value is provided, it will default to “bySentence”.

import asyncio
import json
import websockets
import base64

class RimeClient:
    def __init__(self, speaker, api_key):
        self.url = f"wss://users-ws.rime.ai/ws3?speaker={speaker}&modelId=arcana&audioFormat=mp3"
        self.auth_headers = {
            "Authorization": f"Bearer {api_key}"
        }
        self.audio_data = b''

    async def send_messages(self, websocket, messages):
        for message in messages:
            await websocket.send(json.dumps(message))

    async def handle_audio(self, websocket):
        while True:
            try:
                audio = await websocket.recv()
            except websockets.exceptions.ConnectionClosedOK:
                break
            message = json.loads(audio)

            if message['type'] == 'chunk':
              self.audio_data += base64.b64decode(message['data'])

            if message['type'] == 'timestamps':
                print("Rime model pronounced the words...\n")
                for w, t in zip(message['word_timestamps']['words'], message['word_timestamps']['start']):
                    print(f"'{w}' at time {t}")

    async def run(self, messages):
        async with websockets.connect(self.url, additional_headers=self.auth_headers) as websocket:
            await asyncio.gather(
                self.send_messages(websocket, messages),
                self.handle_audio(websocket),
            )

    def save_audio(self, file_path):
        with open(file_path, 'wb') as f:
            f.write(self.audio_data)
        print(f"\n Audio saved at {file_path}")


message = [
    {"text": "This "},
    {"text": "is "},
    {"text": "a "},
    {"text": "test "},
    {"operation":"clear"},
    {"text": "This "},
    {"text": "is "},
    {"text": "an "},
    {"text": "incomplete "},
    {"text": "sentence "},
    {"operation": "eos"},
]

client = RimeClient("astra", api_key="xxx")
asyncio.run(client.run(message))

client.save_audio("output.mp3")

Arcana API reference

Mist v2 API reference

API Metadata

Other APIs

Overview

Messages

Send

Text

Clear

Flush

EOS

Receive

Chunk

Timestamps

Error

Variable Parameters

Arcana API reference

Mist v2 API reference

API Metadata

Other APIs

​Overview

​Messages

​Send

​Text

​Clear

​Flush

​EOS

​Receive

​Chunk

​Timestamps

​Error

​Variable Parameters

Overview

Messages

Send

Text

Clear

Flush

EOS

Receive

Chunk

Timestamps

Error

Variable Parameters