Overview

In addition to a plaintext websocket implementation, Rime also has an implementation that sends and receives events as JSON objects. Like the other implementation, all synthesis arguments are provided as query parameters when establishing the connection.

The websocket API buffers inputs up to on of the following punctuation characters: ., ?, !. This is most pertinent for the initial messages sent to the API, as synthesis won’t begin until there are sufficient tokens to generate audio with natural prosody. After the first synthesis of any given utterance, typically enough time has elapsed that subsequent audio contains multiple clauses, and the buffering becomes largely invisible.

Messages

Send

Text

This is the most common message, which contains text for synthesis.

schema:

type TextMessage = {
  text: string,
  contextId?: string,
}

examples:

{
    "text": "this is the minimum text message."
}

{
    "text": "this is a text message with a context id.",
    "contextId": "159495B1-5C81-4C73-A51A-9CE10A08239E"
}

Context IDs can be provided, which will be attached to subsequent messages that the server sends back to the client. Rime will not maintain multiple simultaneous context ids. The events will contain the most recent context ID at the time that audio was requested. In the above examples, even if both messages are received by the server before it sends any audio, the audio response for the first sentence will be tagged with contextId: null, and the audio for the second will be tagged with its UUID.

Clear

Your client can clear out the accumulated buffer, which is useful in the case of interruptions.

{ "operation": "clear" }

EOS

At times, your client would like to generate audio for whatever remains in the buffer, and then have the connection immediately closed.

{ "operation" : "eos" }

Receive

Chunk

The most common event will be the audio chunk.

type Base64String = string

type AudioChunkEvent = {
  type: "chunk",
  data: Base64String,
  contextId: string | null,
}

The audio will be a base64 encoded chunk of audio bytes in the audio format specified when the connection was established. If you provided any context id when sending the relevant text, it’ll be included here.

Timestamps

Word timestamps are provided to better understand what precisely has been already said, in the event of an interruption.

type TimestampsEvent = {
  type: "timestamps",
  wordTimestamps: {
    words: string[],
    start: number[],
    end: number[],
  },
}

Error

In the event of a malformed or unexpected input, the server will immediately respond with an error message. The server will not close the connection, and will still accept subsequent well-formed messages. It’s up to the client to decide if it wants to close upon receiving an error.

type ErrorEvent = {
  type: "error",
  message: string,
}

Variable Parameters

speaker
string
required

Must be one of the voices listed in our documentation.

text
string
required

The text you’d like spoken

modelId
string

Choose mist for hyper-realistic conversational voices or v1 for Rime’s first-gen model (default: v1)

audioFormat
string

One of mp3, mulaw, or pcm

pauseBetweenBrackets
bool
default:
"false"

When set to true, adds pauses between words enclosed in angle brackets. The number inside the brackets specifies the pause duration in milliseconds.
Example: “Hi. <200> I’d love to have a conversation with you.” adds a 200ms pause between the first and second sentences.

phonemizeBetweenBrackets
bool
default:
"false"

When set to true, you can specify the phonemes for a word enclosed in curly brackets.
Example: “{h’El.o} World” will pronounce “Hello” as expected. More details on this feature are incoming!

inlineSpeedAlpha
string

Comma-separated list of speed values applied to words in square brackets. Values < 1.0 speed up speech, > 1.0 slow it down. Example: “This sentence is [really] [fast]” with inlineSpeedAlpha “0.5, 3” will make “really” slow and “fast” fast.

samplingRate
int

The value, if provided, must be between 4000 and 44100. Default: 22050

speedAlpha
float
default:
"1.0"

Adjusts the speed of speech. Lower than 1.0 is faster than default. Higher than 1.0 is slower than default.

reduceLatency
bool
default:
"false"

Reduces the latency of response, at the cost of some possible mispronunciation of digits and abbreviations.