Websocket
Websockets
Mist v2 plain-text WebSocket (/ws): send text, receive raw audio bytes.
Documentation Index
Fetch the complete documentation index at: https://docs.rime.ai/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Rime’s websocket implementation accepts bare text, and responds with audio bytes of the selected format. All synthesis arguments are provided as query parameters when establishing the connection. The websocket API buffers inputs up to on of the following punctuation characters:., ,, ?, !. This is most pertinent for the initial messages sent to the API, as synthesis won’t begin until there are sufficient tokens to generate audio with natural prosody. After the first synthesis of any given utterance, typically enough time has elapsed that subsequent audio contains multiple clauses, and the buffering becomes largely invisible.
Messages
Send
The messages your client will send to the websocket API will be bare (non-serialized) text.Receive
The messages your client will receive will be raw audio bytes in the audio format specified at connection time.Commands
There will be times when you want to interact with the API and manipute the stored buffer. These are the current supported commands.<CLEAR>
This clears the current buffer. Used in the event of interruptions.
<FLUSH>
This forces whatever buffer exists, if any, to be synthesized, and the generated audio to be sent over.
<EOS>
This forces whatever buffer exists, if any, to be synthesized, and for the server to close the connection after sending the generated audio.
Variable Parameters
Must be one of the voices listed in our documentation.
The text you’d like spoken. Character limit per request is 500 via the API and 1,000 in the dashboard UI.
Set to
mistv2.One of
mp3, mulaw, or pcmIf provided, the language must match the language spoken by the provided speaker. This can be checked in our voices documentation.
When set to true, adds pauses between words enclosed in angle brackets. The number inside the brackets specifies the pause duration in milliseconds.
Example: “Hi. <200> I’d love to have a conversation with you.” adds a 200ms pause between the first and second sentences.
Example: “Hi. <200> I’d love to have a conversation with you.” adds a 200ms pause between the first and second sentences.
When set to true, you can specify the phonemes for a word enclosed in curly brackets.
Example: “{h’El.o} World” will pronounce “Hello” as expected. Learn more about custom pronunciation.
Example: “{h’El.o} World” will pronounce “Hello” as expected. Learn more about custom pronunciation.
Comma-separated list of speed values applied to words in square brackets. Values < 1.0 speed up speech, > 1.0 slow it down.
Example: “This is [slow] and [fast]”, use “3, 0.5” to make “slow” slower and “fast” faster..
The value, if provided, must be between 4000 and 44100. Default: 22050
Adjusts the speed of speech. Lower than 1.0 is faster and higher than 1.0 is slower.Note: this is the legacy Mist v2 convention. Newer models (Coda, Arcana, Mist v3) invert it — for those, higher than 1.0 is faster.
mist/mistv2 only. Skips text normalization of the input text prior to synthesizing audio. This will reduce latency at the cost of some possible mispronunciation of digits and abbreviations.
Controls how text is segmented for synthesis. Available options:
- “immediate” - Synthesizes text immediately without waiting for complete sentences
- “never” - Never segments the text, waits for explicit flush or EOS
- “bySentence” (default) - Waits for complete sentences before synthesis
immediate=true in query params is equivalent to segment=immediate. If a null value is provided, it will default to “bySentence”.mist/mistv2 only. If set to
true, Rime shall save any currently OOV (out-of-vocabulary) words encountered in text, and save them for the User or Team to review on the
Speech QA dashboard. Note: It may take up to 15 minutes for OOV words to appear on your dashboard.
