Overview
Rime’s websocket implementation accepts bare text, and responds with audio bytes of the selected format. All synthesis arguments are provided as query parameters when establishing the connection.Messages
Send
The messages your client will send to the websocket API will be bare (non-serialized) text.Receive
The messages your client will receive will be raw audio bytes in the audio format specified at connection time.Commands
There will be times when you want to interact with the API and manipute the stored buffer. These are the current supported commands.<CLEAR>
This clears the current buffer. Used in the event of interruptions.
<FLUSH>
This forces whatever buffer exists, if any, to be synthesized, and the generated audio to be sent over.
<EOS>
This forces whatever buffer exists, if any, to be synthesized, and for the server to close the connection after sending the generated audio.
Variable Parameters
Must be one of the voices listed in our documentation for
arcana
.The text you’d like spoken. Character limit per request is 500 via the API and 1,000 in the dashboard UI.
This value must be set to
arcana
else the websockets server will default to mistv2
for speech synthesis.One of
mp3
, mulaw
, or pcm
If provided, the language must match the language spoken by the provided speaker. This can be checked in our voices documentation.
The repetition penalty. We do not recommend changing this from the default value. Typical range is 1 to 2.Float that penalizes new tokens based on whether they appear in the prompt and the generated text so far. Values > 1 encourage the model to use new tokens, while values < 1 encourage the model to repeat tokens.
The temperature. We do not recommend changing this from the default value. Typical range is 0 to 1.Float that controls the randomness of the sampling. Lower values make the model more deterministic, while higher values make the model more random. Zero means greedy sampling.
The top p. We do not recommend changing this from the default value. Typical range is 0 to 1.Float that controls the cumulative probability of the top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens.
The sampling rate (Hz).
- On-cloud: Accepted values: 8000, 16000, 22050, 24000, 44100, 48000, 96000. Anything above 24000 is up sampling.
- On-prem: Any value is accepted.
The value, if provided, must be between 4000 and 44100. Default: 22050
Controls how text is segmented for synthesis. Available options:
- “immediate” - Synthesizes text immediately without waiting for complete sentences
- “never” - Never segments the text, waits for explicit flush or EOS
- “bySentence” (default) - Waits for complete sentences before synthesis
immediate=true
in query params is equivalent to segment=immediate
. If a null value is provided, it will default to “bySentence”.