Documentation Index
Fetch the complete documentation index at: https://docs.rime.ai/llms.txt
Use this file to discover all available pages before exploring further.
tl;dr: Rime is really fast.
Sub-200ms end-to-end latency is standard via the cloud API. Coda, Rime’s flagship model, achieves sub-100ms model latency on the GPU engine when self-hosted or on-prem. Via the cloud API, the total time you measure also includes network round-trip — typically 25–50ms from most of the continental US when you pick the closest regional endpoint. Contact us to optimize further.Real-time performance benchmarks
These numbers were measured on a single Lambda H100 SXM machine — H100 SXM5 GPU, 26 vCPUs, 221 GiB RAM, Ubuntu 24.04, NVIDIA driver
595.58.03.| Metric | Coda | Arcana v3 | Mist v3 |
|---|---|---|---|
| TTFA (P50) @ 1 Concurrency | 96 ms | 118 ms | 37 ms |
| TTFA (P90) @ 1 Concurrency | 98 ms | 119 ms | 56 ms |
| TTFA (P50) @ 12 Concurrency | 150 ms | 168 ms | 37 ms |
| TTFA (P90) @ 12 Concurrency | 181 ms | 182 ms | 56 ms |
| RTF (P99) | 0.33 | 0.29 | 0.004 |
TTS and latency
The response time from any text-to-speech (TTS) API depends on a variety of factors, including network latency, the length of the text input, any text preprocessing needed prior to model inference, and payload size. Rime’s TTS API has been designed from the ground up to be the fastest to respond. The following sections will have information on the contributing factors to response latency and recommendations for reducing this latency.Factors that affect API response time
- Inference time: In general, server processing time will include the amount of time that the server takes to process the API request; for deep-learning TTS models, this will include model inference time but also text preprocessing functions and audio postprocessing.
- Payload size: The size of the response data sent between the server and the client can impact API response time — large payloads take longer to transmit across the network.
- Network latency: Network latency affects response time due to the simple fact that distance from the server will increase the amount of time it takes for the payload to get to the client.
Streaming
Assuming the client is able to stream audio, whether on-device or through a telephony provider or system, streaming will always be faster than non-streaming requests to our API. The reason for this is simple: a great deal of end-to-end response time comes down to the client. If we reduce the size of the initial payload chunk, we can reduce TTFB, which is the time audio can start streaming.
Python
Recommendations for reducing response time
- Use the API parameter
noTextNormalization, which turns off text normalization, to reduce the amount of computation needed to prepare input text for TTS inference. This can safely be used in cases where there are no digits, abbreviations, or tricky punctuation, such as inYes, I grew up on one twenty-three Main Street in Oakland, California.instead ofYes, I grew up on 123 Main St. in Oakland, CA.. - Request audio that meets the needs of your application. Remember: Smaller payloads move across the network more quickly. For example, if you’re building an application for telephony, request 8000kHz-sampled audio using the
samplingRateparameter.Rime’s models produce PCM audio by default, and conversion and downsampling must take place for sampling rates other than 22kHz. Still, the reduction in payload size and therefore network overhead more than makes up for increased server processing time.
Network latency by region
Rime serves the API from multiple regions so you can route requests to the data center closest to your application. Picking the closest endpoint typically shaves tens of milliseconds off round-trip time. These are rough round-trip times (RTT) — the back-and-forth network delay between your application and Rime — from major US metros to each region:| From metro | To US East | To US West |
|---|---|---|
| East Coast (NYC, DC, Boston, Atlanta) | 5–25 ms | 60–85 ms |
| Midwest / South (Chicago, Dallas, Denver) | 25–55 ms | 35–65 ms |
| West Coast (SF, LA, Seattle) | 60–85 ms | 5–25 ms |
- Same region (your app and the Rime endpoint in the same AWS region): typically 1–10 ms.
- Coast-to-coast: ~60 ms is the physical floor, set by the speed of light in fiber across ~2,500 miles. 60–85 ms is normal.
- Above 90 ms between major US metros usually points to a suboptimal network route, not Rime.
rime speedtest.
