> ## Documentation Index
> Fetch the complete documentation index at: https://docs.rime.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Performance tuning

> Tune Rime on-prem deployments for throughput, latency, and concurrency.

# Overview

When we talk about *performance* for a real-time audio streaming service, we
typically mean a combination of the following metrics:

* Initial latency, or time-to-first-frame/byte (TTFF/TTFB), defined as the time
  elapsed when the first frame is delivered from when the request was sent.
  Lower is better.
* Real-time factor (RTF), defined as a proportion of the time spent on
  processing and the stream duration. A value ≤ 1 is required to stream at
  real-time. Lower is better.
* Concurrency, defined as the number of requests a service can handle. Higher is
  better.

The initial latency and RTF typically work in the opposite direction of
concurrency: the latency and RTF typically go up as the concurrency level goes
up. To ensure the performance of a real-time streaming service under high load,
it is recommended to put a limit on the maximum concurrent requests a system
handles, and to queue or reject the out-of-capacity requests.

<Note>Rime does not provide rate-limiting or queuing support, as it is highly
dependent on the exact deployment being used.</Note>

# Metrics

Rime provides some off-the-shelf metrics for reference only. Your mileage may
vary depending on the hardware/software setup, latency constraints, and the
actual traffic.

## Methodology

We use the [`armchair`](https://crates.io/crates/armchair) tool for
benchmarking a system and report the following:

* Initial latency: measured when there is 1 concurrent request (`-c 1`).
* Max concurrency: measured at a performance target of 100% success rate, 99th
  percentile of latency ≤ 1s, 99th percentile of RTF ≤ 1
  (`--target=success:1.00,ttfb:p99@1s,rtf:p99@1.00`).

| Model     | Hardware | Initial latency | Max concurrency |
| --------- | -------- | --------------- | --------------- |
| Arcana v2 | H100     | μ=400ms         | 32              |

`armchair` is run with the default arguments, and from the same machine that
serves the image to eliminate network latency.

# Performance tuning

<Note>Performance tuning is only available for Arcana model images tagged with
`20251027` or later.</Note>

## Environment variables

There is a set of environment variables for the model image that you can tune
in order to improve the concurrency under a set performance constraint:

* `DECODER_MAX_BATCH`, defaults to `32`
* `DECODER_NUM_SESSIONS`, defaults to `6`
* `GENERATOR_MAX_BATCH`, defaults to `32`
* `GENERATOR_GPU_MEMORY_UTILIZATION`, defaults to `0.8`

The defaults for these variables are set to accommodate the lowest spec that
Rime supports, so we recommend tuning them with a benchmark-driven approach.

## Tuning

You can use [`armchair`](https://crates.io/crates/armchair) to tune the
environment variables with the following workflow:

1. Get a base performance report by running `armchair` with specific
   performance constraints, without specifying the concurrency level (`-c`).
2. If RTF is significantly lower than 1, increase both `DECODER_MAX_BATCH` and
   `GENERATOR_MAX_BATCH`.
3. If the server fails to start with an OOM error, decrease
   `GENERATOR_GPU_MEMORY_UTILIZATION`.
4. Repeat the process until the benchmarked concurrency level converges with
   `DECODER_MAX_BATCH` and `GENERATOR_MAX_BATCH`. This is the maximum
   concurrency that the system can accept.
