> ## Documentation Index
> Fetch the complete documentation index at: https://docs.rime.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Performance tuning

> Tune supported Coda and Arcana on-prem deployments for concurrency without exceeding latency or real-time-factor targets.

<Note>Performance tuning applies to both Coda and Arcana on-prem deployments.</Note>

Tune concurrency against an explicit service-level target. Rime does not provide rate limiting or request queues because those controls depend on the deployment, so enforce the concurrency limit in your own infrastructure.

## What to measure

* **Initial latency**, or time to first frame or byte (TTFF/TTFB), measures the time from sending a request until the first frame arrives. Lower is better.
* **Real-time factor (RTF)** is processing time divided by stream duration. RTF must remain at or below 1 for real-time delivery. Lower is better.
* **Concurrency** is the number of simultaneous requests the service can handle while meeting the target. Higher is better.

Latency and RTF usually rise with concurrency. Queue or reject requests above the measured limit instead of allowing overload to degrade every active stream.

## Coda

For Coda, `GENERATOR_MAX_BATCH` is the only control that affects tuning, and its default works for most deployments.

| Variable              | Default |
| :-------------------- | :------ |
| `GENERATOR_MAX_BATCH` | `32`    |

Increase it to `64` or `128` when you serve higher traffic or run on more modern GPUs. A larger batch supports higher concurrency at the cost of longer container startup time. Leave the other controls listed below at their defaults.

## Arcana

Arcana exposes additional batching, session, and memory controls. Tune them together against representative traffic.

### Tuning workflow

Use [`armchair`](https://crates.io/crates/armchair) to repeat this benchmark-driven loop:

1. Run a baseline without specifying concurrency (`-c`) and set the latency, real-time-factor, and success-rate constraints you need.
2. If RTF is significantly lower than 1, increase both `DECODER_MAX_BATCH` and `GENERATOR_MAX_BATCH`.
3. If the server fails to start with an out-of-memory error, decrease `GENERATOR_GPU_MEMORY_UTILIZATION`.
4. Repeat until the benchmarked concurrency converges with `DECODER_MAX_BATCH` and `GENERATOR_MAX_BATCH`. Treat that value as the maximum concurrency for the deployment.

### Controls

These model-container environment variables control batching, sessions, and memory use:

| Variable                           | Default |
| :--------------------------------- | :------ |
| `DECODER_MAX_BATCH`                | `32`    |
| `DECODER_NUM_SESSIONS`             | `6`     |
| `GENERATOR_MAX_BATCH`              | `32`    |
| `GENERATOR_GPU_MEMORY_UTILIZATION` | `0.8`   |

The defaults support Rime's lowest supported hardware specification. Higher-capacity hardware should still be tuned against representative traffic rather than a fixed concurrency assumption.

### Reference benchmark

Rime benchmarks with [`armchair`](https://crates.io/crates/armchair):

* Initial latency is measured at one concurrent request (`-c 1`).
* Maximum concurrency requires a 100% success rate, P99 latency at or below one second, and P99 RTF at or below 1 (`--target=success:1.00,ttfb:p99@1s,rtf:p99@1.00`).

| Model     | Hardware | Initial latency | Maximum concurrency |
| :-------- | :------- | :-------------- | :------------------ |
| Arcana v2 | H100     | μ=400ms         | 32                  |

This reference uses `armchair` defaults from the same machine that serves the image, which removes network latency from the measurement. Actual results vary with hardware, software, latency targets, and traffic shape.