> ## Documentation Index
> Fetch the complete documentation index at: https://docs.rime.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Load balancing

> Load-balance Rime on-prem deployments using the ORCA cost-header signal.

To ensure real-time streaming, it is often necessary to limit the number of
inference requests that a model is processing concurrently.

To support this, Rime's Arcana model images return a [Open Request Cost
Aggregation (ORCA) header](https://github.com/envoyproxy/envoy/issues/6614) to
inform a load balancer based on the number of concurrent requests.

<Note>The full set of ORCA headers is returned by Arcana model containers
since the `20260115` release.</Note>

## HTTP ORCA header

The ORCA header in HTTP responses looks like:

```
endpoint-load-metrics: TEXT application_utilization=0.5, cpu_utilization=0.3128, mem_utilization=0.2453, rps_fractional=0.0000, eps=0.0000
```

## Max concurrency

The `application_utilization` metric is calculated by dividing the number of
concurrent inference requests by a preconfigured max concurrency.

You can override the max concurrency after [parameter
tuning](/docs/on-prem/performance) by setting the
`INFERENCE_CONCURRENCY_CAPACITY` to the desired max concurrency.

<Note>The `INFERENCE_CONCURRENCY_CAPACITY` variable is only used to calculate
the utilization to inform the load balancer. Setting it does not reject or
queue overflowing requests.</Note>
