ORCA headers are returned by Arcana model containers since the
20251220
release.HTTP ORCA header
The ORCA header in HTTP responses looks like:Max concurrency
Theinference_concurrency_utilization named metric is calculated by dividing
the number of concurrent inference requests by a preconfigured max
concurrency.
You can override the max concurrency after parameter
tuning by setting the
INFERENCE_CONCURRENCY_CAPACITY to the desired max concurrency.
The
INFERENCE_CONCURRENCY_CAPACITY variable is only used to calculate
the utilization to inform the load balancer. Setting it does not reject or
queue overflowing requests.
