Rime’s TTS Models
Rime now offers a suite of models tailored for different production needs:- Arcana v3
modelId: arcana- Our flagship TTS model that combines ultra-realistic, expressive voices with low latency (~120ms TTFB out of engine) and native multilingual code-switching across more than 10 languages.
- Enterprise-grade ergonomics for high-volume, real-time deployments at scale.
- Speaker performance optimized for business telephony and IVR.
- Arcana v2
modelId: arcanav2- Ultra-realistic and expressive voices (including laughter and whispering) with low latency (~250 ms TTFB out of the engine).
- Built for high-volume conversational applications.
- Mistv2
modelId: mistv2- Optimized for speed and fine-grained control, giving you accurate pronunciation and high concurrency for use cases that demand quick synthesis and customization.
- This is the highest precision and lowest WER model on the market; use this when perfect fidelity to text is needed.
What Makes Arcana v3 Different
- Real-Time Conversational Performance: Arcana v3 delivers TTS with industry-leading latency (sub 120ms on-prem latency and 200ms via the cloud API), enabling natural back-and-forth interactions without awkward pauses. This is fast enough for mid-utterance control and barge-in with no awkward silences.
- Multilingual & Code-Switching: A single model supports more than 10 languages (English, Spanish, Hindi, Arabic, French, Portuguese, German, Japanese, Hebrew, and Tamil) and can switch between them mid-utterance without losing prosody or voice identity.
- Word-Level Timestamps: Structural metadata enables text-audio alignment, real-time highlighting, better interruption handling, and smarter orchestration in voice applications.
- Enterprise-Grade Deployment: Arcana v3 scales with high concurrency per machine, ORCA headers for seamless auto-scaling, and a robust suite of TTS-specific observability metrics.

