Rime’s TTS Models
Rime now offers a suite of models tailored for different production needs:- Arcana v3 — our flagship TTS model that combines ultra-realistic, expressive voices with low latency (~120 ms TTFB) and native multilingual code-switching across more than 10 languages. It includes word-level timestamps and enterprise-grade ergonomics for high-volume, real-time deployments. It also supports natural elements like laughter and breathing.
- Mistv2 — optimized for speed and fine-grained control, giving you accurate pronunciation and high concurrency for use cases that demand quick synthesis and customization.
What Makes Arcana v3 Different
- Real-Time Conversational Performance: Arcana v3 delivers TTS with industry-leading latency (sub 120ms on-prem latency and 200ms via the cloud API), enabling natural back-and-forth interactions without awkward pauses. This is fast enough for mid-utterance control and barge-in with no awkward silences.
- Multilingual & Code-Switching: A single model supports more than 10 languages (English, Spanish, Hindi, Arabic, French, Portuguese, German, Japanese, Hebrew, and Tamil) and can switch between them mid-utterance without losing prosody or voice identity.
- Word-Level Timestamps: Structural metadata enables text-audio alignment, real-time highlighting, better interruption handling, and smarter orchestration in voice applications.
- Enterprise-Grade Deployment: Arcana v3 scales with high concurrency per machine, ORCA headers for seamless auto-scaling, and a robust suite of TTS-specific observability metrics.

