Rime provides industry-leading text-to-speech (TTS) AI models built for real-time conversational experiences at scale. Our latest flagship model, Arcana v3, delivers authentic, natural, and expressive speech with the speed and reliability required for production voice AI, whether you’re building intelligent IVRs, multilingual voice agents, or anything in between.Documentation Index
Fetch the complete documentation index at: https://docs.rime.ai/llms.txt
Use this file to discover all available pages before exploring further.
Rime’s TTS Models
Rime now offers a suite of models tailored for different production needs:- Arcana v3
modelId: arcana- Our flagship TTS model that combines ultra-realistic, expressive voices with low latency (~120ms TTFB out of engine) and native multilingual code-switching across more than 10 languages.
- Enterprise-grade ergonomics for high-volume, real-time deployments at scale.
- Speaker performance optimized for business telephony and IVR.
- Arcana v2
modelId: arcanav2- Ultra-realistic and expressive voices (including laughter and whispering) with low latency (~250 ms TTFB out of the engine).
- Built for high-volume conversational applications.
- Mist v3
modelId: mistv3- Major update to the Mist engine with typical TTFB well below 100ms — significantly faster than previous Mist versions while preserving quality and predictability.
- Mistv2
modelId: mistv2- Optimized for speed and fine-grained control, giving you accurate pronunciation and high concurrency for use cases that demand quick synthesis and customization.
- This is the highest precision and lowest WER model on the market; use this when perfect fidelity to text is needed.
What Makes Arcana v3 Different
- Real-Time Conversational Performance: Arcana v3 delivers TTS with industry-leading latency (sub 120ms on-prem latency and 200ms via the cloud API), enabling natural back-and-forth interactions without awkward pauses. This is fast enough for mid-utterance control and barge-in with no awkward silences.
- Multilingual & Code-Switching: A single model supports more than 10 languages (English, Spanish, Hindi, Arabic, French, Portuguese, German, Japanese, Hebrew, and Tamil) and can switch between them mid-utterance without losing prosody or voice identity.
- Word-Level Timestamps: Structural metadata enables text-audio alignment, real-time highlighting, better interruption handling, and smarter orchestration in voice applications.
- Enterprise-Grade Deployment: Arcana v3 scales with high concurrency per machine, ORCA headers for seamless auto-scaling, and a robust suite of TTS-specific observability metrics.

