Bengaluru-based AI startup Sarvam AI has announced the release of its latest innovation—Bulbul-v2, a powerful text-to-speech (TTS) model tailored specifically for India. Supporting 11 Indian languages, the model is designed to deliver speech with authentic regional accents that the company describes as sounding “just like India.”
In a recent post on LinkedIn, Sarvam AI emphasized that Bulbul-v2 generates lifelike, expressive audio, avoiding the flat or robotic tone common in many TTS systems. It also boasts high-speed processing, customizable voice options, and is particularly suited for use by brands and enterprises looking to localize content at scale.
According to Sarvam, Bulbul-v2 represents a leap forward for speech AI in India, setting new standards in terms of naturalness, responsiveness, and affordability. As part of its broader mission to democratize access to AI in India, the startup is offering low-latency API access at India-friendly pricing, helping expand the technology’s reach across industries.
Notably, Sarvam AI is the first Indian startup selected by the central government to develop India’s sovereign large language model (LLM) under the national IndiaAI initiative, which aims to build indigenous capabilities in artificial intelligence.
What is Bulbul-v2?
Bulbul-v2 is Sarvam AI’s flagship TTS model, engineered to mirror India’s linguistic diversity and speech patterns. It supports real-time synthesis, multi-language inputs, and code-mixed text, making it adept at handling natural conversations across different Indian languages. The model also includes multiple voice personas, giving users creative flexibility.
Key capabilities include:
- Realistic voice prosody (rhythm, tone, and intonation)
- Voice customization (adjust pitch, speed, and volume)
- Language-aware text processing, including smart handling of numbers, dates, and mixed-language sentences
- Sample rate options ranging from 8kHz to 24kHz for adaptable audio quality
What Can Bulbul-v2 Do?
The model can instantly convert text into natural speech using preset or custom configurations. Users have fine-grained control over audio output, allowing them to tailor speech style to specific use cases—be it customer service, storytelling, or content localization.
The integrated text preprocessing system intelligently normalizes text inputs to enhance clarity and pronunciation, especially for numerical or hybrid linguistic inputs.
Released as a follow-up to Bulbul-v1, which launched in August 2024 with six voice presets, Bulbul-v2 pushes the boundaries with more nuanced voice personalities and greater scalability.
Given its speed, affordability, and Indian linguistic orientation, Bulbul-v2 is being positioned as a competitive alternative to global TTS models, especially for developers, educators, and businesses aiming for localized engagement.