AI4Bharat’s Ambitious “Ten Trillion Token” Project to Advance Indian Language AI

IIT Madras-incubated AI lab, AI4Bharat, is spearheading a large-scale initiative to amass 10 trillion tokens of language data to develop next-generation AI solutions. Tokens, the fundamental units processed by large language models (LLMs), can represent words, subwords, or characters.

According to The Economic Times, AI4Bharat cofounder Mitesh Khapra stated that the organization has spent the past three years gathering linguistic data from nearly every district in India, covering all 22 officially recognized languages. The data has been sourced through voice samples from individuals across diverse demographics and professions.

Khapra highlighted that AI4Bharat has built proprietary tools for data collection, enabling startups, academic institutions, and deeptech firms to leverage their resources for AI model development. He emphasized that their datasets, models, and scripts are open-source, allowing others to build upon them.

The collected data will power the “Ten Trillion Token” project, aimed at creating native Indic AI models that prioritize Indian languages rather than treating them as an afterthought. Khapra stressed the importance of gathering synthetic data that encapsulates linguistic and cultural nuances. The project has potential applications across various sectors, including agriculture, education, and digital finance.

AI4Bharat’s initiative aligns with a parallel effort by People+ai, an organization supported by Aadhaar architect Nandan Nilekani. People+ai is assembling 10 trillion language tokens from government documents and conversations to construct essential datasets for training foundational AI models. With English dominating internet content, AI4Bharat and People+ai seek to bridge the gap by developing robust datasets that capture India’s linguistic diversity, script variations, and grammatical structures.

This initiative builds upon AI4Bharat’s previous contributions, such as IndicVoices—an open-source speech dataset spanning 22 Indian languages. This dataset, funded by the Ministry of Electronics and IT’s Bhashini initiative and other non-profits, underscores the lab’s commitment to democratizing AI for Indian languages.

- Advertisement -

AI4Bharat’s Ambitious “Ten Trillion Token” Project to Advance Indian Language AI

Related Articles

Nokia’s AI and Cloud Sales Double to €446 Million in Second Quarter

Indian Banks See 146% Rise In SMS Scams As Mobile Fraud Sessions Jump

Karnataka Targets December Opening For 200-Acre Semiconductor Park At KWIN City

TIM Brasil partners with HCLTech to transform customer experience with South America’s first cross-platform eSIM transfer capability

LEAVE A REPLY Cancel reply

Latest Articles

Nokia’s AI and Cloud Sales Double to €446 Million in Second...

Indian Banks See 146% Rise In SMS Scams As Mobile Fraud...

Karnataka Targets December Opening For 200-Acre Semiconductor Park At KWIN City

TIM Brasil partners with HCLTech to transform customer experience with South...

Accenture In Talks To Take Majority Stake In Bengaluru-Based ANSR At...

CARPL.ai Raises $10 Million To Scale AI Medical Imaging Marketplace

Tata Communications Appoints Narottam Sharma as Chief Transformation Officer

Alphabet Raises 2026 Capital Spending Plan to $205 Billion as AI...

Mixx Technologies Acquires Sophic Silicon, Partners With Kaynes Semicon for Optical...

Anurag Jain Appointed Chief Executive Officer of NITI Aayog