AI4Bharat’s Ambitious “Ten Trillion Token” Project to Advance Indian Language AI

IIT Madras-incubated AI lab, AI4Bharat, is spearheading a large-scale initiative to amass 10 trillion tokens of language data to develop next-generation AI solutions. Tokens, the fundamental units processed by large language models (LLMs), can represent words, subwords, or characters.

According to The Economic Times, AI4Bharat cofounder Mitesh Khapra stated that the organization has spent the past three years gathering linguistic data from nearly every district in India, covering all 22 officially recognized languages. The data has been sourced through voice samples from individuals across diverse demographics and professions.

Khapra highlighted that AI4Bharat has built proprietary tools for data collection, enabling startups, academic institutions, and deeptech firms to leverage their resources for AI model development. He emphasized that their datasets, models, and scripts are open-source, allowing others to build upon them.

The collected data will power the “Ten Trillion Token” project, aimed at creating native Indic AI models that prioritize Indian languages rather than treating them as an afterthought. Khapra stressed the importance of gathering synthetic data that encapsulates linguistic and cultural nuances. The project has potential applications across various sectors, including agriculture, education, and digital finance.

AI4Bharat’s initiative aligns with a parallel effort by People+ai, an organization supported by Aadhaar architect Nandan Nilekani. People+ai is assembling 10 trillion language tokens from government documents and conversations to construct essential datasets for training foundational AI models. With English dominating internet content, AI4Bharat and People+ai seek to bridge the gap by developing robust datasets that capture India’s linguistic diversity, script variations, and grammatical structures.

This initiative builds upon AI4Bharat’s previous contributions, such as IndicVoices—an open-source speech dataset spanning 22 Indian languages. This dataset, funded by the Ministry of Electronics and IT’s Bhashini initiative and other non-profits, underscores the lab’s commitment to democratizing AI for Indian languages.

 

- Advertisement -

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles

error: Content is protected !!

Sign Up for CXO Digital Pulse Newsletters

Sign Up for CXO Digital Pulse Newsletters to Download the Research Report

Sign Up for CXO Digital Pulse Newsletters to Download the Coffee Table Book

Sign Up for CXO Digital Pulse Newsletters to Download the Vision 2023 Research Report

Download 8 Key Insights for Manufacturing for 2023 Report

Sign Up for CISO Handbook 2023

Download India’s Cybersecurity Outlook 2023 Report

Unlock Exclusive Insights: Access the article

Download CIO VISION 2024 Report

Share your details to download the report

Share your details to download the CISO Handbook 2024

Fill your details to Watch