Tech Mahindra’s Project Indus – Revolutionizing AI for India’s Linguistic Diversity

Tech Mahindra, a prominent Indian IT company, has embarked on a groundbreaking venture known as Project Indus, aimed at creating an Indic-based foundational model for Indian languages. This initiative seeks to address the limitations of existing large language models (LLMs), such as OpenAI’s GPT models, which predominantly rely on English datasets. By developing an Indic LLM, Tech Mahindra aims to enhance the proficiency of language models in understanding and generating content in Indic languages, benefiting a substantial portion of the global population.

The project’s initial phase will focus on supporting 40 different Hindi dialects, with plans to expand to more languages and dialects in the future. The primary objective is to develop an LLM for text continuation and subsequently enable dialogues. Once the model’s performance reaches a satisfactory level, Tech Mahindra intends to release it as open source, making it accessible to a wider audience.

Developing an Indic LLM carries numerous advantages for India. It places a strong emphasis on cultural sensitivity, ensuring that generated content respects local customs and norms. Furthermore, it promotes inclusivity by catering to non-English speakers in the country, thus democratizing AI and expanding its reach. The versatility of a foundational model like LLM enables it to perform a wide range of tasks, benefiting specialized industries such as healthcare, retail, and tourism.

However, building an effective AI model comes with the challenge of sourcing high-quality datasets. While English datasets are abundant, datasets for Indic languages and dialects are scarce. To overcome this hurdle, Tech Mahindra is actively collaborating with various stakeholders, including the Indian government, to create these datasets. They are extracting information from diverse sources like Common Crawl, newspapers, Wikipedia, and YouTube descriptions. The company also encourages contributions from speakers of different dialects to help build these datasets.

Addressing biases in datasets is a crucial aspect of ensuring unbiased outputs from language models. Tech Mahindra has taken proactive measures during the data collection phase to prevent biases related to race, ethnicity, gender, and other factors. They employ a combination of human annotation and automatic techniques to achieve this critical goal.

Tech Mahindra’s Project Indus has far-reaching implications, aiming to develop an Indic-based foundational model that significantly enhances AI’s proficiency in Indic languages. This initiative stands to benefit millions of people in India and beyond by providing them with better access to AI-driven technologies and more culturally sensitive content.

Project Indus represents a significant step forward in the field of natural language processing, particularly for Indic languages. It promises to bridge the gap between English-centric AI models and the diverse linguistic landscape of India, ultimately empowering users and preserving the rich linguistic diversity of the country.

Disclaimer: The views expressed in this feature article are of the author. This is not meant to be an advisory to purchase or invest in products, services or solutions of a particular type or, those promoted and sold by a particular company, their legal subsidiary in India or their channel partners. No warranty or any other liability is either expressed or implied.
Reproduction or Copying in part or whole is not permitted unless approved by author.


Please enter your comment!
Please enter your name here

Latest Articles

Sign Up for CXO Digital Pulse Newsletters

Sign Up for CXO Digital Pulse Newsletters to Download the Research Report

Sign Up for CXO Digital Pulse Newsletters to Download the Coffee Table Book

Sign Up for CXO Digital Pulse Newsletters to Download the Vision 2023 Research Report

Download 8 Key Insights for Manufacturing for 2023 Report

Sign Up for CISO Handbook 2023

Download India’s Cybersecurity Outlook 2023 Report