Tech Mahindra, a prominent Indian IT company, has embarked on a groundbreaking venture known as Project Indus, aimed at creating an Indic-based foundational model for Indian languages. This initiative seeks to address the limitations of existing large language models (LLMs), such as OpenAI’s GPT models, which predominantly rely on English datasets. By developing an Indic LLM, Tech Mahindra aims to enhance the proficiency of language models in understanding and generating content in Indic languages, benefiting a substantial portion of the global population.
The project’s initial phase will focus on supporting 40 different Hindi dialects, with plans to expand to more languages and dialects in the future. The primary objective is to develop an LLM for text continuation and subsequently enable dialogues. Once the model’s performance reaches a satisfactory level, Tech Mahindra intends to release it as open source, making it accessible to a wider audience.
Developing an Indic LLM carries numerous advantages for India. It places a strong emphasis on cultural sensitivity, ensuring that generated content respects local customs and norms. Furthermore, it promotes inclusivity by catering to non-English speakers in the country, thus democratizing AI and expanding its reach. The versatility of a foundational model like LLM enables it to perform a wide range of tasks, benefiting specialized industries such as healthcare, retail, and tourism.
However, building an effective AI model comes with the challenge of sourcing high-quality datasets. While English datasets are abundant, datasets for Indic languages and dialects are scarce. To overcome this hurdle, Tech Mahindra is actively collaborating with various stakeholders, including the Indian government, to create these datasets. They are extracting information from diverse sources like Common Crawl, newspapers, Wikipedia, and YouTube descriptions. The company also encourages contributions from speakers of different dialects to help build these datasets.
Addressing biases in datasets is a crucial aspect of ensuring unbiased outputs from language models. Tech Mahindra has taken proactive measures during the data collection phase to prevent biases related to race, ethnicity, gender, and other factors. They employ a combination of human annotation and automatic techniques to achieve this critical goal.
Tech Mahindra’s Project Indus has far-reaching implications, aiming to develop an Indic-based foundational model that significantly enhances AI’s proficiency in Indic languages. This initiative stands to benefit millions of people in India and beyond by providing them with better access to AI-driven technologies and more culturally sensitive content.
Project Indus represents a significant step forward in the field of natural language processing, particularly for Indic languages. It promises to bridge the gap between English-centric AI models and the diverse linguistic landscape of India, ultimately empowering users and preserving the rich linguistic diversity of the country.