Krutrim: Disrupting AI giants with Indian data dominance
Krutrim, an artificial intelligence (AI) venture co-founded by Bhavish Aggarwal of Ola, has entered the increasingly competitive AI race dominated by players such as Google, Microsoft, and Openai.
However, what sets the firm apart from these players is that it has been built with the largest representation of Indian data used for its generative AI (GENAI) applications in all Indian languages.
Currently, all AI models called LLMS (large language models) are primarily trained in English. Due to India’s multicultural and multilingual context, these models struggle to capture the richness of the country’s linguistic diversity. Experts argue that training on unique data sets specific to the country is crucial.
“This is a problem statement at the crossroads of knowledge and language,” said Ravi Jain, head of strategy at Krutrim, in an interview. “Our differentiation is driven by the data used in the training and the various languages we incorporate, including their richness and depth. This will define the quality of the output in terms of the applications we build, setting us apart from the (technology) giants operating in 180 countries.”
Krutrim, meaning “artificial” in Sanskrit, is a family of LLMS that includes Krutrim Base and Krutrim Pro. The latter boasts multimodal, larger knowledge capabilities, and various technical advancements for inference. It is trained on over 2 trillion tokens, referring to chunks of text that the model reads or generates.
A team of computer scientists, based in Bengaluru and San Francisco, has trained this model, which will also power Krutrim’s conversational AI assistant capable of understanding and speaking multiple Indian languages fluently.
When asked about the source of the data, Jain mentioned that the first model the firm built had a significant representation of Indian data available in the public domain.
“Imagine all the Indian data in various languages on the web, including many PDFS (portable document formats). So, we have a substantial amount of publicly available data in different languages,” Jain explained. “As we progress, digitising non-digitised data, especially in many Indian languages, will become a crucial part of our journey. If we can incorporate them into the corpus and train the models, it will make a big difference.”
Last month, Krutrim became available for public beta testing.
The AI chatbot, similar to Openai’s CHATGPT, is accessible in two languages: English and Hindi.
“This is a starting point for us and our first-generation product. There is much more to come, and improvements will be significant as we build on this foundation,” remarked Ola founder Bhavish Aggarwal recently on X. Aggarwal emphasised that Krutrim is firmly rooted in Indian values and data, covering over 10 Indian languages, and is ready to assist in English, Hindi, Tamil, Bengali, Marathi, Kannada, Gujarati, and even Hinglish.
“While some ‘hallucinations’ may occur, they are much less prevalent in Indian contexts compared to other global platforms. We are working diligently to find and rectify them,” Aggarwal assured. Indeed, Krutrim recently provided incorrect responses to users’ queries. Screenshots shared on social media showed the chatbot incorrectly stating that the West Indies won the 1983 Cricket World Cup. It also erroneously asserted that Hillary Clinton won the 2014 US presidential elections, among other errors.
When asked about addressing such issues, Jain acknowledged that generative models can make mistakes as their insights are based on information in the public domain, which can have diverse views.
AS WE GO ALONG, THIS (PUBLICLY AVAILABLE DATA) WOULD BECOME THE MOST IMPORTANT PART OF OUR JOURNEY ABOUT HOW TO DIGITISE DATA THAT IS NOT DIGITISED YET. AND THAT IS THE CASE WITH MANY INDIAN LANGUAGES. IF WE CAN MAKE THEM PART OF THE CORPUS AND TRAIN THE MODELS, THAT WILL MAKE A BIG DIFFERENCE” RAVI JAIN, HEAD OF STRATEGY, KRUTRIM