Deccan Chronicle

AI to learn local Tenglish to fit in

- NAVEENA GHANATE | DC

For most multilingu­al communitie­s like India, people have the tendency of mixing languages. Urban Indian speakers communicat­e in Hinglish or Tenglish (a blend of Hindi/Telugu and English) like Sui Dhaaga movie lo hero evaru.

This tendency to mix languages is spread across the conversati­ons happening in digital and social media platforms. While it is easy for humans to translate it, for artificial intelligen­ce (AI) it is a Herculean task as it is a combinatio­n of multiple languages. Researcher­s of IIIT-Hyderabad, Microsoft Carnegie Mellon University have developed a system for this code-mix, called Webshodh.

Prof. Manish Shrivastav­a from IIIT Hyderabad, said, “It is a natural evolution of languages to borrow words and the speaking patterns have phrases from other languages, often noticed among urban Indian language users. We see that in

every region of India, specially cities. We switch between our native language and English typically, largely because our education medium is English. Often two languages are mixed; like in Telangana we mix Deccani and Telugu. This is challengin­g and an important research problem we need to tackle.”

Instead of having perfect Hindi version of AI, it is being developed to have a Hinglish version as there is a necessity now and in future. Code mixed phrases

and sentences don’t have a clearly defined structure and are more free-flowing and casual, relying on the common heritage of the two speakers.

Since there is no dictionary for Hinglish or Tenglish, the data was collected from the social media conversati­ons of people, where such languages are more frequently used. Recreating this through algorithms is a challenge. Mr Shrivastav­a said, “Natural language processing depends largely on

availabili­ty of annotated data which has been marked by experts. For Indian languages, this annotated data is difficult to find. With code-mix, the problem is even more as it cannot be brought from formal sources. We have to go through social media and other open platforms where comments are public.”

Around 3-4 lakh datasets were created for Hinglish and Tenglish. Dr Manoj Chinnakotl­a of Microsoft and visiting faculty of IIITH said, “Today’s multilingu­al societies require software which supports interactio­n in code-mix languages. WebShodh is a testament that despite severe constraint on resources, AI based systems could still be built for code-mix languages. WebShodh currently uses very few resources such as bi-lingual dictionari­es. Through more online user interactio­ns, WebShodh has the potential to collect more user data using which the system could be re-trained to further boost accuracy of results.”

Newspapers in English

Newspapers from India