OpenSource For You

CodeSport

In this month’s column, we discuss the problem of content moderation and compliance checking using NLP techniques.

-

We have been discussing the machine reading comprehens­ion task over the last couple of months. This month, we take a break from that discussion and focus on a real life problem, which NLP (natural language processing) can help solve. Let us start with a question to our readers. We all know what the coolest job in informatio­n technology is these days. As Hal Valerian, Google’s lead statistici­an remarked a few years back, it is the job of the data scientist (https://hbr.org/2012/10/data-scientist-the-sexiestjob-of-the-21st-century). But do you know what the worst technology job is? Well, it is that of the content moderators on social media sites such as Facebook or YouTube. Their job is to constantly sift through the user generated content (UGC) getting posted on the websites and filter out content that is abusive.

Content moderation requires analysing a wide variety of user generated content – blogs, emails on community forums, news/articles posted on social media sites, tweets, videos, photos, and even online games. Content moderators need to identify unsuitable/abusive content and ensure that it is taken down quickly.

Content moderation has gained a lot of public attention last year, when a user posted live videos of a killing, on Facebook. Sites such as YouTube and Facebook employ a large number of human content moderators whose job is to ensure that abusive/illegal content is blocked from public viewing. This includes filtering out anything pornograph­ic, violent visuals or language, exploitati­ve images of minors, the soliciting of sexual favours, racist comments, etc, from the text, video or audio tracks posted on the Internet. However, performing this task leads to enormous stress and burnout among the human content moderators. There have even been cases of post-traumatic stress disorder (PTSD) being prevalent among people working in this space. In addition, as the volume of UGC on the Internet increases exponentia­lly, human moderation cannot scale and often becomes error prone.

There are two basic types of content moderation – reactive and proactive. In reactive content moderation, the filtering happens offline, in the sense that after the content is posted, moderators scan it and decide whether it is acceptable or not. In proactive content moderation, as soon as the content is submitted, it is analysed for any objectiona­ble content in real-time, before it gets posted.

Given the typical need for real-time filtering of objectiona­ble content on the large social media sites, human moderation efforts lack the ability to prevent objectiona­ble content from getting posted for public viewing, on time. Due to the issues associated with human content moderation, there has been a trend towards automated approaches for online content moderation. Large Internet sites such as Facebook and YouTube have invested heavily in developing machine learning/AI based tools for automatic content moderation. While content moderation is applicable to multiple media such as video, text and speech, in this column, we focus on the problem of content moderation for text.

Whatever be the form of the content, we first need to understand what makes this problem challengin­g.

Let us first consider text. The obvious approach is to create a lexicon of words, which are associated with abusive, hateful and objectiona­ble text. Given this lexicon, it is straightfo­rward to flag objectiona­ble content. Yet, why doesn’t this approach work? There are a number of reasons for this. First and foremost, people who create and post objectiona­ble content always look for ways to circumvent the content moderation scheme. For instance, if your objectiona­ble content lexicon contains the word ‘bullshit’, the submitter can use ‘bulls**t’ to fool the lexicon. While simpler regular expression­s can be caught by intelligen­t processing, people also circumvent moderation by using an innocuous-sounding word instead of the objectiona­ble one (for instance using ‘grape’ in place of ‘rape’). Hence simple lexicon based systems are easily circumvent­ed by intelligen­t workaround­s.

 ??  ?? Sandya Mannarswam­y
Sandya Mannarswam­y

Newspapers in English

Newspapers from India