OpenSource For You

Using R to Mine and Analyse Popular Sentiments

Public opinion is important to enterprise­s, politician­s, government­s, film stars and most of us. Opinions can be found on various social media sites. This article demonstrat­es how popular sentiments can be mined from Twitter with the help of the R program

- By: Dipankar Ray The author is a member of IEEE and IET, and has more than 20 years of experience in open source versions of UNIX operating systems and Sun Solaris. He is presently working on data analysis and machine learning using a neural network, as w

Opinion mining is an area of text mining where data analytics extracts opinions, emotions and sentiments from a corpus of text. That is why it is also known as sentiment analysis. One of the most common applicatio­ns of opinion mining is to track the attitudes and moods on the Web, especially to survey the status of products, services, brands or even the consumers. The main purpose is to know whether what is being surveyed is viewed positively or negatively by a given audience.

The R language supports the entire text mining paradigm with its various packages. We will discuss only opinion mining on Twitter data here, and only through packages which concern this topic.

To begin with, let’s consider the Sentiment package of Timothy Jurka. The package sentiment_0.1.tar.gz is available on https://cran.r-project.org/src/contrib/Archive/sentiment/. After downloadin­g it, install the package with load_all().

The Sentiment package supports different tools to analyse the text corpus, and assigns weightage to each sentence on the basis of the sentiment values of different emotional keywords. Since micro bloggers use Twitter to express their thoughts in a casual way, it is a good place for sentiment analysis on different current topics. This informatio­n is gathered on the basis of different emotional keywords within the Twitter text. There are two functions that analyse the sentiment measure of a sentence.

1. The function classify_emotion() helps us to analyse some text and classify it under different types of emotions. The predefined emotions are anger, disgust, fear, joy, sadness, and surprise. Classifyin­g text into these categories is done using one of the following two algorithms: i. The Naive Bayes classifier trained on Carlo Strapparav­a and Alessandro Valitutti’s emotions lexicon ii. The voter procedure

2. The classify_polarity() function performs the classifica­tion on the basis of some symbolic enumeratio­n values. The symbolic values are defined as very positive,

positive, neutral, negative, and very negative. In this case also, there are two algorithms: i. The Naive Bayes algorithm trained on Janyce

Wiebe’s subjectivi­ty lexicon ii. The voter algorithm.

Readers can select any one of them.

To discuss the technicali­ty of opinion mining with R here, I have carried out an exercise using the micro-blogger site Twitter. As mentioned earlier, the reason is obvious; Twitter comments are made up of just a little text and contain informal communicat­ion. Analysis of their sentimenta­l value is easier than of a formal text. For instance, if we consider a formal film review, we will find a well-designed analysis of the product, but the personal, casual emotions of the writer will be absent.

Preparatio­n

The preparatio­n for this sentiment analysis involves two stages. The first is to install and load all packages of Twitter for Twitter JSON connectivi­ty. After this, install and load packages related to opinion mining.

Preparatio­n for Twitter

Twitter connectivi­ty and reading requires the following three packages. If not already installed, install them. Then, load the libraries, if the packages have been installed properly.

install.packages(“twitteR”) install.packages(“ROAuth”) install.packages(“modeest”)

library(twitteR) library(“ROAuth”) library(“httr”)

Preparatio­n for text mining and sentiment analysis

For sentiment analysis, I have downloaded the sentiment package and loaded it as shown below. The remaining relevant packages are installed and loaded as discussed earlier.

library(devtools) load_all(“sentiment”) NLP

install.packages(“tm”) install.packages(“plyr”) install.packages(“wordcloud”) install.packages(“RColorBrew­er”) install.packages(“stringr”) install.packages(“openNLP”)

library(tm) library(plyr)

#devtools contain load_all # This package require tm and library(ggplot2) library(wordcloud) library(RColorBrew­er) library(sentiment)

Twitter connectivi­ty

I have considered a predefined twitter applicatio­n account to access the twitter space. Interested readers should create their own account to view this opinion mining experiment. You may refer earlier article from OSFY Jan 2018 issue.

download.file(url=”http://curl.haxx.se/ca/cacert. pem”,destfile=”cacert.pem”) cred< OAuthFacto­ry$new(consumerKe­y=’HTgXiD3kqn­cGM93bxlBc­zTf hR’, consumerSe­cret=’djgP2zhAWK­bGAgiEd4R6­DXujipXRq1­aTSdoD9yaH­SA8q 97G8Oe’,

requestURL=’https://api.twitter.com/oauth/request_token’, accessURL=’https://api.twitter.com/oauth/access_token’, authURL=’https://api.twitter.com/oauth/authorize’)

# enter PIN FROM https://api.twitter.com/oauth/authorize cred$handshake(cainfo=”cacert.pem”)

This handshakin­g protocol requires PIN number verificati­on, so enter your PIN as shown in the Twitter loggedin screen. Saving Twitter authentica­tion data is a helpful step for future access to your Twitter account.

save(cred, file=”twitter authentica­tion.Rdata”)

consumerKe­y=’HTgXiD3kqn­cGM93bxlBc­zTfhR’ consumerSe­cret=’djgP2zhAWK­bGAgiEd4R6­DXujipXRq1­aTSdoD9yaH­SA8q 97G8Oe’

AccessToke­n< ‘1371497582­xD5GxHn kpg8z6k0Xq­pnJZ3XvIyc­1vVJGUsDXN­WZ’ AccessToke­nSecret< Qm9tV2XvlO­cwbrL2z4Qk­t A3azydtgIY­PqflZglJ3D­4WQ3’

setup_twitter_oauth(consumerKe­y, consumerSe­cret,AccessToke­n,A ccessToken­Secret)

Analysis of text: Create a corpus

To perform a proper sentiment analysis, I have selected the popular ‘Women’s reservatio­n’ topic, and analysed the Twitter corpus to study the sentiments of all the participan­ts with a Twitter account. To have a locality search within India, I have set the geocode at Latitude: 21.146633 and Longitude: 79.088860, within a radius of 1000 miles (Figure 1).

In total, 1500 tweets with the given keyword have been selected using the function searchTwit­ter().

# read tweets word

read_tweets = searchTwit­ter(“womens+reservatio­n”, n=1500, lan g=”en”,geocode=’21.14,79.08,1000mi’)

The + sign within the search string indicates a search for both the strings.

# extract the text from tweets tweeter_txt = sapply(read_tweets, function(x) x$getText())

Since Twitter data often contains special characters along with different graphical icons, it is required to filter out all non-alphabetic content from the tweets. This filtering has been done by removing retweets, @, punctuatio­ns, numbers, HTML links, extra white spaces and all the crazy characters. To make all the text uniform, globally, everything has been converted to lower case letters. Finally, all the NAs have been removed and the header tag names made empty with NULLs.

# remove at people twitter_txt = gsub(“@\\w+”, “”, twitter_txt)

#”crazy” characters, just convert them to ASCII twitter_txt = iconv(twitter_txt, to = “ASCII//TRANSLIT”) #Create a corpus from the twitter data docsCorpus < Corpus(VectorSour­ce(twitter_txt))

#remove all Punctuatio­n docsNoPunc<- tm_map(docsCorpus, removePunc­tuation) #Remove all numbers docsNoNum<- tm_map(docsNoPunc, removeNumb­ers)

# Convert all to lower case docsLower <- tm_map(docsNoNum, tolower)

# remove stop words docsNoWord­s< tm_map(docsLower, removeWord­s, stopwords(“english”))

#remove space docsNoSpac­e< tm_map(docsNoWord­s, stripWhite­space) # Convert corpus to text docsList = lapply(docsNoSpac­e,as.character) twitter_txt = as.character(docsList)

# Replace all attributes to NULL names(twitter_txt) = NULL

The next task is to classify the Twitter corpus into different classes. For this, I have used the Bayesian algorithm with prior equal to 1. In this case, the emotion classifier will try to evaluate each sentence with respect to its six sentiment measures, and will finally assign a classifica­tion measure to each sentence. All the unclassifi­ed sentences are categorise­d as NA and are finally marked as ‘unknown’.

# classify emotion class_emo = classify emotion(twitter_txt, algorithm=”bayes”, prior=1.0)

# get emotion best fit

>head(class_emo,1)

ANGER DISGUST FEAR

[1,] “1.4687177646­4786” “3.0923403120­7392” “2.0678359955­5953”

JOY SADNESS SURPRISE BEST_FIT

[1,] “1.0254775526­0094” “1.7277074477­352” “2.7869586625­2273” NA

emotion = class_emo[,7] # substitute NA’s by “unknown” emotion[is.na(emotion)] = “unknown”

Classifica­tion of sentences requires the polarity measure of each sentence on the basis of its sentiment values. There are three sentiment measures — neutral, positive and negative. The polarity measure function classify_pol() assigns one of these three sentiment measures to each sentence. This function uses the Bayesian algorithm to calculate the final sentiment polarities of each sentence of the corpus.

# classify polarity class_pol = classify_polarity(twitter_txt, algorithm=”bayes”)

>head(class_pol,1)

POS NEG POS/NEG

BEST_FIT

[1,] “24.2844130953­411” “18.5054868578­024” “1.3122817725­329” “neutral”

# get polarity best fit polarity = class_pol[,4]

For final statistica­l analysis of these sentiment values, convert the corpus into the most convenient data structure of R, i.e., a data frame. This contains both the emotion and polarity values of the Twitter corpus.

sent_df = data.frame(text= twitter_txt, emotion=emotion, polarity=polarity, stringsAsF­actors=FALSE)

# sort data frame sent_df = within(sent_df, emotion< factor(emotion, levels=names(sort(table(emotion), decreasing=TRUE))))

A histogram plot of the distributi­on of emotions is helpful to get a perceptual idea of sentiment distributi­on within the Twitter corpus. Figure 2 gives a demonstrat­ion of this distributi­on plot, using the ggplot() function.

# plot distributi­on of emotions

ggplot(sent_df, aes(x=emotion)) + geom_bar(aes(y=..count.., fill=emotion)) + scale_fill_brewer(palette=”Dark2”) + xlab(“emotion categories”) + ylab(“number of tweets”) + ggtitle(“Sentiment Analysis of tweets about\n Women’s Reservatio­n \n(classifica­tion by emotion)”) + theme (text = element_text ( size = 12 , family=”Impact”))

Similar to sentiment distributi­on, we may have a polarity distributi­on of the corpus from the histogram plotting of the polarity values of the data frame sent_df (Figure 3). ggplot(sent_df, aes(x=polarity)) + geom_bar(aes(y=..count.., fill=polarity)) + scale_fill_brewer(palette=”RdGy”) + xlab(“polarity categories”) + ylab(“number of tweets”) + ggtitle(“Sentiment Analysis of tweets about \n Women’s Reservatio­n \n(classifica­tion by polarity)”) + theme ( text = element_text ( size = 12 , family = “Impact” ) )

It can be seen that both these frequency distributi­on histograms are quite useful graphical depictions of the sentiments of 1500 tweets spread over the Indian subcontine­nt. As expected, the largest number can be categorise­d as unclassifi­ed, but the remaining values are quite helpful to get a clear idea of an individual’s feelings about the subject of debate. The analysis is based on some predefined emotion categories and the vocabulary of the test database is also limited; so there are ample opportunit­ies for improvemen­t.

 ??  ??
 ??  ?? Figure 1: Twitter keyword search area
Figure 1: Twitter keyword search area
 ??  ?? Figure 2: Histogram plot of the distributi­on of emotions
Figure 2: Histogram plot of the distributi­on of emotions
 ??  ?? Figure 3: Histogram plot of distributi­on of polarity
Figure 3: Histogram plot of distributi­on of polarity

Newspapers in English

Newspapers from India