Us­ing R to Mine and An­a­lyse Pop­u­lar Sen­ti­ments

Pub­lic opin­ion is im­por­tant to en­ter­prises, politi­cians, gov­ern­ments, film stars and most of us. Opin­ions can be found on var­i­ous so­cial me­dia sites. This ar­ti­cle demon­strates how pop­u­lar sen­ti­ments can be mined from Twit­ter with the help of the R pro­gram

OpenSource For You - - Contents - By: Di­pankar Ray The au­thor is a mem­ber of IEEE and IET, and has more than 20 years of ex­pe­ri­ence in open source ver­sions of UNIX op­er­at­ing sys­tems and Sun So­laris. He is presently work­ing on data anal­y­sis and ma­chine learn­ing us­ing a neu­ral net­work, as w

Opin­ion min­ing is an area of text min­ing where data an­a­lyt­ics ex­tracts opin­ions, emo­tions and sen­ti­ments from a cor­pus of text. That is why it is also known as sen­ti­ment anal­y­sis. One of the most com­mon ap­pli­ca­tions of opin­ion min­ing is to track the at­ti­tudes and moods on the Web, es­pe­cially to sur­vey the sta­tus of prod­ucts, ser­vices, brands or even the con­sumers. The main pur­pose is to know whether what is be­ing sur­veyed is viewed pos­i­tively or neg­a­tively by a given au­di­ence.

The R lan­guage sup­ports the en­tire text min­ing par­a­digm with its var­i­ous pack­ages. We will dis­cuss only opin­ion min­ing on Twit­ter data here, and only through pack­ages which con­cern this topic.

To be­gin with, let’s con­sider the Sen­ti­ment pack­age of Ti­mothy Jurka. The pack­age sen­ti­ment_0.1.tar.gz is avail­able on https://cran.r-project.org/src/con­trib/Ar­chive/sen­ti­ment/. Af­ter down­load­ing it, in­stall the pack­age with load­_all().

The Sen­ti­ment pack­age sup­ports dif­fer­ent tools to an­a­lyse the text cor­pus, and as­signs weigh­tage to each sen­tence on the ba­sis of the sen­ti­ment val­ues of dif­fer­ent emo­tional key­words. Since mi­cro blog­gers use Twit­ter to ex­press their thoughts in a ca­sual way, it is a good place for sen­ti­ment anal­y­sis on dif­fer­ent cur­rent top­ics. This in­for­ma­tion is gath­ered on the ba­sis of dif­fer­ent emo­tional key­words within the Twit­ter text. There are two func­tions that an­a­lyse the sen­ti­ment mea­sure of a sen­tence.

1. The func­tion clas­si­fy_e­mo­tion() helps us to an­a­lyse some text and clas­sify it un­der dif­fer­ent types of emo­tions. The pre­de­fined emo­tions are anger, dis­gust, fear, joy, sad­ness, and sur­prise. Clas­si­fy­ing text into these cat­e­gories is done us­ing one of the fol­low­ing two al­go­rithms: i. The Naive Bayes clas­si­fier trained on Carlo Strap­pa­r­ava and Alessan­dro Val­i­tutti’s emo­tions lex­i­con ii. The voter pro­ce­dure

2. The clas­si­fy_po­lar­ity() func­tion per­forms the clas­si­fi­ca­tion on the ba­sis of some sym­bolic enu­mer­a­tion val­ues. The sym­bolic val­ues are de­fined as very pos­i­tive,

pos­i­tive, neu­tral, neg­a­tive, and very neg­a­tive. In this case also, there are two al­go­rithms: i. The Naive Bayes al­go­rithm trained on Janyce

Wiebe’s sub­jec­tiv­ity lex­i­con ii. The voter al­go­rithm.

Read­ers can se­lect any one of them.

To dis­cuss the tech­ni­cal­ity of opin­ion min­ing with R here, I have car­ried out an ex­er­cise us­ing the mi­cro-blog­ger site Twit­ter. As men­tioned ear­lier, the rea­son is ob­vi­ous; Twit­ter com­ments are made up of just a lit­tle text and con­tain in­for­mal com­mu­ni­ca­tion. Anal­y­sis of their sen­ti­men­tal value is eas­ier than of a for­mal text. For in­stance, if we con­sider a for­mal film re­view, we will find a well-de­signed anal­y­sis of the prod­uct, but the per­sonal, ca­sual emo­tions of the writer will be ab­sent.

Prepa­ra­tion

The prepa­ra­tion for this sen­ti­ment anal­y­sis in­volves two stages. The first is to in­stall and load all pack­ages of Twit­ter for Twit­ter JSON con­nec­tiv­ity. Af­ter this, in­stall and load pack­ages re­lated to opin­ion min­ing.

Prepa­ra­tion for Twit­ter

Twit­ter con­nec­tiv­ity and read­ing re­quires the fol­low­ing three pack­ages. If not al­ready in­stalled, in­stall them. Then, load the li­braries, if the pack­ages have been in­stalled prop­erly.

in­stall.pack­ages(“twit­teR”) in­stall.pack­ages(“ROAuth”) in­stall.pack­ages(“mod­eest”)

li­brary(twit­teR) li­brary(“ROAuth”) li­brary(“httr”)

Prepa­ra­tion for text min­ing and sen­ti­ment anal­y­sis

For sen­ti­ment anal­y­sis, I have down­loaded the sen­ti­ment pack­age and loaded it as shown be­low. The re­main­ing rel­e­vant pack­ages are in­stalled and loaded as dis­cussed ear­lier.

li­brary(de­v­tools) load­_all(“sen­ti­ment”) NLP

in­stall.pack­ages(“tm”) in­stall.pack­ages(“plyr”) in­stall.pack­ages(“word­cloud”) in­stall.pack­ages(“RColorBrewer”) in­stall.pack­ages(“stringr”) in­stall.pack­ages(“openNLP”)

li­brary(tm) li­brary(plyr)

#de­v­tools con­tain load­_all # This pack­age re­quire tm and li­brary(gg­plot2) li­brary(word­cloud) li­brary(RColorBrewer) li­brary(sen­ti­ment)

Twit­ter con­nec­tiv­ity

I have con­sid­ered a pre­de­fined twit­ter ap­pli­ca­tion ac­count to ac­cess the twit­ter space. In­ter­ested read­ers should cre­ate their own ac­count to view this opin­ion min­ing ex­per­i­ment. You may re­fer ear­lier ar­ti­cle from OSFY Jan 2018 is­sue.

down­load.file(url=”http://curl.haxx.se/ca/cac­ert. pem”,dest­file=”cac­ert.pem”) cred<­ OAuthFac­tory$new(con­sumerKey=’HTgXiD3kqncGM93bxlBczTf hR’, con­sumerSe­cret=’djgP2zhAWKbGAgiEd4R6DXu­jipXRq1aTS­doD9yaHSA8q 97G8Oe’,

re­questURL=’https://api.twit­ter.com/oauth/re­quest_­to­ken’, ac­cessURL=’https://api.twit­ter.com/oauth/ac­cess_­to­ken’, au­thURL=’https://api.twit­ter.com/oauth/au­tho­rize’)

# en­ter PIN FROM https://api.twit­ter.com/oauth/au­tho­rize cred$hand­shake(cainfo=”cac­ert.pem”)

This hand­shak­ing pro­to­col re­quires PIN num­ber ver­i­fi­ca­tion, so en­ter your PIN as shown in the Twit­ter loggedin screen. Sav­ing Twit­ter au­then­ti­ca­tion data is a help­ful step for fu­ture ac­cess to your Twit­ter ac­count.

save(cred, file=”twit­ter au­then­ti­ca­tion.Rdata”)

con­sumerKey=’HTgXiD3kqncGM93bxlBczTfhR’ con­sumerSe­cret=’djgP2zhAWKbGAgiEd4R6DXu­jipXRq1aTS­doD9yaHSA8q 97G8Oe’

Ac­cessTo­ken<­ ‘1371497582­xD5GxHn kpg8z6k0Xqp­nJZ3XvIy­c1vVJGUsDXNWZ’ Ac­cessTo­kenSe­cret<­ Qm9tV2XvlOcw­brL2z4Qkt A3azy­dt­gIYPqflZglJ3D4WQ3’

se­tup_twit­ter_oauth(con­sumerKey, con­sumerSe­cret,Ac­cessTo­ken,A ccessTo­kenSe­cret)

Anal­y­sis of text: Cre­ate a cor­pus

To per­form a proper sen­ti­ment anal­y­sis, I have se­lected the pop­u­lar ‘Women’s reser­va­tion’ topic, and an­a­lysed the Twit­ter cor­pus to study the sen­ti­ments of all the par­tic­i­pants with a Twit­ter ac­count. To have a lo­cal­ity search within In­dia, I have set the geocode at Lat­i­tude: 21.146633 and Lon­gi­tude: 79.088860, within a ra­dius of 1000 miles (Fig­ure 1).

In to­tal, 1500 tweets with the given key­word have been se­lected us­ing the func­tion searchTwit­ter().

# read tweets word

read­_tweets = searchTwit­ter(“wom­ens+reser­va­tion”, n=1500, lan g=”en”,geocode=’21.14,79.08,1000mi’)

The + sign within the search string in­di­cates a search for both the strings.

# ex­tract the text from tweets tweet­er_txt = sap­ply(read­_tweets, func­tion(x) x$get­Text())

Since Twit­ter data of­ten con­tains spe­cial char­ac­ters along with dif­fer­ent graph­i­cal icons, it is re­quired to fil­ter out all non-al­pha­betic con­tent from the tweets. This fil­ter­ing has been done by re­mov­ing retweets, @, punc­tu­a­tions, num­bers, HTML links, ex­tra white spa­ces and all the crazy char­ac­ters. To make all the text uni­form, glob­ally, ev­ery­thing has been con­verted to lower case letters. Fi­nally, all the NAs have been re­moved and the header tag names made empty with NULLs.

# re­move at peo­ple twit­ter_txt = gsub(“@\\w+”, “”, twit­ter_txt)

#”crazy” char­ac­ters, just con­vert them to ASCII twit­ter_txt = iconv(twit­ter_txt, to = “ASCII//TRANSLIT”) #Cre­ate a cor­pus from the twit­ter data doc­sCor­pus <­ Cor­pus(Vec­torSource(twit­ter_txt))

#re­move all Punc­tu­a­tion doc­sNoPunc<- tm_map(doc­sCor­pus, re­movePunc­tu­a­tion) #Re­move all num­bers doc­sNoNum<- tm_map(doc­sNoPunc, re­moveNum­bers)

# Con­vert all to lower case doc­sLower <- tm_map(doc­sNoNum, tolower)

# re­move stop words doc­sNoWords<­ tm_map(doc­sLower, re­moveWords, stop­words(“english”))

#re­move space doc­sNoS­pace<­ tm_map(doc­sNoWords, stripWhites­pace) # Con­vert cor­pus to text doc­sList = lap­ply(doc­sNoS­pace,as.char­ac­ter) twit­ter_txt = as.char­ac­ter(doc­sList)

# Re­place all at­tributes to NULL names(twit­ter_txt) = NULL

The next task is to clas­sify the Twit­ter cor­pus into dif­fer­ent classes. For this, I have used the Bayesian al­go­rithm with prior equal to 1. In this case, the emo­tion clas­si­fier will try to eval­u­ate each sen­tence with re­spect to its six sen­ti­ment mea­sures, and will fi­nally as­sign a clas­si­fi­ca­tion mea­sure to each sen­tence. All the un­clas­si­fied sen­tences are cat­e­gorised as NA and are fi­nally marked as ‘un­known’.

# clas­sify emo­tion class_emo = clas­sify emo­tion(twit­ter_txt, al­go­rithm=”bayes”, prior=1.0)

# get emo­tion best fit

>head(class_emo,1)

ANGER DIS­GUST FEAR

[1,] “1.46871776464786” “3.09234031207392” “2.06783599555953”

JOY SAD­NESS SUR­PRISE BEST_FIT

[1,] “1.02547755260094” “1.7277074477352” “2.78695866252273” NA

emo­tion = class_emo[,7] # sub­sti­tute NA’s by “un­known” emo­tion[is.na(emo­tion)] = “un­known”

Clas­si­fi­ca­tion of sen­tences re­quires the po­lar­ity mea­sure of each sen­tence on the ba­sis of its sen­ti­ment val­ues. There are three sen­ti­ment mea­sures — neu­tral, pos­i­tive and neg­a­tive. The po­lar­ity mea­sure func­tion clas­si­fy_pol() as­signs one of these three sen­ti­ment mea­sures to each sen­tence. This func­tion uses the Bayesian al­go­rithm to cal­cu­late the fi­nal sen­ti­ment po­lar­i­ties of each sen­tence of the cor­pus.

# clas­sify po­lar­ity class_pol = clas­si­fy_po­lar­ity(twit­ter_txt, al­go­rithm=”bayes”)

>head(class_pol,1)

POS NEG POS/NEG

BEST_FIT

[1,] “24.2844130953411” “18.5054868578024” “1.3122817725329” “neu­tral”

# get po­lar­ity best fit po­lar­ity = class_pol[,4]

For fi­nal sta­tis­ti­cal anal­y­sis of these sen­ti­ment val­ues, con­vert the cor­pus into the most con­ve­nient data struc­ture of R, i.e., a data frame. This con­tains both the emo­tion and po­lar­ity val­ues of the Twit­ter cor­pus.

sen­t_df = data.frame(text= twit­ter_txt, emo­tion=emo­tion, po­lar­ity=po­lar­ity, stringsAsFac­tors=FALSE)

# sort data frame sen­t_df = within(sen­t_df, emo­tion<­ fac­tor(emo­tion, lev­els=names(sort(ta­ble(emo­tion), de­creas­ing=TRUE))))

A his­togram plot of the dis­tri­bu­tion of emo­tions is help­ful to get a per­cep­tual idea of sen­ti­ment dis­tri­bu­tion within the Twit­ter cor­pus. Fig­ure 2 gives a demon­stra­tion of this dis­tri­bu­tion plot, us­ing the gg­plot() func­tion.

# plot dis­tri­bu­tion of emo­tions

gg­plot(sen­t_df, aes(x=emo­tion)) + ge­om_bar(aes(y=..count.., fill=emo­tion)) + scale_­fil­l_brewer(palette=”Dark2”) + xlab(“emo­tion cat­e­gories”) + ylab(“num­ber of tweets”) + ggti­tle(“Sen­ti­ment Anal­y­sis of tweets about\n Women’s Reser­va­tion \n(clas­si­fi­ca­tion by emo­tion)”) + theme (text = el­e­men­t_­text ( size = 12 , fam­ily=”Im­pact”))

Sim­i­lar to sen­ti­ment dis­tri­bu­tion, we may have a po­lar­ity dis­tri­bu­tion of the cor­pus from the his­togram plot­ting of the po­lar­ity val­ues of the data frame sen­t_df (Fig­ure 3). gg­plot(sen­t_df, aes(x=po­lar­ity)) + ge­om_bar(aes(y=..count.., fill=po­lar­ity)) + scale_­fil­l_brewer(palette=”RdGy”) + xlab(“po­lar­ity cat­e­gories”) + ylab(“num­ber of tweets”) + ggti­tle(“Sen­ti­ment Anal­y­sis of tweets about \n Women’s Reser­va­tion \n(clas­si­fi­ca­tion by po­lar­ity)”) + theme ( text = el­e­men­t_­text ( size = 12 , fam­ily = “Im­pact” ) )

It can be seen that both these fre­quency dis­tri­bu­tion his­tograms are quite use­ful graph­i­cal de­pic­tions of the sen­ti­ments of 1500 tweets spread over the In­dian sub­con­ti­nent. As ex­pected, the largest num­ber can be cat­e­gorised as un­clas­si­fied, but the re­main­ing val­ues are quite help­ful to get a clear idea of an in­di­vid­ual’s feel­ings about the sub­ject of de­bate. The anal­y­sis is based on some pre­de­fined emo­tion cat­e­gories and the vo­cab­u­lary of the test data­base is also lim­ited; so there are am­ple op­por­tu­ni­ties for im­prove­ment.

Fig­ure 1: Twit­ter key­word search area

Fig­ure 2: His­togram plot of the dis­tri­bu­tion of emo­tions

Fig­ure 3: His­togram plot of dis­tri­bu­tion of po­lar­ity

Newspapers in English

Newspapers from India

© PressReader. All rights reserved.