Ex­plore Twit­ter Data Us­ing R

As of Au­gust 2017, Twit­ter had 328 mil­lion ac­tive users, with 500 mil­lion tweets be­ing sent ev­ery day. Let’s look at how the open source R pro­gram­ming lan­guage can be used to an­a­lyse the tremen­dous amount of data cre­ated by this very pop­u­lar so­cial me­dia

OpenSource For You - - Contents - By: Di­pankar Ray The au­thor is a mem­ber of IEEE and IET, with more than 20 years of ex­pe­ri­ence in open source ver­sions of UNIX op­er­at­ing sys­tems and Sun So­laris. He is presently work­ing on data anal­y­sis and ma­chine learn­ing us­ing a neu­ral net­work and diff

So­cial net­work­ing web­sites are ideal sources of Big Data, which has many ap­pli­ca­tions in the real world. These sites con­tain both struc­tured and un­struc­tured data, and are per­fect plat­forms for data min­ing and sub­se­quent knowl­edge dis­cov­ery from the source. Twit­ter is a pop­u­lar source of text data for data min­ing. Huge vol­umes of Twit­ter data con­tain many va­ri­eties of top­ics, which can be an­a­lysed to study the trends of dif­fer­ent cur­rent sub­jects, like mar­ket eco­nomics or a wide va­ri­ety of so­cial is­sues. Ac­cess­ing Twit­ter data is easy as open APIs are avail­able to trans­fer and ar­range data in JSON and ATOM for­mats.

In this ar­ti­cle, we will look at an R pro­gram­ming im­ple­men­ta­tion for Twit­ter data anal­y­sis and vi­su­al­i­sa­tion. This will give read­ers an idea of how to use R to an­a­lyse Big Data. As a mi­cro blog­ging net­work for the ex­change and shar­ing of short pub­lic mes­sages, Twit­ter pro­vides a rich repos­i­tory of dif­fer­ent hy­per­links, mul­ti­me­dia and hash­tags, de­pict­ing the con­tem­po­rary so­cial sce­nario in a ge­olo­ca­tion. From the orig­i­nat­ing tweets and the re­sponses to them, as well as the retweets by other users, it is pos­si­ble to im­ple­ment opin­ion min­ing over a sub­ject of in­ter­est in a geopo­lit­i­cal lo­ca­tion. By analysing the favourite counts and the in­for­ma­tion about the pop­u­lar­ity of users in their fol­low­ers’ count, it is also pos­si­ble to make a weighted sta­tis­ti­cal anal­y­sis of the data.

Start your ex­plo­ration

Ex­plor­ing Twit­ter data us­ing R re­quires some prepa­ra­tion. First, you need to have a Twit­ter ac­count. Us­ing that ac­count, regis­ter an ap­pli­ca­tion into your Twit­ter ac­count from https:// apps.twit­ter.com/ site. The reg­is­tra­tion process re­quires ba­sic per­sonal in­for­ma­tion and pro­duces four keys for R ap­pli­ca­tion and Twit­ter ap­pli­ca­tion con­nec­tiv­ity. For ex­am­ple, an ap­pli­ca­tion myapptwit­terR1 may be cre­ated as shown in Fig­ure 1.

In turn, this will create your ap­pli­ca­tion set­tings, as shown in Fig­ure 2.

A cus­tomer key, a cus­tomer se­cret, an ac­cess to­ken and the ac­cess to­ken se­cret com­bi­na­tion forms the fi­nal au­then­ti­ca­tion us­ing the se­tup_ tw it ter_o au th () func­tion.

>set up_ tw it ter_o au th( con­sumer Key, con­sumer Se­cret, Ac­cess To­ken, Ac­cess To­ken Se­cret)

It is also nec­es­sary to create an ob­ject to save the au­then­ti­ca­tion for fu­ture use. This is done by OAuthFac­tory$new() as fol­lows:

cre­den­tial <- O Au th Factory$ new( con­sumer Key, con­sumer Se­cret, re­quest URL, ac­cess URL, au th URL)

Fig­ure 1: Twit­ter ap­pli­ca­tion set­tings

Here, re­questURL, ac­cessURL and au­thURL are avail­able from the ap­pli­ca­tion set­ting of https://apps.twit­ter.com/.

Con­nect to Twit­ter

This ex­er­cise re­quires R to have a few pack­ages for call­ing all Twit­ter re­lated func­tions. Here is an R script to start the Twit­ter data anal­y­sis task. To ac­cess the Twit­ter data through the just cre­ated ap­pli­ca­tion myapptwit­terR, one needs to call twit­ter, ROAuth and mod­est pack­ages.

>setwd(‘d:\\r\\twit­ter’)

>in­stall.pack­ages(“twit­teR”)

>in­stall.pack­ages(“ROAuth”)

>in­stall.pack­ages(“mod­est”)

>li­brary(“twit­teR”)

>li­brary(“ROAuth”)

>li­brary(“httr”)

To test this on the MS Win­dows platform, load Curl into the cur­rent workspace, as fol­lows:

>down­load.file (url=”http://curl.haxx.se/ca/cac­ert. pem”,dest­file=”cac­ert.pem”)

Be­fore the fi­nal con­nec­tiv­ity to the Twit­ter ap­pli­ca­tion, save all the nec­es­sary key val­ues to suit­able vari­ables:

>con­sumerKey=’HTgXiD3kqncGM93bxlBczTfhR’

>con­sumerSe­cret=’djgP2zhAWKbGAgiEd4R6DXu­jipXRq1aTS­doD9yaHSA8 q97G8Oe’

>re­questURL=’https://api.twit­ter.com/oauth/re­quest_­to­ken’,

>ac­cessURL=’https://api.twit­ter.com/oauth/ac­cess_­to­ken’,

>au­thURL=’https://api.twit­ter.com/oauth/au­tho­rize’)

With these prepa­ra­tions, one can now create the re­quired

Fig­ure 2: His­togram of cre­ated time tag

con­nec­tiv­ity ob­ject:

>cred<- O Au th Factory$ new( con­sumer Key, con­sumer Se­cret, re­quest UR L, ac­cess URL, au th URL)

>cr ed$ hand shake( ca info =” ca cert.p em ”)

Au­then­ti­ca­tion to a Twit­ter ap­pli­ca­tion is done by the func­tion se­tup_ tw it ter_o au th () with the stored key val­ues as:

>set up_ tw it ter_o au th( con­sumer Key, con­sumer Se­cret, Ac­cess To­ken, Ac­cess To­ken Se­cret)

With all this done suc­cess­fully, we are ready to ac­cess Twit­ter data. As an ex­am­ple of data anal­y­sis, let us con­sider the simple prob­lem of opin­ion min­ing.

Data anal­y­sis

To demon­strate how data anal­y­sis is done, let’s get some data from Twit­ter. The Twit­ter pack­age pro­vides the func­tion searchTwit­ter() to re­trieve a tweet based on the key­words searched for. Twit­ter or­gan­ises tweets us­ing hash­tags. With the help of a hash­tag, you can ex­pose your mes­sage to an au­di­ence in­ter­ested in only some spe­cific sub­ject. If the hash­tag is a pop­u­lar key­word re­lated to your busi­ness, it can act to in­crease your brand’s aware­ness lev­els. The use of pop­u­lar hash­tags helps one to get no­ticed. Anal­y­sis of hash­tag ap­pear­ances in tweets or In­sta­gram can re­veal dif­fer­ent trends of what the peo­ple are think­ing about the hash­tag key­word. So this can be a good start­ing point to decide your busi­ness strat­egy.

To demon­strate hash­tag anal­y­sis us­ing R, here, we have picked up the num­ber one hash­tag key­word #love for the study. Other than this search key­word, the searchTwit­ter() func­tion also re­quires the max­i­mum num­ber of tweets that the func­tion call will re­turn from the tweets. For this dis­cus­sion, let us con­sider the max­i­mum num­ber as 500. De­pend­ing upon

the speed of your In­ter­net and the traf­fic on the Twit­ter server, you will get an R list class ob­ject re­sponses within a few min­utes and an R list class ob­ject.

>tweet­List<­ searchTwit­ter(“#love”,n=500)

>mode(tweet­List)

[1] “list”

>length(tweet­List)

[1] 500

In R, an ob­ject list is a com­pound data struc­ture and con­tains all types of R ob­jects, in­clud­ing it­self. For fur­ther anal­y­sis, it is nec­es­sary to in­ves­ti­gate its struc­ture. Since it is an ob­ject of 500 list items, the struc­ture of the first item is suf­fi­cient to un­der­stand the schema of the set of records.

The struc­ture shows that there are 20 fields of each list item, and the fields con­tain in­for­ma­tion and data re­lated to the tweets.

Since the data frame is the most ef­fi­cient struc­ture for pro­cess­ing records, it is now nec­es­sary to con­vert each list item to the data frame and bind these row-by-row into a sin­gle frame. This can be done in an el­e­gant way us­ing the do.call() func­tion call, as shown here:

loveDF<­ do.call(“rbind”,lap­ply(tweet­List, as.data.frame))

Func­tion lap­ply() will first con­vert each list to a data frame, then do.call() will bind these, one by one. Now we have a set of records with 19 fields (one less than the list!) in a reg­u­lar for­mat ready for anal­y­sis. Here, we shall mainly con­sider ‘cre­ated’ field to study the dis­tri­bu­tion pat­tern of ar­rival of tweets.

Fig­ure 3: His­togram of or­dered cre­ated time tag

$ screenName : chr “Lez­zard­man”

$ retweet­Count : num 0

$ isRetweet : logi FALSE

$ retweeted : logi FALSE

$ lon­gi­tude : chr NA

$ lat­i­tude : chr NA

$ lo­ca­tion : chr “Bay Area, CA, #CLGWORLDWIDE <ed><U+00A0><U+00BD><ed><U+00B2><U+00AF>”

$ lan­guage : chr “en”

$ pro­fileI­mageURL: chr “http://pbs.twimg.com/pro­file_ im­ages/444325116407603200/XmZ92DvB_nor­mal.jpeg”

>

The fifth col­umn field is ‘cre­ated’; we shall try to ex­plore the dif­fer­ent sta­tis­ti­cal char­ac­ter­is­tics fea­tures of this field.

>at­tach(loveDF) # at­tach the frame for fur­ther pro­cess­ing.

>head(loveDF[‘cre­ated’],2) # first 2 record set items for demo. cre­ated

1 2017­10­04 06:11:03

2 2017­10­04 06:10:55

Twit­ter fol­lows the Co­or­di­nated Uni­ver­sal Time tag as the time-stamp to record the tweet’s time of creation. This helps to main­tain a nor­malised time frame for all records, and it be­comes easy to draw a fre­quency his­togram of the ‘cre­ated’ time tag.

>hist(cre­ated,breaks=15,freq=TRUE,main=”His­togram of cre­ated time tag”)

Fig­ure 4: Cu­mu­la­tive fre­quency dis­tri­bu­tion

If we want to study the pat­tern of how the word ‘love’ ap­pears in the data set, we can take the dif­fer­ences of con­sec­u­tive time el­e­ments of the vec­tor ‘cre­ated’. R func­tion diff() can do this. It re­turns it­er­a­tive lagged dif­fer­ences of the el­e­ments of an in­te­ger vec­tor. In this case, we need lag and it­er­a­tion vari­ables as one. To have a time se­ries from the ‘cre­ated’ vec­tor, it first needs to be con­verted to an in­te­ger; here, we have done it be­fore cre­at­ing the se­ries, as fol­lows:

>de­tach(loveDF)

>sort­loveDF<­loveDF[or­der(as.in­te­ger(cre­ated)),]

>at­tach(sort­loveDF)

>hist(as.in­te­ger(abs(diff(cre­ated)))

This dis­tri­bu­tion shows that the ma­jor­ity of tweets in this group come within the first few seconds and a much smaller num­ber of tweets ar­rive in sub­se­quent time in­ter­vals. From the dis­tri­bu­tion, it’s ap­par­ent that the ar­rival time dis­tri­bu­tion fol­lows a Pois­son Dis­tri­bu­tion pat­tern, and it is now pos­si­ble to model the num­ber of times an event oc­curs in a given time in­ter­val.

Let’s check the cu­mu­la­tive dis­tri­bu­tion pat­tern, and the num­ber of tweets ar­riv­ing within a time in­ter­val. For this we have to write a short R func­tion to get the cu­mu­la­tive val­ues within each in­ter­val. Here is the demo script and the graph plot:

coun­tar­rival<­ func­tion(cre­ated) { i=1 s <­ seq(1,15,1) for(t in seq(1,15,1))

{ s[i] <­ sum((as.in­te­ger(abs(diff(cre­ated))))<t)/500 i=i+1

} re­turn(s)

}

To create a cu­mu­la­tive value of the ar­riv­ing tweets within a given in­ter­val, coun­tar­rival() uses sum() func­tion over diff() func­tion af­ter con­vert­ing the val­ues into an in­te­ger.

>s <-coun­tar­rival(cre­ated)

>x<­seq(1,15,1)

>y<-s

>lo<­ loess(y~x)

>plot(x,y)

>lines(pre­dict(lo), col=’red’, lwd=2)

# sum((as.in­te­ger(abs(diff(cre­ated))))<t)/500

To have a smooth time se­ries curve, the loess() func­tion has been used with the pre­dict() func­tion. Pre­dicted val­ues based on the lin­ear re­gres­sion model, as pro­vided by loess(), are plot­ted along with the x-y fre­quency val­ues.

This is a clas­sic ex­am­ple of prob­a­bil­ity dis­tri­bu­tion of ar­rival prob­a­bil­i­ties. The pat­tern in Fig­ure 5 shows a cu­mu­la­tive Pois­son Dis­tri­bu­tion, and can be used to model the num­ber of events oc­cur­ring within a given time in­ter­val. The X-axis con­tains one-sec­ond time in­ter­vals. Since this is a cu­mu­la­tive prob­a­bil­ity plot, the like­li­hood of the next tweet ar­riv­ing cor­re­sponds to the X-axis value or less than that. For in­stance, since 4 on the X-axis ap­prox­i­mately cor­re­sponds to 60 per cent on the Y-axis, the next tweet will ar­rive in 4 seconds or less than that time in­ter­val. In con­clu­sion, we can say that all the events are mu­tu­ally in­de­pen­dent and oc­cur at a known and con­stant rate per unit time in­ter­val.

This data anal­y­sis and vi­su­al­i­sa­tion shows that the ar­rival pat­tern is ran­dom and fol­lows the Pois­son Dis­tri­bu­tion. The reader may test the ar­rival pat­tern with a dif­fer­ent key­word too.

Cu­mu­la­tive-time-in­ter­val 4 6 14 10 12 8 2

cre­ated-time 0 10 5 15

Newspapers in English

Newspapers from India

© PressReader. All rights reserved.