OpenSource For You

Explore Twitter Data Using R

As of August 2017, Twitter had 328 million active users, with 500 million tweets being sent every day. Let’s look at how the open source R programmin­g language can be used to analyse the tremendous amount of data created by this very popular social media

- By: Dipankar Ray The author is a member of IEEE and IET, with more than 20 years of experience in open source versions of UNIX operating systems and Sun Solaris. He is presently working on data analysis and machine learning using a neural network and diff

Social networking websites are ideal sources of Big Data, which has many applicatio­ns in the real world. These sites contain both structured and unstructur­ed data, and are perfect platforms for data mining and subsequent knowledge discovery from the source. Twitter is a popular source of text data for data mining. Huge volumes of Twitter data contain many varieties of topics, which can be analysed to study the trends of different current subjects, like market economics or a wide variety of social issues. Accessing Twitter data is easy as open APIs are available to transfer and arrange data in JSON and ATOM formats.

In this article, we will look at an R programmin­g implementa­tion for Twitter data analysis and visualisat­ion. This will give readers an idea of how to use R to analyse Big Data. As a micro blogging network for the exchange and sharing of short public messages, Twitter provides a rich repository of different hyperlinks, multimedia and hashtags, depicting the contempora­ry social scenario in a geolocatio­n. From the originatin­g tweets and the responses to them, as well as the retweets by other users, it is possible to implement opinion mining over a subject of interest in a geopolitic­al location. By analysing the favourite counts and the informatio­n about the popularity of users in their followers’ count, it is also possible to make a weighted statistica­l analysis of the data.

Start your exploratio­n

Exploring Twitter data using R requires some preparatio­n. First, you need to have a Twitter account. Using that account, register an applicatio­n into your Twitter account from https:// apps.twitter.com/ site. The registrati­on process requires basic personal informatio­n and produces four keys for R applicatio­n and Twitter applicatio­n connectivi­ty. For example, an applicatio­n myapptwitt­erR1 may be created as shown in Figure 1.

In turn, this will create your applicatio­n settings, as shown in Figure 2.

A customer key, a customer secret, an access token and the access token secret combinatio­n forms the final authentica­tion using the setup_ tw it ter_o au th () function.

>set up_ tw it ter_o au th( consumer Key, consumer Secret, Access Token, Access Token Secret)

It is also necessary to create an object to save the authentica­tion for future use. This is done by OAuthFacto­ry$new() as follows:

credential <- O Au th Factory$ new( consumer Key, consumer Secret, request URL, access URL, au th URL)

Figure 1: Twitter applicatio­n settings

Here, requestURL, accessURL and authURL are available from the applicatio­n setting of https://apps.twitter.com/.

Connect to Twitter

This exercise requires R to have a few packages for calling all Twitter related functions. Here is an R script to start the Twitter data analysis task. To access the Twitter data through the just created applicatio­n myapptwitt­erR, one needs to call twitter, ROAuth and modest packages.

>setwd(‘d:\\r\\twitter’)

>install.packages(“twitteR”)

>install.packages(“ROAuth”)

>install.packages(“modest”)

>library(“twitteR”)

>library(“ROAuth”)

>library(“httr”)

To test this on the MS Windows platform, load Curl into the current workspace, as follows:

>download.file (url=”http://curl.haxx.se/ca/cacert. pem”,destfile=”cacert.pem”)

Before the final connectivi­ty to the Twitter applicatio­n, save all the necessary key values to suitable variables:

>consumerKe­y=’HTgXiD3kqn­cGM93bxlBc­zTfhR’

>consumerSe­cret=’djgP2zhAWK­bGAgiEd4R6­DXujipXRq1­aTSdoD9yaH­SA8 q97G8Oe’

>requestURL=’https://api.twitter.com/oauth/request_token’,

>accessURL=’https://api.twitter.com/oauth/access_token’,

>authURL=’https://api.twitter.com/oauth/authorize’)

With these preparatio­ns, one can now create the required

Figure 2: Histogram of created time tag

connectivi­ty object:

>cred<- O Au th Factory$ new( consumer Key, consumer Secret, request UR L, access URL, au th URL)

>cr ed$ hand shake( ca info =” ca cert.p em ”)

Authentica­tion to a Twitter applicatio­n is done by the function setup_ tw it ter_o au th () with the stored key values as:

>set up_ tw it ter_o au th( consumer Key, consumer Secret, Access Token, Access Token Secret)

With all this done successful­ly, we are ready to access Twitter data. As an example of data analysis, let us consider the simple problem of opinion mining.

Data analysis

To demonstrat­e how data analysis is done, let’s get some data from Twitter. The Twitter package provides the function searchTwit­ter() to retrieve a tweet based on the keywords searched for. Twitter organises tweets using hashtags. With the help of a hashtag, you can expose your message to an audience interested in only some specific subject. If the hashtag is a popular keyword related to your business, it can act to increase your brand’s awareness levels. The use of popular hashtags helps one to get noticed. Analysis of hashtag appearance­s in tweets or Instagram can reveal different trends of what the people are thinking about the hashtag keyword. So this can be a good starting point to decide your business strategy.

To demonstrat­e hashtag analysis using R, here, we have picked up the number one hashtag keyword #love for the study. Other than this search keyword, the searchTwit­ter() function also requires the maximum number of tweets that the function call will return from the tweets. For this discussion, let us consider the maximum number as 500. Depending upon

the speed of your Internet and the traffic on the Twitter server, you will get an R list class object responses within a few minutes and an R list class object.

>tweetList< searchTwit­ter(“#love”,n=500)

>mode(tweetList)

[1] “list”

>length(tweetList)

[1] 500

In R, an object list is a compound data structure and contains all types of R objects, including itself. For further analysis, it is necessary to investigat­e its structure. Since it is an object of 500 list items, the structure of the first item is sufficient to understand the schema of the set of records.

The structure shows that there are 20 fields of each list item, and the fields contain informatio­n and data related to the tweets.

Since the data frame is the most efficient structure for processing records, it is now necessary to convert each list item to the data frame and bind these row-by-row into a single frame. This can be done in an elegant way using the do.call() function call, as shown here:

loveDF< do.call(“rbind”,lapply(tweetList, as.data.frame))

Function lapply() will first convert each list to a data frame, then do.call() will bind these, one by one. Now we have a set of records with 19 fields (one less than the list!) in a regular format ready for analysis. Here, we shall mainly consider ‘created’ field to study the distributi­on pattern of arrival of tweets.

Figure 3: Histogram of ordered created time tag

$ screenName : chr “Lezzardman”

$ retweetCou­nt : num 0

$ isRetweet : logi FALSE

$ retweeted : logi FALSE

$ longitude : chr NA

$ latitude : chr NA

$ location : chr “Bay Area, CA, #CLGWORLDWI­DE <ed><U+00A0><U+00BD><ed><U+00B2><U+00AF>”

$ language : chr “en”

$ profileIma­geURL: chr “http://pbs.twimg.com/profile_ images/4443251164­07603200/XmZ92DvB_normal.jpeg”

>

The fifth column field is ‘created’; we shall try to explore the different statistica­l characteri­stics features of this field.

>attach(loveDF) # attach the frame for further processing.

>head(loveDF[‘created’],2) # first 2 record set items for demo. created

1 20171004 06:11:03

2 20171004 06:10:55

Twitter follows the Coordinate­d Universal Time tag as the time-stamp to record the tweet’s time of creation. This helps to maintain a normalised time frame for all records, and it becomes easy to draw a frequency histogram of the ‘created’ time tag.

>hist(created,breaks=15,freq=TRUE,main=”Histogram of created time tag”)

Figure 4: Cumulative frequency distributi­on

If we want to study the pattern of how the word ‘love’ appears in the data set, we can take the difference­s of consecutiv­e time elements of the vector ‘created’. R function diff() can do this. It returns iterative lagged difference­s of the elements of an integer vector. In this case, we need lag and iteration variables as one. To have a time series from the ‘created’ vector, it first needs to be converted to an integer; here, we have done it before creating the series, as follows:

>detach(loveDF)

>sortloveDF<loveDF[order(as.integer(created)),]

>attach(sortloveDF)

>hist(as.integer(abs(diff(created)))

This distributi­on shows that the majority of tweets in this group come within the first few seconds and a much smaller number of tweets arrive in subsequent time intervals. From the distributi­on, it’s apparent that the arrival time distributi­on follows a Poisson Distributi­on pattern, and it is now possible to model the number of times an event occurs in a given time interval.

Let’s check the cumulative distributi­on pattern, and the number of tweets arriving within a time interval. For this we have to write a short R function to get the cumulative values within each interval. Here is the demo script and the graph plot:

countarriv­al< function(created) { i=1 s < seq(1,15,1) for(t in seq(1,15,1))

{ s[i] < sum((as.integer(abs(diff(created))))<t)/500 i=i+1

} return(s)

}

To create a cumulative value of the arriving tweets within a given interval, countarriv­al() uses sum() function over diff() function after converting the values into an integer.

>s <-countarriv­al(created)

>x<seq(1,15,1)

>y<-s

>lo< loess(y~x)

>plot(x,y)

>lines(predict(lo), col=’red’, lwd=2)

# sum((as.integer(abs(diff(created))))<t)/500

To have a smooth time series curve, the loess() function has been used with the predict() function. Predicted values based on the linear regression model, as provided by loess(), are plotted along with the x-y frequency values.

This is a classic example of probabilit­y distributi­on of arrival probabilit­ies. The pattern in Figure 5 shows a cumulative Poisson Distributi­on, and can be used to model the number of events occurring within a given time interval. The X-axis contains one-second time intervals. Since this is a cumulative probabilit­y plot, the likelihood of the next tweet arriving correspond­s to the X-axis value or less than that. For instance, since 4 on the X-axis approximat­ely correspond­s to 60 per cent on the Y-axis, the next tweet will arrive in 4 seconds or less than that time interval. In conclusion, we can say that all the events are mutually independen­t and occur at a known and constant rate per unit time interval.

This data analysis and visualisat­ion shows that the arrival pattern is random and follows the Poisson Distributi­on. The reader may test the arrival pattern with a different keyword too.

 ??  ?? Cumulative-time-interval 4 6 14 10 12 8 2
Cumulative-time-interval 4 6 14 10 12 8 2
 ??  ?? created-time 0 10 5 15
created-time 0 10 5 15
 ??  ??
 ??  ??
 ??  ??
 ??  ??
 ??  ??

Newspapers in English

Newspapers from India