Session 9 -"Emerging trends in MKTR" - is going to be reading heavy (you've been warned). Sooo many great readings and so little time. Anyway, there's this twitter based reading for which I'm putting up sample code below. Will ask AAs to load the data on LMS. But before that, some background.
Some folks have asked why we stopped where we did with text analysis. When, obviously, so much more downstream analysis and processing could have been done. Sure, a lot is possible and do-able on R. But class time is limited and only so much can fit in. One particular Q that came up:
"Can we do better sentiment analysis than what we just did for the session 6 HW?"Sure, we can. It would be great if we could categorize sentiment and then classify text responses accordingly.
Here's what Wikipedia says on the subject:
A basic task in sentiment analysis is classifying the polarity of a given text at the document, sentence, or feature/aspect level — whether the expressed opinion in a document, a sentence or an entity feature/aspect is positive, negative, or neutral. Advanced, "beyond polarity" sentiment classification looks, for instance, at emotional states such as "angry," "sad," and "happy." |
Anyway, R does sentiment analysis. Its package twitteR (last 'R' is capital) lets you set what keywords you want mined from twitter feeds, where in the world you want this data collected form (specify latitude and longitude of major cities, for example and a 50mile radius around them), collect that data, text mine it and analyze its content, score its sentiment and more. Neat, eh? Well, that's R for you.
Now, finally, on popular demand, here is some elementary R code that I used to analyze tweeple reactions to the latest Bond movie 'Skyfall'.
Step 1: Invoke appropriate libraries. Ensure you've the 'twitteR' and 'sentiment' packages downloaded and installed.
library(twitteR) library(sentiment) library(tm) library(Snowball) library(wordcloud) |
Step 2: Send R to search for and save the data you want. This step is a little involved. Pls read instruction given in bullet points below carefully.
- First copy and paste the below block of code into an empty notepad. Make all edits to the code in this nbotepad and then copy-paste to the R console.
- Your PGP username and password that you use to connect to the web is required. Enter these in the code in place of 'username' and 'password' as given in 'set proxy for R' step.
- If you want specific city based tweets only, use the geocode option in the searchTwitter() function below. For example, ' geocode="29.0167, 77.3833, 50mi" ' refers to tweets originating from a 50 mile radius around the center of Delhi.
- In write.table(), write the tweets collected to a notepad only.
- If you ask R to save more than n=500 tweets in the searchTwitter() function, it might take upto a couple of minutes (depending on your web connection) to find and save them.
###### search in twitter ####### #set proxy in R Sys.setenv(http_proxy = "http://username:password@172.16.0.87:8080") # send R to go collect data rev = searchTwitter("#skyfall", n=500, lang="en") ## -- to do location specific searches --- searchTwitter(searchString, n=25, lang="en", since=date, until=date, geocode = "38.5, 81.4, 50mi") rev[1:5] #shows first 5 tweets rev.df = twListToDF(rev) # changes tweets to data frame #save data write.table(as.matrix(rev.df[ ,1]), file.choose()) |
Step 3: Standard text mining stuff which we already saw in session 6. I won't go into making barplots and histograms, you can do that yourself using the session 6 code.
x = readLines(file.choose()) x1 = Corpus(VectorSource(x)) # standardize the text - remove blanks, uppercase # # punctuation, English connectors etc. # x1 = tm_map(x1, stripWhitespace) x1 = tm_map(x1, tolower) x1 = tm_map(x1, removePunctuation) x1 = tm_map(x1, removeNumbers) # removing it from stopwords myStopwords <- c(stopwords('english'), "skyfall", "bond") x1 = tm_map(x1, removeWords, myStopwords) x1 = tm_map(x1, stemDocument) # make the doc-term matrix # x1mat = DocumentTermMatrix(x1) |
Step 4: Invoke sentiment analysis. Classify the tweets by emotion, find the polarity (or which emotion pole - pos or neg - dominates a text output) using simple functions.
## --- inspect only those tweets which ## got a clear sentiment orientation --- library(sentiment) a1=classify_emotion(x1) a2=x[(!is.na(a1[,7]))] # which tweets had clear polarity a2[1:10] # what is the polarity score of each tweet? # # that is, what's the ratio of pos to neg content? # b1=classify_polarity(x1) dim(b1) # build polarities table b1[1:5,] # view a few rows |
Step 5: Now we dive deeper into emotion classification. Six primary emotion states are available in twitter output from the sentiment package: "Anger", "Disgust", "fear", "joy", "sadness", and "surprise". We classify which tweets score high on which emotion type and view a few rows of each type.
##--- changing the a1 thing to reg data frame a1a=data.matrix(as.numeric(a1)) a1b=matrix(a1a,nrow(a1),ncol(a1)) # build sentiment type-score matrix a1b[1:4,] # view few rows # recover and remove the mode values mode1 <- function(x){names(sort(-table(x)))[1]} for (i1 in 1:6){ # for the 6 primary emotion dimensions mode11=as.numeric(mode1(a1b[,i1])) a1b[,i1] = a1b[,i1]-mode11 } summary(a1b) a1c = a1b[,1:6] colnames(a1c) <- c("Anger", "Disgust", "fear", "joy", "sadness", "surprise") a1c[1:10,] # view a few rows ## -- see top 10 tweets in "Joy" (for example) a1c=as.data.frame(a1c);attach(a1c) test = x[(joy != 0)]; test[1:10] # for the top few tweets in "Anger" test = x[(Anger != 0)]; test[1:10] test = x[(sadness != 0)]; test[1:10] |
Could more be done downstream? Can I now cluster tweets by sentiment? Do collocation dendograms by sentiment polarity?
Sure and more.
But I will stop here for now.
See you in class soon. Ciao.
Sudhir
Good day Prof. Voleti,
ReplyDeleteI had been struggling with the twitter API handshake with R for a long time. I had tried multiple codes but they seemed to give some error or the other. I recently managed to crack the code and thought I'd share it with everyone, in case others are having trouble too.
> require(ROAuth)
Loading required package: ROAuth
Loading required package: RCurl
Loading required package: bitops
Loading required package: digest
> require(twitteR)
Loading required package: twitteR
Loading required package: rjson
> reqURL <- "http://api.twitter.com/oauth/request_token"
> accessURL <- "http://api.twitter.com/oauth/access_token"
> authURL <- "http://api.twitter.com/oauth/authorize"
> consumerKey <- "YOUR CONSUMER KEY"
> consumerSecret <- "YOUR CONSUMER SECRET"
> twitCred <- OAuthFactory$new(consumerKey=consumerKey,
+ consumerSecret=consumerSecret,
+ requestURL=reqURL,
+ accessURL=accessURL,
+ authURL=authURL
+ )
> twitCred$handshake(cainfo="cacert.pem", ssl.verifypeer=FALSE)
To enable the connection, please direct your web browser to:
http://api.twitter.com/oauth/authorize?oauth_token= "YOUR AUTHORIZATION PIN"
When complete, record the PIN given to you and provide it here: VPVPbiutaQw3xEPRyf4yweRdP2KoqRIVzgy1JmV4Rnw
> registerTwitterOAuth(twitCred)
[1] TRUE