Sunday, December 16, 2012

Session 7 R code for Basic Text Analysis

Hi all,

Session 7 has the following modules:

  • Continuing Qualitative MKTR from session 6, we cover Focus Group Discussions (or FGDs) in the first sub-module. A short video 'reading' will be present.
  • The second sub-module in Qualitative MKTR - Analysis of Unstructured text - is covered next. The R code below deals with elementary text mining in R.
  • The third Module deals with building and Testing Hypotheses in MKTR. We'll use R's statistical abilities to run quick tests on two major classes of Hypotheses.
So without any further ado, here goes sub-module 1.

1. Elementary Text Mining in R

First install the required libraries as shown below and read in the data (in 'Q25.txt' in the Session 7 folder on LMS).

library(tm)
library(Snowball)
library(wordcloud)
library(sentiment)

###
### --- code for basic text mining ---
###

# first, read-in data from 'Q25.txt'
x = readLines(file.choose())
x1 = Corpus(VectorSource(x))

Now, run the following code to process the unstructured text and obtain from it a Term-frequency document matrix. The output obtained is show below.

# standardize the text - remove blanks, uppercase #
# punctuation, English connectors etc. #
x1 = tm_map(x1, stripWhitespace)
x1 = tm_map(x1, tolower)
x1 = tm_map(x1, removePunctuation)
x1 = tm_map(x1, removeNumbers)
# removing it from stopwords

myStopwords <- c(stopwords('english'), "ice", "cream")

x1 = tm_map(x1, removeWords, myStopwords)
x1 = tm_map(x1, stemDocument)

# make the doc-term matrix #
x1mat = DocumentTermMatrix(x1)
# --- sort the TermDoc matrix --- #
# removes sparse entries
mydata = removeSparseTerms(x1mat,0.998)
dim(mydata.df <- as.data.frame(inspect(mydata))); mydata.df[1:10,]
mydata.df1 = mydata.df[ ,order(-colSums(mydata.df))]
dim(mydata.df1)
mydata.df1[1:10, 1:10] # view 10 rows & 10 cols

# view frequencies of the top few terms
colSums(mydata.df1) # term name & freq listed
'Stopwords' in the above code are words we do not want analyzed. The list of stopwords can be arbitrarily long.
Notice the Document Corpus x1, the term-frequency document matrix (first 10 rows & cols only). The image below gives the tital frequency of occurrence of each term in the entire corpus.

2. Making Wordclouds

Wordclouds are useful ways to visualize relative frequencies of words in the corpus (size is proportional to frequency). The colors of the words are random, though.

# make wordcloud to visualize word frequencies
wordcloud(colnames(mydata.df1), colSums(mydata.df1)*10, scale=c(4, 0.5), colors=1:10)
So, what can we say from a wordcloud? The above one seems to suggest that 'chocolate' is the flavor most on the minds of people, followed by vanilla. The relative importance of other words can be assessed similarly. But precious little else we find.

It would be more useful to actually see what words are used together most often in a document. For example, 'butter' could mean anything - from 'butter scotch' to 'butter pecan' to 'peanut butter'. To obtain which pairs of words occur most commonly together, I tweaked the R code a bit and the below resulted in a 'collocation dendogram':

# --- making dendograms to #
# visualize word-collocations --- #
min1 = min(mydata$ncol,25) # of top 25 terms
test = matrix(0,min1,min1)
test1 = test
for(i1 in 1:(min1-1)){ for(i2 in i1:min1){
test = mydata.df1[ ,i1]*mydata.df1[ ,i2]
test1[i1,i2] = sum(test); test1[i2, i1] = test1[i1, i2] }}

# make dissimilarity matrix out of the freq one
test2 = (max(test1)+1) - test1
rownames(test2) <- colnames(mydata.df1)[1:min1]
# now plot collocation dendogram
d <- dist(test2, method = "euclidean") # distance matrix
fit <- hclust(d, method="ward")
plot(fit) # display dendogram

Click on the image for larger size. Can you look at the image and say what flavors seem to go best with 'coffee'? With 'Vanilla'? With 'Strawberry'?

It is important not to lose sight of the weaknesses and problems that dog current text mining capabilities.

  • For instance, someone writing 'not good' versus 'good' will both have 'good' picked up by the text miner. So 'meaning' per se is lost on the miner.
  • The text miner is notoriously poor at picking up wit, sarcasm, exaggerations and the like. Again, the 'meaning' part is lost on the miner.
  • Typos etc can also play havoc. Synonyms can cause trouble too, in some contexts. E.g., Some say 'complex', others may say 'complicated' to refer to the same attribute. These will appear as separate terms in the analysis.
  • So text mining is more useful as an exploratory tool, to point out qualitatively what topics and words appear to weigh most on respondent minds. It helps downstream analysis by providing inputs for hypothesis building, for more in-depth investigation later and so on.
  • It *is* important that the more interesting comments, opinions etc be manually checked before arriving at any conclusions.
That concludes our small, elementary text-mining foray using R. R's capabilities in the text-mining arena are quite advanced, extensible and evolving. We ventured in to get a simple example done. This example however scales up easily to larger and more complex datasets of unstructured text. Onwards now to our second sub-module wherein we combine two important elements within text analysis for MKTR - social media chatter and sentiment mining.

3. Sentiment Mining of twitter data

the twitteR package in R showcases well the social media capabilities of R. It allows you to search for particular keywords in particular geographic areas (cities, for example). Thus, you could compare the response to the movie 'Talaash' in Delhi versus say, in Hyderabad.

For our class work exercise, I am using twitter data on #skyfall in the weekend following its release, from London. Read the data in. Run the R code for text mining as we had done above. After that only, implement basic sentiment analysis on the tweets as follows:

#######################################
### --- sentiment mining block ---- ###
#######################################

# After doing text analysis, run this
### --- sentiment analysis --- ###
# read-in positive-words.txt
pos=scan(file.choose(), what="character", comment.char=";")
# read-in negative-words.txt
neg=scan(file.choose(), what="character", comment.char=";")
# including our own positive words to the existing list
pos.words=c(pos,"wow", "kudos", "hurray")
neg.words = c(neg)

# match() returns the position of the matched term or NA
pos.matches = match(colnames(mydata.df1), pos.words)
pos.matches = !is.na(pos.matches)
b1 = colSums(mydata.df1)[pos.matches]
b1 = as.data.frame(b1)
colnames(b1) = c("freq")
wordcloud(rownames(b1), b1[,1], scale=c(5, 1), colors=1:10)
neg.matches = match(colnames(mydata.df1), neg.words)
neg.matches = !is.na(neg.matches)
b2 = colSums(mydata.df1)[neg.matches]
b2 = as.data.frame(b2)
colnames(b2) = c("freq")
wordcloud(rownames(b2), b2[,1], scale=c(5, 1), colors=1:10)
2 Wordclouds will appear - one only for words that have positive sentiment or emotional content. The other for negative ones.

4. Determining Sentiment Polarities

Can we measure 'how much' emotional content or intensity etc a tweet or comment may contain? Well, at least at the ordinal level, perhaps. The package 'sentiment' offers a way to measure sentiment polarities in terms of log-likelihoods of comments being of one polarity versus another. This can be a useful first step in basic sentiment analysis.

#######################################
### --- sentiment mining block II ---- ###
#######################################

### --- inspect only those tweets #
# which got a clear sentiment orientation ---
a1=classify_emotion(x1)
a2=x[(!is.na(a1[,7]))] # 447 of the 1566 tweets had clear polarity
#a3=PlainTextDocument(a2)
a2[1:10]
# what is the polarity of each tweet? #
# that is, what's the ratio of pos to neg content? #
b1=classify_polarity(x1)
dim(b1)
b1[1:5,] # view polarities table

5. Determining Sentiment Dimensions

Can we do more than just do sentiment polarities? Can we get more specific about which primary emotion dominates a particular tweet or opinion or comment? Turns out the sentiment package in R does provide one way out. How well established or usable this is in a given context is caveat emptor.

a1a=data.matrix(as.numeric(a1))
a1b=matrix(a1a,nrow(a1),ncol(a1))
# build sentiment type-score matrix
a1b[1:4,] # view few rows

# recover and remove the mode values
mode1 <- function(x){names(sort(-table(x)))[1]}
for (i1 in 1:6){ # for the 6 primary emotion dimensions
mode11=as.numeric(mode1(a1b[,i1]))
a1b[,i1] = a1b[,i1]-mode11}

summary(a1b)
a1c = a1b[,1:6]
colnames(a1c) <- c("Anger", "Disgust", "fear", "joy", "sadness", "surprise")
a1c[1:10,]
## -- see top 20 tweets in "Joy" (for example) ---
a1c=as.data.frame(a1c);attach(a1c)
test = x[(joy != 0)]; test[1:10]
# for the top few tweets in "Anger" ---
test = x[(Anger != 0)]; test[1:10]
test = x[(sadness != 0)]; test[1:10]
test = x[(Disgust != 0)]; test[1:10]
test = x[(fear != 0)]; test[1:10]
That does it for text and sentiment analysis on R. Again, these were just exploratory forays. The serious manager can choose to invest in R's capabilities for delivering these analytical capabilities quickly and economically. Your exposure to this area now enables you to take that call as a manager tomorrow.

This is it for now. Will putup the Hypothesis testing related R code in a separate blog-post.

Sudhir

No comments:

Post a Comment

Constructive feedback appreciated. Please try to be civil, as far as feasible. Thanks.