Wednesday, October 31, 2012

Session 6 Rcode

Hi all,

R helps qualitative analysis by processing some part of unstructured text into usable output. I hope you'll appreciate just how much of time and hassle is saved by the simple application of R to the text analysis problem. We'll see two big examples in class -

(1) processing open-ended survey responses in a large dataset to reveal which terms are used most frequently (as a word cloud) and which sets of terms occur together (as collocation dendograms). We will use data from the Ice-cream survey (dataset putup on LMS) for this part.

(2) processing consumers' text output (think of product or movie reviews) to yield some basic measures of the "emotional" content and direction - whether positive or negative (also called 'valence' in academic research parlance) - of the consumer's response. We'll use some product reviews downloaded from Amazon and some movie reviews downloaded from Times of India for this one. Again, datasets will be put up on LMS.

1. Structuring text analysis in R

The question was "Q.20. If Wows offered a line of light ice-cream, what flavors would you want to see? Please be as specific as possible." The responses are laid out as a column in an excel sheet with each cell marking one person's response. There was no limit to how much text one wanted to write in response to the question.

First, before we begin, load these libraries - 'tm', 'Snowball' and 'wordcloud'. Ensure these packages are installed and ready on your machine.

library(tm)
library(Snowball)
library(wordcloud)

Next, copy the relevant columns from the excel sheet and save the unstructured text in a notepad. The relevant dataset is Q20.txt in the session 6 folder of LMS.

x=readLines(file.choose()) # reads the file
x1=Corpus(VectorSource(x)) # created Doc corpus

# standardize the text - remove blanks, uppercase #
# punctuation, English connectors etc. #
x1 = tm_map(x1, stripWhitespace)
x1 = tm_map(x1, tolower)
x1 = tm_map(x1, removePunctuation)
x1 = tm_map(x1, removeWords, stopwords("english"))
x1 = tm_map(x1, stemDocument)

# make the doc-term matrix #
x1mat = DocumentTermMatrix(x1)
The above may take 1-2 minutes, so pls be patient.

Let's see which terms occur most frequently in the open-endeds for that Question. "see' as in visualize them and not just tabulate them.

mydata = removeSparseTerms(x1mat,0.99) # removes sparse entries
dim(mydata.df <- as.data.frame(inspect(mydata))); mydata.df[1:10,]
mydata.df1 = mydata.df[ ,order(-colSums(mydata.df))]
# view frequencies of the top few terms
colSums(mydata.df1) # term name & freq listed
# make barplot for term frequencies #
barplot(data.matrix(mydata.df1)[,1:8], horiz=TRUE)

# make wordcloud to visualize word frequencies
wordcloud(colnames(mydata.df1), colSums(mydata.df1), scale=c(4, 0.5), colors=1:10)
The first image here is of a horizontal bar chart depicting relative frequencies of term occurrence.

The second image depicts a word-cloud of the term frequencies. Two different ways of seeing the same thing.

The time has come to find collocations (or sometimes called co-locations) of terms in the data. The basic task is "find which words occur together most number of times in consumers' responses".

# making dendograms to visualize word-collocations
min1 = min(mydata$ncol,25)
test = matrix(0,min1,min1)
test1 = test
for(i1 in 1:(min1-1)){ for(i2 in i1:min1){
test = mydata.df1[ ,i1]*mydata.df1[ ,i2]
test1[i1,i2] = sum(test); test1[i2, i1] = test1[i1, i2] }}
# make dissimilarity matrix out of the freq one
test2 = (max(test1)+1) - test1
rownames(test2) <- colnames(mydata.df1)[1:min1]
# now plot collocation dendogram
d <- dist(test2, method = "euclidean") # distance matrix
fit <- hclust(d, method="ward")
plot(fit) # display dendogram
And the image looks like this:
I'll incorp the second part later today. For now, this is it.

Elementary sentiment Analysis

Update: The second-part - rudimentary sentiment analysis - is in. Don't want to overly raise expectations since what's done here is fairly basic. But fairly useful in multiple MKTR contexts. And very much extensible in multiple ways on R.

Recall the written feedback you gave in session 2 rating the MKTR session. Well, I've managed to make a soft-copy of your comments and that is the input dataset here. So now, I'll try to mine the sentiment behind your MKTR assessment comments. This dataset is available as student_feedback.txt in the 'Session 6' folder in LMS. Pls try this at home.

1. Step one is always, load the libraries needed.

library(tm)
library(wordcloud)
library(Snowball)
Next, read-in the data. The data were originally in an excel sheet with each student's comments in one cell and all the comments in one 68x1 column. I copied them onto a notepad and the notepad will be put up on LMS for you to practice the following code with.
# read-in file first #
x = readLines(file.choose())
summary(x) # view summary
# create Doc corpus
x1=Corpus(VectorSource(x))
So we create what is called a 'corpus' of documents - done when the text inout comes from multiple people and each person's output can be treated as a separate 'document'.

2. Read-in the emotional-content word lists.

There are generic lists compiled by different research teams (in different contexts) of emotion-laden words. We'll use a general list for now. The lists are provided as notepads. Read them into R as follows:

# Sentiment words (Positive vs Negative
# downloaded opinion lexicon from
# "http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html"


# read-in positive-words.txt
pos=scan(file.choose(), what="character", comment.char=";")

# read-in negative-words.txt
neg=scan(file.choose(), what="character", comment.char=";")

3. Add additional emotional-words as required

Since words used to verbalize emotions (or, as the psychologists call it "Affect") depend a lot on context, we have the freedom to add our own context specific emotional words to the list. E.g., "sweet" may be a positive word in a chocolate-chips context but not so much in potato chips context (just, as an example).

# including our own positive words to the existing list
pos.words=c(pos,"wow", "kudos", "hurray")

#including our own negative words
neg.words=c(neg,"wait", "waiting", "too")

4. Clean-up the text of irrelevant words, blank spaces and the like.

# standardize the text - remove blanks, uppercase #
# punctuation, English connectors etc. #
x1 = tm_map(x1, stripWhitespace)
x1 = tm_map(x1, tolower)
x1 = tm_map(x1, removePunctuation)
x1 = tm_map(x1, removeWords, stopwords("english"))
x1 = tm_map(x1, stemDocument)

# make the doc-term matrix #
mydata = DocumentTermMatrix(x1)

mydata.df <- as.data.frame(inspect(mydata)); mydata.df[1:10,]
mydata.df1 = mydata.df[ ,order(-colSums(mydata.df))]

5. Now extract the most frequently used emotional words and plot them in a wordcloud.

# match() returns the position of the matched term or NA

pos.matches = match(colnames(mydata.df1), pos.words)
pos.matches = !is.na(pos.matches)
b1 = colSums(mydata.df1)[pos.matches]
b1 = as.data.frame(b1)
colnames(b1) = c("freq")
neg.matches = match(colnames(mydata.df1), neg.words)
neg.matches = !is.na(neg.matches)
b2 = colSums(mydata.df1)[neg.matches]
b2 = as.data.frame(b2)
colnames(b2) = c("freq")
b = rbind(b1,b2)

wordcloud(rownames(b), b[,1]*20, scale=c(5, 1), colors=1:10)
Click for larger image.
Now this is what the class told me in terms of their feedback.

Sure, one can do downstream analysis on this - assign scores of positivity or negativity to each comment, categorize the emotion more finely - not just in positive/negative terms but in more detail - joy, satisfaction, anger etc. One can think of clustering respondents based on their score along different emotional dimensions. Think of the co-location analysis possibilities that arise, etc.

All this is possible and very much do-able on R. Think of the applications in terms of product reviews, recommendations systems, mining social media for information, measuring "buzz" etc. We'll continue down that path a little more in Session 9 - Emerging trends in MKTR.

Chalo, dassit for now. See you in class. Ciao.

Sudhir

1 comment:

Constructive feedback appreciated. Please try to be civil, as far as feasible. Thanks.