R helps qualitative analysis by processing some part of unstructured text into usable output. I hope you'll appreciate just how much of time and hassle is saved by the simple application of R to the text analysis problem. We'll see two big examples in class -
(1) processing open-ended survey responses in a large dataset to reveal which terms are used most frequently (as a word cloud) and which sets of terms occur together (as collocation dendograms). We will use data from the Ice-cream survey (dataset putup on LMS) for this part.
(2) processing consumers' text output (think of product or movie reviews) to yield some basic measures of the "emotional" content and direction - whether positive or negative (also called 'valence' in academic research parlance) - of the consumer's response. We'll use some product reviews downloaded from Amazon and some movie reviews downloaded from Times of India for this one. Again, datasets will be put up on LMS.
1. Structuring text analysis in R
The question was "Q.20. If Wows offered a line of light ice-cream, what flavors would you want to see? Please be as specific as possible." The responses are laid out as a column in an excel sheet with each cell marking one person's response. There was no limit to how much text one wanted to write in response to the question.
First, before we begin, load these libraries - 'tm', 'Snowball' and 'wordcloud'. Ensure these packages are installed and ready on your machine.
library(tm) library(Snowball) library(wordcloud) |
Next, copy the relevant columns from the excel sheet and save the unstructured text in a notepad. The relevant dataset is Q20.txt in the session 6 folder of LMS.
x=readLines(file.choose()) # reads the file x1=Corpus(VectorSource(x)) # created Doc corpus # standardize the text - remove blanks, uppercase # # punctuation, English connectors etc. # x1 = tm_map(x1, stripWhitespace) x1 = tm_map(x1, tolower) x1 = tm_map(x1, removePunctuation) x1 = tm_map(x1, removeWords, stopwords("english")) x1 = tm_map(x1, stemDocument) # make the doc-term matrix # x1mat = DocumentTermMatrix(x1) |
Let's see which terms occur most frequently in the open-endeds for that Question. "see' as in visualize them and not just tabulate them.
mydata = removeSparseTerms(x1mat,0.99) # removes sparse entries dim(mydata.df <- as.data.frame(inspect(mydata))); mydata.df[1:10,] mydata.df1 = mydata.df[ ,order(-colSums(mydata.df))] # view frequencies of the top few terms colSums(mydata.df1) # term name & freq listed # make barplot for term frequencies # barplot(data.matrix(mydata.df1)[,1:8], horiz=TRUE) # make wordcloud to visualize word frequencies wordcloud(colnames(mydata.df1), colSums(mydata.df1), scale=c(4, 0.5), colors=1:10) |
The second image depicts a word-cloud of the term frequencies. Two different ways of seeing the same thing.
The time has come to find collocations (or sometimes called co-locations) of terms in the data. The basic task is "find which words occur together most number of times in consumers' responses".
# making dendograms to visualize word-collocations min1 = min(mydata$ncol,25) test = matrix(0,min1,min1) test1 = test for(i1 in 1:(min1-1)){ for(i2 in i1:min1){ test = mydata.df1[ ,i1]*mydata.df1[ ,i2] test1[i1,i2] = sum(test); test1[i2, i1] = test1[i1, i2] }} # make dissimilarity matrix out of the freq one test2 = (max(test1)+1) - test1 rownames(test2) <- colnames(mydata.df1)[1:min1] # now plot collocation dendogram d <- dist(test2, method = "euclidean") # distance matrix fit <- hclust(d, method="ward") plot(fit) # display dendogram |
I'll incorp the second part later today. For now, this is it.
Elementary sentiment Analysis
Update: The second-part - rudimentary sentiment analysis - is in. Don't want to overly raise expectations since what's done here is fairly basic. But fairly useful in multiple MKTR contexts. And very much extensible in multiple ways on R.
Recall the written feedback you gave in session 2 rating the MKTR session. Well, I've managed to make a soft-copy of your comments and that is the input dataset here. So now, I'll try to mine the sentiment behind your MKTR assessment comments. This dataset is available as student_feedback.txt in the 'Session 6' folder in LMS. Pls try this at home.
1. Step one is always, load the libraries needed.
library(tm) library(wordcloud) library(Snowball) |
# read-in file first # x = readLines(file.choose()) summary(x) # view summary # create Doc corpus x1=Corpus(VectorSource(x)) |
2. Read-in the emotional-content word lists.
There are generic lists compiled by different research teams (in different contexts) of emotion-laden words. We'll use a general list for now. The lists are provided as notepads. Read them into R as follows:
# Sentiment words (Positive vs Negative # downloaded opinion lexicon from # "http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html" # read-in positive-words.txt pos=scan(file.choose(), what="character", comment.char=";") # read-in negative-words.txt neg=scan(file.choose(), what="character", comment.char=";") |
3. Add additional emotional-words as required
Since words used to verbalize emotions (or, as the psychologists call it "Affect") depend a lot on context, we have the freedom to add our own context specific emotional words to the list. E.g., "sweet" may be a positive word in a chocolate-chips context but not so much in potato chips context (just, as an example).
# including our own positive words to the existing list pos.words=c(pos,"wow", "kudos", "hurray") #including our own negative words neg.words=c(neg,"wait", "waiting", "too") |
4. Clean-up the text of irrelevant words, blank spaces and the like.
# standardize the text - remove blanks, uppercase # # punctuation, English connectors etc. # x1 = tm_map(x1, stripWhitespace) x1 = tm_map(x1, tolower) x1 = tm_map(x1, removePunctuation) x1 = tm_map(x1, removeWords, stopwords("english")) x1 = tm_map(x1, stemDocument) # make the doc-term matrix # mydata = DocumentTermMatrix(x1) mydata.df <- as.data.frame(inspect(mydata)); mydata.df[1:10,] mydata.df1 = mydata.df[ ,order(-colSums(mydata.df))] |
5. Now extract the most frequently used emotional words and plot them in a wordcloud.
# match() returns the position of the matched term or NA pos.matches = match(colnames(mydata.df1), pos.words) pos.matches = !is.na(pos.matches) b1 = colSums(mydata.df1)[pos.matches] b1 = as.data.frame(b1) colnames(b1) = c("freq") neg.matches = match(colnames(mydata.df1), neg.words) neg.matches = !is.na(neg.matches) b2 = colSums(mydata.df1)[neg.matches] b2 = as.data.frame(b2) colnames(b2) = c("freq") b = rbind(b1,b2) wordcloud(rownames(b), b[,1]*20, scale=c(5, 1), colors=1:10) |
Now this is what the class told me in terms of their feedback.
Sure, one can do downstream analysis on this - assign scores of positivity or negativity to each comment, categorize the emotion more finely - not just in positive/negative terms but in more detail - joy, satisfaction, anger etc. One can think of clustering respondents based on their score along different emotional dimensions. Think of the co-location analysis possibilities that arise, etc.
All this is possible and very much do-able on R. Think of the applications in terms of product reviews, recommendations systems, mining social media for information, measuring "buzz" etc. We'll continue down that path a little more in Session 9 - Emerging trends in MKTR.
Chalo, dassit for now. See you in class. Ciao.
Sudhir
This comment has been removed by the author.
ReplyDelete