Session 7 has the following modules:
- Continuing Qualitative MKTR from session 6, we cover Focus Group Discussions (or FGDs) in the first sub-module. A short video 'reading' will be present.
- The second sub-module in Qualitative MKTR - Analysis of Unstructured text - is covered next. The R code below deals with elementary text mining in R.
- The third Module deals with building and Testing Hypotheses in MKTR. We'll use R's statistical abilities to run quick tests on two major classes of Hypotheses.
1. Elementary Text Mining in R
First install the required libraries as shown below and read in the data (in 'Q25.txt' in the Session 7 folder on LMS).
library(tm) library(Snowball) library(wordcloud) library(sentiment) ### ### --- code for basic text mining --- ### # first, read-in data from 'Q25.txt' x = readLines(file.choose()) x1 = Corpus(VectorSource(x)) |
Now, run the following code to process the unstructured text and obtain from it a Term-frequency document matrix. The output obtained is show below.
# standardize the text - remove blanks, uppercase # # punctuation, English connectors etc. # x1 = tm_map(x1, stripWhitespace) x1 = tm_map(x1, tolower) x1 = tm_map(x1, removePunctuation) x1 = tm_map(x1, removeNumbers) # removing it from stopwords myStopwords <- c(stopwords('english'), "ice", "cream") x1 = tm_map(x1, removeWords, myStopwords) x1 = tm_map(x1, stemDocument) # make the doc-term matrix # x1mat = DocumentTermMatrix(x1) # --- sort the TermDoc matrix --- # # removes sparse entries mydata = removeSparseTerms(x1mat,0.998) dim(mydata.df <- as.data.frame(inspect(mydata))); mydata.df[1:10,] mydata.df1 = mydata.df[ ,order(-colSums(mydata.df))] dim(mydata.df1) mydata.df1[1:10, 1:10] # view 10 rows & 10 cols # view frequencies of the top few terms colSums(mydata.df1) # term name & freq listed |
2. Making Wordclouds
Wordclouds are useful ways to visualize relative frequencies of words in the corpus (size is proportional to frequency). The colors of the words are random, though.
# make wordcloud to visualize word frequencies wordcloud(colnames(mydata.df1), colSums(mydata.df1)*10, scale=c(4, 0.5), colors=1:10) |
It would be more useful to actually see what words are used together most often in a document. For example, 'butter' could mean anything - from 'butter scotch' to 'butter pecan' to 'peanut butter'. To obtain which pairs of words occur most commonly together, I tweaked the R code a bit and the below resulted in a 'collocation dendogram':
# --- making dendograms to # # visualize word-collocations --- # min1 = min(mydata$ncol,25) # of top 25 terms test = matrix(0,min1,min1) test1 = test for(i1 in 1:(min1-1)){ for(i2 in i1:min1){ test = mydata.df1[ ,i1]*mydata.df1[ ,i2] test1[i1,i2] = sum(test); test1[i2, i1] = test1[i1, i2] }} # make dissimilarity matrix out of the freq one test2 = (max(test1)+1) - test1 rownames(test2) <- colnames(mydata.df1)[1:min1] # now plot collocation dendogram d <- dist(test2, method = "euclidean") # distance matrix fit <- hclust(d, method="ward") plot(fit) # display dendogram |
It is important not to lose sight of the weaknesses and problems that dog current text mining capabilities.
- For instance, someone writing 'not good' versus 'good' will both have 'good' picked up by the text miner. So 'meaning' per se is lost on the miner.
- The text miner is notoriously poor at picking up wit, sarcasm, exaggerations and the like. Again, the 'meaning' part is lost on the miner.
- Typos etc can also play havoc. Synonyms can cause trouble too, in some contexts. E.g., Some say 'complex', others may say 'complicated' to refer to the same attribute. These will appear as separate terms in the analysis.
- So text mining is more useful as an exploratory tool, to point out qualitatively what topics and words appear to weigh most on respondent minds. It helps downstream analysis by providing inputs for hypothesis building, for more in-depth investigation later and so on.
- It *is* important that the more interesting comments, opinions etc be manually checked before arriving at any conclusions.
3. Sentiment Mining of twitter data
the twitteR package in R showcases well the social media capabilities of R. It allows you to search for particular keywords in particular geographic areas (cities, for example). Thus, you could compare the response to the movie 'Talaash' in Delhi versus say, in Hyderabad.
For our class work exercise, I am using twitter data on #skyfall in the weekend following its release, from London. Read the data in. Run the R code for text mining as we had done above. After that only, implement basic sentiment analysis on the tweets as follows:
####################################### ### --- sentiment mining block ---- ### ####################################### # After doing text analysis, run this ### --- sentiment analysis --- ### # read-in positive-words.txt pos=scan(file.choose(), what="character", comment.char=";") # read-in negative-words.txt neg=scan(file.choose(), what="character", comment.char=";") # including our own positive words to the existing list pos.words=c(pos,"wow", "kudos", "hurray") neg.words = c(neg) # match() returns the position of the matched term or NA pos.matches = match(colnames(mydata.df1), pos.words) pos.matches = !is.na(pos.matches) b1 = colSums(mydata.df1)[pos.matches] b1 = as.data.frame(b1) colnames(b1) = c("freq") wordcloud(rownames(b1), b1[,1], scale=c(5, 1), colors=1:10) neg.matches = match(colnames(mydata.df1), neg.words) neg.matches = !is.na(neg.matches) b2 = colSums(mydata.df1)[neg.matches] b2 = as.data.frame(b2) colnames(b2) = c("freq") wordcloud(rownames(b2), b2[,1], scale=c(5, 1), colors=1:10) |
4. Determining Sentiment Polarities
Can we measure 'how much' emotional content or intensity etc a tweet or comment may contain? Well, at least at the ordinal level, perhaps. The package 'sentiment' offers a way to measure sentiment polarities in terms of log-likelihoods of comments being of one polarity versus another. This can be a useful first step in basic sentiment analysis.
####################################### ### --- sentiment mining block II ---- ### ####################################### ### --- inspect only those tweets # # which got a clear sentiment orientation --- a1=classify_emotion(x1) a2=x[(!is.na(a1[,7]))] # 447 of the 1566 tweets had clear polarity #a3=PlainTextDocument(a2) a2[1:10] # what is the polarity of each tweet? # # that is, what's the ratio of pos to neg content? # b1=classify_polarity(x1) dim(b1) b1[1:5,] # view polarities table |
5. Determining Sentiment Dimensions
Can we do more than just do sentiment polarities? Can we get more specific about which primary emotion dominates a particular tweet or opinion or comment? Turns out the sentiment package in R does provide one way out. How well established or usable this is in a given context is caveat emptor.
a1a=data.matrix(as.numeric(a1)) a1b=matrix(a1a,nrow(a1),ncol(a1)) # build sentiment type-score matrix a1b[1:4,] # view few rows # recover and remove the mode values mode1 <- function(x){names(sort(-table(x)))[1]} for (i1 in 1:6){ # for the 6 primary emotion dimensions mode11=as.numeric(mode1(a1b[,i1])) a1b[,i1] = a1b[,i1]-mode11} summary(a1b) a1c = a1b[,1:6] colnames(a1c) <- c("Anger", "Disgust", "fear", "joy", "sadness", "surprise") a1c[1:10,] ## -- see top 20 tweets in "Joy" (for example) --- a1c=as.data.frame(a1c);attach(a1c) test = x[(joy != 0)]; test[1:10] # for the top few tweets in "Anger" --- test = x[(Anger != 0)]; test[1:10] test = x[(sadness != 0)]; test[1:10] test = x[(Disgust != 0)]; test[1:10] test = x[(fear != 0)]; test[1:10] |
This is it for now. Will putup the Hypothesis testing related R code in a separate blog-post.
Sudhir
No comments:
Post a Comment
Constructive feedback appreciated. Please try to be civil, as far as feasible. Thanks.