Saturday, November 3, 2012

Session 6 HW & Project Announcement

Update - Project related Announcement:

Hi all,

I met a couple of project groups in the past two days who had come seeking inputs and guidance for their project. I found insightful a first-hand view of student perspective of MKTR tools and methods, what they're looking to get out of the project, how some students are combining projects from others courses ('Pricing' for instance) and so on. For the record, I'm fine if you seek to leverage data already collected in other projects for the MKTR one. As long as the D.P and the R.O.s have marketing substance to them.

I will be available during working hours, everyday (except on Tue and Thu) from tomorrow to Sunday 11-Nov in my office 2118 in case any group wants to run their project status by me and get some informal feedback and pointers for the way forward. No formal appointment necessary and its fine if at least half the group shows up (not everybody's schedules may agree). Just call my office extn # 7106 and drop by if I'm in. Bringing a PPT (or printout) of your proposal and plans with you would certainly help.

Sudhir

#############################################

Hi all,

The homework for session 6, due 11-Nov Sunday noon, is described here.

I've putup some 86 user reviews of the Samsung Galaxy S3, pulled from Amazon onto LMS. The AAs aren't in and I'm not that familiar with LMS. So, pls let me know if you are having trouble accessing the datasets.

The Code to execute the assignment is also putup here (its a minor variation over the classwork code). You are strongly advised to first try replicating the classwork examples on your machine, available in this blog-post, before trying this one.

Your task is to use R to text analyze the dataset. Figure out:
(i) what most people seem to be saying about the product. And thereby interpret a general 'sense' of the talk or buzz around the product.

(ii) List what positive emotions seem associated with the S3. And thereby interpret what S3's strengths are. The business implications of such early signs of Word-of-mouth, instantaneous customer feedback, buzz etc for positioning, branding, promotions, communications and other tools in the Mktg repertoire are easy to see.

(iii) List what negative associations seem to be around. And ideate on how S3's plausible weaknesses and how it can try to defend itself.The business importance of early warning systems, damage assessment and speedy damage control are hard to miss.

Thus, this HW essentially asks you to do this: From a business point of view, interpret from the first few indications of online chatter surrounding the Samsung Galaxy S3 - its SWOT of sorts. Such an activity would normally fall under the rubric of Mktg intelligence perhaps. But tomorrow's world will likely make the Mktg Intelligence-Mktg Research distinctions blurry anyway, perhaps.

Here's the code for analysis:

## ##
## --- Sentiment mining the Samsung Galaxy S3 --- ##
## ##

# first load libraries
library(tm)
library(wordcloud)
library(Snowball)

Now read in data from S3.csv directly.

# Paste reviews in .csv, one per row.
# Do Ctrl+H and replace all commas with blanks in the reviews.
# Now read in each review as 1 doc with scan() as shown below.

x=scan(file.choose(), sep=",",
dec = ".", what=character(),
na.strings = "NA",
strip.white = TRUE,
comment.char = "", allowEscapes = TRUE, flush = FALSE )
x1=Corpus(VectorSource(x)) # created Doc corpus
summary(x1)
If all went well, R will say in blue font "A corpus with 86 text documents". Good.

Next, we parse the text document, remove stopwords (like "the" etc.), and add our own stopwords to the list on a contextual basis. For instance, 'samsung' and 'phone' would show up as the most frequently used terms. Duh, its a Samsung phone review after all. So itsnot that informative to have these two terms occupy the top 2 slots.

# standardize the text - remove blanks, uppercase #
# punctuation, English connectors etc. #
x1 = tm_map(x1, stripWhitespace)
x1 = tm_map(x1, tolower)
x1 = tm_map(x1, removePunctuation)

# Adding 'phone' &' samsung' to stopwords list
myStopwords <- c(stopwords('english'), "phone", "samsung")
x1 = tm_map(x1, removeWords, myStopwords)
x1 = tm_map(x1, stemDocument)

OK. Now time to build a word-frequency matrix, sort it, get summaries like basic counts etc and see which words top the frequency list using a barplot.

# --- make the doc-term matrix --- #
x1mat = DocumentTermMatrix(x1)

# --- sort the TermDoc matrix --- #
mydata = removeSparseTerms(x1mat,0.99)
dim(mydata.df <- as.data.frame(inspect(mydata))); mydata.df[1:10,]
mydata.df1 = mydata.df[ ,order(-colSums(mydata.df))]

# -- view frequencies of the top few terms --
colSums(mydata.df1) # term name & freq listed

# -- make barplot for term frequencies -- #
barplot(data.matrix(mydata.df1)[,1:10])

Barplots are passe, perhaps. So let's get some more detail and color added. We'll make a wordcloud. Then, we use co-location analysis to see which words occur together most often in a 'typical' review. This we view as a 'collocation dendogram'.

# make wordcloud to visualize word frequencies
wordcloud(colnames(mydata.df1), colSums(mydata.df1), scale=c(4, 0.5), colors=1:10)

# --- making dendograms to visualize
# word-collocations --- #
min1 = min(mydata$ncol, 25) # find for top 25 words
test = matrix(0,min1,min1)
test1 = test
for(i1 in 1:(min1-1)){ for(i2 in i1:min1){
test = mydata.df1[ ,i1]*mydata.df1[ ,i2]
test1[i1,i2] = sum(test); test1[i2, i1] = test1[i1, i2] }}

# make dissimilarity matrix out of the freq one
test2 = (max(test1)+1) - test1
rownames(test2) <- colnames(mydata.df1)[1:min1]

# now plot collocation dendogram
d <- dist(test2, method = "euclidean") # distance matrix
fit <- hclust(d, method="ward")
plot(fit) # display dendogram

OK, Time to wade into sentiment analysis now. People are passionate about brands in certain categories and mobile are pretty much up there on that list. Let's see the emotional connect quotient of the reviewers.

So now, we will build wordlists of positive and negative terms, match the reviews' frequent terms with the wordlists and analyze the results.

### --- sentiment analysis --- ###

# read-in positive-words.txt
pos=scan(file.choose(), what="character", comment.char=";")

# read-in negative-words.txt
neg=scan(file.choose(), what="character", comment.char=";")

# including our own positive words to the existing list
pos.words=c(pos,"sleek", "slick", "light")

#including our own negative words
neg.words=c(neg,"wait", "heavy", "too")

# match() returns the position of the matched term or NA

pos.matches = match(colnames(mydata.df1), pos.words)
pos.matches = !is.na(pos.matches)
b1 = colSums(mydata.df1)[pos.matches]
b1 = as.data.frame(b1)
colnames(b1) = c("freq")

# positive word cloud #
# know your strengths #
wordcloud(rownames(b1), b1[,1]*20, scale=c(8, 1), colors=1:10)

Well, so what is the S3 perceived to be strong on in terms of emotional connect quotient? How about S3's weaknesses?

neg.matches = match(colnames(mydata.df1), neg.words)
neg.matches = !is.na(neg.matches)
b2 = colSums(mydata.df1)[neg.matches]
b2 = as.data.frame(b2)
colnames(b2) = c("freq")

# negative word cloud #
# know your weak points #
wordcloud(rownames(b2), b2[,1]*20, scale=c(8, 1), colors=1:10)

At this point, one may ask, "well, word clustering based on frequency is all fine. But can we also cluster people based on their emotional connect (as seen in their text output)? Sure. Here goes.

# say we decide to use the top 30 emotional words
# to segment users into groups #

mydata.df2 = mydata.df1[,1:30]

# now plot collocation dendogram
d <- dist(mydata.df2, method = "euclidean") # distance matrix
fit <- hclust(d, method="ward")
plot(fit) # display dendogram
# tossup between 2 & 3 clusters

## -- clustering people through reviews -- ##

# Determine number of clusters #
wss <- (nrow(mydata.df2)-1)*sum(apply(mydata.df2,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(mydata.df2,centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")
# Look for an "elbow" in the scree plot #
# seems like elbow is at k=2
Elbow plot seems to suggest k1=2. If you get something else on your screeplot, choose that value for k1 and proceed.

Now, in order to characterize the segments in terms of their emotional text output, let us see what the top 3 words are for each segment and decide.

### for each cluster returns 3 most frequent terms ###

# k-means clustering of tweets
k <- 2
kmeansResult <- kmeans(mydata.df2, k)
kmeansResult$size # segment sizes
# cluster centers
round(kmeansResult$centers, digits=3)

## print cluster profile ##
for (i in 1:k) {
cat(paste("cluster ", i, ": ", sep=""))
s <- sort(kmeansResult$centers[i,], decreasing=T)
cat(names(s)[1:3], "\n")}
# print the words of every cluster
That's it. Pls do this, paste your output on a PPT and interpret the results. Answer the Qs above in bullet points, nothing fancy.

Pls feel free to ask around and take help from your peers. You're always welcome to approach me, the AAs or Ankit Anand with any queries. I'd prefer you use the blog's comments section to reach me fastest. I look forward to hearing your feedback on this and other HWs.

Sudhir

13 comments:

  1. Dear Professor,

    Although we have done a stopword on 'phone' in the R code, the plural form of the same, 'phones' is still showing up among the top 5 words in terms of frequency when we run the analysis. Perhaps, we should add 'phones' to the stopword list as well? Thanks.

    Regards,
    Harneet Chawla

    ReplyDelete
    Replies
    1. Hi Harneet,

      Yes. You can make your stopword-list as long as you want. It takes some iterations to filter out the chaff but that's part of the learning curve, I guess.

      Sudhir

      Delete
  2. Dear Professor,
    It seems there is some error in the collocation dendogram..from what I understand from the code, the word cloud should be restricted to 25 words but its throwing out so many! Also R is throwing out many errors saying certain objects are not found..any tips sir?

    ReplyDelete
    Replies
    1. Hi Swati,

      Could you let me know what the error messages say exactly? If they are warning() messages, ignore them.

      Also, the 25 limit is pretty tight. That part of the code was not executed for some reason which is why you;re getting a much larger word file. Again, copy-paste line by line and let me knwo what error messages you see for any given line and I can debug it.

      Sudhir

      Delete
    2. Thanks Professor,
      It worked when i copy-pasted line by line without the comments part...the object definitions might have been considered as comments the last time i ran the code..

      Delete
  3. Prof

    While creating dendograms, it gives an error
    "Error in matrix(0, min1, min1) : object 'min1' not found" and similarly test1 not found.

    Can you please help what could be the solution

    ReplyDelete
    Replies
    1. Hi Shashaank,

      Clearly, 'min1' definition did not execute.

      Pls copy-paste the code starting a few lines before min1 line-by line. Do not copy anything in comments (lines starting with a '#').

      It should work. I haven't heard about this issue from other folks yet.

      Sudhir

      Delete
  4. Sir,

    I am having trouble "reading in" the positive and negative word lists from text files. Keep getting an error- "Error in file.choose() : file choice cancelled".

    Could you please advise what I can do to rectify the issue.

    Thank you,

    Rahul

    ReplyDelete
    Replies
    1. Hi Rahul,

      I have not faced such an error before and I have to wonder why this may be happening. Is yours a mac by any chance? In any case, could you try asking your peers if they have seen this too? I haven't heard of this particular problem yet.

      Sudhir

      Delete
  5. Hi Professor,
    When I enter code:

    # --- making dendograms to visualize # word-collocations --- # min1 = min(mydata$ncol, 25) # find for top 25 words
    test = matrix(0,min1,min1)
    test1 = test
    for(i1 in 1:(min1-1)){ for(i2 in i1:min1){
    test = mydata.df1[ ,i1]*mydata.df1[ ,i2]
    test1[i1,i2] = sum(test); test1[i2, i1] = test1[i1, i2] }}

    It gives me an error: Min1 and test not found

    How should I rectify this?

    Thnx,
    Abhinav

    ReplyDelete
    Replies
    1. Hi Abhinav,

      I see the issue. Pls ensure there is no comment character ('#') in front of any line with executable code. Pls copy-paste the code line by line and do not copy the lines starting with hashtags.

      Should work. In my earlier revision, I changed it and forgot to put in html line breaks which is why the line folds to mix the comments and executable code.

      I hope you've not merely waited for me to reply and have instead taken help from your peers - borrow their analysis output and write your interpretation - as it were.

      Sudhir

      Delete
  6. Dear Professor,
    How can we plot a dendogram for positive or negative words? This is important because customers might be talking about a lot of products. Positive or negative emotions can then belong to any product. The above code only picks out the positive and negative emotions without telling for which product they have been used.

    thanks & regards
    Abhishek

    ReplyDelete
    Replies
    1. Hi Abhinav,

      You're right. Machines don't 'get' context, sarcasm, context and all that. Which is why human intervention will remain critical in this area. Machines can narrow down the field bigtime and reduce much of human gruntwork time though. E.g., by directing us to the 'typical', the 'interesting' or the extreme/outlier comments.

      The hope is that when processing many people's text output, such issues of sarcasm, compound sentences, combining neg and pos words etc will be averaged out.

      Sudhir

      Delete

Constructive feedback appreciated. Please try to be civil, as far as feasible. Thanks.