Marketing Yogi: Session 6 HW & Project Announcement

Update - Project related Announcement:

Hi all,

I met a couple of project groups in the past two days who had come seeking inputs and guidance for their project. I found insightful a first-hand view of student perspective of MKTR tools and methods, what they're looking to get out of the project, how some students are combining projects from others courses ('Pricing' for instance) and so on. For the record, I'm fine if you seek to leverage data already collected in other projects for the MKTR one. As long as the D.P and the R.O.s have marketing substance to them.

I will be available during working hours, everyday (except on Tue and Thu) from tomorrow to Sunday 11-Nov in my office 2118 in case any group wants to run their project status by me and get some informal feedback and pointers for the way forward. No formal appointment necessary and its fine if at least half the group shows up (not everybody's schedules may agree). Just call my office extn # 7106 and drop by if I'm in. Bringing a PPT (or printout) of your proposal and plans with you would certainly help.

Sudhir

#############################################

Hi all,

The homework for session 6, due 11-Nov Sunday noon, is described here.

I've putup some 86 user reviews of the Samsung Galaxy S3, pulled from Amazon onto LMS. The AAs aren't in and I'm not that familiar with LMS. So, pls let me know if you are having trouble accessing the datasets.

The Code to execute the assignment is also putup here (its a minor variation over the classwork code). You are strongly advised to first try replicating the classwork examples on your machine, available in this blog-post, before trying this one.

Your task is to use R to text analyze the dataset. Figure out:
(i) what most people seem to be saying about the product. And thereby interpret a general 'sense' of the talk or buzz around the product.

(ii) List what positive emotions seem associated with the S3. And thereby interpret what S3's strengths are. The business implications of such early signs of Word-of-mouth, instantaneous customer feedback, buzz etc for positioning, branding, promotions, communications and other tools in the Mktg repertoire are easy to see.

(iii) List what negative associations seem to be around. And ideate on how S3's plausible weaknesses and how it can try to defend itself.The business importance of early warning systems, damage assessment and speedy damage control are hard to miss.

Thus, this HW essentially asks you to do this: From a business point of view, interpret from the first few indications of online chatter surrounding the Samsung Galaxy S3 - its SWOT of sorts. Such an activity would normally fall under the rubric of Mktg intelligence perhaps. But tomorrow's world will likely make the Mktg Intelligence-Mktg Research distinctions blurry anyway, perhaps.

Here's the code for analysis:

## ##
## --- Sentiment mining the Samsung Galaxy S3 --- ##
## ##

# first load libraries
library(tm)
library(wordcloud)
library(Snowball)

Now read in data from S3.csv directly.

# Paste reviews in .csv, one per row.
# Do Ctrl+H and replace all commas with blanks in the reviews.
# Now read in each review as 1 doc with scan() as shown below.

x=scan(file.choose(), sep=",",
dec = ".", what=character(),
na.strings = "NA",
strip.white = TRUE,
comment.char = "", allowEscapes = TRUE, flush = FALSE )
x1=Corpus(VectorSource(x)) # created Doc corpus
summary(x1)

If all went well, R will say in blue font "A corpus with 86 text documents". Good.

Next, we parse the text document, remove stopwords (like "the" etc.), and add our own stopwords to the list on a contextual basis. For instance, 'samsung' and 'phone' would show up as the most frequently used terms. Duh, its a Samsung phone review after all. So itsnot that informative to have these two terms occupy the top 2 slots.

# standardize the text - remove blanks, uppercase #
# punctuation, English connectors etc. #
x1 = tm_map(x1, stripWhitespace)
x1 = tm_map(x1, tolower)
x1 = tm_map(x1, removePunctuation)

# Adding 'phone' &' samsung' to stopwords list
myStopwords <- c(stopwords('english'), "phone", "samsung")
x1 = tm_map(x1, removeWords, myStopwords)
x1 = tm_map(x1, stemDocument)

OK. Now time to build a word-frequency matrix, sort it, get summaries like basic counts etc and see which words top the frequency list using a barplot.

# --- make the doc-term matrix --- #
x1mat = DocumentTermMatrix(x1)

# --- sort the TermDoc matrix --- #
mydata = removeSparseTerms(x1mat,0.99)
dim(mydata.df <- as.data.frame(inspect(mydata))); mydata.df[1:10,]
mydata.df1 = mydata.df[ ,order(-colSums(mydata.df))]

# -- view frequencies of the top few terms --
colSums(mydata.df1) # term name & freq listed

# -- make barplot for term frequencies -- #
barplot(data.matrix(mydata.df1)[,1:10])

Barplots are passe, perhaps. So let's get some more detail and color added. We'll make a wordcloud. Then, we use co-location analysis to see which words occur together most often in a 'typical' review. This we view as a 'collocation dendogram'.

# make wordcloud to visualize word frequencies
wordcloud(colnames(mydata.df1), colSums(mydata.df1), scale=c(4, 0.5), colors=1:10)

# --- making dendograms to visualize
# word-collocations --- #
min1 = min(mydata$ncol, 25) # find for top 25 words
test = matrix(0,min1,min1)
test1 = test
for(i1 in 1:(min1-1)){ for(i2 in i1:min1){
test = mydata.df1[ ,i1]*mydata.df1[ ,i2]
test1[i1,i2] = sum(test); test1[i2, i1] = test1[i1, i2] }}

# make dissimilarity matrix out of the freq one
test2 = (max(test1)+1) - test1
rownames(test2) <- colnames(mydata.df1)[1:min1]

# now plot collocation dendogram
d <- dist(test2, method = "euclidean") # distance matrix
fit <- hclust(d, method="ward")
plot(fit) # display dendogram

OK, Time to wade into sentiment analysis now. People are passionate about brands in certain categories and mobile are pretty much up there on that list. Let's see the emotional connect quotient of the reviewers.

So now, we will build wordlists of positive and negative terms, match the reviews' frequent terms with the wordlists and analyze the results.

### --- sentiment analysis --- ###

# read-in positive-words.txt
pos=scan(file.choose(), what="character", comment.char=";")

# read-in negative-words.txt
neg=scan(file.choose(), what="character", comment.char=";")

# including our own positive words to the existing list
pos.words=c(pos,"sleek", "slick", "light")

#including our own negative words
neg.words=c(neg,"wait", "heavy", "too")

# match() returns the position of the matched term or NA

pos.matches = match(colnames(mydata.df1), pos.words)
pos.matches = !is.na(pos.matches)
b1 = colSums(mydata.df1)[pos.matches]
b1 = as.data.frame(b1)
colnames(b1) = c("freq")

# positive word cloud #
# know your strengths #
wordcloud(rownames(b1), b1[,1]*20, scale=c(8, 1), colors=1:10)

Well, so what is the S3 perceived to be strong on in terms of emotional connect quotient? How about S3's weaknesses?

neg.matches = match(colnames(mydata.df1), neg.words)
neg.matches = !is.na(neg.matches)
b2 = colSums(mydata.df1)[neg.matches]
b2 = as.data.frame(b2)
colnames(b2) = c("freq")

# negative word cloud #
# know your weak points #
wordcloud(rownames(b2), b2[,1]*20, scale=c(8, 1), colors=1:10)

At this point, one may ask, "well, word clustering based on frequency is all fine. But can we also cluster people based on their emotional connect (as seen in their text output)? Sure. Here goes.

# say we decide to use the top 30 emotional words
# to segment users into groups #

mydata.df2 = mydata.df1[,1:30]

# now plot collocation dendogram
d <- dist(mydata.df2, method = "euclidean") # distance matrix
fit <- hclust(d, method="ward")
plot(fit) # display dendogram
# tossup between 2 & 3 clusters

## -- clustering people through reviews -- ##

# Determine number of clusters #
wss <- (nrow(mydata.df2)-1)*sum(apply(mydata.df2,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(mydata.df2,centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")
# Look for an "elbow" in the scree plot #
# seems like elbow is at k=2

Elbow plot seems to suggest k1=2. If you get something else on your screeplot, choose that value for k1 and proceed.

Now, in order to characterize the segments in terms of their emotional text output, let us see what the top 3 words are for each segment and decide.

### for each cluster returns 3 most frequent terms ###

# k-means clustering of tweets
k <- 2
kmeansResult <- kmeans(mydata.df2, k)
kmeansResult$size # segment sizes
# cluster centers
round(kmeansResult$centers, digits=3)

## print cluster profile ##
for (i in 1:k) {
cat(paste("cluster ", i, ": ", sep=""))
s <- sort(kmeansResult$centers[i,], decreasing=T)
cat(names(s)[1:3], "\n")}
# print the words of every cluster

That's it. Pls do this, paste your output on a PPT and interpret the results. Answer the Qs above in bullet points, nothing fancy.

Pls feel free to ask around and take help from your peers. You're always welcome to approach me, the AAs or Ankit Anand with any queries. I'd prefer you use the blog's comments section to reach me fastest. I look forward to hearing your feedback on this and other HWs.

Sudhir

13 comments:

Harneet ChawlaNovember 8, 2012 at 1:29 PM
Dear Professor,

Although we have done a stopword on 'phone' in the R code, the plural form of the same, 'phones' is still showing up among the top 5 words in terms of frequency when we run the analysis. Perhaps, we should add 'phones' to the stopword list as well? Thanks.

Regards,
Harneet Chawla
Swathi ReddyNovember 9, 2012 at 4:00 PM
Dear Professor,
It seems there is some error in the collocation dendogram..from what I understand from the code, the word cloud should be restricted to 25 words but its throwing out so many! Also R is throwing out many errors saying certain objects are not found..any tips sir?
UnknownNovember 10, 2012 at 5:10 PM
Prof

While creating dendograms, it gives an error
"Error in matrix(0, min1, min1) : object 'min1' not found" and similarly test1 not found.

Can you please help what could be the solution
Immaculate SahaiNovember 10, 2012 at 7:55 PM
Sir,

I am having trouble "reading in" the positive and negative word lists from text files. Keep getting an error- "Error in file.choose() : file choice cancelled".

Could you please advise what I can do to rectify the issue.

Thank you,

Rahul
abbyNovember 10, 2012 at 9:08 PM
Hi Professor,
When I enter code:

# --- making dendograms to visualize # word-collocations --- # min1 = min(mydata$ncol, 25) # find for top 25 words
test = matrix(0,min1,min1)
test1 = test
for(i1 in 1:(min1-1)){ for(i2 in i1:min1){
test = mydata.df1[ ,i1]*mydata.df1[ ,i2]
test1[i1,i2] = sum(test); test1[i2, i1] = test1[i1, i2] }}

It gives me an error: Min1 and test not found

How should I rectify this?

Thnx,
Abhinav
abhishekNovember 11, 2012 at 12:17 AM
Dear Professor,
How can we plot a dendogram for positive or negative words? This is important because customers might be talking about a lot of products. Positive or negative emotions can then belong to any product. The above code only picks out the positive and negative emotions without telling for which product they have been used.

thanks & regards
Abhishek

Constructive feedback appreciated. Please try to be civil, as far as feasible. Thanks.

Marketing Yogi

Saturday, November 3, 2012

Session 6 HW & Project Announcement

13 comments:

Blog Archive