Tuesday, December 24, 2013

Updates on Session 9

Hi all,

Session 9 is done. One more to go.

1. Notes from Today's session:

We covered a lot in this rather eclectic session - from Hypothesis formulation to Social network analysis (SNA). A few quick notes on the same:

i. Hypothesis formulation and testing fits neatly into the Causal research (Experimentation) topic that we did in Session 7. After all, logical, measurable hypotheses underlie the experimental method.

ii. The two types of tests we did - Association (chi-square) and Differences (t-tests) cover the majority of situations you are likely to face. However, even for other, more esoteric testing requirements R is handy and available.

iii. Regression modeling borrows much from your Stats core. However, even if repetitive, I couldn't risk leaving it out as IMO, basic regression modeling forms a critical part of tomorrow's Mktg manager's repertoire.

iv. The 3 basic regression variants we covered viz., quadratic terms (for ideal points estimation), log-log models (for elasticities) and interaction effects open up a lot of space for managers to maneuver and test conjectures in.

v. The SNA portion was new to me too, in a sense. I'd done my network theory basics way back in grad study but getting back in touch felt good. SNA going forward will gain in importance and applicability. Already we saw the kind of Qs it is able to provide guidance and answers for.

vi. We've only scratched the surface where SNA is concerned. R's capabilities extend further but let me quickly admit that I have myself not explored very far in this area. A big limitation on SNA is how much data we can collect from Twitter, FB etc when their connection APIs set rather small automated limits for downloading data.

vii. The R code and data for today's classwork is up on LMS. Pls try to replicate at home, interested folks can generally play around with the code and see.

2. Regarding SNA data:

Folks, normally I would remove identifying information about students from the data but in this case (and in the JSM homework case which called for individual level plots), that was not possible. so the identifying information remains.

Pls remember: Since only 6 friends could be listed, its not possible for people to list all their friends. Hence some names might be missing in the SNA dataset of Co2014. Pls take that in the right spirit rather than blame anybody for non-reciprocity.

Bottomline: I don't want MKTR to be remembered for any negative reasons whatsoever.

3. Some Announcements:

a. Those who have missed filling up a survey can also take up the 2-page write-up assignment I'd given earlier to folks who'd missed pre-read quizzes, here, at this link. Deadline is Thursday.

b. There will be 5 pre-read quizzes in all (including the session 10 one). I'll consider your top 4 scores for grading. So if you performed really badly on one, you can let it go.

c. Update: I have decided to drop this reading from your pre-read list for session 10.You have two pre-reads for session 10. One is this McKinsey article: Capturing Business Value with Social Technologies. Scan it quickly, doesn't require a very in depth read, IMHO. But the general ideas should be clear.

d. Your other pre-read for session 10 is a famous Wired article from 2004 which went on to become a major 2007 book on the subject: The Long Tail, by Chris Anderson. Its an excellent article on a new economic paradigm enabled by technology.

e. The practice exam is up on LMS. Its solution is up too. But only those Qs which have one clear answer have been solved. More open-ended Qs have been left blank. The practice exam is a good template for what you can expect to see in the end-term. Most pre-reads per se will not come unless explicitly covered in the slides, as you can see in the practice exam.

f. I solved Session 8 HW again. The findAssoc() function seems to do OK for Qs 1-9. So keep it as is. FindAssocs() is not needed for the Amazon reviews analysis anyway.

Well, that's it from me for now. See you on Thursday.

Sudhir

Saturday, December 21, 2013

Assignment in Lieu of one Pre-read Quiz

Hi All,

This following Write-up based assignment is *only* for those folks who have missed one pre-read quiz or may have done badly in any one quiz.

Problem background:

You are a consultant and your client, a multinational manufacturing behemoth, wants to know trends and impact of disruption in manufacturing technologies in the next decade with particular emphasis on 'additive manufacturing' (a.k.a. 3 dimensional printing) technologies.

Your D.P. is to find "Which industries and product categories will shift earliest to (or be most affected by) 3D printing tech and around what time line?".

An alternative D.P. says, "What are the most likely consumer uses of 3D printing and around what time line?"

Choose any one of the two D.P.s, build corresponding R.O.s and write a 3 (or fewer) page report (Times New Roman 12 font, 1.5 line spacing, standard margins) outlining your principal findings in solving that R.O. through secondary research alone.

Hint:

Google for 'economist.com 3D printing' (without the inverted commas). Scan through the links that appear on the first page. I have posted a few examples below.

How 3D printers work (7 Sept 2013)

3D printing Out of the box (6 Aug 2013)

3D printing scales up (7 Sept 2013)

Inventing HP in 3D (28 Nov 2013)

Pls ensure you have:

  • Written your name and PGID on the document
  • Clearly spelt out which D.P. you have chosen
  • Clearly spelt our your R.O.(s)
  • Clearly included citations of sources (URLs etc) either as footnotes or as a separate References section outside the page limit.
The deadline is before the start of the next class. Pls submit electronically to a dropbox that Ankit will make on LMS for this purpose.

Any queries etc., contact me.

Thanks.

Sudhir

Friday, December 20, 2013

Session 8 HW

Hi all,

Pls find below the last HW for MKTR - that for session 8.

The files for classwork and HW are both up on LMS. *Highly* recommended to first try classwork code before the HW one.

HW can be done in groups but submission must be individual only. Also, interpretation should be yours, feel free to take help from peers for running the analysis.

Oh, and one other thing. For text analytics, better to leave Rstudio and run the analysis directly on the original R GUI. It will be there in Program files in the start menu. Same copy-paste will work there also. Plots appear in a cascading window. This is recommended but not mandatory

HW Qs:

The following Qs are based on your survey responses

  • Q1. List the top 5 firms that people have expressed preference for.
  • Q2. For each of the top 3 most frequently cited firms, name the top two firms that co-occur the most and with what correlation coefficient.
  • Q3. Name two 'singleton' firms - firms that do not have any co-occurence with any other firms in the network. [Hint: Invoke plot.words.network function]
  • Q4. Name three singleton people (who apparently do not share firm preferences) with anybody else in the class. [Hint: Invoke plot.ppl.network function]
  • Q5. Analyze the wordcloud for the top loyalty-commanding brands for Co2014
  • Q6. For each of the top 3 most frequently cited brands, name the top two brands that co-occur the most and with what correlation coefficient.
  • Q7. In the brands associations plot, do you see any natural groupings emerge? Do brands of a particular category or price level bunch up together? [Hint: Invoke plot.words.network function]
  • Q8. Name a few people who seem to share a lot of brand preferences with others in the class? [Hint: Invoke plot.ppl.network function]
  • ### following Qs are for web extraction of data from amazon ###
  • Q9. Collect 100 odd reviews from Amazon for xbox 360. Analyze the wordcloud. What themes seem to emerge from the wordcloud?
  • Q10. Analyze the positive wordcloud. What are the xbox's seeming strengths? What can they position around?
  • Q11. Analyze the negative wordcloud. What are the xbox's seeming weaknesses? What can they prioritize and fix?
Deadline is coming Thursday midnight. Submission must be in the form of PPTs only. Write your name and PGID on the title slide and write your name as file name. Dropbox will be made for this.

Any Qs etc, contact me.

Suhdir

Monday, December 16, 2013

Session 6 HW

Hi all,

Update: Mailbag

Received this email from Kanwal and my response is as under - displayed here for wider dissemination.

Dear Professor, I have read all the blog posts but I am confused about future assignments. All, I understand is that we have to submit a focus group assignment on 21 Dec. Can you please tell what are the next assignments and when they are due. Our exams start next week. Many thanks, Kanwal Kella

My response:

Hi Kanwal, Am not sure on what exactly is confusing here. Let me list it all out anyway.

Session 4 HW (FGD) - due 21-Dec Saturday
Session 5 - Segmtn and targetting - No HW
Session 6 - JSM HW - due a week later - before the beginning of session 9 on 24-Dec Tuesday
Course feedback text survey taking - due before session 8 on 19-Dec Thursday

Session 7 - Causal research - No HW
Session 8 - Text Analysis - a rather limited HW exercise will be due a week later, before the start of session 10, on 26 Dec Thursday
Sessions 9 and 10 - No HW

Hope that helps.

Sudhir

*****************************

Here is the HW for session 6. As mentioned before - this includes the HW for session 5 as well.

1. Session 6 HW: Part1 - JSMs

This HW is also a group submission. You will need to co-operate with the rest of our group to get it done.

  • JSM based homework:
  • Collect basic demographic information about your group mates - #yrs of workex, previous industry, educational qualifications, intended major etc.
  • Run individual level JSM analysis on each of your team mates (and youself) using the code below (place appropriate name in student.name = c("") in that code)
  • Compare the JSMs you obtain - what salient similarities and differences do you see?
  • Now, using the demographic data you have collected, speculate on which demographic characteristics are best able to explain at least some of the similarities and differences you see.
  • Place (i) the 4 JSms, (ii) your list of salient similarities and differences (preferably in tabular form), and (iii) the subset of demographic variables that best explain the JSMs in a PPT.

Update:
I'm dropping Part 2 of the HW. Submit Part 1 and that will be sufficient.

2. Session 6 HW: Part 2 - Segmentation and Positioning PDAs

  • Connector PDA case based homework:
  • Pls scan through the basic facts about the ConneCtor PDA 2001 (segmentation) and (Positioning) cases in MEXL
  • Segment the dataset given along basis variables using model based clustering
  • Profile and characterize the segments that emerge. Give each one a reasonable, descriptive name.
  • Speculate on which of these segments you as the firm would most like to target. In other words, rate these segments in terms of their (High/Medium/Low) attractiveness for you.
  • Look at the discriminant variables list corresponding to your chisen segment. Based on the list, speculate on how you might target your chosen segment?
  • Paste the results you obtain (including the segment descriptions in tabular form) onto the PPT and submit.
Any queries etc, contact me.

Sudhir

Session 6 Updates

Hi all,

Session 6 covers two main ways to map perceptual data - (i) using the attribute ratings (AR) method to create p-maps and joint-space maps (JSMs), and (ii) using the overall similarity (OS) approach to create multidimensional scaling (MDS) maps.

We also saw some 101 stuff on positioning, definitional terms, common positioning strategies etc. The point was to get you thinking on how the mapping process could throw insights onto positioning in general, which strategy to adopt based on what criteria etc.

OK, next, what will follow is the code and snapshots of the plots that emerge from the classwork examples I did. Again, you are strongly encouraged to replicate the classwork examples at home. Copy-paste a only a few lines of code at a time after reading the comments next to each line of code.

{P.S.- the statements following a '#' are for documentation purposes only and aren't executed}.So, without further ado, let us start right away:

##########################################

1. Simple Data Visualization using biplots: USArrests example.

We use USArrests data (inbuilt R dataset) to see how it can be visualized in 2 dimensions. Just copy-paste the code below onto the R console [Hit 'enter' after the last line]. Need to install package "MASS". Don't reinstall if you have already installed it previously. A package once installed lasts forever.

rm(list = ls()) # clear workspace

install.packages("MASS") # install MASS package

mydata = USArrests # USArrests is an inbuilt dataset

pc.cr = princomp(mydata, cor=TRUE) # princomp() is core func summary(pc.cr) # summarize the pc.cr object

biplot(pc.cr) # plot the pc.cr object

abline(h=0); abline(v=0) # draw horiz and vertical axes

This is what the plot should look like. Click on image for larger view.

2. Code for making Joint Space maps:

I have coded a user-defined function called JSM in R. You can use it whenever you need to make joint space maps provided just by invoking the function. All it requires to work is a perceptions table and a preference rating table. First copy-paste the entire block of code below onto your R console. Those interested in reading the code, pls copy-paste line-by-line. I have put explanations in comments ('#') for what the code is doing.

## --- Build func to run simple perceptual maps --- ##

JSM = function(inp1, prefs){ #JSM() func opens

# inp1 = perception matrix with row and column headers
# brands in rows and attributes in columns
# prefs = preferences matrix

par(pty="s") # set square plotting region

fit = prcomp(inp1, scale.=TRUE) # extract prin compts

plot(fit$rotation[,1:2], # use only top 2 prinComps

type ="n", xlim=c(-1.5,1.5), ylim=c(-1.5,1.5), # plot parms

main ="Joint Space map - Home-brew on R") # plot title

abline(h=0); abline(v=0) # build horiz and vert axes

attribnames = colnames(inp1);

brdnames = rownames(inp1)

# -- insert attrib vectors as arrows --

for (i1 in 1:nrow(fit$rotation)){

arrows(0,0, x1 = fit$rotation[i1,1]*fit$sdev[1],

y1 = fit$rotation[i1,2]*fit$sdev[2], col="blue", lwd=1.5);

text(x = fit$rotation[i1,1]*fit$sdev[1], y = fit$rotation[i1,2]*fit$sdev[2],

labels = attribnames[i1],col="blue", cex=1.1)}

# --- make co-ords within (-1,1) frame --- #

fit1=fit; fit1$x[,1]=fit$x[,1]/apply(abs(fit$x),2,sum)[1]

fit1$x[,2]=fit$x[,2]/apply(abs(fit$x),2,sum)[2]

points(x=fit1$x[,1], y=fit1$x[,2], pch=19, col="red")

text(x=fit1$x[,1], y=fit1$x[,2], labels=brdnames, col="black", cex=1.1)

# --- add preferences to map ---#

k1 = 2; #scale-down factor

pref = data.matrix(prefs)# make data compatible

pref1 = pref %*% fit1$x[,1:2];

for (i1 in 1:nrow(pref1)){

segments(0, 0, x1 = pref1[i1,1]/k1, y1 = pref1[i1,2]/k1, col="maroon2", lwd=1.25);

points(x = pref1[i1,1]/k1, y = pref1[i1,2]/k1, pch=19, col="maroon2");

text(x = pref1[i1,1]/k1, y = pref1[i1,2]/k1, labels = rownames(pref)[i1], adj = c(0.5, 0.5), col ="maroon2", cex = 1.1)}

# voila, we're done! #

} # JSM() func ends

3. OfficeStar MEXL example done on R

Goto LMS folder 'Session 6 files'. The file 'R code officestar.txt' contains the code (which I've broken up into chunks and annotated below) and the files 'officestar data1.txt' and 'officestar pref data2.txt' contain the average perceptions or attribute table and preferences table respectively.

Step 3a: Read in the attribute table into 'mydata'.

# -- Read in Average Perceptions table -- #

mydata = read.table(file.choose(), header = TRUE)

mydata = t(mydata) #transposing to ease analysis

mydata #view the table read

# extract brand and attribute names #

brdnames = rownames(mydata);

attribnames = colnames(mydata)

Step 3b: Read into R the preferences table into 'prefs'.

# -- Read in preferences table -- #

pref = read.table(file.choose())

dim(pref) #check table dimensions

pref[1:10,] #view first 10 rows

Data reading is done. You should see the data read-in as in the figure above. We can start analysis now. Finally.

Step 3c: Run Analysis

# creating empty pref dataset

pref0 = pref*0; rownames(pref0) = NULL

JSM(mydata, pref0) # p-map without prefs information

The above code will generate a p-map (without the preference vectors). Should look like the image below (click for larger image):

However, to make true joint-space maps (JSMs), wherein the preference vectors are overlaid atop the p-map, run the one line code below:

JSM(mydata, pref)

That is it. That one function call executes the entire JSM sequence. The result can be seen in the image below.

Again, the JSM function is generic and can be applied to *any* dataset in the input format we just saw to make joint space maps from. Am sure you'll leverage the code for animating your project datasets. Let me or Ankit know in case any assistance is needed in this regard.

4. Session 2 survey Data on firm Perceptions:

Lookup LMS folder 'session 6 files'. Save the data and code files to your machine. Data files are 'courses data.txt' for the raw data on perceptions and courses data prefs.txt' for the preference data with student names on it. Now let the games begin.

# read in data

mydata = read.table(file.choose()) # 'courses data.txt'

head(mydata)

# I hard coded attribute and brand names

attrib.names = c("Brd.Equity", "career.growth.oppty", "roles.challenges", "remuneration", "overall.preference") brand.names = c("Accenture", "Cognizant", "Citi", "Facebook", "HindLever")

Should you try using your project data or some other dataset, you'll need to enter the brand and attribute names for that dataset in the same order in which they appear in the dataset, separately as given above.I then wrote a simple function, titled 'pmap.inp()' to denote "p-map input", to transform the raw data into a brands-attributes average peceptions table. Note that the below code is specific to the last set of columns being the preferences data.

# construct p-map input matrices using pmap.inp() func

pmap.inp = function(mydata, attrib.names, brand.names){ #> pmap.inp() func opens

a1 = NULL

for (i1 in 1:length(attrib.names)){

start = (i1-1)*length(brand.names)+1; stop = i1*length(brand.names);

a1 = rbind(a1, apply(mydata[,start:stop], 2, mean)) } # i1 loop ends

rownames(a1) = attrib.names; colnames(a1) = brand.names

a1 } # pmap.inp() func ends

a1 = pmap.inp(mydata, attrib.names, brand.names)

The above code should yield the average perceptions table that will look something like this:

And now, we're ready to run the analysis. First the p-map without the prefences and then the full JSM.

# now run the JSM func on data

percep = t(a1[2:nrow(a1),]); percep

# prefs = mydata[, 1:length(brand.names)]

prefs = read.table(file.choose(), header = TRUE) # 'courses data prefs.txt'

prefs1 = prefs*0; rownames(prefs1) = NULL # null preferences doc created

JSM(percep, prefs1) # for p-map sans preferences

Should produce the p-map below: (click for larger image)

And the one-line JSM run:

JSM(percep, prefs) # for p-map with preference data

Should produce the JSM below:

Follow the rest of the HW code given to run segment-wise JSMs in the same fashion.

5. Running JSMs for individuals: (Useful for your HW)

One of your session 6 HW components will require you to make individual level JSMs and compare then with the class average JSMs. Use the following code to get that done:

### --- For Session 6 HW --- ###

# Use code below to draw individual level JSM plots:

student.name = c("Sachin") # say, student's name is Sachin

# retain only that row in the raw data which has name 'Sachin'
mydata.test = mydata[(rownames(prefs) == student.name),]

# run the pmap.inp() func to build avg perceptions table
a1.test = pmap.inp(mydata.test, attrib.names, brand.names)

percep.test = t(a1.test[1:(nrow(a1.test)-1),]);

# introduce a small perturbation lest matrix not be of full rank
percep.test = percep.test + matrix(rnorm(nrow(percep.test)*ncol(percep.test))*0.01, nrow(percep.test), ncol(percep.test));

prefs.test = prefs[(rownames(prefs) == student.name),]; prefs.test

# run analysis on percep.test and prefs.test
JSM(percep.test, prefs.test)

This is what I got as Mr Sachin's personal JSM:

More generally, change the student name to the one you want and run the above code.

5. Running MDS code with Car Survey Data:

In LMS folder 'session 6 files', the data are in 'mds car data raw v1.txt'. Read it in and follow the instructions here.

# --------------------- #
### --- MDS code ---- ###
# --------------------- #

rm(list = ls()) # clear workspace

mydata = read.table(file.choose(), header = TRUE) # 'mds car data raw.txt'

dim(mydata) # view dimension of the data matrix

brand.names = c("Hyundai", "Honda", "Fiat", "Ford", "Chevrolet", "Toyota", "Nissan", "TataMotors", "MarutiSuzuki")

Note that I have hard-coded the brand names into 'brand.names' If you want to use this MDS code for another dataset (for your project, say) then you'll have to likewise hard-code the brand.names in.Next, I defined a function called run.mds() that takes as input the raw data and the brand names vector, runs the analysis and outputs the MDS map. Cool, or what..

### --- copy-paste MDS func below as a block --- ###

### -------------block starts here ---------------- ###

run.mds = function(mydata, brand.names){

# build distance matrix #

k = length(brand.names);

dmat = matrix(0, k, k);

for (i1 in 1:(k-1)){ a1 = grepl(brand.names[i1], colnames(mydata));

for (i2 in (i1+1):k){a2 = grepl(brand.names[i2], colnames(mydata));
# note use of Regex here

a3 = a1*a2;

a4 = match(1, a3);

dmat[i1, i2] = mean(mydata[, a4]);

dmat[i2, i1] = dmat[i1, i2] } #i2 ends

} # i1 ends

colnames(dmat) = brand.names;

rownames(dmat) = brand.names

### --- run metric MDS --- ###

d = as.dist(dmat)

# Classical MDS into k dimensions #

fit = cmdscale(d,eig=TRUE, k=2) # cmdscale() is core MDS func

fit # view results

# plot solution #

x = fit$points[,1];
y = fit$points[,2];

plot(x, y, xlab="Coordinate 1", ylab="Coordinate 2", main="Metric MDS", xlim = c(floor(min(x)), ceiling(max(x))), ylim = c(floor(min(y)), ceiling(max(y))), type="p",pch=19, col="red");

text(x, y, labels = rownames(fit$points), cex=1.1, pos=1);

abline(h=0); abline(v=0)# horiz and vertical lines drawn

} # run.mds func ends

### ---------block ends here-------------------- ###

Time now to finally invoke the run.mds func and get the analysis results:

# run MDS on raw data (before segmenting)

run.mds(mydata, brand.names)

The resulting MDS map looks like this:

OK, that's quite a bit now for classwork replication. Let me know if any code anywhere is not running etc due to any issues.

Sudhir

Thursday, December 12, 2013

Session 5 Updates - Targeting

Hi all,

We'll quickly go over the targeting portion of the PDA case. Pls ensure you're comfortable with the How and why of segmentation and targeting from the lecture slides before going ahead with this one. I will assume you know the contents of the slides well for what follows.

#----------------------------------------------#
##### PDA caselet from MEXL - Targetting #######
#----------------------------------------------#

rm(list = ls()) # clear workspace

# read in 'PDA case discriminant variables.txt'

mydata = read.table(file.choose(), header=TRUE)

head(mydata) # view top few rows of dataset

The last column labeled 'memb' is the cluster membership assigned by mclust in the previous blogpost.

The purpose of targeting is to *predict* with as much accuracy as feasible, a previously unknown customer's segment membership. Since we cannot make such predictions with certainty, what we obtain as output are probabilities of segment membership for each customer.

First, we must assess how much accuracy our targeting algorithm has. There are many targeting algorithms developed and deployed for this purpose. We'll use the simplest and best known - the multinomial logit model.

To assess accuracy, we split the dataset *randomly* into a training dataset and a validation dataset. The code below does that (we use 'test' in place of validation in the code below).

# build training and test samples using random assignment

# two-thirds of sample is for training

train_index = sample(1:nrow(mydata), floor(nrow(mydata)*0.65))

train_data = mydata[train_index, ]

test_data = mydata[-(train_index), ]

train_x = data.matrix(train_data[ ,c(2:18)])

train_y = data.matrix(train_data[ , ncol(mydata)])

test_x = data.matrix(test_data[ ,c(2:18)])

test_y = test_data[ , ncol(mydata)]

And now we're ready to run logit (from the package 'textir'). Ensure the package is installed. And just follow the code below.

###### Multinomial logit using Rpackage textir ###

library(textir)

covars = mydata[ ,c(2,4,14)]; s=sdev(mydata[,c(2,4,14)]));

dd = data.frame(cbind(memb=mydata$memb,covars,mydata[ ,c(3,5:13,15:18)]));

train_ml = dd[train_index, ];

test_ml = dd[-(train_index), ];

gg = mnlm(counts = as.factor(train_ml$memb), penalty = 1, covars = train_ml[ ,2:18]);

prob = predict(gg, test_ml[ ,2:18]);

head(prob);

Should see the following result.

Note the table below shows probabilities. To read the table, consider the first row. Each column in the first row shows the probability that the first row respondent belongs to cluster 1 (with column 1 probability), to cluster 2 (with column 2 probability) and so on.

For convenience sake, we merely assign the member to that cluster at which he/she has maximum probability of belonging. Now, we can compare how well our predicted membership agrees with the actual membership.

To see this, run the following code and obtain what is called a 'confusion matrix' - a cross-tabulation between observed and predicted memberships. In the confusion matrix, the diagonal cells represent correctly classified respondents and off-diagonal cells the misclassified ones.

pred = matrix(0, nrow(test_ml), 1);

accuracy = matrix(0, nrow(test_ml), 1);

for(j in 1:nrow(test_ml)){

pred[j, 1] = which.max(prob[j, ]);

if(pred[j, 1]==test_ml$memb[j]) {accuracy[j, 1] = 1}

}

mean(accuracy)

The mean accuracy of the algo appears to be 63% in my run. YOurs may vary slightly due to randomly allocated traiing and validation samples. This 63% accuracy copares very well indeed with a 25% average accuracy if we were to depend merely on chance to allocate respondents to clusters.

That's it for now. Any queries etc. contact me over email or better still, use the comments section below this post.

Sudhir

Wednesday, December 11, 2013

Session 5 Updates

Hi all,

Yesterday in Session 5 we covered two major topics - Segmentation and Targeting.

Sorry about the delay in bringing out this blog post. In this blog post, I shall lay out the classwork examples (which you might want to try replicating) and their interpretation.

There are many approaches to doing cluster analysis and R handles a dizzying variety of them. We'll focus on 3 broad approaches - Agglomerative Hierarchical clustering (under which we will do basic hierarchical clustering with dendograms), Partitioning (here, we do K-means) and model based clustering. Each has its pros and cons (as discussed in class). Also, as mentioned in class, the goal here is to get tomorrow's managers (i.e., you) an exposure to the intuition behind clustering and the various methods in play. Going into technical detail is not a part of this course. However, I'm open to discussing and happy to receive Qs of a technical nature, outside of class time.

1. Cluster Analysis Data preparation

First read in the data. USArrests is pre-loaded, so no sweat. I use the USArrests dataset example throughout for cluster analysis.

# first install these packages
# Note: You only need to install a package ONCE.
# Thereafter a library() call is enough.

install.packages("cluster")
install.packages("mclust")
install.packages("textir")
install.packages("clValid")
# Now read-in data#

mydata = USArrests

Data preparation is required to remove variable scaling effects. To see this, consider a simple example. If you measure weight in Kgs and I do so in Grams - all other variables being the same - we'll get two very different clustering solutions from what is otherwise the same dataset. To get rid of this problem, just copy-paste the following code.

# Prepare Data #

mydata = na.omit(mydata) # listwise deletion of missing

mydata = scale(mydata) # standardize variables

2. Now we first do agglomerative Hierarchical clustering, plot dendograms, split them around and see what is happening.

# Ward Hierarchical Clustering

d = dist(mydata, method = "euclidean") # distance matrix

fit = hclust(d, method="ward") # run hclust func

plot(fit) # display dendogram

Click on image for larger size.

Eyeball the dendogram. Imagine horizontally slicing through the dendogram's longest vertical lines, each of which represents a cluster. Should you cut it at 2 clusters or at 4? How to know?

Sometimes eyeballing is enough to give a clear idea, sometimes not. Various stopping-rule criteria have been proposed for where to cut a dendogram - each with its pros and cons.

For the purposes of MKTR, I'll use three well-researched internal validity criteria available in the "clValid" package, viz. Connectivity, Dunn's index and Silhouette width - to determine the optimal no. of clusters. We don't need to go into any technical detail about these 3 metrics, for this course.

#-- Q: How to know how many clusters in hclust are optimal? #

library(clValid)

intval = clValid(USArrests, 2:10, clMethods = c("hierarchical"), validation = "internal", method = c("ward"));

summary(intval)

The result will look like the below image. 2 of the 3 metrics support a 2-cluster solution, so let's go with the majority opinion in this case.

Since we decided 2 is better, we set the optimal no. of clusters 'k1' to 2 below, thus:

k1 = 2 # from clValid metrics

Note: If for another dataset, the optimal no. of clusters changes to, say, 5 then use 'k1=5' in the line above instead. Don't blindly copy-paste that part. However, once you have set 'k1', the rest of the code can be peacefully copy-pasted as-is.

# cut tree into k1 clusters

groups = cutree(fit, k=k1) # cut tree into k1 clusters

3. Coming to the second approach, 'partitioning', we use the popular K-means method.

Again, the Q arises, how to know the optimal no. of clusters? MEXL (and quiote a few other commercial software) require you to magically come up with the correct number as input to K-means. R provides better guidance on choosing the no. of clusters. So with R, you can actually take an informed call.

#-- Q: How to know how many clusters in kmeans are optimal? #

library(clValid)

intval = clValid(USArrests, 2:10, clMethods = c("kmeans"), validation = "internal", method = c("ward"));

summary(intval)

The following result obtains. Note there's no majority opinion emerging in this case. I'd say, choose any one result that seems reasonable and proceed.

# K-Means Cluster Analysis

fit = kmeans(mydata, k1) # k1 cluster solution


To understand a clustering solution, we need to go beyond merely IDing which individual unit goes to which cluster. We have to characterize the cluster, interpret what is it that's common among a cluster's membership, give each cluster a name, an identity, if possible. Ideally, after this we should be able to think in terms of clusters (or segments) rather than individuals for downstream analysis.

# get cluster means

aggregate(mydata.orig,by=list(fit$cluster),FUN=mean)

# append cluster assignment

mydata1 = data.frame(mydata, fit$cluster);

mydata1[1:10,]

# K-Means Cluster Analysis

fit = kmeans(mydata, k1) # k1 cluster solution


To understand a clustering solution, we need to go beyond merely IDing which individual unit goes to which cluster. We have to characterize the cluster, interpret what is it that's common among a cluster's membership, give each cluster a name, an identity, if possible. Ideally, after this we should be able to think in terms of clusters (or segments) rather than individuals for downstream analysis.

# get cluster means

aggregate(mydata.orig,by=list(fit$cluster),FUN=mean)

t(cmeans) # view cluster centroids

# append cluster assignment

mydata1 = data.frame(mydata, fit$cluster);

mydata1[1:10,]

OK, That is fine., But can I actually, visually, *see* what the clustering solution looks like? Sure. In 2-dimensions, the easiest way is to plot the clusters on the 2 biggest principal components that arise. Before copy-pasting the following code, ensure we have the 'cluster' package installed.

# Cluster Plot against 1st 2 principal components
# vary parameters for most readable graph

install.packages("cluster")
library(cluster)
clusplot(mydata, fit$cluster, color=TRUE, shade=TRUE,labels=2, lines=0)

Two clear cut clusters emerge. Missouri seems to border the two. Some overlap is also seen. Overall, the clusPlot seems to put a nice visualization over the clustering process. Neat, eh? Try doing this with R's competitors...:)

4. Finally, the last (and best) approach - Model based clustering.'Best' because it is the most general approach (it nests the others as special cases), is the most robust to distributional and linkage assumptions and because it penalizes for surplus complexity (resolves the fit-complexity tradeoff in an objective way). My thumb-rule is: When in doubt, use model based clustering. And yes, mclust is available *only* on R to my knowledge. Install the 'mclust' package for this first. Then run the following code.

install.packages("mclust")

# Model Based Clustering

library(mclust)

fit = Mclust(mydata)

fit # view solution summary

The mclust solution has 3 components! Something neither the dendogram nor the k-means scree-plot predicted. Perhaps the assumptions underlying the other approaches don't hold for this dataset. I'll go with mclust simply because it is more general than the other approaches. Remember, when in doubt, go with mclust.

fit$BIC # lookup all the options attempted

classif = fit$classification # classifn vector

mydata1 = cbind(mydata.orig, classif) # append to dataset

mydata1[1:10,] #view top 10 rows

# Use below only if you want to save the output

write.table(mydata1,file.choose())#save output

The classification vector is appended to the original dataset as its last column. Can now easily assign individual units to segments.Visualize the solution. See how exactly it differs from that for the other approaches.

fit1=cbind(classif)

rownames(fit1)=rownames(mydata)

library(cluster)

clusplot(mydata, fit1, color=TRUE, shade=TRUE,labels=2, lines=0)

Imagine if you're a medium sized home-security solutions vendor looking to expand into a couple of new states. Think of how much it matters that the optimal solution had 3 segments - not 2 or 4. To help characterize the clusters, examine the cluster means (sometimes also called 'centroids', for each basis variable.

# get cluster means

cmeans=aggregate(mydata.orig,by=list(classif),FUN=mean);

cmeans

In the pic above, the way to understand or interpret the segments would be to characterize the segment in terms of which variables best describe that cluster as distinct from the other clusters. Typically, we look for variables that attain highest or lowest values for that cluster. In the figure above, it is clear that the first cluster (first column) ius the most 'unsafe' (in terms of having highest murder rate, assualt rate etc) and the last cluster the most 'safe'.

Thus, from mclust, it seems like we have 3 clusters of US states emerging - the unsafe, the safe and the super-safe. From the kmeans solution, we have 2 clusters - 'Unsafe' and 'Safe' emerging.

Now, we can do the same copy-paste for any other datasets that may show up in classwork or homework. I'll close the segmentation module here. R tools for the Targeting module are discussed in the next section of this blog post. Any queries or comment, pls use the comments box below to reach me fastest.

****************************************************

2. Segmenting and Targeting in R (PDA case - classwork example)

We saw a brief intro to the Conglomerate PDA case in the class handout in session 5. For the full length case, go to the 'Documents' folder in your local drive, then to 'My Marketing Engineering v2.0' folder, within that to the 'Cases and Exercises' folder and within that to the 'ConneCtor PDA 2001 (Segmentation)' folder (if you've installed MEXL, that is). For the purposes of this session, what was given in the handout is enough, however.

Without further ado, let's start. I'll skip the Hclust and k-means steps and go straight to model based clustering (mclust).

#----------------------------------------------#
##### PDA caselet from MEXL - Segmentation #####
#----------------------------------------------#

rm(list = ls()) # clear workspace

# read in 'PDA basis variables.txt' below
mydata = read.table(file.choose(), header=TRUE); head(mydata)

### --- Model Based Clustering --- ###

library(mclust)

fit = Mclust(mydata); fit

fit$BIC # lookup all the options attempted

classif = fit$classification # classifn vector

The image above shows the result. Click for larger picture.

We plot what the clusters look like in 2-D using the clusplot() function in the cluster library. What clusplot() does is essentially performs factor analysis on the dataset, plots the first two (or largest two) factors as axes and the rows as factor score points in this 2-D space. In the clusplot below, we can see that the top 2 factors explain x% of the total variance in the dataset. Anyway, this plot is illustrative only and not critical to our analysis here.

But how to interpret what the clusters mean? To interpret the clusters, we have to first *characterize* the clusters in terms of which variables most distinguish them from the other clusters.

For instance, see the figure below in which the cluster means (or centroids) of the 4 clusters we obtained via mclust are shown on each of the 15 basis variables we started with.

Thus, I would tend to characterize or profile cluster 1 (the first column above) in terms of the variables that assume extreme values for that cluster (e.g., very LOW on Price, Monthly, Ergonomic considerations, Monitor requirements and very HIGH on use.PIM) and so on for the other 3 clusters as well.

What follows the profiling of the clusters is then 'naming' or labeling of the clusters suitably to reflect the cluster's characteristics. Then follows an evaluation of how 'attractive' the cluster is for the firm based on various criteria.

Thus far in the PDA case, what we did was still in the realm of segmentation. Time to now enter Targeting which lies at the heart of predictive analytics in Marketing. Since this blog post has gotten too large already, shall take that part to the next blogpost.

Sudhir

Wednesday, December 4, 2013

Session 4 Updates

Hi all,

Session 4 Big-picture Recap:

The readings-heavy Session 4 'Qualitative Research' covers many topics of interest. To recap the four big-picture take-aways from the session, let me use bullet-points:

  • We studied Observation Techniques - both of the plain vanilla observation (Reading 1 - Museums) and the 'immersive' ethnographic variety (Reading 2 - adidas).
  • We then ventured into deconstructing the powerful habit formation process and arrived a 3-step loop framework to describe it for Marketing purposes: cue-routine-reward.
  • We saw how the innovative combination of qualitative insight and predictive analytics can lead to windfall $$ profits (Reading 3-Target and reading 4-Febreze)
  • Finally we saw how unstructured respondent interaction personified by a focus group discussion (FGD) can be a powerful qualitative tool for digging up customer insights.

Update:
This is a link to the NYT video that failed to play in class today.

**************************************

There are two parts to Session 4 HW.

Part 1 of Session 4 HW: Survey filling

Fillup these two surveys please, each less than 15 minutes a piece. Kindly do so positively by Sunday midnight deadline.

Survey 1 for COnjoint analysis in Session 7

Survey 2 link for Social Network Analysis in Session 9

**************************************

Part 2 of Session 4 HW: FGD

Read the following two Economist articles (added later: and one ET article) outlining a new product about to hit the shelves that wants to do an Apple on Apple.

Every step you take

India's answer to Google Glass: Hands free wearable device enables users to carry out computer functions (Economic Times)

The people’s panopticon

Problem context:

You are a mid-sized technology firm with a US presence. Your R&D division has recently won angel funding for hiring bright talent and developing applications for the google glass platform. You, as a the Marketing manager need to give inputs to the tech team on what kind of apps and products may appeal to customers. You have a few ideas in mind but are unsure if they'll appeal to customers.

Run an FGD to explore tech-savvy early-majority customers' expectations and wishes from the Google glass (or more generally, a wearable networking and technology) platform. Pitch your ideas to the group and see how they receive it, what their expectations, concerns and first impressions are etc.

Submission format:

  • For a group (no more than 4 people) and select a name for it (based on a well-known Indian brand)
  • Title slide of your PPT should have your group name, member names and PGIDs
  • Choose a D.P. and corresponding R.O.(s) for the given problem context for the FGD.
  • Next slide, write your D.P. and R.O.(s) clearly.
  • Third slide, introduce the FGD participants and a line or so on why you chose them (tabular form is preferable for this)
  • Fourth Slide, write a bullet-pointed exec summary of the big-picture take-aways from the FGD
  • Fifth Slide on, describe and summarize what happened in the FGD
  • Note if unification and / or polarization dynamics happened in the FGD
  • Name your slide groupname_FGD.pptx and drop in the appropriate dropbox by the start of session 6
  • Extra points if you can put up a short video on youtuibe of the FGD in progress and its major highlights. Share the link on the PPT

Any queries etc., pls feel free to email me.

********************************************

Update 1: FGD HW guidelines: (This is based on my experience with the Term 5 FGD HW in Hyd)

To keep it focussed and brief, lemme use the bullet-points format

  • The point of the FGD is *not* to 'solve' the problem, but merely to point a likely direction where a solution can be found. So don't brainstorm for a 'solution', that is NOT the purpose of the FGD.
  • Ensure the D.P. and R.O.s are aligned and sufficiently exploratory before the FGD can start. Different ROs lead to very different FGD outcomes. For example, if you define your R.O. as "Explore which portable devices will be most cannibalized due to Google Glass" versus "Explore potential for new to the world applications using Google Glass", etc.
  • Keep your D.P. and R.O. tightly focussed, simple and do-able in a mini-FGD format. Having too broad a focus or too many sub-topics will lead nowhere in the 30 odd minutes you have.
  • Start broad: Given an R.O., explore how people connect with or relate to portability, Technology and devices in general, their understanding of what constitutes a 'cool device', their understanding of what constitutes 'excitement', memorability', 'social currency' or 'talkability' in a device and so on. You might want to start with devices in general and not narrow down to Google Glass right away (depending on the constructs you seek, of course).
  • Prep the moderator well: The moderator in particular has a crucial role. Have a broad list of constructs of interest, Focus on getting them enough time and traction (without being overly pushy). For example, the mod could start by asking the group: "What do you think about Portable devices? Where do you see the trend going in portable devices like your smartphone, fuel bands and so on?" and get the ball rolling, then steer it to keep it on course.
  • Converge on Google Glass in detail: After exploring devices in general, explore the particulars of GGlasss as a devices - what is it, how is it viewed or understood, what are the perceptions, hopes and expectations around it etc.
  • Do some background research on Tech trends and their Evolution first. See if any interesting analogies come up.
  • See where people agree in general, change opinions on interacting with other people on any topic, disagree sharply on some topics and stand their ground etc.
  • In your PPT report, mention some of the broad constructs you planned to explore via the FGD.
  • Report (among other things) what directions seem most likely to be fruitiful for investigation.

BTW, here are some FGD video links by MKTR groups in Hyderabad in the last term.

Their FGD topic was different, though. FYI.

That's quite enough for now. Good luck for the placement season.

Sudhir

Tuesday, December 3, 2013

Session 3 Updates

Hi all,

Update:

This blog post from last year contains more details on how to interpret factor analysis results in R.

Session 3 covers two seemingly diverse topics - Questionnaire design and Data reduction via Factor analysis.

Each topic brings its own HW with it. And yes, let me pre-empt some possible grips that may arise.... No, the HWs aren't overly heavy yet. The Hyd folks did the similar HWs despite also having a project.

The Questionnaire design portion has a HW that asks you to program a websurvey based on a D.P.-R.O. that you extract from a given M.P. Let me introduce that HW right away below:

Consider the following problem context:

Flipkart, a leading Indian e-tailer, wants to know about how students in premier professional colleges in India view shopping online. Flipkart believes that this segment will, a few years down, become profitable, a source of positive word of mouth from a set of opinion leaders. This will seed the next wave of customer acquisition and growth and is hence a high stakes project for Flipkart.

Flipkart wants to get some idea about the online buying habits, buying process concerns and considerations, product categories of interest, basic demographics, media consumption (in order to better reach this segment) and some idea of the psychographics of this segment.

As lead consultant in this engagement, you must now come up with a quick way to prioritize and assess the target-segment's perceptions on these diverse parameters.

HW Q: Build a short survey (no longer than 12-15 minutes of fill-up time for the average respondent) on qualtrics web survey software for this purpose. Pls submit websurvey link in this google form. The deadline is before Session 5 starts.

******************************************

OK, now let's start. Fire up your Rstudio. Download all the data files required from the 'Session 3 files' folder on LMS.

Copy the code below and paste it on the 'Console' in Rstudio. A window will open up asking for the location of the dataset to be read in. Read in 'factorAn data.txt'.Use the 'Packages' tab in the lower right pane in Rstudio to install the nFactors package.

rm(list = ls()) # clear workspace first

# read in the data 'factorAn data.txt'

mydata=read.table(file.choose(),header=TRUE)

mydata[1:5,] #view first 5 rows

# install the required package first

install.packages("nFactors")

# determine optimal no. of factors

library(nFactors) # invoke library

ev = eigen(cor(mydata)) # get eigenvalues

ap = parallel(subject=nrow(mydata),var=ncol(mydata),rep=100,cent=.05);

nS = nScree(ev$values, ap$eigen$qevpea);

plotnScree(nS)

A scree plot should appear, like the one below:

On the scree plot that appears, The green horizontal line represents the Eigenvalue=1 level. Simply count how many green triangles (in the figure above) lie before the black line cuts the green line. That is the optimal no. of factors. Here, it is 2. The plot looks intimidating as it is, hence, pls do not bother with any other color-coded information given - blue, black or green. Just stick to the instructions above. Now, we set 'k1' to 2 as shown below:

k1 = 2 # set optimal no. of factors

If the optimal no. of factors changes when you use a new dataset, simply change the value of "k1" in the line above. Copy paste the line onto a notepad, change it to 'k1=6' or whatever you get as optimal and paste onto R console. Rest of the code runs as-is.

# extracting k1 factors with varimax rotation

fit = factanal(mydata, k1, scores="Bartlett", rotation="varimax");

print(fit, digits=2, cutoff=.3, sort = TRUE)

You'll see something like this below (click for larger image)

Clearly, the top 3 variables load onto factor 1 and the bottom 3 onto factor 2.

Another point of interest is the last line in the image above which says "Cumulative Var". It stands for Cumulative variance explained by the factor solution. For our 2 factor solution, the cumulative variance explained is 0.74 or 74%. In other words, close to three-quarters or 74% of the net information content in the original 6-variable dataset is retained by the 2-factor solution.

Also, look at 'Uniquenesses' of the variables. The more 'Unique' a variable is, the less it is explained by the facctor solution. hence, often times, we drop variables with very high uniqueness (say over 2/3rds) and re-run the analysis on the remaining variables. The dropped variables can essentially be considered factors in their own right and are included as such in downstream analysis. If there is *any* aspect of the above process that you want to see expanded or see more detail on, pls let me know. I shall do so to the best I can.

We can now plot the variables onto the top 2 factors (if we want to) and see how that looks like. Also, we can save the factor scores for later use downstream, if we want to.

# plot factor 1 by factor 2

load <- fit$loadings[,1:2]

par(col="black") #black lines in plots

plot(load,type="p",pch=19,col="red") # set up plot

abline(h=0);abline(v=0)#draw axes

text(load,labels=names(mydata),cex=1,pos=1)

# view & save factor scores

fit$scores[1:4,]#view factor scores

write.table(fit$scores, file.choose())

The above is the plot of the variable on the first two factors. The variables closest to the axes (factors) load onto it.

*************************************

Session 3 - Homework 2: Factor Analysis:

  • In the Session 3 files' folder on LMS, there is a dataset labeled 'personality survey responses new.txt'.
  • This is *your* data - 33 psychographic variables that should map onto 4 personality factors - that you answered in the surveys of Session 2 HW.
  • Read it into R using the code given above.
  • Run the analysis as per the code given above or as given in 'R code for factor analysis.txt' notepad.
  • Look up the scree plot and decide what is the optimal # factors.
  • Plug that number into 'k1= ' piece of the code.
  • Copy and save the plots as metafiles directly onto a PPT slide.
  • Copy and paste the R results tables either into excel or as images onto PPT.
  • See the image below for 'Q reference' to map which variable meant what in the survey.
  • *Interpret* any 5 of the factors that emerge based on the variables that load onto them. Label those factors.
  • Write your name and PGID on the PPT title slide.
  • Name the PPT yourname_session3HW.pptx and submit into the session 3 dropbox on LMS
  • Submission deadline is a week from now, before the start of session 5.

Above is the Qs cross-reference, just in case.

Any Qs or clarifications etc, contact me.Pls check this blog post regularly for updates. As Qs and clarifications come to me, I will update this post to handle them.

Sudhir