Tuesday, December 24, 2013

Updates on Session 9

Hi all,

Session 9 is done. One more to go.

1. Notes from Today's session:

We covered a lot in this rather eclectic session - from Hypothesis formulation to Social network analysis (SNA). A few quick notes on the same:

i. Hypothesis formulation and testing fits neatly into the Causal research (Experimentation) topic that we did in Session 7. After all, logical, measurable hypotheses underlie the experimental method.

ii. The two types of tests we did - Association (chi-square) and Differences (t-tests) cover the majority of situations you are likely to face. However, even for other, more esoteric testing requirements R is handy and available.

iii. Regression modeling borrows much from your Stats core. However, even if repetitive, I couldn't risk leaving it out as IMO, basic regression modeling forms a critical part of tomorrow's Mktg manager's repertoire.

iv. The 3 basic regression variants we covered viz., quadratic terms (for ideal points estimation), log-log models (for elasticities) and interaction effects open up a lot of space for managers to maneuver and test conjectures in.

v. The SNA portion was new to me too, in a sense. I'd done my network theory basics way back in grad study but getting back in touch felt good. SNA going forward will gain in importance and applicability. Already we saw the kind of Qs it is able to provide guidance and answers for.

vi. We've only scratched the surface where SNA is concerned. R's capabilities extend further but let me quickly admit that I have myself not explored very far in this area. A big limitation on SNA is how much data we can collect from Twitter, FB etc when their connection APIs set rather small automated limits for downloading data.

vii. The R code and data for today's classwork is up on LMS. Pls try to replicate at home, interested folks can generally play around with the code and see.

2. Regarding SNA data:

Folks, normally I would remove identifying information about students from the data but in this case (and in the JSM homework case which called for individual level plots), that was not possible. so the identifying information remains.

Pls remember: Since only 6 friends could be listed, its not possible for people to list all their friends. Hence some names might be missing in the SNA dataset of Co2014. Pls take that in the right spirit rather than blame anybody for non-reciprocity.

Bottomline: I don't want MKTR to be remembered for any negative reasons whatsoever.

3. Some Announcements:

a. Those who have missed filling up a survey can also take up the 2-page write-up assignment I'd given earlier to folks who'd missed pre-read quizzes, here, at this link. Deadline is Thursday.

b. There will be 5 pre-read quizzes in all (including the session 10 one). I'll consider your top 4 scores for grading. So if you performed really badly on one, you can let it go.

c. Update: I have decided to drop this reading from your pre-read list for session 10.You have two pre-reads for session 10. One is this McKinsey article: Capturing Business Value with Social Technologies. Scan it quickly, doesn't require a very in depth read, IMHO. But the general ideas should be clear.

d. Your other pre-read for session 10 is a famous Wired article from 2004 which went on to become a major 2007 book on the subject: The Long Tail, by Chris Anderson. Its an excellent article on a new economic paradigm enabled by technology.

e. The practice exam is up on LMS. Its solution is up too. But only those Qs which have one clear answer have been solved. More open-ended Qs have been left blank. The practice exam is a good template for what you can expect to see in the end-term. Most pre-reads per se will not come unless explicitly covered in the slides, as you can see in the practice exam.

f. I solved Session 8 HW again. The findAssoc() function seems to do OK for Qs 1-9. So keep it as is. FindAssocs() is not needed for the Amazon reviews analysis anyway.

Well, that's it from me for now. See you on Thursday.

Sudhir

Saturday, December 21, 2013

Assignment in Lieu of one Pre-read Quiz

Hi All,

This following Write-up based assignment is *only* for those folks who have missed one pre-read quiz or may have done badly in any one quiz.

Problem background:

You are a consultant and your client, a multinational manufacturing behemoth, wants to know trends and impact of disruption in manufacturing technologies in the next decade with particular emphasis on 'additive manufacturing' (a.k.a. 3 dimensional printing) technologies.

Your D.P. is to find "Which industries and product categories will shift earliest to (or be most affected by) 3D printing tech and around what time line?".

An alternative D.P. says, "What are the most likely consumer uses of 3D printing and around what time line?"

Choose any one of the two D.P.s, build corresponding R.O.s and write a 3 (or fewer) page report (Times New Roman 12 font, 1.5 line spacing, standard margins) outlining your principal findings in solving that R.O. through secondary research alone.

Hint:

Google for 'economist.com 3D printing' (without the inverted commas). Scan through the links that appear on the first page. I have posted a few examples below.

How 3D printers work (7 Sept 2013)

3D printing Out of the box (6 Aug 2013)

3D printing scales up (7 Sept 2013)

Inventing HP in 3D (28 Nov 2013)

Pls ensure you have:

  • Written your name and PGID on the document
  • Clearly spelt out which D.P. you have chosen
  • Clearly spelt our your R.O.(s)
  • Clearly included citations of sources (URLs etc) either as footnotes or as a separate References section outside the page limit.
The deadline is before the start of the next class. Pls submit electronically to a dropbox that Ankit will make on LMS for this purpose.

Any queries etc., contact me.

Thanks.

Sudhir

Friday, December 20, 2013

Session 8 HW

Hi all,

Pls find below the last HW for MKTR - that for session 8.

The files for classwork and HW are both up on LMS. *Highly* recommended to first try classwork code before the HW one.

HW can be done in groups but submission must be individual only. Also, interpretation should be yours, feel free to take help from peers for running the analysis.

Oh, and one other thing. For text analytics, better to leave Rstudio and run the analysis directly on the original R GUI. It will be there in Program files in the start menu. Same copy-paste will work there also. Plots appear in a cascading window. This is recommended but not mandatory

HW Qs:

The following Qs are based on your survey responses

  • Q1. List the top 5 firms that people have expressed preference for.
  • Q2. For each of the top 3 most frequently cited firms, name the top two firms that co-occur the most and with what correlation coefficient.
  • Q3. Name two 'singleton' firms - firms that do not have any co-occurence with any other firms in the network. [Hint: Invoke plot.words.network function]
  • Q4. Name three singleton people (who apparently do not share firm preferences) with anybody else in the class. [Hint: Invoke plot.ppl.network function]
  • Q5. Analyze the wordcloud for the top loyalty-commanding brands for Co2014
  • Q6. For each of the top 3 most frequently cited brands, name the top two brands that co-occur the most and with what correlation coefficient.
  • Q7. In the brands associations plot, do you see any natural groupings emerge? Do brands of a particular category or price level bunch up together? [Hint: Invoke plot.words.network function]
  • Q8. Name a few people who seem to share a lot of brand preferences with others in the class? [Hint: Invoke plot.ppl.network function]
  • ### following Qs are for web extraction of data from amazon ###
  • Q9. Collect 100 odd reviews from Amazon for xbox 360. Analyze the wordcloud. What themes seem to emerge from the wordcloud?
  • Q10. Analyze the positive wordcloud. What are the xbox's seeming strengths? What can they position around?
  • Q11. Analyze the negative wordcloud. What are the xbox's seeming weaknesses? What can they prioritize and fix?
Deadline is coming Thursday midnight. Submission must be in the form of PPTs only. Write your name and PGID on the title slide and write your name as file name. Dropbox will be made for this.

Any Qs etc, contact me.

Suhdir

Monday, December 16, 2013

Session 6 HW

Hi all,

Update: Mailbag

Received this email from Kanwal and my response is as under - displayed here for wider dissemination.

Dear Professor, I have read all the blog posts but I am confused about future assignments. All, I understand is that we have to submit a focus group assignment on 21 Dec. Can you please tell what are the next assignments and when they are due. Our exams start next week. Many thanks, Kanwal Kella

My response:

Hi Kanwal, Am not sure on what exactly is confusing here. Let me list it all out anyway.

Session 4 HW (FGD) - due 21-Dec Saturday
Session 5 - Segmtn and targetting - No HW
Session 6 - JSM HW - due a week later - before the beginning of session 9 on 24-Dec Tuesday
Course feedback text survey taking - due before session 8 on 19-Dec Thursday

Session 7 - Causal research - No HW
Session 8 - Text Analysis - a rather limited HW exercise will be due a week later, before the start of session 10, on 26 Dec Thursday
Sessions 9 and 10 - No HW

Hope that helps.

Sudhir

*****************************

Here is the HW for session 6. As mentioned before - this includes the HW for session 5 as well.

1. Session 6 HW: Part1 - JSMs

This HW is also a group submission. You will need to co-operate with the rest of our group to get it done.

  • JSM based homework:
  • Collect basic demographic information about your group mates - #yrs of workex, previous industry, educational qualifications, intended major etc.
  • Run individual level JSM analysis on each of your team mates (and youself) using the code below (place appropriate name in student.name = c("") in that code)
  • Compare the JSMs you obtain - what salient similarities and differences do you see?
  • Now, using the demographic data you have collected, speculate on which demographic characteristics are best able to explain at least some of the similarities and differences you see.
  • Place (i) the 4 JSms, (ii) your list of salient similarities and differences (preferably in tabular form), and (iii) the subset of demographic variables that best explain the JSMs in a PPT.

Update:
I'm dropping Part 2 of the HW. Submit Part 1 and that will be sufficient.

2. Session 6 HW: Part 2 - Segmentation and Positioning PDAs

  • Connector PDA case based homework:
  • Pls scan through the basic facts about the ConneCtor PDA 2001 (segmentation) and (Positioning) cases in MEXL
  • Segment the dataset given along basis variables using model based clustering
  • Profile and characterize the segments that emerge. Give each one a reasonable, descriptive name.
  • Speculate on which of these segments you as the firm would most like to target. In other words, rate these segments in terms of their (High/Medium/Low) attractiveness for you.
  • Look at the discriminant variables list corresponding to your chisen segment. Based on the list, speculate on how you might target your chosen segment?
  • Paste the results you obtain (including the segment descriptions in tabular form) onto the PPT and submit.
Any queries etc, contact me.

Sudhir

Session 6 Updates

Hi all,

Session 6 covers two main ways to map perceptual data - (i) using the attribute ratings (AR) method to create p-maps and joint-space maps (JSMs), and (ii) using the overall similarity (OS) approach to create multidimensional scaling (MDS) maps.

We also saw some 101 stuff on positioning, definitional terms, common positioning strategies etc. The point was to get you thinking on how the mapping process could throw insights onto positioning in general, which strategy to adopt based on what criteria etc.

OK, next, what will follow is the code and snapshots of the plots that emerge from the classwork examples I did. Again, you are strongly encouraged to replicate the classwork examples at home. Copy-paste a only a few lines of code at a time after reading the comments next to each line of code.

{P.S.- the statements following a '#' are for documentation purposes only and aren't executed}.So, without further ado, let us start right away:

##########################################

1. Simple Data Visualization using biplots: USArrests example.

We use USArrests data (inbuilt R dataset) to see how it can be visualized in 2 dimensions. Just copy-paste the code below onto the R console [Hit 'enter' after the last line]. Need to install package "MASS". Don't reinstall if you have already installed it previously. A package once installed lasts forever.

rm(list = ls()) # clear workspace

install.packages("MASS") # install MASS package

mydata = USArrests # USArrests is an inbuilt dataset

pc.cr = princomp(mydata, cor=TRUE) # princomp() is core func summary(pc.cr) # summarize the pc.cr object

biplot(pc.cr) # plot the pc.cr object

abline(h=0); abline(v=0) # draw horiz and vertical axes

This is what the plot should look like. Click on image for larger view.

2. Code for making Joint Space maps:

I have coded a user-defined function called JSM in R. You can use it whenever you need to make joint space maps provided just by invoking the function. All it requires to work is a perceptions table and a preference rating table. First copy-paste the entire block of code below onto your R console. Those interested in reading the code, pls copy-paste line-by-line. I have put explanations in comments ('#') for what the code is doing.

## --- Build func to run simple perceptual maps --- ##

JSM = function(inp1, prefs){ #JSM() func opens

# inp1 = perception matrix with row and column headers
# brands in rows and attributes in columns
# prefs = preferences matrix

par(pty="s") # set square plotting region

fit = prcomp(inp1, scale.=TRUE) # extract prin compts

plot(fit$rotation[,1:2], # use only top 2 prinComps

type ="n", xlim=c(-1.5,1.5), ylim=c(-1.5,1.5), # plot parms

main ="Joint Space map - Home-brew on R") # plot title

abline(h=0); abline(v=0) # build horiz and vert axes

attribnames = colnames(inp1);

brdnames = rownames(inp1)

# -- insert attrib vectors as arrows --

for (i1 in 1:nrow(fit$rotation)){

arrows(0,0, x1 = fit$rotation[i1,1]*fit$sdev[1],

y1 = fit$rotation[i1,2]*fit$sdev[2], col="blue", lwd=1.5);

text(x = fit$rotation[i1,1]*fit$sdev[1], y = fit$rotation[i1,2]*fit$sdev[2],

labels = attribnames[i1],col="blue", cex=1.1)}

# --- make co-ords within (-1,1) frame --- #

fit1=fit; fit1$x[,1]=fit$x[,1]/apply(abs(fit$x),2,sum)[1]

fit1$x[,2]=fit$x[,2]/apply(abs(fit$x),2,sum)[2]

points(x=fit1$x[,1], y=fit1$x[,2], pch=19, col="red")

text(x=fit1$x[,1], y=fit1$x[,2], labels=brdnames, col="black", cex=1.1)

# --- add preferences to map ---#

k1 = 2; #scale-down factor

pref = data.matrix(prefs)# make data compatible

pref1 = pref %*% fit1$x[,1:2];

for (i1 in 1:nrow(pref1)){

segments(0, 0, x1 = pref1[i1,1]/k1, y1 = pref1[i1,2]/k1, col="maroon2", lwd=1.25);

points(x = pref1[i1,1]/k1, y = pref1[i1,2]/k1, pch=19, col="maroon2");

text(x = pref1[i1,1]/k1, y = pref1[i1,2]/k1, labels = rownames(pref)[i1], adj = c(0.5, 0.5), col ="maroon2", cex = 1.1)}

# voila, we're done! #

} # JSM() func ends

3. OfficeStar MEXL example done on R

Goto LMS folder 'Session 6 files'. The file 'R code officestar.txt' contains the code (which I've broken up into chunks and annotated below) and the files 'officestar data1.txt' and 'officestar pref data2.txt' contain the average perceptions or attribute table and preferences table respectively.

Step 3a: Read in the attribute table into 'mydata'.

# -- Read in Average Perceptions table -- #

mydata = read.table(file.choose(), header = TRUE)

mydata = t(mydata) #transposing to ease analysis

mydata #view the table read

# extract brand and attribute names #

brdnames = rownames(mydata);

attribnames = colnames(mydata)

Step 3b: Read into R the preferences table into 'prefs'.

# -- Read in preferences table -- #

pref = read.table(file.choose())

dim(pref) #check table dimensions

pref[1:10,] #view first 10 rows

Data reading is done. You should see the data read-in as in the figure above. We can start analysis now. Finally.

Step 3c: Run Analysis

# creating empty pref dataset

pref0 = pref*0; rownames(pref0) = NULL

JSM(mydata, pref0) # p-map without prefs information

The above code will generate a p-map (without the preference vectors). Should look like the image below (click for larger image):

However, to make true joint-space maps (JSMs), wherein the preference vectors are overlaid atop the p-map, run the one line code below:

JSM(mydata, pref)

That is it. That one function call executes the entire JSM sequence. The result can be seen in the image below.

Again, the JSM function is generic and can be applied to *any* dataset in the input format we just saw to make joint space maps from. Am sure you'll leverage the code for animating your project datasets. Let me or Ankit know in case any assistance is needed in this regard.

4. Session 2 survey Data on firm Perceptions:

Lookup LMS folder 'session 6 files'. Save the data and code files to your machine. Data files are 'courses data.txt' for the raw data on perceptions and courses data prefs.txt' for the preference data with student names on it. Now let the games begin.

# read in data

mydata = read.table(file.choose()) # 'courses data.txt'

head(mydata)

# I hard coded attribute and brand names

attrib.names = c("Brd.Equity", "career.growth.oppty", "roles.challenges", "remuneration", "overall.preference") brand.names = c("Accenture", "Cognizant", "Citi", "Facebook", "HindLever")

Should you try using your project data or some other dataset, you'll need to enter the brand and attribute names for that dataset in the same order in which they appear in the dataset, separately as given above.I then wrote a simple function, titled 'pmap.inp()' to denote "p-map input", to transform the raw data into a brands-attributes average peceptions table. Note that the below code is specific to the last set of columns being the preferences data.

# construct p-map input matrices using pmap.inp() func

pmap.inp = function(mydata, attrib.names, brand.names){ #> pmap.inp() func opens

a1 = NULL

for (i1 in 1:length(attrib.names)){

start = (i1-1)*length(brand.names)+1; stop = i1*length(brand.names);

a1 = rbind(a1, apply(mydata[,start:stop], 2, mean)) } # i1 loop ends

rownames(a1) = attrib.names; colnames(a1) = brand.names

a1 } # pmap.inp() func ends

a1 = pmap.inp(mydata, attrib.names, brand.names)

The above code should yield the average perceptions table that will look something like this:

And now, we're ready to run the analysis. First the p-map without the prefences and then the full JSM.

# now run the JSM func on data

percep = t(a1[2:nrow(a1),]); percep

# prefs = mydata[, 1:length(brand.names)]

prefs = read.table(file.choose(), header = TRUE) # 'courses data prefs.txt'

prefs1 = prefs*0; rownames(prefs1) = NULL # null preferences doc created

JSM(percep, prefs1) # for p-map sans preferences

Should produce the p-map below: (click for larger image)

And the one-line JSM run:

JSM(percep, prefs) # for p-map with preference data

Should produce the JSM below:

Follow the rest of the HW code given to run segment-wise JSMs in the same fashion.

5. Running JSMs for individuals: (Useful for your HW)

One of your session 6 HW components will require you to make individual level JSMs and compare then with the class average JSMs. Use the following code to get that done:

### --- For Session 6 HW --- ###

# Use code below to draw individual level JSM plots:

student.name = c("Sachin") # say, student's name is Sachin

# retain only that row in the raw data which has name 'Sachin'
mydata.test = mydata[(rownames(prefs) == student.name),]

# run the pmap.inp() func to build avg perceptions table
a1.test = pmap.inp(mydata.test, attrib.names, brand.names)

percep.test = t(a1.test[1:(nrow(a1.test)-1),]);

# introduce a small perturbation lest matrix not be of full rank
percep.test = percep.test + matrix(rnorm(nrow(percep.test)*ncol(percep.test))*0.01, nrow(percep.test), ncol(percep.test));

prefs.test = prefs[(rownames(prefs) == student.name),]; prefs.test

# run analysis on percep.test and prefs.test
JSM(percep.test, prefs.test)

This is what I got as Mr Sachin's personal JSM:

More generally, change the student name to the one you want and run the above code.

5. Running MDS code with Car Survey Data:

In LMS folder 'session 6 files', the data are in 'mds car data raw v1.txt'. Read it in and follow the instructions here.

# --------------------- #
### --- MDS code ---- ###
# --------------------- #

rm(list = ls()) # clear workspace

mydata = read.table(file.choose(), header = TRUE) # 'mds car data raw.txt'

dim(mydata) # view dimension of the data matrix

brand.names = c("Hyundai", "Honda", "Fiat", "Ford", "Chevrolet", "Toyota", "Nissan", "TataMotors", "MarutiSuzuki")

Note that I have hard-coded the brand names into 'brand.names' If you want to use this MDS code for another dataset (for your project, say) then you'll have to likewise hard-code the brand.names in.Next, I defined a function called run.mds() that takes as input the raw data and the brand names vector, runs the analysis and outputs the MDS map. Cool, or what..

### --- copy-paste MDS func below as a block --- ###

### -------------block starts here ---------------- ###

run.mds = function(mydata, brand.names){

# build distance matrix #

k = length(brand.names);

dmat = matrix(0, k, k);

for (i1 in 1:(k-1)){ a1 = grepl(brand.names[i1], colnames(mydata));

for (i2 in (i1+1):k){a2 = grepl(brand.names[i2], colnames(mydata));
# note use of Regex here

a3 = a1*a2;

a4 = match(1, a3);

dmat[i1, i2] = mean(mydata[, a4]);

dmat[i2, i1] = dmat[i1, i2] } #i2 ends

} # i1 ends

colnames(dmat) = brand.names;

rownames(dmat) = brand.names

### --- run metric MDS --- ###

d = as.dist(dmat)

# Classical MDS into k dimensions #

fit = cmdscale(d,eig=TRUE, k=2) # cmdscale() is core MDS func

fit # view results

# plot solution #

x = fit$points[,1];
y = fit$points[,2];

plot(x, y, xlab="Coordinate 1", ylab="Coordinate 2", main="Metric MDS", xlim = c(floor(min(x)), ceiling(max(x))), ylim = c(floor(min(y)), ceiling(max(y))), type="p",pch=19, col="red");

text(x, y, labels = rownames(fit$points), cex=1.1, pos=1);

abline(h=0); abline(v=0)# horiz and vertical lines drawn

} # run.mds func ends

### ---------block ends here-------------------- ###

Time now to finally invoke the run.mds func and get the analysis results:

# run MDS on raw data (before segmenting)

run.mds(mydata, brand.names)

The resulting MDS map looks like this:

OK, that's quite a bit now for classwork replication. Let me know if any code anywhere is not running etc due to any issues.

Sudhir

Thursday, December 12, 2013

Session 5 Updates - Targeting

Hi all,

We'll quickly go over the targeting portion of the PDA case. Pls ensure you're comfortable with the How and why of segmentation and targeting from the lecture slides before going ahead with this one. I will assume you know the contents of the slides well for what follows.

#----------------------------------------------#
##### PDA caselet from MEXL - Targetting #######
#----------------------------------------------#

rm(list = ls()) # clear workspace

# read in 'PDA case discriminant variables.txt'

mydata = read.table(file.choose(), header=TRUE)

head(mydata) # view top few rows of dataset

The last column labeled 'memb' is the cluster membership assigned by mclust in the previous blogpost.

The purpose of targeting is to *predict* with as much accuracy as feasible, a previously unknown customer's segment membership. Since we cannot make such predictions with certainty, what we obtain as output are probabilities of segment membership for each customer.

First, we must assess how much accuracy our targeting algorithm has. There are many targeting algorithms developed and deployed for this purpose. We'll use the simplest and best known - the multinomial logit model.

To assess accuracy, we split the dataset *randomly* into a training dataset and a validation dataset. The code below does that (we use 'test' in place of validation in the code below).

# build training and test samples using random assignment

# two-thirds of sample is for training

train_index = sample(1:nrow(mydata), floor(nrow(mydata)*0.65))

train_data = mydata[train_index, ]

test_data = mydata[-(train_index), ]

train_x = data.matrix(train_data[ ,c(2:18)])

train_y = data.matrix(train_data[ , ncol(mydata)])

test_x = data.matrix(test_data[ ,c(2:18)])

test_y = test_data[ , ncol(mydata)]

And now we're ready to run logit (from the package 'textir'). Ensure the package is installed. And just follow the code below.

###### Multinomial logit using Rpackage textir ###

library(textir)

covars = mydata[ ,c(2,4,14)]; s=sdev(mydata[,c(2,4,14)]));

dd = data.frame(cbind(memb=mydata$memb,covars,mydata[ ,c(3,5:13,15:18)]));

train_ml = dd[train_index, ];

test_ml = dd[-(train_index), ];

gg = mnlm(counts = as.factor(train_ml$memb), penalty = 1, covars = train_ml[ ,2:18]);

prob = predict(gg, test_ml[ ,2:18]);

head(prob);

Should see the following result.

Note the table below shows probabilities. To read the table, consider the first row. Each column in the first row shows the probability that the first row respondent belongs to cluster 1 (with column 1 probability), to cluster 2 (with column 2 probability) and so on.

For convenience sake, we merely assign the member to that cluster at which he/she has maximum probability of belonging. Now, we can compare how well our predicted membership agrees with the actual membership.

To see this, run the following code and obtain what is called a 'confusion matrix' - a cross-tabulation between observed and predicted memberships. In the confusion matrix, the diagonal cells represent correctly classified respondents and off-diagonal cells the misclassified ones.

pred = matrix(0, nrow(test_ml), 1);

accuracy = matrix(0, nrow(test_ml), 1);

for(j in 1:nrow(test_ml)){

pred[j, 1] = which.max(prob[j, ]);

if(pred[j, 1]==test_ml$memb[j]) {accuracy[j, 1] = 1}

}

mean(accuracy)

The mean accuracy of the algo appears to be 63% in my run. YOurs may vary slightly due to randomly allocated traiing and validation samples. This 63% accuracy copares very well indeed with a 25% average accuracy if we were to depend merely on chance to allocate respondents to clusters.

That's it for now. Any queries etc. contact me over email or better still, use the comments section below this post.

Sudhir

Wednesday, December 11, 2013

Session 5 Updates

Hi all,

Yesterday in Session 5 we covered two major topics - Segmentation and Targeting.

Sorry about the delay in bringing out this blog post. In this blog post, I shall lay out the classwork examples (which you might want to try replicating) and their interpretation.

There are many approaches to doing cluster analysis and R handles a dizzying variety of them. We'll focus on 3 broad approaches - Agglomerative Hierarchical clustering (under which we will do basic hierarchical clustering with dendograms), Partitioning (here, we do K-means) and model based clustering. Each has its pros and cons (as discussed in class). Also, as mentioned in class, the goal here is to get tomorrow's managers (i.e., you) an exposure to the intuition behind clustering and the various methods in play. Going into technical detail is not a part of this course. However, I'm open to discussing and happy to receive Qs of a technical nature, outside of class time.

1. Cluster Analysis Data preparation

First read in the data. USArrests is pre-loaded, so no sweat. I use the USArrests dataset example throughout for cluster analysis.

# first install these packages
# Note: You only need to install a package ONCE.
# Thereafter a library() call is enough.

install.packages("cluster")
install.packages("mclust")
install.packages("textir")
install.packages("clValid")
# Now read-in data#

mydata = USArrests

Data preparation is required to remove variable scaling effects. To see this, consider a simple example. If you measure weight in Kgs and I do so in Grams - all other variables being the same - we'll get two very different clustering solutions from what is otherwise the same dataset. To get rid of this problem, just copy-paste the following code.

# Prepare Data #

mydata = na.omit(mydata) # listwise deletion of missing

mydata = scale(mydata) # standardize variables

2. Now we first do agglomerative Hierarchical clustering, plot dendograms, split them around and see what is happening.

# Ward Hierarchical Clustering

d = dist(mydata, method = "euclidean") # distance matrix

fit = hclust(d, method="ward") # run hclust func

plot(fit) # display dendogram

Click on image for larger size.

Eyeball the dendogram. Imagine horizontally slicing through the dendogram's longest vertical lines, each of which represents a cluster. Should you cut it at 2 clusters or at 4? How to know?

Sometimes eyeballing is enough to give a clear idea, sometimes not. Various stopping-rule criteria have been proposed for where to cut a dendogram - each with its pros and cons.

For the purposes of MKTR, I'll use three well-researched internal validity criteria available in the "clValid" package, viz. Connectivity, Dunn's index and Silhouette width - to determine the optimal no. of clusters. We don't need to go into any technical detail about these 3 metrics, for this course.

#-- Q: How to know how many clusters in hclust are optimal? #

library(clValid)

intval = clValid(USArrests, 2:10, clMethods = c("hierarchical"), validation = "internal", method = c("ward"));

summary(intval)

The result will look like the below image. 2 of the 3 metrics support a 2-cluster solution, so let's go with the majority opinion in this case.

Since we decided 2 is better, we set the optimal no. of clusters 'k1' to 2 below, thus:

k1 = 2 # from clValid metrics

Note: If for another dataset, the optimal no. of clusters changes to, say, 5 then use 'k1=5' in the line above instead. Don't blindly copy-paste that part. However, once you have set 'k1', the rest of the code can be peacefully copy-pasted as-is.

# cut tree into k1 clusters

groups = cutree(fit, k=k1) # cut tree into k1 clusters

3. Coming to the second approach, 'partitioning', we use the popular K-means method.

Again, the Q arises, how to know the optimal no. of clusters? MEXL (and quiote a few other commercial software) require you to magically come up with the correct number as input to K-means. R provides better guidance on choosing the no. of clusters. So with R, you can actually take an informed call.

#-- Q: How to know how many clusters in kmeans are optimal? #

library(clValid)

intval = clValid(USArrests, 2:10, clMethods = c("kmeans"), validation = "internal", method = c("ward"));

summary(intval)

The following result obtains. Note there's no majority opinion emerging in this case. I'd say, choose any one result that seems reasonable and proceed.

# K-Means Cluster Analysis

fit = kmeans(mydata, k1) # k1 cluster solution


To understand a clustering solution, we need to go beyond merely IDing which individual unit goes to which cluster. We have to characterize the cluster, interpret what is it that's common among a cluster's membership, give each cluster a name, an identity, if possible. Ideally, after this we should be able to think in terms of clusters (or segments) rather than individuals for downstream analysis.

# get cluster means

aggregate(mydata.orig,by=list(fit$cluster),FUN=mean)

# append cluster assignment

mydata1 = data.frame(mydata, fit$cluster);

mydata1[1:10,]

# K-Means Cluster Analysis

fit = kmeans(mydata, k1) # k1 cluster solution


To understand a clustering solution, we need to go beyond merely IDing which individual unit goes to which cluster. We have to characterize the cluster, interpret what is it that's common among a cluster's membership, give each cluster a name, an identity, if possible. Ideally, after this we should be able to think in terms of clusters (or segments) rather than individuals for downstream analysis.

# get cluster means

aggregate(mydata.orig,by=list(fit$cluster),FUN=mean)

t(cmeans) # view cluster centroids

# append cluster assignment

mydata1 = data.frame(mydata, fit$cluster);

mydata1[1:10,]

OK, That is fine., But can I actually, visually, *see* what the clustering solution looks like? Sure. In 2-dimensions, the easiest way is to plot the clusters on the 2 biggest principal components that arise. Before copy-pasting the following code, ensure we have the 'cluster' package installed.

# Cluster Plot against 1st 2 principal components
# vary parameters for most readable graph

install.packages("cluster")
library(cluster)
clusplot(mydata, fit$cluster, color=TRUE, shade=TRUE,labels=2, lines=0)

Two clear cut clusters emerge. Missouri seems to border the two. Some overlap is also seen. Overall, the clusPlot seems to put a nice visualization over the clustering process. Neat, eh? Try doing this with R's competitors...:)

4. Finally, the last (and best) approach - Model based clustering.'Best' because it is the most general approach (it nests the others as special cases), is the most robust to distributional and linkage assumptions and because it penalizes for surplus complexity (resolves the fit-complexity tradeoff in an objective way). My thumb-rule is: When in doubt, use model based clustering. And yes, mclust is available *only* on R to my knowledge. Install the 'mclust' package for this first. Then run the following code.

install.packages("mclust")

# Model Based Clustering

library(mclust)

fit = Mclust(mydata)

fit # view solution summary

The mclust solution has 3 components! Something neither the dendogram nor the k-means scree-plot predicted. Perhaps the assumptions underlying the other approaches don't hold for this dataset. I'll go with mclust simply because it is more general than the other approaches. Remember, when in doubt, go with mclust.

fit$BIC # lookup all the options attempted

classif = fit$classification # classifn vector

mydata1 = cbind(mydata.orig, classif) # append to dataset

mydata1[1:10,] #view top 10 rows

# Use below only if you want to save the output

write.table(mydata1,file.choose())#save output

The classification vector is appended to the original dataset as its last column. Can now easily assign individual units to segments.Visualize the solution. See how exactly it differs from that for the other approaches.

fit1=cbind(classif)

rownames(fit1)=rownames(mydata)

library(cluster)

clusplot(mydata, fit1, color=TRUE, shade=TRUE,labels=2, lines=0)

Imagine if you're a medium sized home-security solutions vendor looking to expand into a couple of new states. Think of how much it matters that the optimal solution had 3 segments - not 2 or 4. To help characterize the clusters, examine the cluster means (sometimes also called 'centroids', for each basis variable.

# get cluster means

cmeans=aggregate(mydata.orig,by=list(classif),FUN=mean);

cmeans

In the pic above, the way to understand or interpret the segments would be to characterize the segment in terms of which variables best describe that cluster as distinct from the other clusters. Typically, we look for variables that attain highest or lowest values for that cluster. In the figure above, it is clear that the first cluster (first column) ius the most 'unsafe' (in terms of having highest murder rate, assualt rate etc) and the last cluster the most 'safe'.

Thus, from mclust, it seems like we have 3 clusters of US states emerging - the unsafe, the safe and the super-safe. From the kmeans solution, we have 2 clusters - 'Unsafe' and 'Safe' emerging.

Now, we can do the same copy-paste for any other datasets that may show up in classwork or homework. I'll close the segmentation module here. R tools for the Targeting module are discussed in the next section of this blog post. Any queries or comment, pls use the comments box below to reach me fastest.

****************************************************

2. Segmenting and Targeting in R (PDA case - classwork example)

We saw a brief intro to the Conglomerate PDA case in the class handout in session 5. For the full length case, go to the 'Documents' folder in your local drive, then to 'My Marketing Engineering v2.0' folder, within that to the 'Cases and Exercises' folder and within that to the 'ConneCtor PDA 2001 (Segmentation)' folder (if you've installed MEXL, that is). For the purposes of this session, what was given in the handout is enough, however.

Without further ado, let's start. I'll skip the Hclust and k-means steps and go straight to model based clustering (mclust).

#----------------------------------------------#
##### PDA caselet from MEXL - Segmentation #####
#----------------------------------------------#

rm(list = ls()) # clear workspace

# read in 'PDA basis variables.txt' below
mydata = read.table(file.choose(), header=TRUE); head(mydata)

### --- Model Based Clustering --- ###

library(mclust)

fit = Mclust(mydata); fit

fit$BIC # lookup all the options attempted

classif = fit$classification # classifn vector

The image above shows the result. Click for larger picture.

We plot what the clusters look like in 2-D using the clusplot() function in the cluster library. What clusplot() does is essentially performs factor analysis on the dataset, plots the first two (or largest two) factors as axes and the rows as factor score points in this 2-D space. In the clusplot below, we can see that the top 2 factors explain x% of the total variance in the dataset. Anyway, this plot is illustrative only and not critical to our analysis here.

But how to interpret what the clusters mean? To interpret the clusters, we have to first *characterize* the clusters in terms of which variables most distinguish them from the other clusters.

For instance, see the figure below in which the cluster means (or centroids) of the 4 clusters we obtained via mclust are shown on each of the 15 basis variables we started with.

Thus, I would tend to characterize or profile cluster 1 (the first column above) in terms of the variables that assume extreme values for that cluster (e.g., very LOW on Price, Monthly, Ergonomic considerations, Monitor requirements and very HIGH on use.PIM) and so on for the other 3 clusters as well.

What follows the profiling of the clusters is then 'naming' or labeling of the clusters suitably to reflect the cluster's characteristics. Then follows an evaluation of how 'attractive' the cluster is for the firm based on various criteria.

Thus far in the PDA case, what we did was still in the realm of segmentation. Time to now enter Targeting which lies at the heart of predictive analytics in Marketing. Since this blog post has gotten too large already, shall take that part to the next blogpost.

Sudhir

Wednesday, December 4, 2013

Session 4 Updates

Hi all,

Session 4 Big-picture Recap:

The readings-heavy Session 4 'Qualitative Research' covers many topics of interest. To recap the four big-picture take-aways from the session, let me use bullet-points:

  • We studied Observation Techniques - both of the plain vanilla observation (Reading 1 - Museums) and the 'immersive' ethnographic variety (Reading 2 - adidas).
  • We then ventured into deconstructing the powerful habit formation process and arrived a 3-step loop framework to describe it for Marketing purposes: cue-routine-reward.
  • We saw how the innovative combination of qualitative insight and predictive analytics can lead to windfall $$ profits (Reading 3-Target and reading 4-Febreze)
  • Finally we saw how unstructured respondent interaction personified by a focus group discussion (FGD) can be a powerful qualitative tool for digging up customer insights.

Update:
This is a link to the NYT video that failed to play in class today.

**************************************

There are two parts to Session 4 HW.

Part 1 of Session 4 HW: Survey filling

Fillup these two surveys please, each less than 15 minutes a piece. Kindly do so positively by Sunday midnight deadline.

Survey 1 for COnjoint analysis in Session 7

Survey 2 link for Social Network Analysis in Session 9

**************************************

Part 2 of Session 4 HW: FGD

Read the following two Economist articles (added later: and one ET article) outlining a new product about to hit the shelves that wants to do an Apple on Apple.

Every step you take

India's answer to Google Glass: Hands free wearable device enables users to carry out computer functions (Economic Times)

The people’s panopticon

Problem context:

You are a mid-sized technology firm with a US presence. Your R&D division has recently won angel funding for hiring bright talent and developing applications for the google glass platform. You, as a the Marketing manager need to give inputs to the tech team on what kind of apps and products may appeal to customers. You have a few ideas in mind but are unsure if they'll appeal to customers.

Run an FGD to explore tech-savvy early-majority customers' expectations and wishes from the Google glass (or more generally, a wearable networking and technology) platform. Pitch your ideas to the group and see how they receive it, what their expectations, concerns and first impressions are etc.

Submission format:

  • For a group (no more than 4 people) and select a name for it (based on a well-known Indian brand)
  • Title slide of your PPT should have your group name, member names and PGIDs
  • Choose a D.P. and corresponding R.O.(s) for the given problem context for the FGD.
  • Next slide, write your D.P. and R.O.(s) clearly.
  • Third slide, introduce the FGD participants and a line or so on why you chose them (tabular form is preferable for this)
  • Fourth Slide, write a bullet-pointed exec summary of the big-picture take-aways from the FGD
  • Fifth Slide on, describe and summarize what happened in the FGD
  • Note if unification and / or polarization dynamics happened in the FGD
  • Name your slide groupname_FGD.pptx and drop in the appropriate dropbox by the start of session 6
  • Extra points if you can put up a short video on youtuibe of the FGD in progress and its major highlights. Share the link on the PPT

Any queries etc., pls feel free to email me.

********************************************

Update 1: FGD HW guidelines: (This is based on my experience with the Term 5 FGD HW in Hyd)

To keep it focussed and brief, lemme use the bullet-points format

  • The point of the FGD is *not* to 'solve' the problem, but merely to point a likely direction where a solution can be found. So don't brainstorm for a 'solution', that is NOT the purpose of the FGD.
  • Ensure the D.P. and R.O.s are aligned and sufficiently exploratory before the FGD can start. Different ROs lead to very different FGD outcomes. For example, if you define your R.O. as "Explore which portable devices will be most cannibalized due to Google Glass" versus "Explore potential for new to the world applications using Google Glass", etc.
  • Keep your D.P. and R.O. tightly focussed, simple and do-able in a mini-FGD format. Having too broad a focus or too many sub-topics will lead nowhere in the 30 odd minutes you have.
  • Start broad: Given an R.O., explore how people connect with or relate to portability, Technology and devices in general, their understanding of what constitutes a 'cool device', their understanding of what constitutes 'excitement', memorability', 'social currency' or 'talkability' in a device and so on. You might want to start with devices in general and not narrow down to Google Glass right away (depending on the constructs you seek, of course).
  • Prep the moderator well: The moderator in particular has a crucial role. Have a broad list of constructs of interest, Focus on getting them enough time and traction (without being overly pushy). For example, the mod could start by asking the group: "What do you think about Portable devices? Where do you see the trend going in portable devices like your smartphone, fuel bands and so on?" and get the ball rolling, then steer it to keep it on course.
  • Converge on Google Glass in detail: After exploring devices in general, explore the particulars of GGlasss as a devices - what is it, how is it viewed or understood, what are the perceptions, hopes and expectations around it etc.
  • Do some background research on Tech trends and their Evolution first. See if any interesting analogies come up.
  • See where people agree in general, change opinions on interacting with other people on any topic, disagree sharply on some topics and stand their ground etc.
  • In your PPT report, mention some of the broad constructs you planned to explore via the FGD.
  • Report (among other things) what directions seem most likely to be fruitiful for investigation.

BTW, here are some FGD video links by MKTR groups in Hyderabad in the last term.

Their FGD topic was different, though. FYI.

That's quite enough for now. Good luck for the placement season.

Sudhir

Tuesday, December 3, 2013

Session 3 Updates

Hi all,

Update:

This blog post from last year contains more details on how to interpret factor analysis results in R.

Session 3 covers two seemingly diverse topics - Questionnaire design and Data reduction via Factor analysis.

Each topic brings its own HW with it. And yes, let me pre-empt some possible grips that may arise.... No, the HWs aren't overly heavy yet. The Hyd folks did the similar HWs despite also having a project.

The Questionnaire design portion has a HW that asks you to program a websurvey based on a D.P.-R.O. that you extract from a given M.P. Let me introduce that HW right away below:

Consider the following problem context:

Flipkart, a leading Indian e-tailer, wants to know about how students in premier professional colleges in India view shopping online. Flipkart believes that this segment will, a few years down, become profitable, a source of positive word of mouth from a set of opinion leaders. This will seed the next wave of customer acquisition and growth and is hence a high stakes project for Flipkart.

Flipkart wants to get some idea about the online buying habits, buying process concerns and considerations, product categories of interest, basic demographics, media consumption (in order to better reach this segment) and some idea of the psychographics of this segment.

As lead consultant in this engagement, you must now come up with a quick way to prioritize and assess the target-segment's perceptions on these diverse parameters.

HW Q: Build a short survey (no longer than 12-15 minutes of fill-up time for the average respondent) on qualtrics web survey software for this purpose. Pls submit websurvey link in this google form. The deadline is before Session 5 starts.

******************************************

OK, now let's start. Fire up your Rstudio. Download all the data files required from the 'Session 3 files' folder on LMS.

Copy the code below and paste it on the 'Console' in Rstudio. A window will open up asking for the location of the dataset to be read in. Read in 'factorAn data.txt'.Use the 'Packages' tab in the lower right pane in Rstudio to install the nFactors package.

rm(list = ls()) # clear workspace first

# read in the data 'factorAn data.txt'

mydata=read.table(file.choose(),header=TRUE)

mydata[1:5,] #view first 5 rows

# install the required package first

install.packages("nFactors")

# determine optimal no. of factors

library(nFactors) # invoke library

ev = eigen(cor(mydata)) # get eigenvalues

ap = parallel(subject=nrow(mydata),var=ncol(mydata),rep=100,cent=.05);

nS = nScree(ev$values, ap$eigen$qevpea);

plotnScree(nS)

A scree plot should appear, like the one below:

On the scree plot that appears, The green horizontal line represents the Eigenvalue=1 level. Simply count how many green triangles (in the figure above) lie before the black line cuts the green line. That is the optimal no. of factors. Here, it is 2. The plot looks intimidating as it is, hence, pls do not bother with any other color-coded information given - blue, black or green. Just stick to the instructions above. Now, we set 'k1' to 2 as shown below:

k1 = 2 # set optimal no. of factors

If the optimal no. of factors changes when you use a new dataset, simply change the value of "k1" in the line above. Copy paste the line onto a notepad, change it to 'k1=6' or whatever you get as optimal and paste onto R console. Rest of the code runs as-is.

# extracting k1 factors with varimax rotation

fit = factanal(mydata, k1, scores="Bartlett", rotation="varimax");

print(fit, digits=2, cutoff=.3, sort = TRUE)

You'll see something like this below (click for larger image)

Clearly, the top 3 variables load onto factor 1 and the bottom 3 onto factor 2.

Another point of interest is the last line in the image above which says "Cumulative Var". It stands for Cumulative variance explained by the factor solution. For our 2 factor solution, the cumulative variance explained is 0.74 or 74%. In other words, close to three-quarters or 74% of the net information content in the original 6-variable dataset is retained by the 2-factor solution.

Also, look at 'Uniquenesses' of the variables. The more 'Unique' a variable is, the less it is explained by the facctor solution. hence, often times, we drop variables with very high uniqueness (say over 2/3rds) and re-run the analysis on the remaining variables. The dropped variables can essentially be considered factors in their own right and are included as such in downstream analysis. If there is *any* aspect of the above process that you want to see expanded or see more detail on, pls let me know. I shall do so to the best I can.

We can now plot the variables onto the top 2 factors (if we want to) and see how that looks like. Also, we can save the factor scores for later use downstream, if we want to.

# plot factor 1 by factor 2

load <- fit$loadings[,1:2]

par(col="black") #black lines in plots

plot(load,type="p",pch=19,col="red") # set up plot

abline(h=0);abline(v=0)#draw axes

text(load,labels=names(mydata),cex=1,pos=1)

# view & save factor scores

fit$scores[1:4,]#view factor scores

write.table(fit$scores, file.choose())

The above is the plot of the variable on the first two factors. The variables closest to the axes (factors) load onto it.

*************************************

Session 3 - Homework 2: Factor Analysis:

  • In the Session 3 files' folder on LMS, there is a dataset labeled 'personality survey responses new.txt'.
  • This is *your* data - 33 psychographic variables that should map onto 4 personality factors - that you answered in the surveys of Session 2 HW.
  • Read it into R using the code given above.
  • Run the analysis as per the code given above or as given in 'R code for factor analysis.txt' notepad.
  • Look up the scree plot and decide what is the optimal # factors.
  • Plug that number into 'k1= ' piece of the code.
  • Copy and save the plots as metafiles directly onto a PPT slide.
  • Copy and paste the R results tables either into excel or as images onto PPT.
  • See the image below for 'Q reference' to map which variable meant what in the survey.
  • *Interpret* any 5 of the factors that emerge based on the variables that load onto them. Label those factors.
  • Write your name and PGID on the PPT title slide.
  • Name the PPT yourname_session3HW.pptx and submit into the session 3 dropbox on LMS
  • Submission deadline is a week from now, before the start of session 5.

Above is the Qs cross-reference, just in case.

Any Qs or clarifications etc, contact me.Pls check this blog post regularly for updates. As Qs and clarifications come to me, I will update this post to handle them.

Sudhir

Thursday, November 28, 2013

Session 2 Updates and HW (Mohali)

Hi all,

Session Recap:
Session 2 got done today. We ventured into psychometric scaling and attempted to measure complex constructs using the Likert scale, among others. We also embarked on a common-sensical approach to survey design.

There were some Qs that "jump ahead" in the sense that I hope to cover them in Session 3 - Questionnaire Design. And Qs that seem to want to "force" a 'right' answer in a multiple choice context. Well, one issue with a lot of MKTR is that it is context-sensitive, so its hard to proclaim 'right' answers that will hold true in general. "It depends" is usally a better bet. Wherever possible I do try to point out general principles and frameworks but in many cases, the problem context decides whether something is true or not.

For example, "inferring scale reliability via analysis of demographic profiles" raised quite a few Qs [PsyScaling Q7]. Well, there are product categories where demographics alone can explain enough to evaluate scale relibility on their basis. But in an increasing number of product categories, demographics do not explain product choice very much. In such instances, its hard to conclude definitely about scale reliability on their basis. At least, that was the limited point, I was making.

Study Group Formation details:

  • Since there's no project, there's no project group. However, a number of homework activities are group-based, hence pls form HW or study groups.
  • Regarding group formation, pls send an email (only one per group) to the AA in the format prescribed.
  • To help in group formation, pls find the list of all final registrants for MKTR_146 on LMS.
  • If you are unable to find group partners and would like to be allotted to a group, let the AA know.
  • Choose a well-known brand as your group name and write your group name next to your PGID and name in the spreadsheet.
  • Pls complete the group formation exercise well before the start of session 3.

*******************************************

Pls read *any 2* of the articles from the business press below. The following 2 HWs are based on the above articles.

HW Part 1: Reducing an M.P. to a D.P. to an R.O.

For each article,

  • Q.1.1. write a short description of what a mgmt problem (M.P.) may look like.
  • Q.1.2. Write one D.P. corresponding to the M.P.
  • Q.1.3. Write an example or two of R.O.s that correspond to the D.P.
HW Part 2: Construct Analysis

  • Q.2.1. List a few major constructs you find (if any) in each of the two articles that are of MKTR interest.
  • Q.2.2. Pick any one construct you have listed in Q.2.1. and break it down into a few aspects.
  • Q.2.3. Make a table with 2 columns. In the first column, write the names of the aspects you came up with. In the second column, corresponding to each aspect, write a Likert statement that you might use in a Survey Questionnaire to measure that aspect.

Session 2 HW submission format:

  • Use a plain white blank PPT.
  • On the title slide, write your name and PGID.
  • For slide headers, use format "HW1: [Article name]" (and so on for the next article chosen)
  • Pls mention clearly the Question numbers you are solving in the slide body. Use fresh slides for each new article
  • Use a blank slide to separate HW2 from HW1.
  • Save the slide deck as session2HW_yourname.ppt and put in in the dropbox on LMS before the start of session 4.
*******************************************

Session 2 HW part 3 - Survey filling

Pls complete the following two surveys latest by sunday (01-Dec) midnight. I reckon it'll take you max 15 minutes on each survey.

Note: Pls answer as truthfully as you can. The data will be used to illustrate MKTR tools and R's analysis prowess in the classroom. It will *not* be shared with anybody outside the classroom.

Survey 1 (for JSM and perceptual mapping)

Survey 2 (Standard psychographic Personality profile questionnaire)

*******************************************

Session 3 Preview:
In Session 3 we cover two broad topics. For the first, we continue the "principles of survey design" part and wade into Questionnaire Design Proper. Be sure to read the pre-read on Questionnaire Design as it covers the basics. It'll thus help lighten my load considerably in class. And who knows if there's another pre-reads quiz lurking somewhere in Session 3 as well...

For the second broad topic, we do Data Reduction via Factor Analysis. For this we'll need R. Two ways to get R:

(i) The easy way is to copy the .exe files for both R and Rstudio putup by Ankit on LMS. First install R by clicking on the .exe file and following instructions. Then repeat the same for Rstudio.exe.

(ii) the second way ios to directly download and install R from the following CRAN link for both the Windows and Mac versions:

Download and Install R

Download and install Rstudio (only *after R has been installed)

If you have any trouble with R installation, contact IT and let me know. Watch this space for more updates.
That's it for now. See you in the next class.
Sudhir

Tuesday, November 26, 2013

Welcome Message and Session 1 Updates (Mohali)

Hi Co2014 @ Mohali,

Welcome to MKTR.

The first session got done today. We covered some subject preliminaries and the crucial task of problem formulation.

Reading for session 2:
"What is Marketing Research" (HBS note 9-592-013, Aug 1991) in the coursepack.

About pedagogy going forward:
The parts of the pedagogy that make MKTR distinctive : pre-reads quick-checks using index cards, in-class reads, this blog and R.

About this blog:
This is an informal blog that oncerns itself soley with MKTR affairs. It's informal in the sense that its not part of the LMS system and the language used here is more on the casual side. Otherwise, its a relevant place to visit if you're taking MKTR. Pls expect MKTR course related announcements, R code, Q&A, feedback etc. on here.

Last year, students said that using both LMS and the blog (not to mention email alerts etc) was inefficient and confusing. I'm hence making this blog the single-point contact for all relevant MKTR course related info henceforth. LMS will only be used for file transfers and email notifications to the class only in emergencies. Each session's blog-post will be updated with later news coming at the top of the post. Kindly bookmark this blog and visit regularly for updates.

About Session 2:
Session 2 deals with psychographic scaling techniques and delves into the intricacies of defining and measuring "constructs" - complex mental patterns that manifest as patterns of behavior. This session also sets the stage for questionnaire design to make an entry in Session 3.

There will be a series of 2 homework assignments for session 2. These concern merely filling up surveys that I will send you (this data we will use later in the course for perceptual mapping and segmentation). Nothing too troublesome, as you can see.

Any Qs etc, pls feel free to email me or use the comments section below.

Sudhir Voleti

Saturday, November 16, 2013

Session 8 HW queries

Hi all,

Some session 8 HW related queries that IMHO merit wider dissemination...

Satish Wrote:

Dear Professor,
I am having trouble interpreting the output of the factor regression and was wondering whether you could help me understand it better...

I understand that we use the factor regression for categorical variables. But in the Session 8 HW, the quant, qualitative etc are not categorical variables but we are forcing them to be categorical – correct? I didn’t understand why we were doing this.. (eg: summary(lm(overall ~ factor(quant1) +factor(quali1) +factor(R1) +factor(HWs1) +factor(blog1))))

Also, how does the interpretation of the results from the factor regression differ from that of regular regression? For example, what does each beta coefficient mean in a factor regression? I understand that ‘high’ is the reference in each of the factors but what exactly does it mean when we say that (for example) increasing the factor(quant) low would decrease the overall rating? (as shown by the negative sign)

Could you please elaborate? Thanks
Best Regards
Satish

My response:

Hi Satish (and Swati, who had a similar Q),

1. True that we use dummy variables (inR, factor() function makes dummy 0/1 variables out of a categorical variable) for categorical or nonmetric variables, and that the raw data for C02014 feedback was metric.

2. The point of the HW was to get you to run a dummy variables regression anyway. I *discretized* the metric X variables into categorical X variables using Hi/Med/Low scale. Normally, we wouldn't do this, metric variables are anyday much more informative than nonmetric factors. But for this HW, we did.

3. The interpretation for a factor regression is straightfwd - take the High/Med/Low case. By default R chooses 1 of the 3 categories (typically the first, High) as reference, sets it to zero and measures the effect of the other two factor levels (Med and Low) against this zero baseline. If Med and Low have higher impact than High, then they are positive. Lower impact, then they are negative and about the same impact as High, then they are insignificant.

4. Changing the reference makes no difference to the rest of the regression, it only moves the baseline up or down. For example, if you were to make Low the reference, then just add the negative of the Low coefficient to the coeffs of High, Medium and Low and you have your new set of coefficients.

To test this, just tweak the code slightly: replace 'factor(quali)' with 'factor(quali, ref = "Low")' in the code and then run the analysis again. Note what happens to the coeffs, to the overall fit in R square terms etc.

Hope that helps.

Anupama writes:

I have one more query –
I am unable to interpret negative coefficients in the variant of regression when you introduced categories of independent variables in the assignment.

If the overall qualitative rating is on a scale of 1-9….should I understand categories as - Low -> 1-3; Med -> 4-6; High -> 7-9 ?
With the above understanding, should I interpret negative coefficient for qual1-Low as ….
‘decrease in low qualitative rating increase overall rating’ => ‘Increase in quality rating increases overall rating’ ?

Overall implications => Professor should give high importance to qualitative material and low level of quantitative material and rest of the factors(like HW, blog) are not significant enough to affect overall rating?

Please let me know if my above understanding is correct.

Also, it would be great if we can addendum to the Session 8 and provide solution to this assignment.
It would help us in preparation for end-term exam.

My response:

Hi Anupama,

This is correct:

‘decrease in low qualitative rating increase overall rating’ => ‘Increase in quality rating increases overall rating’ ?

The High/Med/Low were chosen I think based on this rule: High (low) if score is > one stdev above (below) from the mean. Rest all are medium.

Have a received a few more such queries, will write a blog post and share my responses.

P.S.
Will putup session 8 HW solution (actually choose a few exemplerily good submissions) on LMS.

Friday, November 15, 2013

Project related Mailbag, Q&A

Hi all,

This post will be a general purpose one relating to Project based Q&A (the earlier one for End term based Q&A).

Priyo wrote to me with these Qs:

Hi Prof,

Requesting some clarity regarding further scope of work for the project submission.

Data Collection Guidelines—sample size, demographic, etc.
Analysis and flow of slides.
Scope of conclusion.
Read the grading criteria on you blog post but couldn’t glean much regarding the above.

Thanks,
Priyo

My response:

Hi Priyo,

>> Data Collection Guidelines—sample size, demographic, etc.

The data collection guidelines are, well, flexible. Am not expecting rigorous adherence to sample size requirements for instance. I'd say, with 4 people in a group, each person collecting some 10-15 survey responses is quite enough.

Regarding demographic, ideally go for offerings targeted at the most abundant demographic in ISB - upper middle class urban youth in the mid to late 20s.

>> Analysis and flow of slides.

These should be, in one-word, 'Common-sensical'. There's no right or wrong way, just context dependent implementation, I guess.

>> Scope of conclusion.

Depends on the scope of the DP and ROs. If the ROs are confirmatory, then yes, a yes/no kind of clear decision recommendation would be nice.

If exploratory, mere pointers or indications for further investigation are typically deemed sufficient.

Hope that helps.

Sudhir

P.S.
Will put this up on the blog for wider dissemination.

P.S. watch this space for more such updates. More recent updates at the top of the post.

********************************

Update: Incorporating social network analysis via R for MKTR insights:

I got an interesting Q from a student working on twitteR, on whether and how text analytics relates to social network analysis in general. So I started digging around... And recently discovered that R has a full suite of social media mapping and network analysis applications.

Yes, text analytics is merely the tip of the iceberg. R can go far deeper and far higher (at the same time) than merely text analytics. Now let's talk in terms of *networks* - verticies (or nodes) and edges signifying relations between the nodes...

Am going to demo social network analysis 101 on R using your course feedback 'overall feedback.txt' and the names of the associated students in 'names.txt' (see LMS). Social network anlysis would be a full lecture (or, perhaps, even a full course) by itself but the major major take-aways can be skimmed through rather quickly, I reckon.

Try these classwork examples at home, maybe you may want to do this for your project?

# read-in data first

names = read.table(file.choose()) # 'names.txt'

x = readLines(file.choose()) # 'overall feedback.txt'

x1 = Corpus(VectorSource(x)) # build corpus

ngram <- function(x1) NGramTokenizer(x1, Weka_control(min = 1, max = 2))

tdm0 <- TermDocumentMatrix(x1, control = list(tokenize = ngram,
tolower = TRUE,
removePunctuation = TRUE,
removeNumbers = TRUE,
stopwords = TRUE,
stemDocument = TRUE)) # patience. Takes a minute.
# remove columns with zero sums

dim(tdm0); a0 = NULL;

for (i1 in 1:ncol(tdm0)){ if (sum(tdm0[, i1]) == 0) {a0 = c(a0, i1)} }

if (length(a0) >0) { tdm1 = tdm0[, -a0]} else {tdm1 = tdm0}; dim(tdm1)

inspect(tdm1[1:5, 1:10])# to view elements in tdm1, use inspect()

# convert tdms to dtms
# dtm weighting from Tf to TfIdf (term freq Inverse Doc freq)
dtm0 = t(tdm1) # docs are rows and terms are cols
dtm = tfidf(dtm0) # new dtm with TfIdf weighting

# rearrange terms in descending order of TfIDF and view

a1 = apply(dtm, 2, sum); a2 = sort(a1, decreasing = TRUE, index.return = TRUE);

dtm01 = dtm0[, a2$ix]; dtm1 = dtm[, a2$ix];

the baove analysis was standard, pretty much what we did in the classwork in session 9. What follows comes with a twist. Will need to install package 'igraph' for this.

What we do next is find 'relations' or connections between terms - for our context, I define a 'connection' between two terms as the intra-document co-occurence, i.e. how often those terms occured together in a document across all docs in the corpus. Somewhat like a cluster dendogram, I guess, but way cooler.

install.packages("igraph") # install once per comp

### --- making social network of top-40 terms --- ###

dtm1.new = inspect(dtm1[, 1:40]); # top-25 tfidf weighted terms

term.adjacency.mat = t(dtm1.new) %*% dtm1.new; dim(term.adjacency.mat)

## -- now invoke igraph and build a social network --

library(igraph)

g <- graph.adjacency(term.adjacency.mat, weighted = T, mode = "undirected")

g <- simplify(g) # remove loops

V(g)$label <- V(g)$name # set labels and degrees of vertices

V(g)$degree <- degree(g)

# -- now the plot itself

set.seed(1234) # set seed to make the layout reproducible

layout1 <- layout.fruchterman.reingold(g)

plot(g, layout=layout1)

you should see something like this. Click for larger image.

The image depicts connections between terms. Of course, one may say that social networks are built among *people*, not among terms.

OK. Sure.

So can we build one among people using a similar procedure? You bet. This time, we'd be connecting people using the common terms they used in the corpus. The code to do that is below:

### --- make similar network for the individuals --- ###

dtm2.new = inspect(dtm1[,]); dim(dtm2.new)

term.adjacency.mat2 = dtm2.new %*% t(dtm2.new); dim(term.adjacency.mat2)

rownames(term.adjacency.mat2) = as.matrix(names)
colnames(term.adjacency.mat2) = as.matrix(names)

g1 <- graph.adjacency(term.adjacency.mat2,
weighted = T, mode = "undirected");

g1 <- simplify(g1) # remove loops

V(g1)$label <- V(g1)$name # set labels and degrees of vertices
V(g1)$degree <- degree(g1)

# -- now the plot itself --

set.seed(1234) # set seed to make the layout reproducible

layout2 <- layout.fruchterman.reingold(g1)

plot(g1, layout=layout2)

And the result will look something like this:

Recall Session 9's segmentation exercise on ice-cream flavor comments? We were trying to cluster together respondents based on similarity or affinity in terms used. k-means scree plots were a poor way of judging the number of clusters. The above graph provides much better insight that way. Seems like there are two big clusters and some 2-3 smaller ones in the periphery.

Sure, important and interesting Qs arise from this kind of analysis.... For instance, "Who is the most representative commentator, i.e. one whose words best represent the class'?", "Who is best connected with majority opinion?" and so on. Marketers routinely ask similar Qs to try to detect "influencers" (along various metrics of node centrality etc). But again, that is a whole other session.

Sudhir

Wednesday, November 13, 2013

End-term updates

Hi all,

Update: R tutorial announcement:

22-Nov Friday 4-6 pm at AC8 LT (tentatively). Will get a venue confirmation and update here. Would be nice if every group is represented at the tutorial.

Sudhir

1. Exam related Cases:

There are two full-length cases in your course-pack: The 'Coop' case, and 'The Fashion Channel' case.

At least one of these two cases will feature in the end-term exam. I'd rather you not try to read the case for the first time in the exam hall. Pls read them at home beforehand.

Preferably, discuss them in a group on the lines of the following Qs:

  • What is the management problem? Describe also the major symptoms.
  • What are some of the likely decision problems (DPs) that emerge based on the management problem?
  • List a few research objectives (R.O.s) that emerge based on the DPs.

2. End-term exam pattern Notes

  • There are a total of 50 Qs, 2 marks each.
  • The Qs are broken down into 8 Question-sets, each having tables or figures and Qs based on them
  • The Qs are all short-answer - True/False, fill-in-the-blanks, write expression for ...., name these factors, type of stuff.
  • If any Q comes from any pre-read, the concerned pre-read will be specified in the Q itself. So bring your course-pack to the exam. Not all pre-reads are relevant, only the ones I've specifically asked you to read. Properly speaking, those pre-reads are a part of the course.
  • Nothing that was not covered in class will show up anywhere in the exam.
  • At least one question set relates to a full-length case (see above)
  • Time will not be a problem - you'll have 150 minutes for a 120 minute paper.

Pls use the comments section to this post for any Q&A so that it is visible to the class at large.

See you in class for our last classroom meeting. Any feedback you have on how to improve any aspect the course etc is welcome at any time.

3. R tutorial and R in your resume:

From the project point of view, if you want an R tutorial at anytime between now and 24-Nov (when I leave for Mohali), pls let me know. A quorom of minimum 5 people must signup for the tutorial which can go as technical as the attendees want. Ideally, one person from each project group would attend.

If you want to include R in your resume, then provided that you have made a good-faith attempt at installing and running R for your HWs, provided that you intend to continue to invest in R going forward and provided that you have been able to read the code sent and get a "sense" of the analysis, pls consider amending and using any relevant subset of the following (as applicable):

  • State what you have done w.r.t R:
  • Have developed a familiarity with the R environment
  • Have used R to apply and analyze a wide variety of Mktg Research tools (from structured hypothesis testing to text analytics and social media analysis)
  • Have gained some understanding of the flexibility and extensibility of the system (installing and using packages and interfaces with external repositaries)
  • Have analyzed a full credit course project on the platform
  • State what you will do going forward:
  • Intend to continue investing in and developing greater insights into the R platform
  • Are attracted by the low-cost, license-free unrestricted use terms and rapid-analysis capabilities of the open-source platform
  • Believe in R's promise of substantially expanding enterprise analytics capabilities while keeping a tight lid on costs
  • Believe in the philosophy of collaborative design, rapid rollout and innate scale-ability
  • Bottomline: are convinced of R's compelling cost/benefit calculus for delivering enormous value to the organization

Again, folks, ensure that *only* that subset that applies to you in a bona fide way is used for your statements of purpose, on the resume etc. Making false claims has a habit of coming back to bite you at inconvenient times.

Good luck for the exam and going fwd for the placements as well.

Sudhir