Wednesday, October 31, 2012

Session 4 HW deadline extension

Update: Folks, there seems to be a widespread misconception that there is a MDS component in the HW for session 4. No, there is not! MDS requires a lot of work prepping data into the right format for entry, which is why I did not give out any MDS HW.

Hi all, I received a few emails and this one sums the issue up rather well:

"Dear Professor,
In view of the multiple job postings/deadlines this week, we just wanted to check with you if there is any possibility of extending the deadline for HW 4, to Saturday Nov. 3rd.
This is an important assignment that I would like to attempt with full sincerity and not under time pressures. Will be very grateful if you can accommodate the request."
My response:
Hi S,

I've received a similar request from a few other students as well.
Normally I would accede to this request. However, my concern is this may become a precedent of sorts and not of the healthy variety. The HW was announced last week Thursday, after all.
I understand there's a lot going on and some students may have difficulty submitting on time. A late submission (within an extra 3 days) will only carry a nominal penalty in this case, so go ahead and submit whenever you can.
However, pls note that failure to submit within this time frame will be treated as a non-submission.
Thanks,

I'd also like to mention that the HWs are HCC1, so you are free to collaborate with others in running the analyses. However, the interpretation and write-up should be individual effort only. So, pls pool up your resources to reduce the time pressure for all involved is my suggestion. Sudhir

Session 5 HW

Update: The HW Qs can be found on the Session 5 HW word doc, which is up on LMS. Answers must be written/typed only in the space provided. Feel free to give and take help in the data analysis part but the interpretation & write-up must be individual only. No need to attach any graphs and charts.

Hi all,

There are 2 parts to session 5 homework. 1 deals with segmentation-targeting using your preference, demo- and psycho- graphic data collected in the session 2 HW and the other is a simple spreadsheet modeling exercise for demand estimation.

1. Segmentation-Targeting homework

A clean, 'anonymous' version of the dataset (with names/PGIDs removed) is up on LMS. Your task is to come up with a segmentation of people based on their pscyhographic disposition. There are a few demographic variables also - gender, workex, whether intended major is marketing and whether edu background is engineering. Each of you answered 25 psychographic Qs on a 1-5 agree-disagree scale (5=completely agree). yielding a 71x25 matrix. Now 25 columns is large for interpretable segmentation to happen easily. So...

(i) Reduce the data to a smaller set of factors.

Justify your choice of the no. of factors. Comment on the suitability of the factor solution in terms of information content retained. Interpret what the factors may mean. Use your judgment and pick the top 10 factors and/or standalone variables that should be considered as basis variables for a segmentation step.

(ii) Now segment the 71 respondents into a manageable no. of similar groups.

Justify your choice of clustering method and your choice of the no. of clusters. Profile the clusters by looking at each segment's size and average scores on each basis variable/factor. Give each cluster a name.

(iii) Use demographic information given in a discrimination analysis setting (either on MEXL or R). Which demographic variables do you think are good predictors of segment membership?

Note: Use preferably MEXL for discriminant analysis as I haven't yet figured out the MANOVA inference part for linear discriminants on R. If you choose to use R for this part, then ignore the significance for now.

2. Demand estimation exercise

Pls open the elementary spreadsheet model put up on LMS. Change the sample size parameter (while assuming the proportions remain the same) and see how much the upper and lower bounds of revenue vary. Now answer the following questions:

(a) What is the minimum sample size required (ignore the population size for this one) to get the revenue projection confidence interval to be no longer than Rs 10 lakh? Rs 5 lakh? Rs 20lakh?

(b) At the 90% confidence level, what is the minimum sample size required to get the revenue projection confidence interval to be no longer than Rs 10 lakh? Rs 5 lakh? Rs 20lakh?

Thanks,

Sudhir

Session 6 Rcode

Hi all,

R helps qualitative analysis by processing some part of unstructured text into usable output. I hope you'll appreciate just how much of time and hassle is saved by the simple application of R to the text analysis problem. We'll see two big examples in class -

(1) processing open-ended survey responses in a large dataset to reveal which terms are used most frequently (as a word cloud) and which sets of terms occur together (as collocation dendograms). We will use data from the Ice-cream survey (dataset putup on LMS) for this part.

(2) processing consumers' text output (think of product or movie reviews) to yield some basic measures of the "emotional" content and direction - whether positive or negative (also called 'valence' in academic research parlance) - of the consumer's response. We'll use some product reviews downloaded from Amazon and some movie reviews downloaded from Times of India for this one. Again, datasets will be put up on LMS.

1. Structuring text analysis in R

The question was "Q.20. If Wows offered a line of light ice-cream, what flavors would you want to see? Please be as specific as possible." The responses are laid out as a column in an excel sheet with each cell marking one person's response. There was no limit to how much text one wanted to write in response to the question.

First, before we begin, load these libraries - 'tm', 'Snowball' and 'wordcloud'. Ensure these packages are installed and ready on your machine.

library(tm)
library(Snowball)
library(wordcloud)

Next, copy the relevant columns from the excel sheet and save the unstructured text in a notepad. The relevant dataset is Q20.txt in the session 6 folder of LMS.

x=readLines(file.choose()) # reads the file
x1=Corpus(VectorSource(x)) # created Doc corpus

# standardize the text - remove blanks, uppercase #
# punctuation, English connectors etc. #
x1 = tm_map(x1, stripWhitespace)
x1 = tm_map(x1, tolower)
x1 = tm_map(x1, removePunctuation)
x1 = tm_map(x1, removeWords, stopwords("english"))
x1 = tm_map(x1, stemDocument)

# make the doc-term matrix #
x1mat = DocumentTermMatrix(x1)
The above may take 1-2 minutes, so pls be patient.

Let's see which terms occur most frequently in the open-endeds for that Question. "see' as in visualize them and not just tabulate them.

mydata = removeSparseTerms(x1mat,0.99) # removes sparse entries
dim(mydata.df <- as.data.frame(inspect(mydata))); mydata.df[1:10,]
mydata.df1 = mydata.df[ ,order(-colSums(mydata.df))]
# view frequencies of the top few terms
colSums(mydata.df1) # term name & freq listed
# make barplot for term frequencies #
barplot(data.matrix(mydata.df1)[,1:8], horiz=TRUE)

# make wordcloud to visualize word frequencies
wordcloud(colnames(mydata.df1), colSums(mydata.df1), scale=c(4, 0.5), colors=1:10)
The first image here is of a horizontal bar chart depicting relative frequencies of term occurrence.

The second image depicts a word-cloud of the term frequencies. Two different ways of seeing the same thing.

The time has come to find collocations (or sometimes called co-locations) of terms in the data. The basic task is "find which words occur together most number of times in consumers' responses".

# making dendograms to visualize word-collocations
min1 = min(mydata$ncol,25)
test = matrix(0,min1,min1)
test1 = test
for(i1 in 1:(min1-1)){ for(i2 in i1:min1){
test = mydata.df1[ ,i1]*mydata.df1[ ,i2]
test1[i1,i2] = sum(test); test1[i2, i1] = test1[i1, i2] }}
# make dissimilarity matrix out of the freq one
test2 = (max(test1)+1) - test1
rownames(test2) <- colnames(mydata.df1)[1:min1]
# now plot collocation dendogram
d <- dist(test2, method = "euclidean") # distance matrix
fit <- hclust(d, method="ward")
plot(fit) # display dendogram
And the image looks like this:
I'll incorp the second part later today. For now, this is it.

Elementary sentiment Analysis

Update: The second-part - rudimentary sentiment analysis - is in. Don't want to overly raise expectations since what's done here is fairly basic. But fairly useful in multiple MKTR contexts. And very much extensible in multiple ways on R.

Recall the written feedback you gave in session 2 rating the MKTR session. Well, I've managed to make a soft-copy of your comments and that is the input dataset here. So now, I'll try to mine the sentiment behind your MKTR assessment comments. This dataset is available as student_feedback.txt in the 'Session 6' folder in LMS. Pls try this at home.

1. Step one is always, load the libraries needed.

library(tm)
library(wordcloud)
library(Snowball)
Next, read-in the data. The data were originally in an excel sheet with each student's comments in one cell and all the comments in one 68x1 column. I copied them onto a notepad and the notepad will be put up on LMS for you to practice the following code with.
# read-in file first #
x = readLines(file.choose())
summary(x) # view summary
# create Doc corpus
x1=Corpus(VectorSource(x))
So we create what is called a 'corpus' of documents - done when the text inout comes from multiple people and each person's output can be treated as a separate 'document'.

2. Read-in the emotional-content word lists.

There are generic lists compiled by different research teams (in different contexts) of emotion-laden words. We'll use a general list for now. The lists are provided as notepads. Read them into R as follows:

# Sentiment words (Positive vs Negative
# downloaded opinion lexicon from
# "http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html"


# read-in positive-words.txt
pos=scan(file.choose(), what="character", comment.char=";")

# read-in negative-words.txt
neg=scan(file.choose(), what="character", comment.char=";")

3. Add additional emotional-words as required

Since words used to verbalize emotions (or, as the psychologists call it "Affect") depend a lot on context, we have the freedom to add our own context specific emotional words to the list. E.g., "sweet" may be a positive word in a chocolate-chips context but not so much in potato chips context (just, as an example).

# including our own positive words to the existing list
pos.words=c(pos,"wow", "kudos", "hurray")

#including our own negative words
neg.words=c(neg,"wait", "waiting", "too")

4. Clean-up the text of irrelevant words, blank spaces and the like.

# standardize the text - remove blanks, uppercase #
# punctuation, English connectors etc. #
x1 = tm_map(x1, stripWhitespace)
x1 = tm_map(x1, tolower)
x1 = tm_map(x1, removePunctuation)
x1 = tm_map(x1, removeWords, stopwords("english"))
x1 = tm_map(x1, stemDocument)

# make the doc-term matrix #
mydata = DocumentTermMatrix(x1)

mydata.df <- as.data.frame(inspect(mydata)); mydata.df[1:10,]
mydata.df1 = mydata.df[ ,order(-colSums(mydata.df))]

5. Now extract the most frequently used emotional words and plot them in a wordcloud.

# match() returns the position of the matched term or NA

pos.matches = match(colnames(mydata.df1), pos.words)
pos.matches = !is.na(pos.matches)
b1 = colSums(mydata.df1)[pos.matches]
b1 = as.data.frame(b1)
colnames(b1) = c("freq")
neg.matches = match(colnames(mydata.df1), neg.words)
neg.matches = !is.na(neg.matches)
b2 = colSums(mydata.df1)[neg.matches]
b2 = as.data.frame(b2)
colnames(b2) = c("freq")
b = rbind(b1,b2)

wordcloud(rownames(b), b[,1]*20, scale=c(5, 1), colors=1:10)
Click for larger image.
Now this is what the class told me in terms of their feedback.

Sure, one can do downstream analysis on this - assign scores of positivity or negativity to each comment, categorize the emotion more finely - not just in positive/negative terms but in more detail - joy, satisfaction, anger etc. One can think of clustering respondents based on their score along different emotional dimensions. Think of the co-location analysis possibilities that arise, etc.

All this is possible and very much do-able on R. Think of the applications in terms of product reviews, recommendations systems, mining social media for information, measuring "buzz" etc. We'll continue down that path a little more in Session 9 - Emerging trends in MKTR.

Chalo, dassit for now. See you in class. Ciao.

Sudhir

Monday, October 29, 2012

Interpreting a Joint Space Map - Session 4 HW help

Hi all,

I received some pretty pointed Qs on how exactly one should interpret a Joint Space map. Here's my attempt at a simplified answer.

A perceptual map positions multiple brands against multiple attributes. Each one of you evaluated 4 course offerings along 5 attributes (4 regular + 1 preference atribute). So we should be able to make p-maps for each one you, individually, right? Below is the rating set one of you (let's call him/her SK) gave:

Courses GSB INVA MGTO SAIT
Conceptual Value 3 4 5 4
Practical Relevance 4 5 5 5
Interest Stimulated 4.8 4.9 5.2 5.1
Difficulty 3 6 3 3
Preference 1 4 5 6

Note: The third row of ratings was '5' for all courses. A constant row or column is problematic as the matrix would be singular. So I had to introduce small variation there, hence the '4.8' type numbers you see.

The ratings clearly say that SK perceives SAIT, INVA & MGTO to be high on the first 3 attributes, INVA high on difficulty, SAIT high on preference and GSB low on all attributes. Any perceptual map (henceforth, p-map) must faithfully capture and depict at least this much information. Let us see how well our p-map does:

What if all courses are rated the same on an attribute?
Notice that when all courses are rated the same (score was '5') on 'Interest', that attribute is no longer informative in the p-map. Apart from Interest, seems like the map pretty much captures the basic ordering of SK's perceptions. Would you agree?

How does SK's preference vector - the meroon line - enter the picture?
Well, SK's given his/her preferences for each of the 4 courses. And each course has its position on the map already marked in trms of (x,y) co-ordinates. So, we weight each course's (x,y) co-ordinates with SK's preference score. The weighted averages of the x- and y- values become SK's preference vector co-ordinates. Simple, no?

Look closely and you can see that SK's preference ordering is borne out on SK's p-map. SAIT has the highest pref and is closest (perpendicular distance) to the higher side of the meroon line, followed closely by MGTO. INVA is exactly at the mean preference score ('4' for SK) and so is almost on the origin w.r.t. the meroon line. GSB is well below the mean level and so appears on the opposite side.

That was one student's p-map. To see how p-maps vary when attribute evaluations change, consider another student HS's ratings set followed by his/her Joint Space map.

Courses GSB INVA MGTO SAIT
Conceptual Value 7 1 3 5
Practical Relevance 5 3 3 7
Interest Stimulated 7 6 1 5
Difficulty 1 5 1 4
Preference 7 1 3 5

Compare this map with SK's map. What would happen if I take an *average* of these students' scores and draw a *third* map? Would it adequately represent the perceptions of either, neither or both students? How better to find out than to simply perform the experiment, eh? Below is the attribute table of the average of SK's and HS' ratings. The preferences are not averaged, instead we'll get two pref lines - 1 per student.

Courses GSB INVA MGTO SAIT
Conceptual Value 5 2.5 4 4.5
Practical Relevance 4.5 4 4 6
Interest Stimulated 5.9 5.45 3.1 5.05
Difficulty 2 5.5 2 3.5
preference SK 1 4 5 6
preference HC 7 1 3 5

Compare the three p-maps. Look at student 1's preference vector in the combined map. Where it was clearly favoring MGTO and SAIT on one side and slightly away from INVA on the other, now it is showing a weird position away from all 3. Think of the potential for how distorted preference vector locations became when we averaged over just 2 people.

Now imagine averaging over a group of 71 respondents. Would the distortions cancel out or get aggravated? Segmentation procedures try to find groups of people with similar perceptions. Based on what we have just seen, is the case for segmentation prior to p-mapping strengthened? I'll leave you with these Qs as you interpret and get the final answers to your homeworks for session 4.

Sudhir

R tutorial - Afterthoughts

Hi all,

First off, thanks for attending the tutorial in good numbers and making it interactive and interesting. Now I have a much better sense of practical issues students are facing in taking to R. Accordingly, I have revised, re-formatted and updated the Rcode blog-posts for session 4 and will do the same for session 5 also shortly.

Some thoughts:
1. Data input remains the biggest headache.
The good news of course is that this is eminently solvable. I have edited the session 4 blog-post to reflect this. The data entry part for the joint space maps code is broken into steps 2a, 2b and 2c with clear instructions on the sequence to be followed. Practice data input a few times and after that it becomes a peaceful, mechanical process (as it should be).

2. The graphs I get should be put up on the blog so that people can compare their output with mine.
Thsi is now done. Pls see the plots uploaded as images under each of the steps mentioned in Session 4 Rcode.

Click on the images for a larger size. Let me know if you are unable to see the images for any reason.

3. Google docs spreadsheet is sometimes causing issues with simple copy-pasting.
However, the tables are read in fully despite the error message shown. So pls ignore that particular error message and proceed.

4. File-name confusion erupts, occasionally.
I named most files we read in as 'mydata' for ocnsistency sake. But you can choose any (non-system reserved) file name. No problemo.

5. Package installation issues were there.
I suspect it could be the CISCO problem with the FTP protocol. However, I was able to download packages peacefully after yesterday. Pls let me know if you are still unable to download and install packages. I suggest choosing a US based R server due to their higher bandwidth.

In Session 4, for MDS, we need package 'MASS'; for Eigneval scree plots, we need package 'nFactors' and in session 5, we will need packages 'mclust' and 'cluster' for a variety of cluster analysis algorithms. Pls download and keep this part ready as and when you are able to from your homes.

Added later:
6. The point of it all

People have asked how R is helping fulfill the course goals. Let me recap and clarify.

Folks, the course is designed for managers to understand the scale and scope of the MKTR challenges facing them in the near future. It isn't designed to teach how to code in R.

Now, I think the best way for folks to understand MKTR challenges is:
- to be exposed to the fundemental concepts in the area (through pre-reads, reinforced by classroom discussion and reflection)
- to be exposed to the tools of the trade (through exercises, assignments, the hands-on project and discussions)
- to be exposed to the emerging changes and trends driven by economics, technology and firm strategy (through in-class readings, project and discussions).

R contributes to the course goals by stressing on the last 2 aspects. Of course, you are free use any software platform you like.

However, I recommend R for the following reasons:

(i) You get your hands dirty with data. Unlike some packages (SPSS, SAS etc) R is not about dooing analyses from afar but shaping the flow and transformation of data line by line, para by para. There's no substitute to actually grappling with data to understand what is going on.
(ii) You get acquianted with a very powerful, very flexible and totally cutting edge platform for *all* (and not merely your MKTR) computing needs. Sans licensing worries too. This becomes important because increasingly, problems are becoming multi-disciplinary and as managers you have think outside the MKTR box to solve them.
(iii) You learn the *correct* way - the scientific, data-driven, objective way - to decision making in the data analysis stage. For instance, Qs like "How many factors/clusters should we have?", "Why is one segmentation model better than another?" etc. often call for judgment calls and heuristics which can be risky in new and untested situations. Better to have R do the heavy lifting of model fitting and complexity-penalization and then guide you to the correct decision.
(iv) R offers economies of scale - even large datasets [with 1000s of observations] are as easily and speedily crunched as the small ones. Having larger RAM and faster processors always helps, of course. [However, if your datasets are millions of observations long, then go to SAS.]
(v) R offers economies of scope - not just plain vanilla S-T-P analysis or regressions but v cool, v emerging stuff from text analytics to sentiment analysis to pattern recognition - are all there. Somebody somewhere in the world faced that problem, solved it and put up the general solution procedure neatly packaged into a R library module. And R continues to grow with each passing month.
(vi) Sure, data input/output is not as easy-breezy as in some other platforms but that is the only hurdle to cross (and not a particularly high one at that).

I hope I have persuaded you to stay the course with R and not give up because of point (vi).

That is all I can think of for now. Pls use the comments section below for your queries, suggestions and feedback. Sudhir

Sunday, October 28, 2012

Rcode for Session 5 Classwork

Hi all,

Welcome to the Rcodes to Session 5 classroom examples. Pls try these at home. Use the blog comments section for any queries or comments.

There are many approaches to doing cluster analysis and R handles a dizzying variety of them. We'll focus on 3 broad approaches - Agglomerative Hierarchical clustering (under which we will do basic hierarchical clustering with dendograms), Partitioning (here, we do K-means) and model based clustering. Each has its pros and cons. Model based is probably the best around, highly recommended.

1. Cluster Analysis Data preparation
First read in the data. USArrests is pre-loaded, so no sweat. I use the USArrests dataset example throughout for cluster analysis.

#first read-in data#
mydata = USArrests

Data preparation is required to remove variable scaling effects. To see this, consider a simple example. If you measure weight in Kgs and I do so in Grams - all other variables being the same - we'll get two very different clustering solutions from what is otherwise the same dataset. To get rid of this problem, just copy-paste the following code.

# Prepare Data #
mydata <- na.omit(mydata) # listwise deletion of missing
mydata.orig = mydata #save orig data copy
mydata <- scale(mydata) # standardize variables

2. Now we first do agglomerative Hierarchical clustering, plot dendograms, split them around and see what is happening.

# Ward Hierarchical Clustering
d <- dist(mydata, method = "euclidean") # distance matrix
fit <- hclust(d, method="ward")
plot(fit) # display dendogram
Click on image for larger size.

Eyeball the dendogram. Imagine horizontally slicing through the dendogram's longest vertical lines, each of which represents a cluster. Should you cut it at 2 clusters or at 4? How to know? Sometimes eyeballing is enough to give a clear idea, sometimes not. Suppose you decide 2 is better. Then set the optimal no. of clusters 'k1' to 2.

k1 = 2 # eyeball the no. of clusters
Note: If for another dataset, the optimal no. of clusters changes to, say, 5 then use 'k1=5' in the line above instead. Don't blindly copy-paste that part. However, once you have set 'k1', the rest of the code can be peacefully copy-pasted as-is.

# cut tree into k1 clusters
groups <- cutree(fit, k=k1)
# draw dendogram with red borders around the k1 clusters
rect.hclust(fit, k=k1, border="red")

3. Coming to the second approach, 'partitioning', we use the popular K-means method.

Again, the Q arises, how to know the optimal no. of clusters? Eyeballing the dendogram might sometimes help. But at other times, what should you do? MEXL (and most commercial software too) requires you to magically come up with the correct number as input to K-means. R does one better and shows you a scree plot of sorts that shows how the within-segment variance (a proxy for clustering solution quality) varies with the no. of clusters. So with R, you can actually take an informed call.

# Determine number of clusters #
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(mydata,centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")
# Look for an "elbow" in the scree plot #
Look for an "elbow" in the scree plot. The interior node at which the angle formed by the 'arms' is the smallest. This scree-plot is not unlike the one we saw in factor-analysis. Again, as with the dendogram, we get either 2 or 4 as the options available. Suppose we go with 2.

# Use optimal no. of clusters in k-means #
k1=2
Note: If for another dataset, the optimal no. of clusters changes to, say, 5 then use 'k1=5' in the line above instead. Don't blindly copy-paste that part. However, once you have set 'k1', the rest of the code can be peacefully copy-pasted as-is.

# K-Means Cluster Analysis
fit <- kmeans(mydata, k1) # k1 cluster solution

To understand a clustering solution, we need to go beyond merely IDing which individual unit goes to which cluster. We have to characterize the cluster, interpret what is it that's common among a cluster's membership, give each cluster a name, an identity, if possible. Ideally, after this we should be able to think in terms of clusters (or segments) rather than individuals for downstream analysis.

# get cluster means
aggregate(mydata.orig,by=list(fit$cluster),FUN=mean)
# append cluster assignment
mydata1 <- data.frame(mydata.orig, fit$cluster)

OK, That is fine., But can I actually, visually, *see* what the clustering solution looks like? Sure. In 2-dimensions, the easiest way is to plot the clusters on the 2 biggest principal components that arise. Before copy-pasting the following code, ensure we have the 'cluster' package installed.

# Cluster Plot against 1st 2 principal components
# vary parameters for most readable graph
library(cluster)
clusplot(mydata, fit$cluster, color=TRUE, shade=TRUE,labels=2, lines=0)
Two clear cut clusters emerge. Missouri seems to border the two. Some overlap is also seen. Overall, the clusPlot seems to put a nice visualization over the clustering process. Neat, eh? Try doing this with R's competitors...:)

4. Finally, the last (and best) approach - Model based clustering.

'Best' because it is the most general approach (it nests the others as special cases), is the most robust to distributional and linkage assumptions and because it penalizes for surplus complexity (resolves the fit-complexity tradeoff in an objective way). My thumb-rule is: When in doubt, use model based clustering. And yes, mclust is available *only* on R to my knowledge.

Install the 'mclust' package for this first. Then run the following code.

# Model Based Clustering
library(mclust)
fit <- Mclust(mydata)
fit # view solution summary
The mclust solution has 3 components! Something neither the dendogram nor the k-means scree-plot predicted. Perhaps the assumptions underlying the other approaches don't hold for this dataset. I'll go with mclust simply because it is more general than the other approaches. Remember, when in doubt, go with mclust.

fit$BIC # lookup all the options attempted
classif = fit$classification # classifn vector
mydata1 = cbind(mydata.orig, classif) # append to dataset
mydata1[1:10,] #view top 10 rows

# Use only if you want to save the output
write.table(mydata1,file.choose())#save output
The classification vector is appended to the original dataset as its last column. Can now easily assign individual units to segments.

Visualize the solution. See how exactly it differs from that for the other approaches.

fit1=cbind(classif)
rownames(fit1)=rownames(mydata)
library(cluster)
clusplot(mydata, fit1, color=TRUE, shade=TRUE,labels=2, lines=0)
Imagine if you're a medium sized home-security solutions vendor looking to expand into a couple of new states. Think of how much it matters that the optimal solution had 3 segments - not 2 or 4.

To help characterize the clusters, examine the cluster means (sometimes also called 'centroids', for each basis variable.

# get cluster means
cmeans=aggregate(mydata.orig,by=list(classif),FUN=mean); cmeans
Seems like we have 3 clusters of US states emerging - the unsafe, the safe and the super-safe.

5. Discriminant Analysis for targeting

I only demo linear discriminant analysis on R. Other, more complex functions such as the quadratic Discriminant are available for enthusiasts to explore. (Let me know if you really want to go there). The following code copy-pasted does a simple linear discriminant analysis on mclust output and tries to see which targeting variables best discriminate (or predict) segment membership between 2 particular clusters. Thus, to know which variables help determine segment membership in cluster 1 versus cluster 2, look at discriminant function 1 and so on.

# run discriminant analysis for targeting
library(MASS)
mydata1 = cbind(classif, mydata1)
z <- lda(classif ~ ., mydata1) #run linear discriminant
z # view result
For every k1 clusters, there will be (k1-1) discriminant functions. Look at the largest coefficients (in absolute terms) for each taregtting variable to roughly assess significance. Admittedly, I haven't incorporated the inference part of LDA as yet (it involves invoking MANOVA) but for now, this should do.

Now, we can do the same copy-paste for any other datasets that may show up in classwork or homework. I'll close the segmentation and targeting piece here.

Sudhir

Saturday, October 27, 2012

Attendance related Queries and other issues

Update: Quick Announcement

The first half of session 5 will cover a holistic S-T-P (Segmentation-Targeting and Positioning) view. The emphasis will be on the segmentation part and the clustering techniques it relies upon. The second half will cover a primer on Demand Estimation techniques. This is coming up now because having done Sampling basics (in session 3) and Segmentation (in session 5), you are now, finally, ready to take a crack at the big question - "what is the likely demand for xyz product or service offering in the ABC market?". Will talk about this a little more in a later blog post introducing the background and motivation for Session 5 topics (along the lines of what I did for session 4 here.

Hi all,

Two points to elaborate upon in this blog post. Pls respond with your views and comments (if any) in the comments section.

Attendence related queries

There have a been quite a few queries in the past week relating to how a student may make-up for the attendence-CP points for a class that may have been missed for any reason.

The attendence component is 8% and 1% is allotted for each session fully attended. There will be a total of 10 such sessions (sessions 2-10 and the guest lecture session). So a student can peacefully miss upto 2 sessions without incurring *any* attendence penalty. However, if a student misses more than two sessions, for *whatever* reason, then I believe s/he should be ready to accept some nominal grade penalty (a 1% dunk is not the end of the world after all). Not imposing such a penalty would be unfair to other students who have diligently worked around their schedules to attend sessions.

Re the CP marks, they are aggregated across sessions. So its quite OK for a person to be very active in one session and relatively inactive in another - since the CP score is averaged over sessions. So, missing a session or two doesn't overly skew the CP points if you make up in the sessions you do attend.

BTW, the CP grade has both oral and quick-check written components. The oral components will get more weightage simply because it enlives the class and brings interaction value directly to bear. So, kindly don't work on an assumption that one CP form is a substitute for the other.

Regarding the Guest Lecture

Thank you for making the guest speaker session a success.

For the record, I don't agree 100% with everything Mr Singhal said. However, he is spot on in that the future MKTR challenges need a fresh approach and cannot reliably be extrapolated from past techniques, data and trends.

I hope you've noticed that MKTR@ISB also works on similar assumptions. The emphasis in MKTR@ISB is also on discerning emerging trends, gaining fresh perspective, honing problem formulation competencies, digging for insights beneath the visible surface, learning recombinatory methods on a super-versatile tools platform (R) among other things. To this end, I appreciate your feedback on how to sharpen and smarten the course components, contents and delivery.

Mr Singhal's slide deck will be up on LMS. A recording of the session should be available with LRC in a few days.

Regards,

Sudhir

Friday, October 26, 2012

Some Notes on Session 4

Hi all,

A few quick points.

1. Optional R tutorial this Sunday?

Many of you have said in the written quick-check feedback that an opt-in tutorial would help. How about an hourlong tutorial this sunday at, say 4 pm? Interested folks, pls bring your laptops with R installed and LAN cables. R installation instructions are mentioned in this previous blogpost.I'll intimate the venue (most likely AC8 LT since it has R already installed) as soon as I can get confirmation.

Pls write and let me know if the time is not convenient to a majority of you for any reason.

P.S.
Let me sweeten the tutorial deal a little bit. While walking you through how R works, I will *solve* the session 4 Homework with you. You are free to directly use the results for your homework submission.:)

P.P.S.
BTW, here are some excellent video tutorials on R that cover most newbie FAQs. If you can't find the answer to your query there, then feel free to contact me or my RA Mr Ankit Anand directly.

How to do stuff in r in two minutes or less

UC Denver site with video compilations of how to do basic stuff on R

2. Project Proposals provisionally accepted

I went through a few (not all, yet) project proposals. Interesting perspectives on some of them, tweaking needed on some R.O.s. Some others, the tools might need a rethink. But overall, good show.

Must mention that a few teams did not stick to the format prescribed (and described in such careful detail). No titles, verbose R.O.s, the management problem conflated with the DP, absence of plausible alternative DPs, etc. But the majority of proposals I saw seemed OK.

For the record, pls assume that you have provisional approval for your project proposals. If there's any issue, I'll contact you directly. I shall also putup some good (example) projects from previous years. Should help accelarate learning based on what worked and didn't in the past.

3. Rationale for Session 4 homeworks

Let's dissect the session 4 HW in some more detail.

Perceptual mapping Homework

Split core courses dataset into two by engineering versus non-engineering background. Save the split datasets in a separate excel worksheet.

For each split part of the dataset, bring data into the input format needed for MEXL or R. Thus, you will get two 4x4 tables with average scores for each attribute on each course offering. And the corresponding preference tables.

Now run the analysis. Save the plots on a PPT.

Examine the perceptual maps to see if something “jumps” out at you in terms of perceptual differences that can be interesting/usable from a Marketing standpoint. Record your observations as bullet points on a slide. Try to go a little beyond the obvious; dig beneath the surface phenomena and see what insights might come.

The exact same thing repeats for work-experience based segmentation. Now sort the dataset by work-ex length. Split the original dataset into two - those with workex below the median and those with workex above it. Repeat the exact same thing we did above.

Factor analysis

Read in the psychographic responses dataset into R. Run the factor analysis code as shown in the Session 4 R code blogpost. Answer the following questions:

(i)How many factors are optimal? How did you arrive at this decision?
(ii) How much of variance is explained by the optimal no. of factors?
(iii) Map which variables load onto which factors.
Now this part is really important:
(iv)Interpret what the factors may mean (i.e. what may be the underlying construct behind the variables). Give the factor a suitable name.
(iv) For those attending the R tutorial this sunday: We'll also see how to read, interpret and save the factor loadings and the factor scores from factor analysis. The factor scores can be used as 'reduced data' for downstream analysis.

Well, that's it from me for now. Hope to see you Sunday.

Sudhir

Thursday, October 25, 2012

Session 4 R code and Homework Data

Update:

Pls see this Session 4 HW guidance blog-post before submitting your HW.

I have re-formatted parts of this blog-post. Inline images have been added for each tool's graphical output. Pls let me know if you're having any trouble viewing the images. Use the comments box below for fastest response.

Hi all,

I hope you had the chance to run the R demo of session 3 at home just to see first-hand R's ease-of-use in MKTR.

Before jumping into the R code for Session 3, first, a small admin step - you need to know how to download and install packages in R. Say, you need the package 'MASS' installed. Go to the Packages Menu in the Menu bar, click on 'Install Package(s)...', a window will open asking which server to download the package from. Be patriotic and choose 'India'.

A second window will open listing all the packages in R (at present, this list grows every month) in alphabetical order. Click on thepackage you want and sit back. R will automatically download and install the package for you. Might take a minute or two at most.

All the data for the below examples are stored as tables in this google spreadsheet.

OK, let's get started with the R codes then.

1. Simple Data Visualization using biplots. We use USArrests data (inbuilt R dataset) to see how it can be visualized in 2 dimensions. Just copy-paste the code below onto the R console [Hit 'enter' after the last line].

pc.cr <-princomp(USArrests, cor = TRUE)# scale data
summary(pc.cr)
biplot(pc.cr)
This is what the plot should look like. Click on image for larger view.

2. Code for making Joint Space maps I use as my example the OfficeStar dataset that I also demo in class from MEXL's built-in examples database. The code is generic and can be applied to any dataset in the right format you want to make joint space maps from. To facilitate comparison, I use as input format in R the same tables that you would otherwise use in MEXL

For data input, read in only the cells with numbers, no headers. The headers will need to be read in separately.

Step 2a: Read in the attribute table into 'mydata'. Only number cells, no headers.

# Read in the 5x4 mean attributes table #
mydata = read.table(file.choose())
mydata = t(mydata)#transposing to ease analysis
mydata #view the table read

Step 2b: Read the preferences table into 'pref'. Only the number cells, no headers.

# Read in preferences table#
pref=read.table(file.choose())
dim(pref) #check table dimensions
pref[1:10,]#view first 10 rows

Step 2c: Now read in table headers for 'mydata'. The below code is specific to the current example. If you're working on another dataset, pls edit the headers below (on a notepad) and then copy-paste into R.

# Read in headers separately #
attribnames = c("Large choice","Low prices","Service quality","Product Quality","Convenience")
brdnames = c("Office star", "Paper & co.","Office Eqpmt", "supermkt")
rownames(mydata) = brdnames
colnames(mydata) = attribnames
mydata # view table with headers

Data reading is done. Can start analysis now. Finally.

Step 2d: Plot the Joint space maps. The following code is general and can be used directly for any joint space mapping you may want to do once you have correctly read-in the data.

par(pty="s")#square plotting region
fit = prcomp(mydata, scale.=TRUE)
plot(fit$rotation[,1:2], type="n",xlim=c(-1.5,1.5), ylim=c(-1.5,1.5),main="Joint Space map - Home-brew on R")
abline(h=0); abline(v=0)
for (i1 in 1:nrow(fit$rotation)){
arrows(0,0, x1=fit$rotation[i1,1]*fit$sdev[1],y1=fit$rotation[i1,2]*fit$sdev[2], col="blue", lwd=1.5);
text(x=fit$rotation[i1,1]*fit$sdev[1],y=fit$rotation[i1,2]*fit$sdev[2], labels=attribnames[i1],col="blue", cex=1.1)}

# make co-ords within (-1,1) frame #
fit1=fit
fit1$x[,1]=fit$x[,1]/apply(abs(fit$x),2,sum)[1]
fit1$x[,2]=fit$x[,2]/apply(abs(fit$x),2,sum)[2]
points(x=fit1$x[,1], y=fit1$x[,2], pch=19, col="red")
text(x=fit1$x[,1], y=fit1$x[,2], labels=brdnames,col="black", cex=1.1)

# --- add preferences to map ---#
k1 = 2; #scale-down factor
pref=data.matrix(pref)# make data compatible
pref1 = pref %*% fit1$x[,1:2]
for (i1 in 1:nrow(pref1)){segments(0,0, x1=pref1[i1,1]/k1,y1=pref1[i1,2]/k1, col="maroon2", lwd=1.25)}
# voila, we're done! #
This is what you should get. Click on image for larger size. Again, just ensure data input happened correctly. Rest follows through peacefully.

3. Code for Joint Space maps on your Session 2 Homework dataset (term 4 courses).

For data input, read in only the cells with numbers, no headers. The headers will need to be read in separately. Following steps mirror steps 2a, 2b and 2c above.
Step 3a: Read in the attribute table into 'mydata'. Only number cells, no headers.

# Read in the 4x4 mean attributes table #
mydata = read.table(file.choose())
mydata = t(mydata)#transposing to ease analysis
mydata #view the table read

Step 3b: Read the preferences table into 'pref'. Only the number cells, no headers.

# Read in preferences table#
pref=read.table(file.choose())
dim(pref) #check table dimensions
pref[1:10,]#view first 10 rows

Step 3c: Now read in table headers for 'mydata'. The below code is specific to the current example. If you're working on another dataset, pls edit the headers below (on a notepad) and then copy-paste into R.

# Read in headers separately #
attribnames = c("Conceptual & theoretical value","Practical relevance","Interest sustained","Difficulty level")
brdnames = c("GSB", "INVA","MGTO", "SAIT")
rownames(mydata) = brdnames
colnames(mydata) = attribnames
mydata # view table with headers

The rest of the analysis uses exactly the same block of code as in step 2d after the data input part.

4. MDS code

MDS or Multi-dimensional scaling is the way we analyze overall-similarity (OS) data. Recall that you rated your impression of overall similarity among 9 car brands in a series of paired comparisons. We use that data for MDS here. The input is required in a particular format and there's some data cleaning and aggregation required. Luckily, my RA Shri Ankit Anand was around and a big help. Pls find the data for MDS input in the above google spreadsheet starting at cell P1.

First, we do metric MDS. Metric MDS uses metric data as its input. Since you rated similarity-dissimilarity on a 1-7 interval scale, metric MDS will work just fine. Here's the code for metric MDS, just copy-paste onto the R console.

Read in data (with the headers).

# read in the dissimilarities matrix #
mydata=read.table(file.choose(),header=TRUE)
d = as.dist(mydata)
# Classical MDS into k dimensions#
fit <- cmdscale(d,eig=TRUE, k=2)
fit # view results

# plot solution #
x <- fit$points[,1]
y <- fit$points[,2]
plot(x, y, xlab="Coordinate 1", ylab="Coordinate 2", main="Metric MDS", type="p",pch=19, col="red")
text(x, y, labels = rownames(fit$points), cex=1.1, pos=1)
abline(h=0); abline(v=0)
This is my graphical MDS output. Click on image for larger size.

Suppose the similarity-dissimilarity judgments were in yes/no terms rather than in metric ratings. Then metric MDS becomes dicey to use as it relies on interval sclaing assumptions. The more robust (but somewhat less efficient) nonmetric MDS then becomes the way to go.

Nonmetric MDS is just as easy to run. However, make sure you have the MASS package installed before running it.

library(MASS)
d <- as.dist(mydata)
fit <- isoMDS(d, k=2)
fit # view results

# plot solution $
x <- fit$points[,1]
y <- fit$points[,2]
plot(x, y, xlab="Coordinate 1", ylab="Coordinate 2", main="Nonmetric MDS", type="p", pch=19, col="red")
text(x, y, labels = rownames(fit$points), cex=1.1, pos=1)
abline(h=0); abline(v=0)
Click on image for larger size.

5. Factor analysis code

We will use exploratory (or 'common') factor analysis first on the toothpaste survey dataset. This dataset can be found starting cell P22 in the google spreadsheet mentioned above. Need to install package 'nFactors' (R is case sensitive, always) for running the scree plot.

# read in the data #
mydata=read.table(file.choose(),header=TRUE)
mydata[1:5,]#view first 5 rows

# determine optimal no. of factors#
library(nFactors)
ev <- eigen(cor(mydata)) # get eigenvalues
ap <- parallel(subject=nrow(mydata),var=ncol(mydata),rep=100,cent=.05)
nS <- nScree(ev$values, ap$eigen$qevpea)
plotnScree(nS)

On the scree plot that appears, The green horizontal line represents the Eigenvalue=1 level. Simply count how many green triangles (in the figure above) lie before the black line cuts the green line. That is the optimal no. of factors. Here, it is 2. The plot looks intimidating as it is, hence, pls do not bother with any other color-coded information given - blue, black or green. Just stick to the instructions above.

k1=2 # set optimal no. of factors
If the optimal no. of factors changes when you use a new dataset, simply change the value of k1 in the line above. Copy paste the line onto a notepad, change it to 'k1=6' or whatever you get as optimal and paste onto R console. Rest of the code runs as-is.

# extracting k1 factors #
# with varimax rotation #
fit <- factanal(mydata, k1, scores="Bartlett",rotation="varimax")
print(fit, digits=2, cutoff=.3, sort=TRUE)

# plot factor 1 by factor 2 #
load <- fit$loadings[,1:2]
par(col="black")#black lines in plots
plot(load,type="p",pch=19,col="red") # set up plot
abline(h=0);abline(v=0)#draw axes
text(load,labels=names(mydata),cex=1,pos=1)
# view & save factor scores #
fit$scores[1:4,]#view factor scores

#save factor scores (if needed)
write.table(fit$scores, file.choose())
Click on image for larger size.

In case you are wondering how the variable loading onto factors looks like in R (after factor rotation), here is the relevant R console snapshot.

Clearly, the top 3 variables load onto factor 1 and the bottom 3 onto factor 2. That is all for now. See you soon.

Sudhir

P.S.
This survey based exercise we had done in class. I've sent you its data in an excel file. For practice sake, pls run it through a joint space map and see if you get the results I did (pasted below).
This one is using Female students' inputs

And this one is using inputs from the male students:

Session 4 Background and Intro

Hi all,

With Session 4, we finally get into heavy usage of MKTR software. The session comprises two broad topics - Perceptual Mapping and Data Reduction. It promises to be rich in tools, concepts and learning in a variety of ways - from a data visualization standpoint to map-interpretations to demographic-basis segmentation.

In the first broad topic - Perceptual mapping - we learn how to invoke, handle and exploit two types of perceptual maps - Attribute based Joint-space maps and similarity based MDS maps.

Admittedly, I couldn't readily locate a canned package in R that does Joint Space maps. But the ingredients are all there, so I wrote up my own home-brew R code for joint space maps. The output tallies nicely with MEXL output, BTW.

To my knowledge, MEXL doesn't do MDS (multi-dimensional scaling) and so, we'll have to rely solely on R for this one.

For the second broad session topic - Data Reduction - we use Factor Analysis. And in factor analysis, R does pretty much everything you can ask for - from principal components extraction to principal axis factoring to singular value decompositions of arbitrary matrices to enabling every factor rotation scheme you can imagine. In contrast, (from what I have seen so far), MEXL does only common factor analysis, that too sans any factor rotation and with little in the way of guidance for selecting the optimal no. of factors. Oh, and it limits the maximum no. of observations your dataset can have (to 200, from what I recall).

Last but not least, pls expect your homework for this session to be on the heavier side only. I'm hoping you'll choose R for this project. In at least one Q, you won't have the MEXL option available to you anyway. I'll put up the classwork as well as homework datasets and the R code for it in a separate post.

Sudhir

Monday, October 22, 2012

Feedback & perspectives from Session 2


Hi all,
I've read your feedback & comments from the Session 2 quick-checks. Thanks for speaking up and sharing your thoughts.

Most comments concerned a few issues that came up repeatedly. Let me respond with my perspective to the same:

1. "too fast-Paced"

Over half of you have said this is one form or another. Fair enough. I shall slow down henceforth. My plan now is to prioritize subtopics & drop quite a few slides, passages and readings. I'll do so.

However, the flip side is that originally intended content coverage may suffer a shortfall. To make up, I'll send you additional readings and pre-readings as required.

2. "Basic concepts not emphasized"

Each session's pre-reads are specifically meant to develop familiarity with the basic concepts required for that session. We then build on and extend the basic concepts using various pedagogical tools - cases, simulations, class-discussion, assignments etc.

In MKTR (as in most elective courses), basic concepts are only re-capped at best, not re-taught (so to say). I'd also like to point out that the pre-reads are carefully chosen. Some date back over a decade. They're classics, have stood the test of time and have provided solid grounding in the basic concepts using simple text, examples and illustrations to generations of students.

3. "Live example/simulation expected"

Will happen. The quant (and yes, R) part of the course will have plenty of data play. The Homework for Session 3 does a real-world questionnaire analysis (Session 2's HW was too crowded already). So yes, by managing time within and outside class, I hope to be able to address some of the points that have been raised.

4. "Readings' direct relevance to MKTR?"

OK. This one is interesting and I think unique. The exact words used by Sudarshan B were "Except the last example all other examples/readings did not pertain to Marketing per se. The course is after all Marketing Research." OK, fair enough, here's my (rather long-ish) perspective.

Session 2 was about Survey design. And surveys are extensively used in MKTR to gauge customer opinions on a variety of topics. Opinions on a lot of topics are easy to ask for, give and analyze downstream and ample guidance on how to handle them can be found in the pre-reads, textbook chapters etc.

However, getting opinions on an increasing variety of topics is not-so-straightforward for a number of reasons - complexity, newness, vagueness or sensitivity of the subject; indifference, unawareness, noncomprehension, nonarticulation etc on part of respondents and so on. My aim in session 2 was to bring out some of these non-straightforward aspects so to say, via session 2 and its readings.

Reading 1 (Meet the new Boss: Big Data) talks about how a complex, multi-dimensional and intangible 'object' like the psychographic profile (obtained from routine survey research) when combined with outcomes data can be used to yield a predictive analytics "expert" system. Is there a more direct marketing application? Sure. Imagine if you had customer demo- and psychographics combined with their purchase history and response to (say) couponing or promotions. We'll discuss the possibilities of precisely this scenario in Sessions 4 and 6.

Reading 2 talks about the perils of eliciting opinion on new and complex subjects (e.g., stem cell research), how phrasing influences the results, classic signs to look for that opinion hasn't formed yet (extreme variation in poll results across random groups of respondents), and the importance of having safeguards when measuring opinions on complex subject matter.

Now, one can argue that the link to MKTR needs to made stronger etc and that would be a fair point. However, to question the readings' relevance itself I think is a bit of a stretch. I would argue that folks should take a broad view on the in-class readings. My attempt here is to dig beyond the obvious, surface phenomena and to discern the economic and strategic drivers of these observed phenomena. Doing so requires that we take a more holistic view of business trends and drivers rather than just pigeonhole ourselves into a Marketing silo.

5. Last but not least, I guess I have to mention this comment from Harmeet simply because it spells out precisely what I was hoping to achieve with the pedagogy and the readings. Its heartening that it got through to at least a few people in its intended form. To quote Harmeet verbatim:

"All the material was an add on to the pre reads & not a repitition of what we already know. I love the in class excercises & passage reading analysis - really helps to internalize the concepts well. Would be nice if you gave appropriate time to finish reading the passage before starting discussions/doing quick checks."

Aah, that comment is sweet vindication only. Jai ho.

Well, see you in class tomorrow.

Sudhir

Sunday, October 21, 2012

PPT template for project proposal submissions

Hi all,

Please find listed here a quick 5-7 slide template for submission of your project proposals on PPT. To demonstrate the PPT proposal template, I shall use HW3 of session 2 as an example.

1. Slide 1 - Title Slide
Give the project an informative title and list the names & PGIDs of the team members working on it.
For example, for the problem statement we have in Session 2's questionnaire design homework, we could give a title like "Assessing E-tailing's appeal among the young and upwardly mobile" or something like that.

2. Slide 2 - Management problem
Pls give a condensed problem statement describing the essential context of the management problem. For instance, the problem statement used for the questionnaire design homework can be used as a base. Further condense it as required to fit it comfortably onto one slide.
Another example of a management problem context statement could be Venkatesan Chang's musings we saw in Session 1's slides. More generally, I would encourage teams to base their management problems on real world concerns. Pick up any business magazine or newspaper and chances are around 10% of the articles may describe problem contexts for specific firms or industries that you could modify and adapt for your project.

Here are a few articles I saw from which management problem contexts can be seen emerging:

(a) The appeal of 3D movies can lead to a survey estimating preference for -> willingness to pay for -> likely demand for movies in the 3D format among a particular target segment. Here's the management problem sourced from the Economist (July 2011) The appeal of 3D movies - Cinema's great hope


(b) Here's an interesting possibility that requires folks to look at reeeally new products - akin to forecasting email's effect on postal services in 1994 - the impact on small-scale manufacturing of 3D printing services. This too is sourced from the Economist (Dec 2011)

(c) Here's a desi innovation that might get a huge fillip in demand as demographics start to favor it. Demand soars for a "House-call doctor services" for the elderly and the chronically infirm. Source is Economic times, 2012.

(d) Here's an interview with the boss of the cafe coffee day chain and he describes some interesting looking initiatives CCD is taking in trying to leverage facebook and other social media to provide speedy feedback on CCD Ops nationwide etc.

And soon. These are merely a few examples. There's no dearth of good management problems to find.

3. Slide 3 - Decision problem
Condensing the management problem to a decision problem (D.P.) is tricky stuff, as we saw in the Venkatesan Chang's case. Make appropriate assumptions and achieve this step. State the decision problem chosen in clear words. If possible, list also a few alternative D.P. statements that were considered but not chosen.

4. Slide 4 - Research objectives (R.O.s)
Ensure your set of R.O.s "cover" the D.P. in that they address the central question(s) raised by the D.P. State the R.O. in the prescribed format (lookup session 1 slides for this).

5. Slide 5 - Tools mapping
Map the R.O.s onto particular MKTR tools (or sets of tools in a particular sequence). You can refer to the MKTR toolbox for starters but are free to choose tools from outside that toolbox as well. Just to be clear, "tools" here refers to methods or approaches being followed - e.g., the survey method, secondary analysis or experiments are all "MKTR tools" for us.

Important Note:
It is imperative that your projects are submitted and approved early. Pls submit your team's project proposals for approval positively by midnight wednesday. A dropbox on LMS will be setup for this purpose. Projects can only be altered to a very limited extent once approval has been granted. FYI.

*********************************************

In case you are wondering what the grading criteria may be for the project (as that might influence what project you finally choose), then let me outline some thoughts on these criteria based on what I have used in the past. These criteria are indicative only and are not exhaustive. However, they give a fairly good idea of what you can expect. Pls plan such that your chosen problem can yield enough material such that your deliverable of 30 odd slides due at the end of the term doesn't lack substance in these broad areas.

(i) Quality of the management problem context chosen - How interesting, relevant, forward looking and do-able it is within the time and bandwidth constraints we are operating under, in the course.

(ii) Quality of the D.P.s chosen - How well it aligns with and addresses the business problem vaguely outlined in the project scope document; How well can it be resolved given the data at hand. Etc.

(iii) Quality of the R.O.s - How well defined and specific the R.O.s are in general; How well the R.O.s cover and address the D.P.s; How well they map onto specific analysis tools; How well they lead to specific recomemndations made to the client in the end. Etc.

(iv) Clarity, focus and purpose in the Methodology -  Flows from the D.P. and the R.O.s. Why you chose this particular series of analysis steps in your methodology and not some alternative. The methodology section would be a subset of a full fledged research design, essentially. The emphasis should be on simplicity, brevity and logical flow.

(v) Quality of Assumptions made - Assumptions should be reasonable and clearly stated in different steps. Was there opportunity for any validation of assumptions downstream, any reality checks done to see if things are fine?

(vi) Quality of results obtained - the actual analysis performed and the results obtained. What problems were encountered and how did you circumvent them. How useful are the results? If they're not very useful, how did you transform them post-analysis into something more relevant and useable.

(vii) Quality of insight obtained, recommendations made - How all that you did so far is finally integrated into a coherent whole to yield data-backed recommendations that are clear, actionable, specific to the problem at hand and likely to significantly impact the decisions downstream. How well the original D.P. is now 'resolved'.

(viii) Quality of learnings noted - Post-facto, what generic learnings and take-aways from the project emerged. More specifically, "what would you do differently in questionnaire design, in data collection and in data analysis to get a better outcome?".

(ix) Completeness of submission - Was sufficient info provided to track back what you actually did, if required - preferably in the main slides, else in the appendices? For instances, were Q no.s provided for the inputs to a factor analysis or cluster analysis exercise?  Were links to appendix tables present in the main slides? Etc.

(x) Creativity, story and flow - Was the submission reader-friendly? Does a 'story' come through in an interconnection between one slide and the next? Were important points highlighted, cluttered slides animated in sequence, callouts and other tools used to emphasize important points in particular slides and so on.

OK. Thats quite a lot already, I guess. I don;t want to spook anybody this early in the course (or later in the course, for that matter). But now, let none say that they didn't know how the project would be viewed and graded.

Sudhir

Saturday, October 20, 2012

Tips for Questionnaire Design homework, Session 2

Hi all,

A few quick tips on HW3 of Session 2.

1. Prioritizing and Streamlining the Problem Statement
 It probably would have become amply clear that in less than 12 minutes of respondent time, getting enough info out of respondents to address all of the things Flipkart wants is infeasible. This in turn emphasizes the *prioritize* aspect of the problem description. Typically, in real business settings also, due to time and cost constraints, not everything a client wants can always be covered. Prioritizing what's important and do-able is critical. Pls make assumptions as necessary and narrow down the questionnaire scope such that it can be completed in the time constraint mentioned.

2. Using Skip-logic
Qualtrics (and most other websurvey software) allow you to program "skip-logic" into questions. This means that based on answers to so-called "gateway" questions, different respondents can be directed to different locations in the questions thereby 'skipping' entire blocks of questions that are not relevant to them.
For example, a person who doesn't use e-shopping and doesn't intend to do so in the future can be safely directed away from many of the Qs that concern particular e-shopping behavior.

3. Feel free to borrow from the web and other sources
Some preliminary exploratory discussions with Flipkart users to understand their purchase decision process *before* questionnaire design can be immensely helpful. Similarly, if there are sections of questionnaires available online (e.g. demographics, psychographics, domain related etc.) or from other sources, feel free to borrow and adapt to your needs. But *please* cite the source. I would imagine a thorough look-through Flipkart's website itself would be a good starting point.

Hope that helps.

Sudhir

An R primer for Session 3 and beyond

Hi all,

Time to gently introduce R into MKTR. In Session 3, I attempt to demonstrate sampling related concepts using R. This does not require that you as students do anything other than watch and listen, discuss and learn.  However, I would encourage you to try the same thing I demo in class at home on R. All you have to do is copy-paste the Rcode below (in the grey shaded boxes) onto the R console. Since there is no grading or assignment pressure involved, consider this a gentle intro to R.  Later, when the graded homework assignments happen, then also, expect the same thing - I will put out R code here on the blog and you have the option of using it to solve your homeworks directly.

1. To download and install R:
Go to the following link
CRAN project: R download links
download the installer, and follow its instructions.

After that, I expect that in < 20 minutes (on a good net connection), you will have top-class computational firepower ready on your machines.

Open R and, well, look around. The GUI won't look like much probably, but appearances can be deceiving (as we'll see later in the course in Session 6 - Qualitative research and Session 7 - Experimentation).

2. Basic Input/Output in R
Any dataset you want to read in R, store in a standalone .txt or .csv file anywhere on your computer. Then simply copy paste the following code:
mydata = read.table(file.choose(), header = TRUE)
dim(mydata);     summary(mydata);     mydata[1:5,]
Note: The dataset should have column headers, else put "header = FALSE" above. We have both read in the dataset (named 'mydata') into R and run a quick-check on it to see its summary characteristics.

3. Read in Session 3 Dataset into R
For session 3, to demonstrate sampling concepts on R, I collected data from students in the MKTR class of 2011 on their self-reported height (in cms) and weight (in kgs). This data is available for copying from this google spreadsheet. Please copy the data onto a .txt file on your computer and use the code in step 2 to read it into R.

4. Run your first set of analyses on R - histograms for what the data look like.
# What are the data like? #
attach(mydata)
par(mfcol=c(2,1))# makes plots on a single page in a 2 x 2 pattern
for (i in 1:ncol(mydata)) # starts loop for i from 1 to 4
{hist(mydata[,i], breaks=30, main=dimnames(mydata)[[2]][i], col="gray")
abline(v=mean(mydata[,i]), lwd=3, lty=2, col="Red")}
Just copy-paste the above code into the R console. Should run peacefully.

5. Set sample size (k) and see by how much sample based estimates differ from true values.
# Randomly sample 10 values & estimate mean height, weight.
k = 10
ht = sample(height,k); ht
wt = sample(weight,k); wt
mean(ht); mean(wt)
mean(height); mean(weight)
error.ht = mean(height)-mean(ht); error.ht
error.wt = mean(weight)-mean(wt); error.wt

6. Now set a higher sample size, say k=40 (instead of 10) and rerun the above code. The errors come down, usually (but not always).
# Randomly sample 10 values & estimate mean height, weight.
k = 40
ht = sample(height,k); ht
wt = sample(weight,k); wt
mean(ht); mean(wt)
mean(height); mean(weight)
error.ht = mean(height)-mean(ht); error.ht
error.wt = mean(weight)-mean(wt); error.wt

7. Plot and see how closely sample distribution approximates population distribution.
par(mfrow=c(2,2))
hist(mydata[,1], breaks=30, main ="Population Height", xlim=c(140,200), col="gray")
abline(v=mean(mydata[,1]), lwd=3, lty=2, col="Red")
hist(mydata[,2], breaks=30, main="Population Weight", xlim=c(40,100), col="gray")
abline(v=mean(mydata[,2]), lwd=3, lty=2, col="Red")
hist(ht, breaks=20, main="Sample size k=40", xlim=c(140,200), col="beige")
abline(v=mean(ht), lwd=3, lty=2, col="Red")
hist(wt, breaks=20, main="Sample size k=40", xlim=c(40,100), col="beige")
abline(v=mean(wt), lwd=3, lty=2, col="Red")
By the way, if you want to save the above (or any other) plot from R, just right-click, copy as metafile and then paste onto your PPT or word doc etc.

8. Time to deep-dive into Sampling Distributions now. Ask yourself, what happens if we repeat the sampling process 1000 times? We get 1000 means from 1000 samples. Now, what do the 1000 means look like as a histogram?
outp = matrix(0,nrow=1000,ncol=2)
k = 10;# set sample size
for (i in 1:1000){ outp[i,1]=mean(sample(height,k))
outp[i,2]=mean(sample(weight,k))}
par(mfrow=c(2,2))
for (i in 1:ncol(outp)){
hist(outp[,i], breaks=10,main=c( "sample size=",k),
xlab=dimnames(mydata)[[2]][i],xlim=range(mydata[,i]))}

9. Now repeat the above code but with k=40 (instead of k=10) and see. The plots are superimposed in a 2x2 pattern. Do you see fall or rise in the std error? The standard error will roughly half, following the square root rule.
outp = matrix(0,nrow=1000,ncol=2)
k = 40;# set sample size
for (i in 1:1000){ outp[i,1]=mean(sample(height,k))
outp[i,2]=mean(sample(weight,k))}
#par(mfrow=c(2,2))
for (i in 1:ncol(outp)){
hist(outp[,i], breaks=10,main=c( "sample size=",k),
xlab=dimnames(mydata)[[2]][i],xlim=range(mydata[,i]))}

If all went well and you were able to replicate what I got in class in Session 3, well then, congratulations, I guess. You've run R peacefully for MKTR.

I'd strongly recommend folks try R to the max extent feasible for this course. Pls correspond with me using this blog for any R related queries, trouble-shoots etc.

Sudhir

P.S.
Here's code for demo-ing CLT on a-priori non-normal data: one from exponential distribution and the other from a log-normal one.

x1 = rexp(10000) # exponential distbn
x2 = exp(rnorm(10000)) # lognormal distbn
outp = matrix(0,nrow=1000,ncol=2)
k = 40;# set sample size
par(mfrow=c(2,2))

hist(x1, breaks=40)
hist(x2, breaks=40)

for (i in 1:1000){ outp[i,1]=mean(sample(x1,k))}
hist(outp[,1], breaks=10,main=c( "sample size=",k))

for (i in 1:1000){ outp[i,2]=mean(sample(x2,k))}
hist(outp[,2], breaks=10,main=c( "sample size=",k))

I also tried the flipped classroom thing and have putup two youtube vids on how to read in and write out data from R. Here are the links:
5 Steps to Read data into R
4 Steps to Save data from R




Friday, October 19, 2012

Notes on Session 2

Hi all,

First off, an affable welcome to the final registrants for MKTR. I hope the course meets your expectations.

Session 2 is done. Some new pedagogical tools have been tried. I shall await the feedback and response to the same in the quick-checks.

1. Re the in-class readings for Session 2, here are the original sources:

Reading 1 - Meet the New Boss: Big Data
Reading 2 - The Unbearable Lightness of Opinion Polls
Readings 3 and 4 - Book excerpts from Malcolm Gladwell's Blink

2. There was queries about the grading scheme. Let me explain.

- Grade points for the index card pre-reads quiz comes from the 15% allocated to home-works because the pre-reads are properly speaking, home work. I think this point was mentioned in Session 1 slides.

- The longer full page quick-check is basically an attendance sheet. The grade points for this come from the 8% earmarked for attendance.

- There is also a CP component to the full-page quick-checks, based on a 0-1-2 rubric. A blank space or random commentary on any quick-check question invites a "0"; every good faith attempt gets a "1" (I expect most folks to get this) and extraordinarily presented or argued points get a "2". Some of these "2" points wil feature here on the blog as well.


3. About the homework assignments for session 2:
There are 3 pieces to the homework. 2 are straightforward websurvey filling exercises. Here are the websurvey links:
Session 2, HW 1, Survey
Session 2, HW 2, web Survey

HW1 basically collects data from you on what we call Attribute Ratings (AR) based perceptual mapping, to be used in Session 4. HW 2 collects data using the Overall Similarity (OS) method to perform percpetual mapping, again in session 4. HW 2 also additionally features a psychographics section. Please fill out *both* surveys sincerely. I expect no more than 10-15 minutes per survey.

The third homework requires you to design a short questionnaire for Flipkart for the following context:
Flipkart, a leading Indian e-tailer, wants to know about how students in premier professional colleges in India view shopping online. Flipkart believes that this segment will, a few years down, become profitable, a source of positive word of mouth from a set of opinion leaders. This will seed the next wave of customer acquisition and growth and is hence a high stakes project for Flipkart.
Flipkart wants to get some idea about the buying habits, buying process concerns and considerations, product categories of interest, demography, media consumption and some idea of the psychographics of this segment. 
As lead consultant in this engagement, you must now come up with a quick way to prioritize and assess target segment perceptions on these diverse parameters.

Build up a quick survey (that should no longer than 10-12 minutes of fill-up time for an average respondent) on qualtrics web survey software for this purpose. Pls submit websurvey link in a google form that we will send you for this purpose and a hard copy of the survey as well (printed both sides of the page) to your section AA before Session 3 starts.
4. Project group formation must now proceed apace. Pls have your team rep email a list of team members (names, PGIDs) and your team-name (any small town in India) to any one of the AAs.

I will meanwhile work on some example projects that could be considered and put out a small PPT template in which the project proposal can be submitted for approval.

***

OK, That's all from me for now. Will see you all in session 3 then.

Sudhir

Wednesday, October 17, 2012

Session 1, done.

Hi all,

Session 1 is over. In case folks want to see where the in-class readings were sourced from, here is a list:

Reading 1: "Have Breakfast… or…Be Breakfast!" 
Reading 2: The magic of good service
Reading 3: Even in emerging markets, Nokia's star is fading
Reading 4: Join them or Joyn them
Reading 5: Jobs of the future

About pedagogy going forward
I attended the evening session of the ENDM class yesterday. Arun Pereira is a great teaching resource and I'd be foolish not to take advantage of such a resource. The session was interesting to me in terms of approach to classroom instruction. Of course, the courses are different and I'm not Arun, so not everything he does will work for me. Still, there seems to be enough scope to pick and choose things that might work. One such (and this is based also on past student feedback) is the need to maintain a sort of continuous student engagement in the classroom. Hence, I plan to borrow more of his "quick-checks" format than I'd first intended. This is FYI. It may not work out very well in MKTR and if that is the case, I'd appreciate your letting me know.

The other parts of the pedagogy that make MKTR distinctive : in-class reads and R will remain as before.

About this blog
This is an informal blog that oncerns itself soley with MKTR affairs. It's informal in the sense that its not part of the LMS system and the language used here is more on the casual side. Otherwise, its a relevant place to visit every now and then if you're taking MKTR. Pls expect MKTR course related announcements, R code, Q&A, feedback etc. on here.

About Session 2
Session 2 deals with psychographic scaling techniques and delves into the mechanics of questionnaire design. Am not sure 'mechanics' is the right word here because questionnaire design is more of an art than an exact science.

There will be a series of 3 homework assignments for session 2. Two concern merely filling up surveys that I will send you (this data we will use later in the course for perceptual mapping and segmentation) and the third asks you to design a small questionnaire for a specific context. Nothing too troublesome, as you can see.

The two pre-reads for session 2 might appear lenghty-ish but are easy-reads (I think). And yes, there will be a short pre-reads based quick-check on them. So do please read them and come.
The pre-reads for each session are listed in page ix of the course-pack.

OK, see you folks in session 2 then.

Sudhir