Marketing Yogi: Session 5

Showing posts with label Session 5. Show all posts

Wednesday, October 30, 2013

Session 5 Updates

Hi all,
Yesterday in Session 5 we covered two major topics - Segmentation and Targeting. Sorry about the delay in bringing out this blog post.In this blog post, I shall lay out the classwork examples (which you might want to try replicating) and their interpretation, and the HW for this session.There are many approaches to doing cluster analysis and R handles a dizzying variety of them. We'll focus on 3 broad approaches - Agglomerative Hierarchical clustering (under which we will do basic hierarchical clustering with dendograms), Partitioning (here, we do K-means) and model based clustering. Each has its pros and cons. Model based is probably the best around, highly recommended.1. Cluster Analysis Data preparation
First read in the data. USArrests is pre-loaded, so no sweat. I use the USArrests dataset example throughout for cluster analysis.

#first read-in data#
mydata = USArrests

Data preparation is required to remove variable scaling effects. To see this, consider a simple example. If you measure weight in Kgs and I do so in Grams - all other variables being the same - we'll get two very different clustering solutions from what is otherwise the same dataset. To get rid of this problem, just copy-paste the following code.

# Prepare Data #

mydata = na.omit(mydata) # listwise deletion of missing

mydata = scale(mydata) # standardize variables

2. Now we first do agglomerative Hierarchical clustering, plot dendograms, split them around and see what is happening.

# Ward Hierarchical Clustering

d = dist(mydata, method = "euclidean") # distance matrix

fit = hclust(d, method="ward") # run hclust func

plot(fit)# display dendogram

Click on image for larger size.Eyeball the dendogram. Imagine horizontally slicing through the dendogram's longest vertical lines, each of which represents a cluster. Should you cut it at 2 clusters or at 4? How to know? Sometimes eyeballing is enough to give a clear idea, sometimes not. Various stopping-rule criteria have been proposed for where to cut a dendogram - each with its pros and cons. I'll go with subjective - visual criterion for the purposes of this course.

Suppose you decide 2 is better. Then set the optimal no. of clusters 'k1' to 2.

k1 = 2 # eyeball the no. of clusters

Note: If for another dataset, the optimal no. of clusters changes to, say, 5 then use 'k1=5' in the line above instead. Don't blindly copy-paste that part. However, once you have set 'k1', the rest of the code can be peacefully copy-pasted as-is.

# cut tree into k1 clusters

groups = cutree(fit, k=k1)# cut tree into k1 clusters

3. Coming to the second approach, 'partitioning', we use the popular K-means method. Again, the Q arises, how to know the optimal no. of clusters? Eyeballing the dendogram might sometimes help. But at other times, what should you do? MEXL (and most commercial software too) requires you to magically come up with the correct number as input to K-means. R does one better and shows you a scree plot of sorts that shows how the within-segment variance (a proxy for clustering solution quality) varies with the no. of clusters. So with R, you can actually take an informed call.

# Determine number of clusters #

wss = (nrow(mydata)-1)*sum(apply(mydata,2,var));

for (i in 2:15) wss[i] = sum(kmeans(mydata,centers=i)$withinss);

plot(1:15, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")
# Look for an "elbow" in the scree plot #

Look for an "elbow" in the scree plot. The interior node at which the angle formed by the 'arms' is the smallest. This scree-plot is not unlike the one we saw in factor-analysis. Again, as with the dendogram, we get either 2 or 4 as the options available. Suppose we go with 2.

# Use optimal no. of clusters in k-means #

k1=2

# K-Means Cluster Analysis

fit = kmeans(mydata, k1) # k1 cluster solution

To understand a clustering solution, we need to go beyond merely IDing which individual unit goes to which cluster. We have to characterize the cluster, interpret what is it that's common among a cluster's membership, give each cluster a name, an identity, if possible. Ideally, after this we should be able to think in terms of clusters (or segments) rather than individuals for downstream analysis.

# get cluster means

aggregate(mydata.orig,by=list(fit$cluster),FUN=mean)

# append cluster assignment

mydata1 = data.frame(mydata, fit$cluster);

mydata1[1:10,]

OK, That is fine., But can I actually, visually, *see* what the clustering solution looks like? Sure. In 2-dimensions, the easiest way is to plot the clusters on the 2 biggest principal components that arise. Before copy-pasting the following code, ensure we have the 'cluster' package installed.

# Cluster Plot against 1st 2 principal components
# vary parameters for most readable graph

install.packages("cluster")
library(cluster)
clusplot(mydata, fit$cluster, color=TRUE, shade=TRUE,labels=2, lines=0)

Two clear cut clusters emerge. Missouri seems to border the two. Some overlap is also seen. Overall, the clusPlot seems to put a nice visualization over the clustering process. Neat, eh? Try doing this with R's competitors...:)

4. Finally, the last (and best) approach - Model based clustering.'Best' because it is the most general approach (it nests the others as special cases), is the most robust to distributional and linkage assumptions and because it penalizes for surplus complexity (resolves the fit-complexity tradeoff in an objective way). My thumb-rule is: When in doubt, use model based clustering. And yes, mclust is available *only* on R to my knowledge.Install the 'mclust' package for this first. Then run the following code.

install.packages("mclust")

# Model Based Clustering

library(mclust)

fit = Mclust(mydata)

fit # view solution summary

The mclust solution has 3 components! Something neither the dendogram nor the k-means scree-plot predicted. Perhaps the assumptions underlying the other approaches don't hold for this dataset. I'll go with mclust simply because it is more general than the other approaches. Remember, when in doubt, go with mclust.

fit$BIC # lookup all the options attempted

classif = fit$classification # classifn vector

mydata1 = cbind(mydata.orig, classif) # append to dataset

mydata1[1:10,] #view top 10 rows

# Use below only if you want to save the output

write.table(mydata1,file.choose())#save output

The classification vector is appended to the original dataset as its last column. Can now easily assign individual units to segments.Visualize the solution. See how exactly it differs from that for the other approaches.

fit1=cbind(classif)
rownames(fit1)=rownames(mydata)
library(cluster)
clusplot(mydata, fit1, color=TRUE, shade=TRUE,labels=2, lines=0)

Imagine if you're a medium sized home-security solutions vendor looking to expand into a couple of new states. Think of how much it matters that the optimal solution had 3 segments - not 2 or 4.To help characterize the clusters, examine the cluster means (sometimes also called 'centroids', for each basis variable.

# get cluster means
cmeans=aggregate(mydata.orig,by=list(classif),FUN=mean); cmeans

Seems like we have 3 clusters of US states emerging - the unsafe, the safe and the super-safe. Now, we can do the same copy-paste for any other datasets that may show up in classwork or homework. I'll close the segmentation module here. R tools for the Targeting module are discussed in the next blog post. Any queries or comment, pls use the comments box below to reach me fastest.

###############################

Targeting in R

This is the code for classwork MEXL example "Conglomerate's PDA". This is the roadmap for what we are going to do:

First we segment the customer base using model based clustering or mclust, the recommended method.
Then we randomly split the dataset into training and test samples. The test sample is about one-third of the original dataset in size, following accepted practice.
Then we try to establish via the training sample, how the discriminant variables relate to segment membership. This is where we train the Targeting algorithm to learn about how discriminant variables relate to segment memberships.
Then comes the real test - validate algorithm performance on the test dataset. We compare prediction accuracy across traditional and proposed methods.
Since R is happening, there are many targeting algorithms to choose from on R. I have decided to go with one that has shown good promise of late - the randomForest algorithm. Where we had seen decision trees in Session 5, think now of 'decision forests' in a sense...
Other available algorithms that we can run (provided there is popular demand) are artificial neural nets (multi-layer perceptrons) and Support vector machines. But for now, these are not part of this course.

So without further ado, let me start right away.1. Segment the customer Base.To read-in data, directly save and use the 'basis' and 'discrim' notepads I have sent you by email. Then ensure you have packages 'mclust' and 'cluster' installed before running the clustering code.

# read-in basis and discrim variables
basis = read.table(file.choose(), header=TRUE)
dim(basis); basis[1:3,]
summary(basis)

discrim = read.table(file.choose(), header=TRUE)
dim(discrim); discrim[1:3,]
summary(discrim)

# Run segmentation on the basis.training dataset library(mclust) #invoke library

fit = Mclust(basis) # run mclust

fit # view result

classif = fit$classification

# print cluster sizes

for (i1 in 1:max(classif)){print(sum(classif==i1))}

# Cluster Plot against 1st 2 principal components

require(cluster)

fit1 = cbind(classif)

rownames(fit1)=rownames(basis)

clusplot(basis, fit1, color=TRUE, shade=TRUE,labels=2, lines=0)

The segmentation produces 4 optimal clusters. Below is the clusplot where, interestingly, despite our using 15 basis variables, we see decent separation among the clusters in the top 2 principal components directly.

Click on the above image for larger size.

2. Split dataset into Training & Test samplesRead in the dataset 'PDA case discriminant variables.txt' from LMS for the below analysis:

rm(list = ls()) # clear workspace

# 'PDA case discriminant variables.txt'

mydata = read.table(file.choose(), header=TRUE)

head(mydata)

# build training and test samples using random assignment

train_index = sample(1:nrow(mydata), floor(nrow(mydata)*0.65));

# two-thirds of sample is for training

train_index[1:10];

train_data = mydata[train_index, ];

test_data = mydata[-(train_index), ];

train_x = data.matrix(train_data[ ,c(2:18)]);

train_y = data.matrix(train_data[ ,19]);

# for classification we need as.factor

test_x = data.matrix(test_data[ ,c(2:18)]); test_y = test_data[ ,19]

Last year, when Targeting was a full lecture session, I used the most popular machine learning algorithms - neural nets, random forests and Support vector machines (all available on R, of course) to demonstrate targeting. Those notes can be found here.3. Use multinomial logit for Targeting:Will need to install library 'textir' for this one.

###### Multinomial logit using Rpackage textir #######

install.packages("textir")

library(textir)

covars = normalize(mydata[ ,c(2,4,14)], s=sdev(mydata[,c(2,4,14)])); #normalizing the data dd = data.frame(cbind(memb=mydata$memb,covars,mydata[ ,c(3,5:13,15:18)]));

train_ml <- dd[train_index, ];

test_ml = dd[-(train_index), ];

gg = mnlm(counts = as.factor(train_ml$memb), penalty = 1, covars = train_ml[ ,2:18]);

prob = predict(gg, test_ml[ ,2:18]);

head(prob);

pred = matrix(0, nrow(test_ml), 1);

accuracy = matrix(0, nrow(test_ml), 1);

for(j in 1:nrow(test_ml)){

pred[j, 1] = which.max(prob[j, ]);

if(pred[j, 1]==test_ml$memb[j]) {accuracy[j, 1] = 1}

}

mean(accuracy)

You'll see something like this (but not the exact same thing because the training and test samples were randomly chosen)

Look at the probabilities table given. The table tells us the probability that respondent 1 (in row1) belongs to segment 1, 2, 3 or 4. We get maximum probability for segment 1, so we say that respondent 1 belongs to segment 1 with a 61% probability. In some cases, the all the probabilities may be less than 50%. Just take the maximum and assign the respondent to that segment, if so.Now, we let loose the logit algorithm onto the test sample. the algo comes back with its predictions. In the real world, we will go by what the machine says. But in this case, since we have the actual segment memberships, we can validate the results. This is what I got when I tried to assess the accuracy of the algo's predictions:

So, the algo is able to predict with a 60% odd accuracy, not bad considering that random allocation would have given you at best a 25% success rate. Besides, this is simple logit - more sophisticated algos exist that can do better, perhaps even much better.

That's it for now. Will putup the HW for this session in a separate update (deadline now is 9-Nov saturday midnight) here (watch this space).

Session 5 HW update:

There will be no HW ofr session 5. I figure I can combine segmentation and targeting bits into the session 6 HW.

Sudhir

Monday, December 17, 2012

Session 5 HW

Hi all,

Your 'Session 5 HW' is out, in a folder of the same name on LMS. The R code I used to test the HW is put up as a notepad. Feel free to use blocks of that R code directly for your HW.

Important: Pls read the short caselet in a PDF file 'Conglomerate PDA' in the HW folder *before* attempting the exercise. As an instructor, I assure you that if you try interpreting the analyses without reading the caselet, it will show.

Recommended:

Pls try the classwork examples on R before trying the HW examples. The classwork blog-post has explanations for each block of code.
Ensure you have the required packages loaded before you start. These are (IIRC): nFactors, cluster, mclust and rpart.

HW Questions:

Q1. Are there any 'constructs' underlying the basis variables used for segmentation in the PDA caselet? What might they be? Give a name to and interpret these factors/ constructs.
Q2. Segment respondent data using hclust, mclust and k-means (with scree-plot). Record how many segments you find. Draw clusplots for k-means and mclust outputs.
Q3. Characterize or 'profile' the clusters obtained from mclust. Name the segments (similar to the China-digital consumer segments reading)
Q4. Read-in discriminant data for the PDA case. Make a dataset consisting of the interval scaled discriminant variables only. Now plot a decision tree to see which variables best explain the membership to the largest segment (segment 1). List the variables by order of importance.

Submission format is a set of PPT slides:

Ensure your name and PGID are written on the title slide.
Ensure all your plots (and the important tables) are pasted as images on the slides. Typically metafile images are best.
Pls give each slide an informative title and mention question number on it.
Pls be aware that while you are free to consult peers on the R part of the plots making, interpretation and writing up is solely an individual activity.
Submission deadline is Monday Midnight 24-Dec-2012 in a LMS dropbox.

Session 6 HW is in the making. Will happen shortly and its deadline will be close to this one's. FYI.

Any Qs on this HW, pls contact me directly. Pls use the blog comments pages to reach me fastest. Feedback on any aspect of the HW or the course is most welcome.

Sudhir

Sunday, December 9, 2012

Targeting Tools in R - Decision Trees

Hi all,

Today's second module - Targeting - is crucial in MKTR and underlies much of the work we're currently engaged in in terms of predictive analytics. It is also a lengthy one and so will likely spill over into part of Wednesday's class. Just FYI.

1. Decision Trees

First off, a quick drive by on what they are and why they matter in MKTR. This is how Wiki defines Decision trees:

A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility.

OK, lots of fancy verbiage there. Perhaps an example can illustrate better. Many cognitive-logical decisions can be represented in an algorithmic form with a tree like structure. For example - "Should we enter market A or not?" Imagine two paths out of this Question - one saying yes and the other 'no'. Each of these paths (or 'branches') can then be further split into more branches, say, 'cost' and 'benefit' and so on till a decision is reachable.

Essentially, we are trying to partition this dataset along that branch network which best explains variation in Y - our chosen variable of interest.

2. Building a simple Decision Tree

The dataset used is an internal R dataset 'cu.summary'. No need to read it in, its pre-loaded. Exhibit 3 in the handout gives some rows of the dataset.

# view dataset summary #
library(rpart)
data(cu.summary)
dim(cu.summary)
summary(cu.summary)
cu.summary[1:3,]

This is the dataset summary. Click for larger image.

Now make sure you have installed the 'rpart' package before doing the rest.

# Regression Tree Example

# grow tree
fit <- rpart(Mileage~Price + Country + Reliability + Type,
method="anova", data=cu.summary)

summary(fit) # detailed summary of splits

The summary of results gives a lot of things. Note the formula used, the number of splits (nsplit), the variable importance (on a constant sum of 100) and hajaar detail on each node. Of course, we'll be plotting the nodes, so no problemo.

OK. Time to plot the tree itself. Just copy-paste the code below.

par(col="black", mfrow=c(1,1))
# create attractive postcript plot of tree
post(fit, file = "", title = "Regression Tree for Mileage ", cex=0.7)

Click for larger image.

These algorithms follow certain rules - start rules, stop rules etc and essentially operate by maximizing come criterion. In this case, it is minimizing the informational entropy associated with each possible branch split. However, in some critical applications, one may want to go beyond minimizing entropy and say, well, I want to assess statistical significance for a node split into branches. In such delicate situations, nonparametric conditional decision trees ride to the rescue. Ensure you have the party package installed before trying the following:

##--- nonparametric conditional inference ---
library(party)

fit2 <- ctree(Mileage~Price + Country + Reliability + Type,
data=na.omit(cu.summary))

plot(fit2)

Well, when should I go for one (rpart) or the other (party)? Well, traditional decision trees are fine in most apps and are also more intuitive to explain and communicate, so maybe that is what you want to stick with.

3. Multivariate Decision Trees

For this variation in decision trees, copy data from cells A1 to I52 in this google spreadsheet. This is data from your preference ratings for the term 4 offerings. Your 4-dimensional preference vector now becomes the Y variable here.

The task now is to partition the dataset's demographic variables to best explain the distribution of the preference vector. Ensure you have package mvpart installed before trying the following code.

##--- multivariate decision trees ---
data = read.table(file.choose(), header = TRUE)

library(mvpart)
mvpart(data.matrix(data[,1:4])~workex+Gender+Engg+Masters+MktgMajor, data)

Click for a larger image.

Couple of points as I signoff.

The algos we executed today with just a few copy-pastes of generic R code are actually quite sophisticated and very powerful. Now available on a platter to you thanks to R.
Implementation of business solutions based on these very algos can cost easily upwards of 1000s of USD per installation. Those savings are very real, especially if yours is a small firm/startup etc. I hope what this means will find appreciation.
The applications are more important than the coding. The interpretation, the context and the understanding of what a tool is meant to do is more important than merely getting code to run.
My effort is to expose you to the possibilities so that when opportunity arises, you will be able to connect tool to application and extract insight in the process.

Well, this is it for now from me. The targeting based algos - the random Forest and neural nets alongside good old Logit will come next session then. See you in class tomorrow.

Sudhir

Segmentation Tools in R - Session 5

Hi all,

Welcome to the R codes to Session 5 classroom examples. Pls try these at home. Use the blog comments section for any queries or comments.

There are 3 basic modules of direct managerial relevance in session 5 -

(a) Segmentation: We use cluster analysis tools in R (Hierarchical, k-means and model-based algorithms)
(b) Decision Trees: Using some sophisticated R algorithms (recursive partitioning, conditional partitioning, random forests)
(c) Targeting: Using both parametric (Multinomial Logit) and non-parametric (Machine learning based) algorithms.

Now it may well be true that one session may not do justice to this range of topics. So, even if it spills over into other sessions, no problemo. But I hope and intend to cover this well.

This blog-post is for Segmentation via cluster analysis. The other two will be discussed in a separate blog post for clarity.

There are many approaches to doing cluster analysis and R handles a dizzying variety of them. We'll focus on 3 broad approaches - Agglomerative Hierarchical clustering (under which we will do basic hierarchical clustering with dendograms), Partitioning (here, we do K-means) and model based clustering. Each has its pros and cons. Model based is probably the best around, highly recommended.

1. Cluster Analysis Data preparation
First read in the data. USArrests is pre-loaded, so no sweat. I use the USArrests dataset example throughout for cluster analysis.

#first read-in data#
mydata = USArrests

# Prepare Data #
mydata <- na.omit(mydata) # listwise deletion of missing
mydata.orig = mydata #save orig data copy
mydata <- scale(mydata) # standardize variables

2. Now we first do agglomerative Hierarchical clustering, plot dendograms, split them around and see what is happening.

# Ward Hierarchical Clustering
d <- dist(mydata, method = "euclidean") # distance matrix
fit <- hclust(d, method="ward")
plot(fit) # display dendogram

Click on image for larger size.

Eyeball the dendogram. Imagine horizontally slicing through the dendogram's longest vertical lines, each of which represents a cluster. Should you cut it at 2 clusters or at 4? How to know? Sometimes eyeballing is enough to give a clear idea, sometimes not. Suppose you decide 2 is better. Then set the optimal no. of clusters 'k1' to 2.

k1 = 2 # eyeball the no. of clusters

# cut tree into k1 clusters
groups <- cutree(fit, k=k1)
# draw dendogram with red borders around the k1 clusters
rect.hclust(fit, k=k1, border="red")

3. Coming to the second approach, 'partitioning', we use the popular K-means method.

Again, the Q arises, how to know the optimal no. of clusters? Eyeballing the dendogram might sometimes help. But at other times, what should you do? MEXL (and most commercial software too) requires you to magically come up with the correct number as input to K-means. R does one better and shows you a scree plot of sorts that shows how the within-segment variance (a proxy for clustering solution quality) varies with the no. of clusters. So with R, you can actually take an informed call.

# Determine number of clusters #
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(mydata,centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")
# Look for an "elbow" in the scree plot #

# Use optimal no. of clusters in k-means #
k1=2

# K-Means Cluster Analysis
fit <- kmeans(mydata, k1) # k1 cluster solution

# get cluster means
aggregate(mydata.orig,by=list(fit$cluster),FUN=mean)
# append cluster assignment
mydata1 <- data.frame(mydata.orig, fit$cluster)

# Cluster Plot against 1st 2 principal components
# vary parameters for most readable graph
library(cluster)
clusplot(mydata, fit$cluster, color=TRUE, shade=TRUE,labels=2, lines=0)

4. Finally, the last (and best) approach - Model based clustering.

'Best' because it is the most general approach (it nests the others as special cases), is the most robust to distributional and linkage assumptions and because it penalizes for surplus complexity (resolves the fit-complexity tradeoff in an objective way). My thumb-rule is: When in doubt, use model based clustering. And yes, mclust is available *only* on R to my knowledge.

Install the 'mclust' package for this first. Then run the following code.

# Model Based Clustering
library(mclust)
fit <- Mclust(mydata)
fit # view solution summary

fit$BIC # lookup all the options attempted
classif = fit$classification # classifn vector
mydata1 = cbind(mydata.orig, classif) # append to dataset
mydata1[1:10,] #view top 10 rows

# Use only if you want to save the output
write.table(mydata1,file.choose())#save output

The classification vector is appended to the original dataset as its last column. Can now easily assign individual units to segments.

Visualize the solution. See how exactly it differs from that for the other approaches.

fit1=cbind(classif)
rownames(fit1)=rownames(mydata)
library(cluster)
clusplot(mydata, fit1, color=TRUE, shade=TRUE,labels=2, lines=0)

Imagine if you're a medium sized home-security solutions vendor looking to expand into a couple of new states. Think of how much it matters that the optimal solution had 3 segments - not 2 or 4.

To help characterize the clusters, examine the cluster means (sometimes also called 'centroids', for each basis variable.

# get cluster means
cmeans=aggregate(mydata.orig,by=list(classif),FUN=mean); cmeans

Seems like we have 3 clusters of US states emerging - the unsafe, the safe and the super-safe.

Now, we can do the same copy-paste for any other datasets that may show up in classwork or homework. I'll close the segmentation module here. R tools for the Targeting module are discussed in the next blog post. Any queries or comment, pls use the comments box below to reach me fastest.

Sudhir

Marketing Yogi

Wednesday, October 30, 2013

Session 5 Updates

Monday, December 17, 2012

Session 5 HW

Sunday, December 9, 2012

Targeting Tools in R - Decision Trees

Segmentation Tools in R - Session 5

Blog Archive