Wednesday, December 11, 2013

Session 5 Updates

Hi all,

Yesterday in Session 5 we covered two major topics - Segmentation and Targeting.

Sorry about the delay in bringing out this blog post. In this blog post, I shall lay out the classwork examples (which you might want to try replicating) and their interpretation.

There are many approaches to doing cluster analysis and R handles a dizzying variety of them. We'll focus on 3 broad approaches - Agglomerative Hierarchical clustering (under which we will do basic hierarchical clustering with dendograms), Partitioning (here, we do K-means) and model based clustering. Each has its pros and cons (as discussed in class). Also, as mentioned in class, the goal here is to get tomorrow's managers (i.e., you) an exposure to the intuition behind clustering and the various methods in play. Going into technical detail is not a part of this course. However, I'm open to discussing and happy to receive Qs of a technical nature, outside of class time.

1. Cluster Analysis Data preparation

First read in the data. USArrests is pre-loaded, so no sweat. I use the USArrests dataset example throughout for cluster analysis.

# first install these packages
# Note: You only need to install a package ONCE.
# Thereafter a library() call is enough.

install.packages("cluster")
install.packages("mclust")
install.packages("textir")
install.packages("clValid")
# Now read-in data#

mydata = USArrests

Data preparation is required to remove variable scaling effects. To see this, consider a simple example. If you measure weight in Kgs and I do so in Grams - all other variables being the same - we'll get two very different clustering solutions from what is otherwise the same dataset. To get rid of this problem, just copy-paste the following code.

# Prepare Data #

mydata = na.omit(mydata) # listwise deletion of missing

mydata = scale(mydata) # standardize variables

2. Now we first do agglomerative Hierarchical clustering, plot dendograms, split them around and see what is happening.

# Ward Hierarchical Clustering

d = dist(mydata, method = "euclidean") # distance matrix

fit = hclust(d, method="ward") # run hclust func

plot(fit) # display dendogram

Click on image for larger size.

Eyeball the dendogram. Imagine horizontally slicing through the dendogram's longest vertical lines, each of which represents a cluster. Should you cut it at 2 clusters or at 4? How to know?

Sometimes eyeballing is enough to give a clear idea, sometimes not. Various stopping-rule criteria have been proposed for where to cut a dendogram - each with its pros and cons.

For the purposes of MKTR, I'll use three well-researched internal validity criteria available in the "clValid" package, viz. Connectivity, Dunn's index and Silhouette width - to determine the optimal no. of clusters. We don't need to go into any technical detail about these 3 metrics, for this course.

#-- Q: How to know how many clusters in hclust are optimal? #

library(clValid)

intval = clValid(USArrests, 2:10, clMethods = c("hierarchical"), validation = "internal", method = c("ward"));

summary(intval)

The result will look like the below image. 2 of the 3 metrics support a 2-cluster solution, so let's go with the majority opinion in this case.

Since we decided 2 is better, we set the optimal no. of clusters 'k1' to 2 below, thus:

k1 = 2 # from clValid metrics

Note: If for another dataset, the optimal no. of clusters changes to, say, 5 then use 'k1=5' in the line above instead. Don't blindly copy-paste that part. However, once you have set 'k1', the rest of the code can be peacefully copy-pasted as-is.

# cut tree into k1 clusters

groups = cutree(fit, k=k1) # cut tree into k1 clusters

3. Coming to the second approach, 'partitioning', we use the popular K-means method.

Again, the Q arises, how to know the optimal no. of clusters? MEXL (and quiote a few other commercial software) require you to magically come up with the correct number as input to K-means. R provides better guidance on choosing the no. of clusters. So with R, you can actually take an informed call.

#-- Q: How to know how many clusters in kmeans are optimal? #

library(clValid)

intval = clValid(USArrests, 2:10, clMethods = c("kmeans"), validation = "internal", method = c("ward"));

summary(intval)

The following result obtains. Note there's no majority opinion emerging in this case. I'd say, choose any one result that seems reasonable and proceed.

# K-Means Cluster Analysis

fit = kmeans(mydata, k1) # k1 cluster solution


To understand a clustering solution, we need to go beyond merely IDing which individual unit goes to which cluster. We have to characterize the cluster, interpret what is it that's common among a cluster's membership, give each cluster a name, an identity, if possible. Ideally, after this we should be able to think in terms of clusters (or segments) rather than individuals for downstream analysis.

# get cluster means

aggregate(mydata.orig,by=list(fit$cluster),FUN=mean)

# append cluster assignment

mydata1 = data.frame(mydata, fit$cluster);

mydata1[1:10,]

# K-Means Cluster Analysis

fit = kmeans(mydata, k1) # k1 cluster solution


To understand a clustering solution, we need to go beyond merely IDing which individual unit goes to which cluster. We have to characterize the cluster, interpret what is it that's common among a cluster's membership, give each cluster a name, an identity, if possible. Ideally, after this we should be able to think in terms of clusters (or segments) rather than individuals for downstream analysis.

# get cluster means

aggregate(mydata.orig,by=list(fit$cluster),FUN=mean)

t(cmeans) # view cluster centroids

# append cluster assignment

mydata1 = data.frame(mydata, fit$cluster);

mydata1[1:10,]

OK, That is fine., But can I actually, visually, *see* what the clustering solution looks like? Sure. In 2-dimensions, the easiest way is to plot the clusters on the 2 biggest principal components that arise. Before copy-pasting the following code, ensure we have the 'cluster' package installed.

# Cluster Plot against 1st 2 principal components
# vary parameters for most readable graph

install.packages("cluster")
library(cluster)
clusplot(mydata, fit$cluster, color=TRUE, shade=TRUE,labels=2, lines=0)

Two clear cut clusters emerge. Missouri seems to border the two. Some overlap is also seen. Overall, the clusPlot seems to put a nice visualization over the clustering process. Neat, eh? Try doing this with R's competitors...:)

4. Finally, the last (and best) approach - Model based clustering.'Best' because it is the most general approach (it nests the others as special cases), is the most robust to distributional and linkage assumptions and because it penalizes for surplus complexity (resolves the fit-complexity tradeoff in an objective way). My thumb-rule is: When in doubt, use model based clustering. And yes, mclust is available *only* on R to my knowledge. Install the 'mclust' package for this first. Then run the following code.

install.packages("mclust")

# Model Based Clustering

library(mclust)

fit = Mclust(mydata)

fit # view solution summary

The mclust solution has 3 components! Something neither the dendogram nor the k-means scree-plot predicted. Perhaps the assumptions underlying the other approaches don't hold for this dataset. I'll go with mclust simply because it is more general than the other approaches. Remember, when in doubt, go with mclust.

fit$BIC # lookup all the options attempted

classif = fit$classification # classifn vector

mydata1 = cbind(mydata.orig, classif) # append to dataset

mydata1[1:10,] #view top 10 rows

# Use below only if you want to save the output

write.table(mydata1,file.choose())#save output

The classification vector is appended to the original dataset as its last column. Can now easily assign individual units to segments.Visualize the solution. See how exactly it differs from that for the other approaches.

fit1=cbind(classif)

rownames(fit1)=rownames(mydata)

library(cluster)

clusplot(mydata, fit1, color=TRUE, shade=TRUE,labels=2, lines=0)

Imagine if you're a medium sized home-security solutions vendor looking to expand into a couple of new states. Think of how much it matters that the optimal solution had 3 segments - not 2 or 4. To help characterize the clusters, examine the cluster means (sometimes also called 'centroids', for each basis variable.

# get cluster means

cmeans=aggregate(mydata.orig,by=list(classif),FUN=mean);

cmeans

In the pic above, the way to understand or interpret the segments would be to characterize the segment in terms of which variables best describe that cluster as distinct from the other clusters. Typically, we look for variables that attain highest or lowest values for that cluster. In the figure above, it is clear that the first cluster (first column) ius the most 'unsafe' (in terms of having highest murder rate, assualt rate etc) and the last cluster the most 'safe'.

Thus, from mclust, it seems like we have 3 clusters of US states emerging - the unsafe, the safe and the super-safe. From the kmeans solution, we have 2 clusters - 'Unsafe' and 'Safe' emerging.

Now, we can do the same copy-paste for any other datasets that may show up in classwork or homework. I'll close the segmentation module here. R tools for the Targeting module are discussed in the next section of this blog post. Any queries or comment, pls use the comments box below to reach me fastest.

****************************************************

2. Segmenting and Targeting in R (PDA case - classwork example)

We saw a brief intro to the Conglomerate PDA case in the class handout in session 5. For the full length case, go to the 'Documents' folder in your local drive, then to 'My Marketing Engineering v2.0' folder, within that to the 'Cases and Exercises' folder and within that to the 'ConneCtor PDA 2001 (Segmentation)' folder (if you've installed MEXL, that is). For the purposes of this session, what was given in the handout is enough, however.

Without further ado, let's start. I'll skip the Hclust and k-means steps and go straight to model based clustering (mclust).

#----------------------------------------------#
##### PDA caselet from MEXL - Segmentation #####
#----------------------------------------------#

rm(list = ls()) # clear workspace

# read in 'PDA basis variables.txt' below
mydata = read.table(file.choose(), header=TRUE); head(mydata)

### --- Model Based Clustering --- ###

library(mclust)

fit = Mclust(mydata); fit

fit$BIC # lookup all the options attempted

classif = fit$classification # classifn vector

The image above shows the result. Click for larger picture.

We plot what the clusters look like in 2-D using the clusplot() function in the cluster library. What clusplot() does is essentially performs factor analysis on the dataset, plots the first two (or largest two) factors as axes and the rows as factor score points in this 2-D space. In the clusplot below, we can see that the top 2 factors explain x% of the total variance in the dataset. Anyway, this plot is illustrative only and not critical to our analysis here.

But how to interpret what the clusters mean? To interpret the clusters, we have to first *characterize* the clusters in terms of which variables most distinguish them from the other clusters.

For instance, see the figure below in which the cluster means (or centroids) of the 4 clusters we obtained via mclust are shown on each of the 15 basis variables we started with.

Thus, I would tend to characterize or profile cluster 1 (the first column above) in terms of the variables that assume extreme values for that cluster (e.g., very LOW on Price, Monthly, Ergonomic considerations, Monitor requirements and very HIGH on use.PIM) and so on for the other 3 clusters as well.

What follows the profiling of the clusters is then 'naming' or labeling of the clusters suitably to reflect the cluster's characteristics. Then follows an evaluation of how 'attractive' the cluster is for the firm based on various criteria.

Thus far in the PDA case, what we did was still in the realm of segmentation. Time to now enter Targeting which lies at the heart of predictive analytics in Marketing. Since this blog post has gotten too large already, shall take that part to the next blogpost.

Sudhir

No comments:

Post a Comment

Constructive feedback appreciated. Please try to be civil, as far as feasible. Thanks.