Hi all,
Welcome to the Rcodes to Session 5 classroom examples. Pls try these at home. Use the blog comments section for any queries or comments.
There are many approaches to doing cluster analysis and R handles a dizzying variety of them. We'll focus on 3 broad approaches - Agglomerative Hierarchical clustering (under which we will do basic hierarchical clustering with dendograms), Partitioning (here, we do K-means) and model based clustering. Each has its pros and cons. Model based is probably the best around, highly recommended.
1. Cluster Analysis Data preparation
First read in the data. USArrests is pre-loaded, so no sweat. I use the USArrests dataset example throughout for cluster analysis.
#first read-in data#
mydata = USArrests
|
Data preparation is required to remove variable scaling effects. To see this, consider a simple example. If you measure weight in Kgs and I do so in Grams - all other variables being the same - we'll get two very different clustering solutions from what is otherwise the same dataset. To get rid of this problem, just copy-paste the following code.
# Prepare Data #
mydata <- na.omit(mydata) # listwise deletion of missing
mydata.orig = mydata #save orig data copy
mydata <- scale(mydata) # standardize variables
|
2. Now we first do agglomerative Hierarchical clustering, plot dendograms, split them around and see what is happening.
# Ward Hierarchical Clustering
d <- dist(mydata, method = "euclidean") # distance matrix
fit <- hclust(d, method="ward")
plot(fit) # display dendogram
|
Click on image for larger size.
Eyeball the dendogram. Imagine horizontally slicing through the dendogram's longest vertical lines, each of which represents a cluster. Should you cut it at 2 clusters or at 4? How to know? Sometimes eyeballing is enough to give a clear idea, sometimes not. Suppose you decide 2 is better. Then set the optimal no. of clusters 'k1' to 2.
k1 = 2 # eyeball the no. of clusters
|
Note: If for another dataset, the optimal no. of clusters changes to, say, 5 then use 'k1=5' in the line above instead. Don't blindly copy-paste that part. However, once you have set 'k1', the rest of the code can be peacefully copy-pasted as-is.
# cut tree into k1 clusters
groups <- cutree(fit, k=k1)
# draw dendogram with red borders around the k1 clusters
rect.hclust(fit, k=k1, border="red")
|
3. Coming to the second approach, 'partitioning', we use the popular K-means method.
Again, the Q arises, how to know the optimal no. of clusters? Eyeballing the dendogram might sometimes help. But at other times, what should you do? MEXL (and most commercial software too) requires you to magically come up with the correct number as input to K-means. R does one better and shows you a scree plot of sorts that shows how the within-segment variance (a proxy for clustering solution quality) varies with the no. of clusters. So with R, you can actually take an informed call.
# Determine number of clusters #
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(mydata,centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")
# Look for an "elbow" in the scree plot #
|
Look for an "
elbow" in the scree plot. The interior node at which the angle formed by the 'arms' is the smallest. This scree-plot is not unlike the one we saw in factor-analysis. Again, as with the dendogram, we get either 2 or 4 as the options available. Suppose we go with 2.
# Use optimal no. of clusters in k-means #
k1=2
|
Note: If for another dataset, the optimal no. of clusters changes to, say, 5 then use 'k1=5' in the line above instead. Don't blindly copy-paste that part. However, once you have set 'k1', the rest of the code can be peacefully copy-pasted as-is.
# K-Means Cluster Analysis
fit <- kmeans(mydata, k1) # k1 cluster solution
|
To understand a clustering solution, we need to go beyond merely IDing which individual unit goes to which cluster. We have to
characterize the cluster,
interpret what is it that's common among a cluster's membership, give each cluster a name, an
identity, if possible. Ideally, after this we should be able to think in terms of clusters (or
segments) rather than individuals for downstream analysis.
# get cluster means
aggregate(mydata.orig,by=list(fit$cluster),FUN=mean)
# append cluster assignment
mydata1 <- data.frame(mydata.orig, fit$cluster)
|
OK, That is fine., But can I actually, visually, *see* what the clustering solution looks like? Sure. In 2-dimensions, the easiest way is to plot the clusters on the 2 biggest principal components that arise. Before copy-pasting the following code, ensure we have the 'cluster' package installed.
# Cluster Plot against 1st 2 principal components
# vary parameters for most readable graph
library(cluster)
clusplot(mydata, fit$cluster, color=TRUE, shade=TRUE,labels=2, lines=0)
|
Two clear cut clusters emerge. Missouri seems to border the two. Some overlap is also seen. Overall, the clusPlot seems to put a nice visualization over the clustering process. Neat, eh? Try doing this with R's competitors...:)
4. Finally, the last (and best) approach - Model based clustering.
'Best' because it is the most general approach (it nests the others as special cases), is the most robust to distributional and linkage assumptions and because it penalizes for surplus complexity (resolves the fit-complexity tradeoff in an objective way). My thumb-rule is: When in doubt, use model based clustering. And yes, mclust is available *only* on R to my knowledge.
Install the 'mclust' package for this first. Then run the following code.
# Model Based Clustering
library(mclust)
fit <- Mclust(mydata)
fit # view solution summary
|
The mclust solution has 3 components! Something neither the dendogram nor the k-means scree-plot predicted. Perhaps the assumptions underlying the other approaches don't hold for this dataset. I'll go with mclust simply because it is more general than the other approaches. Remember, when in doubt, go with mclust.
fit$BIC # lookup all the options attempted
classif = fit$classification # classifn vector
mydata1 = cbind(mydata.orig, classif) # append to dataset
mydata1[1:10,] #view top 10 rows
# Use only if you want to save the output
write.table(mydata1,file.choose())#save output
|
The classification vector is appended to the original dataset as its last column. Can now easily assign individual units to segments.
Visualize the solution. See how exactly it differs from that for the other approaches.
fit1=cbind(classif)
rownames(fit1)=rownames(mydata)
library(cluster)
clusplot(mydata, fit1, color=TRUE, shade=TRUE,labels=2, lines=0)
|
Imagine if you're a medium sized home-security solutions vendor looking to expand into a couple of new states. Think of how much it matters that the optimal solution had 3 segments - not 2 or 4.
To help characterize the clusters, examine the cluster means (sometimes also called 'centroids', for each basis variable.
# get cluster means
cmeans=aggregate(mydata.orig,by=list(classif),FUN=mean); cmeans
|
Seems like we have 3 clusters of US states emerging - the unsafe, the safe and the super-safe.
5. Discriminant Analysis for targeting
I only demo linear discriminant analysis on R. Other, more complex functions such as the quadratic Discriminant are available for enthusiasts to explore. (Let me know if you really want to go there). The following code copy-pasted does a simple linear discriminant analysis on mclust output and tries to see which targeting variables best discriminate (or predict) segment membership between 2 particular clusters. Thus, to know which variables help determine segment membership in cluster 1 versus cluster 2, look at discriminant function 1 and so on.
# run discriminant analysis for targeting
library(MASS)
mydata1 = cbind(classif, mydata1)
z <- lda(classif ~ ., mydata1) #run linear discriminant
z # view result
|
For every k1 clusters, there will be (k1-1) discriminant functions. Look at the largest coefficients (in absolute terms) for each taregtting variable to roughly assess significance. Admittedly, I haven't incorporated the inference part of LDA as yet (it involves invoking MANOVA) but for now, this should do.
Now, we can do the same copy-paste for any other datasets that may show up in classwork or homework. I'll close the segmentation and targeting piece here.
Sudhir