Yesterday in Session 5 we covered two major topics - Segmentation and Targeting. Sorry about the delay in bringing out this blog post.In this blog post, I shall lay out the classwork examples (which you might want to try replicating) and their interpretation, and the HW for this session.There are many approaches to doing cluster analysis and R handles a dizzying variety of them. We'll focus on 3 broad approaches - Agglomerative Hierarchical clustering (under which we will do basic hierarchical clustering with dendograms), Partitioning (here, we do K-means) and model based clustering. Each has its pros and cons. Model based is probably the best around, highly recommended.1. Cluster Analysis Data preparation
First read in the data. USArrests is pre-loaded, so no sweat. I use the USArrests dataset example throughout for cluster analysis.
#first read-in data# mydata = USArrests |
# Prepare Data # mydata = na.omit(mydata) # listwise deletion of missing mydata = scale(mydata) # standardize variables |
2. Now we first do agglomerative Hierarchical clustering, plot dendograms, split them around and see what is happening.
# Ward Hierarchical Clustering d = dist(mydata, method = "euclidean") # distance matrix fit = hclust(d, method="ward") # run hclust func plot(fit)# display dendogram |
Suppose you decide 2 is better. Then set the optimal no. of clusters 'k1' to 2.
k1 = 2 # eyeball the no. of clusters |
Note: If for another dataset, the optimal no. of clusters changes to, say, 5 then use 'k1=5' in the line above instead. Don't blindly copy-paste that part. However, once you have set 'k1', the rest of the code can be peacefully copy-pasted as-is.
# cut tree into k1 clusters groups = cutree(fit, k=k1)# cut tree into k1 clusters |
3. Coming to the second approach, 'partitioning', we use the popular K-means method. Again, the Q arises, how to know the optimal no. of clusters? Eyeballing the dendogram might sometimes help. But at other times, what should you do? MEXL (and most commercial software too) requires you to magically come up with the correct number as input to K-means. R does one better and shows you a scree plot of sorts that shows how the within-segment variance (a proxy for clustering solution quality) varies with the no. of clusters. So with R, you can actually take an informed call.
# Determine number of clusters # wss = (nrow(mydata)-1)*sum(apply(mydata,2,var)); for (i in 2:15) wss[i] = sum(kmeans(mydata,centers=i)$withinss);
plot(1:15, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares") |
# Use optimal no. of clusters in k-means # k1=2 |
# K-Means Cluster Analysis fit = kmeans(mydata, k1) # k1 cluster solution |
To understand a clustering solution, we need to go beyond merely IDing which individual unit goes to which cluster. We have to characterize the cluster, interpret what is it that's common among a cluster's membership, give each cluster a name, an identity, if possible. Ideally, after this we should be able to think in terms of clusters (or segments) rather than individuals for downstream analysis.
# get cluster means aggregate(mydata.orig,by=list(fit$cluster),FUN=mean) # append cluster assignment mydata1 = data.frame(mydata, fit$cluster); mydata1[1:10,] |
# Cluster Plot against 1st 2 principal components # vary parameters for most readable graph
install.packages("cluster") |
4. Finally, the last (and best) approach - Model based clustering.'Best' because it is the most general approach (it nests the others as special cases), is the most robust to distributional and linkage assumptions and because it penalizes for surplus complexity (resolves the fit-complexity tradeoff in an objective way). My thumb-rule is: When in doubt, use model based clustering. And yes, mclust is available *only* on R to my knowledge.Install the 'mclust' package for this first. Then run the following code.
install.packages("mclust") # Model Based Clustering library(mclust) fit = Mclust(mydata) fit # view solution summary |
The mclust solution has 3 components! Something neither the dendogram nor the k-means scree-plot predicted. Perhaps the assumptions underlying the other approaches don't hold for this dataset. I'll go with mclust simply because it is more general than the other approaches. Remember, when in doubt, go with mclust.
fit$BIC # lookup all the options attempted classif = fit$classification # classifn vector mydata1 = cbind(mydata.orig, classif) # append to dataset mydata1[1:10,] #view top 10 rows # Use below only if you want to save the output write.table(mydata1,file.choose())#save output |
The classification vector is appended to the original dataset as its last column. Can now easily assign individual units to segments.Visualize the solution. See how exactly it differs from that for the other approaches.
fit1=cbind(classif) rownames(fit1)=rownames(mydata) library(cluster) clusplot(mydata, fit1, color=TRUE, shade=TRUE,labels=2, lines=0) |
# get cluster means cmeans=aggregate(mydata.orig,by=list(classif),FUN=mean); cmeans |
###############################
Targeting in R
This is the code for classwork MEXL example "Conglomerate's PDA". This is the roadmap for what we are going to do:
- First we segment the customer base using model based clustering or mclust, the recommended method.
- Then we randomly split the dataset into training and test samples. The test sample is about one-third of the original dataset in size, following accepted practice.
- Then we try to establish via the training sample, how the discriminant variables relate to segment membership. This is where we train the Targeting algorithm to learn about how discriminant variables relate to segment memberships.
- Then comes the real test - validate algorithm performance on the test dataset. We compare prediction accuracy across traditional and proposed methods.
- Since R is happening, there are many targeting algorithms to choose from on R. I have decided to go with one that has shown good promise of late - the randomForest algorithm. Where we had seen decision trees in Session 5, think now of 'decision forests' in a sense...
- Other available algorithms that we can run (provided there is popular demand) are artificial neural nets (multi-layer perceptrons) and Support vector machines. But for now, these are not part of this course.
# read-in basis and discrim variables basis = read.table(file.choose(), header=TRUE) dim(basis); basis[1:3,] summary(basis) discrim = read.table(file.choose(), header=TRUE) dim(discrim); discrim[1:3,] summary(discrim) # Run segmentation on the basis.training dataset library(mclust) #invoke library fit = Mclust(basis) # run mclust fit # view result classif = fit$classification # print cluster sizes for (i1 in 1:max(classif)){print(sum(classif==i1))} # Cluster Plot against 1st 2 principal components require(cluster) fit1 = cbind(classif) rownames(fit1)=rownames(basis) clusplot(basis, fit1, color=TRUE, shade=TRUE,labels=2, lines=0) |
The segmentation produces 4 optimal clusters. Below is the clusplot where, interestingly, despite our using 15 basis variables, we see decent separation among the clusters in the top 2 principal components directly.
Click on the above image for larger size.
2. Split dataset into Training & Test samplesRead in the dataset 'PDA case discriminant variables.txt' from LMS for the below analysis:
rm(list = ls()) # clear workspace # 'PDA case discriminant variables.txt' mydata = read.table(file.choose(), header=TRUE) head(mydata) # build training and test samples using random assignment train_index = sample(1:nrow(mydata), floor(nrow(mydata)*0.65)); # two-thirds of sample is for training train_index[1:10]; train_data = mydata[train_index, ]; test_data = mydata[-(train_index), ]; train_x = data.matrix(train_data[ ,c(2:18)]); train_y = data.matrix(train_data[ ,19]); # for classification we need as.factor test_x = data.matrix(test_data[ ,c(2:18)]); test_y = test_data[ ,19] |
Last year, when Targeting was a full lecture session, I used the most popular machine learning algorithms - neural nets, random forests and Support vector machines (all available on R, of course) to demonstrate targeting. Those notes can be found here.3. Use multinomial logit for Targeting:Will need to install library 'textir' for this one.
###### Multinomial logit using Rpackage textir ####### install.packages("textir") library(textir) covars = normalize(mydata[ ,c(2,4,14)], s=sdev(mydata[,c(2,4,14)])); #normalizing the data dd = data.frame(cbind(memb=mydata$memb,covars,mydata[ ,c(3,5:13,15:18)])); train_ml <- dd[train_index, ]; test_ml = dd[-(train_index), ]; gg = mnlm(counts = as.factor(train_ml$memb), penalty = 1, covars = train_ml[ ,2:18]); prob = predict(gg, test_ml[ ,2:18]); head(prob); pred = matrix(0, nrow(test_ml), 1); accuracy = matrix(0, nrow(test_ml), 1); for(j in 1:nrow(test_ml)){ pred[j, 1] = which.max(prob[j, ]); if(pred[j, 1]==test_ml$memb[j]) {accuracy[j, 1] = 1} } mean(accuracy) |
You'll see something like this (but not the exact same thing because the training and test samples were randomly chosen)
Look at the probabilities table given. The table tells us the probability that respondent 1 (in row1) belongs to segment 1, 2, 3 or 4. We get maximum probability for segment 1, so we say that respondent 1 belongs to segment 1 with a 61% probability. In some cases, the all the probabilities may be less than 50%. Just take the maximum and assign the respondent to that segment, if so.Now, we let loose the logit algorithm onto the test sample. the algo comes back with its predictions. In the real world, we will go by what the machine says. But in this case, since we have the actual segment memberships, we can validate the results. This is what I got when I tried to assess the accuracy of the algo's predictions:
So, the algo is able to predict with a 60% odd accuracy, not bad considering that random allocation would have given you at best a 25% success rate. Besides, this is simple logit - more sophisticated algos exist that can do better, perhaps even much better.
That's it for now. Will putup the HW for this session in a separate update (deadline now is 9-Nov saturday midnight) here (watch this space).
Session 5 HW update:
There will be no HW ofr session 5. I figure I can combine segmentation and targeting bits into the session 6 HW.
Sudhir
No comments:
Post a Comment
Constructive feedback appreciated. Please try to be civil, as far as feasible. Thanks.