Update: Received this email from Mr Suraj Amonkar of section C on one possible reason why we saw what we did in the 2-step procedure:
Hello Professor,
I have attached the document which explains a bit the sample-size effect for two step clustering.
Since the method uses the first step to form "pre-clusters" and the second step to use "hierarchical clustering", I suspect having too small a number of samples will not give the method enough information to form good "pre-clusters". Especially if the number of variables are high, relative to the number of samples
“ SPSS has three different procedures that can be used to cluster data: hierarchical cluster analysis, k-means cluster, and two-step cluster. They are all described in this chapter. If you have a large data file (even 1,000 cases is large for clustering) or a mixture of continuous and categorical variables, you should use the SPSS two-step procedure. If you have a small data set and want to easily examine solutions with increasing numbers of clusters, you may want to use hierarchical clustering. If you know how many clusters you want and you have a moderately sized data set, you can use k-means clustering. “
Also, there are methods that automatically detect the ideal “k” for k-means. This in essence would be similar to the “two-step” approach followed by SPSS (which is based on the bottom-up hierarchical approach). Please find attached a paper describing an implementation of this method. I am not sure if SPSS has some implementation for this ; but R or Matlab might.
Thanks,
Suraj.
My response:
Nice, Suraj.
It confirms a lot of what I had in mind regarding two-step. However, the 2-step algo did work perfectly well in 2009 for this very same 20-row dataset. Perhaps it was, as someone was mentioning, because the random number generator seed in the software installed was the same for everybody back then.
Shall put this up on the blog (and have the files uploaded or something) later.
Thanks again.
Sudhir
Added later: Also, a look through the attached papers show that indeed calculating the optimal number of clusters in a k-means scenario is indeed a difficult problem. Some Algorithms have evolved on how to address it but I'd rather we not go there at this point in this course.
Sudhir.
Hi All,
SPSS as we know it has changed from what it was like in 2009 (when I last used it, successfully, in class) to what it is like now. In the interim, IBM took over SPSS and seems to have injected its own code and programs in select routines, one of which is cluster analysis - two-step solution. This change hasn't necessarily been for the better.
1. First off, a few days ago when making slides for this session, I first noticed that a contrived dataset, from a textbook example no less, that I used without problems in 2009, was giving an erraneous result when 2-step cluster analyzed. The results shown was 'only one cluster found optimal for the data' or some such thing. The 20-row dataset is designed to produce 2 (or at most 3) cleanly separating clusters. So something was off for sure.
2. In section A, I over-rode the 'automatically find optimal #clusters' option and manually chose 3. In doing so, I negated the most important advantage 2-step clustering gives us - an information criteria based objective determination of the optimal # clusters. Sure, when over-ridden, the 2-step solution still gives some info on the 'quality' of the clustering solution - based, I suspect, on some variant of the ratio of between-cluster to within-cluster variance that we typically use to assess clustering solution quality.
3. In section B, when I sought to repeat the same exercise, it turned out that some students were getting 2 clusters as optimal whereas I (and some other students) continued to get 1 as optimal. Now what is going on here? Either the algorithm itself is now so unrelaiable that it fails these basic consistency tests for datasets or maybe there's something quirky about this particular dataset due to which we see this discrepancy.
I'd like to know whether you get different optimal #clusters when doing the homework with the same input.
4. Which brings me to why, despite the issues that've dogged SPSS including license related ones, primarily, I've insited upon and stuck to SPSS. Well, its far more user-friendly and intuitive to work with than Excel, R or JMP, for one. Of course, the second big reason is that SPSS is the industry standard and there's much greater resume-value to saying you're comfortable with conducting SPSS based analyses than saying 'JMP' which many in industry may never have heard of.
A third reason is that in a number of procedures - cluster analysis and MDS among them - SPSS allows us to objectively (that is based on information criteria or other such objective metrics) determine the optimal number of clusters, axes etc that would otherwise need to be subjectively done risking erros along the way. Also, in many other applications, including forecasting along the way, SPSS provides a whole host of options that are not available in other packages (R excepted, of course).
5. Some students reported their ENDM based SPSS licenses expired yesterday. Well, homework-submission for the quant part is anyway gone flexible, so I'll not stress too much on timely submission. However, undue delay is not called for, either. I'm hoping most such students are able to work around the issue with the virtual machine solution that IT has documented and sent you.
Well, I hope that clarifies what's been going on and whay we are where we are with the software side of the MKTR story.
Sudhir
P.S.
6. All LTs are booked almost all day on Friday. The only slot I got is 12.30-2.30 PM on Friday. The poll results do not suggest a strong demand for an R tutorial.
So I'm announcing a general SPSS cum R hands on Q&A session for 11-nov Friday 12.30-2.30 PM in AC2 LT. Attendence is entirely optional. If nobody bothers to show up, woh bhi chalega, I'll simply pack up and head home.