I've been meeting groups all afternoon and evening today and some things have come up which IMHOmerit wider dissemination:
1. Have some basic roadmap in mind before you start: This is important else you risk getting lost in the data and all the analyses that are now possible. There are literally millions of ways in which a dataset that size can be sliced and diced. Groups that had no broad, big-picture idea of where they want to go with the analysis inevitably run into problems.
Now don't get me wrong, this is not to pre-judge or straitjacket your perspective or anything - the initial plan you have in mind doesn't restrict your options. It can and should be changed and improvised as the analysis proceeds.
Update: OK. Some may ask - can we get a more specific example? Here is what I had in mind when I was thinking broad, basic plan from an example I outlined in the comments section to a post below:
E.g. - First we clean data out for missing values in Qs 7,10,27 etc -> then do factor analysis on psychogr and demogr -> then did cluster analysis on the factors -> then we estimate segment sizes thus obtained -> then we look up supply side options -> arrive at recommendations.
Hope that clarifies.
2. Segmentation is the key: The Project essentially, at its core, boils down to an STP or Segmentation-Targeting-Positioning exercise. And it is the Segmentation part which is crucial to getting the TP parts right. What inputs to have for the segmentation part, what clustering bases to use, how many clusters to get out via k-means, how best to characterize those clusters and how to decide which among them is best/most attractive are, IMHO, the real tricky questions in the project.
3. Kindly ask around for JMP gyan: A good number of folk I have met seemed to have basic confusion regarding factor and cluster analyses and how to run these on JMP. This after I thought I'd done a good job going step-by-step over the procedure in class and interpreting the results. Kindly ask around for clarifications etc on the JMP implementation of these procedures. The textbook contains good overviews about the conceptual aspects of these methods.
I'm hopeful that at least a few folk in each group have a handle on these critical procedures - factor and cluster. I shall, for completeness sake, again go through them quickly tomorrow in class.
4. The 80-20 rule applies very much so in data cleaning: Chances are under 20% of the columns in the dataset will yield over 80% of its usable information content. So don't waste time cleaning data (i.e. removing missing values, nonsense answers etc) from all the columns, just the important ones only. Again, you need to have some basic plan in mind before you can ID the important columns.
Also, not all data cleaning need mean dropping rows. In some instances, missing values can perhaps be safely imputed using column means or medians or the mode (depending on data type).
Chalo, enough for now. More as updates occur.
Sudhir
No comments:
Post a Comment
Constructive feedback appreciated. Please try to be civil, as far as feasible. Thanks.