Sunday, November 13, 2011

Take-aways from past projects

Update: Received a few such emails so might as well clarify:

Hi Prof,

There seems to be some confusion on the final report submission due date. This is to check whether submission is due in session 10.

Kindly confirm.
S

My response:

No. Its due on 27-Nov. - the day before term 6.
Hope that clarifies.
Sudhir
----------------------------------------------------------
Hi all,

As promised, some collected thoughts on where folks ran into obstacles in past years. Am collating and putting up relevant excerpts from past blogposts to start with. Shall add and expand this thread as more info comes in.
------------------------------------------------------------------
From last year (financial planning and savings avenues analysis)


[Some] stuff that IMHO mertits wider dissemination.
1. Let me rush to clarify that no great detail is expected in the supply side Q.
As in, you're not expected to say - "XYZ should offer an FD [fixed deposit] with a two year minimum lock-in offering 8.75% p.a.".
No.
Saying "XYZ should offer FDs in its product portfolio." is sufficient.


2. Make the assumption that the sample adequately represents the target population - of young, urban, upwardly mobile professionals. 

3. Yes, data cleaning is a long, messy process. But it is worthwhile since once it's done, the rest of the analyses follow through very easily indeed, in seconds. 

4. It helps to start with some idea or set of conjectures about a set of product classes and a set of potential target segments in mind, perhaps. One can then use statistical analyses to either confirm or disprove particular hypotheses about them.

5. There is no 'right or wrong' approach to the problem. There is however a logical, coherent and data-driven approach to making actionable recommendations versus one that is not. I'll be looking for logical errors, coherency issues, unsustainable assumptions and the like in your journey to the recommendations you make in phase III.
------------------------------------------------------------------
From last year on stumbles in the analysis phase:
1. Have some basic roadmap in mind before you start: This is important else you risk getting lost in the data and all the analyses that are now possible. There are literally millions of ways in which a dataset that size can be sliced and diced. Groups that had no broad, big-picture idea of where they want to go with the analysis inevitably run into problems.

Now don't get me wrong, this is not to pre-judge or straitjacket your perspective or anything - the initial plan you have in mind doesn't restrict your options. It can and should be changed and improvised as the analysis proceeds.

Update: OK. Some may ask - can we get a more specific example? Here is what I had in mind when I was thinking broad, basic plan from an example I outlined in the comments section to a post below:
E.g. - First we clean data out for missing values in Qs 7,10,27 etc -> then do factor analysis on psychogr and demogr -> then did cluster analysis on the factors -> then we estimate segment sizes thus obtained -> then we look up supply side options -> arrive at recommendations.

Hope that clarifies.
2. Segmentation is the key: The Project essentially, at its core, boils down to an STP or Segmentation-Targeting-Positioning exercise. And it is the Segmentation part which is crucial to getting the TP parts right. What inputs to have for the segmentation part, what clustering bases to use, how many clusters to get out via k-means, how best to characterize those clusters and how to decide which among them is best/most attractive are, IMHO, the real tricky questions in the project.

3. Kindly ask around for software tool gyan: A good number of folk I have met seemed to have basic confusion regarding factor and cluster analyses and how to run these on the software tool. This after I thought I'd done a good job going step-by-step over the procedure in class and interpreting the results. Kindly ask around for clarifications etc on the JMP implementation of these procedures. The textbook contains good overviews about the conceptual aspects of these methods.

I'm hopeful that at least a few folk in each group have a handle on these critical procedures - factor and cluster. I shall, for completeness sake, again go through them quickly tomorrow in class.

4. The 80-20 rule applies very much so in data cleaning:Chances are under 20% of the columns in the dataset will yield over 80% of its usable information content. So don't waste time cleaning data (i.e. removing missing values, nonsense answers etc) from all the columns, just the important ones only. Again, you need to have some basic plan in mind before you can ID the important columns.

Also, not all data cleaning need mean dropping rows. In some instances, missing values can perhaps be safely imputed using column means or medians or the mode (depending on data type). 

Chalo, enough for now. More as updates occur.
Sudhir
------------------------------------------------------------------
More from last year on specific problems encountered in final presentations:


1. Research Objective (R.O.) matters.
Recall from lectures 1, 2 & 3 my repeated exhortations that "A clear cut R.O. that starts with an action verb defined over a crisp actionable object sets the agenda for all that follows". Well that wasn't all blah-blah blah. Its effects are measurable, as I came to see.

Suppose the entire group was on board with and agreed upon a single, well-defined R.O., then planning, delegation and recombining different modules into a whole would have been much simplified. Coherence matters much in a project this complex and with coordination issues of the kind you must've faced. It was likely to visibly impact the quality of the outcome, and IMHO, it did.

2. Two broad approaches emerged - Asset First and Customer First.
One, where you define your research objective (R.O.) as "Identify the most attractive asset class." and the other, "Identify the most attractive customer segment." The two R.O.s lead to 2 very different downstream paths.

Most groups preferred the first (asset first) route. Here, the game was to ID the most attractive asset classes using size, monetary value as addressable market or some such criterion and then filter in only those respondents who showed some interest in the selected asset classes. Then characterize the indirect respondent segments obtained and build recommendations on that basis.

I was trying to nudge people towards the second, "Customer segmentation first" route partly because it aligns much more closely with the core Marketing STP (Segmentation-Targeting-Positioning) way. In this approach, the entire respondent base is first segmented along psychographic- behavioral - motivational or demographic bases, then different segments are evaluated for attractiveness based on some criterion - monetary value, count or share etc, and then the most attractive segments are profiled/analyzed for asset class preferences and investments.

Am happy to say that in a majority of the groups, once a group implicitly chose a particular R.O., the approach that followed was logically consistent with the R.O. 

3. Some novel, surprising things.
Just reeling off a few quick ones that do come to my mind.

One, how do you select the "most attractive" segment or asset class given a set of options? Some groups went with a simple countcriterion - count the # of respondents corresponding to that cluster and pick the largest one. Some groups went further and used a value criterion - multiply the count with (%savings times average income times % asset class allocation) to arrive at a rupee figure. This latter approach is more rigorous and objective, IMHO. There were only 2 groups that went even further in their choice of a attractiveness criterion - the customer lifetime value (CLV) criterion. They multiplied the rupee value per annum per respondent with a (cleaned up) "years to retirement" variable to obtain the revenue stream value of a respondent over his/her pre-retirement lifetime. Post-retirement, people become net consumers and not net savers, so post-retirement is a clean break from pre-retirement. I thought this last approach was simply brilliant. Wow. And even within the two groups that did use this idea, one went further and normalized cluster lifetime earnings by cluster size giving a crisp comparison benchmark.

Two, how to select the basis variables for a good clustering solution? Regardless of which approach you took, a good segmenting solution in which clusters are clear, distinct, sizeable and actionable would be required. One clear thing that emerged across multiple groups was that using only the Q27 psychographics and the Demographics wasn't yielding a good clustering solution. The very first few runs (<1 minute each on JMP and I'm told several minutes on MEXL) should have signaled that things were off with this approach. Adding more variables would have been key. Typically, groups adding savings motivation variables, Q7 constant sum etc were able to see a better clustering solution. There is seldom any ideal clustering solution and that's a valuable learning when dealing with real data (unlike the made-up data of classroom examples).

One group that stood out in the second point approach used all 113 variables in the dataset in a factor analysis -> got some 43 odd factors -> labeled and IDed them -> then selectively chose 40 from among the 43 as a segmenting basis and obtained a neat clustering solution. The reason this approach stands out in my mind 'brute force approach' is that there's no place for subjective judgment, no chance that some correlations among disparate variables will have been overlooked etc. It's also risky as such attempts are fraught with multi-collinearity and inference issues. Anyway, it seemed to have worked.
[...]
Anyway, like I have repeatedly mentioned - effort often correlates positively with learning. So I'm hoping your effort on this project did translate into enduring learning regarding research design, data work, modeling,  project planning, delegation and coordination among other things.
------------------------------------------------------------------
OK, that's it for now. Shall update the thread with more learnings from past years for reference purposes.


Sudhir

No comments:

Post a Comment

Constructive feedback appreciated. Please try to be civil, as far as feasible. Thanks.