Sunday, November 20, 2011

Data cleaning tips for Project Phase III

Hi all,

As you by now well know, one of the exam Questions related to output from your phase III project (the factor analysis one). While analyzing data for this particular problem, I did notice quite a few irregulartities with the data. This is typical. And hence, the need for a reasonable cleaning of the data prior to analysis.

For instance in the Qs related measuring to importance of telecom service provider attributes, there were a few pranksters who rated everything "Never heard of it." Clearly, its stretch to think internet savvy folks have never heard of even one of a telecom carrier's service attributes. Such rows should be cleaned (i.e. either removed from analysis or if there are gaps etc, then imputing missing values for these gaps etc) before analysis can proceed.

Some advice for speeding up the cleaning process:

1. Do *not* attempt to clean data in all the columns. That is way too much. Only for important Qs, on which you will do quantitaive analysis using advanced tools (factor/cluster analysis, MDS, regressions etc) should you consider cleaning the data. So choose carefully the columns to clean data in. Needless to say this will depend on what decision problems you have identified for resolution.

2.  An easy way to start the cleaning process is to check whether there is reasonable variation in responses for each respondent. Thus, for instance, after the psychographic Q columns, insert a new column and in it, for each row, compute the standard deviation of the responses of the psychographic Qs. The respondents with very high or low standard deviations should be investigated for possible data cleaning. Thus for example, this method would catch people who mark the same response for every Q. Or those who mark only extremes for some reason.

3. If there are gaps or missing values (these are depicted with -99 in the dataset) in critical columns, then you may consider replacing these with the median or the mean of that column. A fancier, more general name for such work is imputation which includes also various model based predictions to use as the replacement value. Without imputation, one will be forced to drop the entire row for want of a few missing values here and there. Always ensure, the imputed values are 'reasonable' before going down this path.

More apropriately, imputation might work better if used segment-wise. Suppose you've segmented the consumer base based on some clustering variables. Now since those segments are assumed homogenous in some sense, one can better impute missing values with segment means/medians perhaps.

4. Don't spend too much time on data cleaning also. In the first (half-)hour odd spent on data cleaning, chances are you will cover the majority of troubled rows. After that there is a diminishing returns pattern. So draw a line somewhere, stop and start analysis from that point on.

Update:
I'll share my answers here on the blog to any Qs asked to me by groups. Conversely, I request groups to first check the blog for whether their Qs have already been answered here insome form or the other.

Hope that clarifies.

Sudhir

P.S.
Any feedback on the exam, project or any other course feature etc is welcome. As always.


No comments:

Post a Comment

Constructive feedback appreciated. Please try to be civil, as far as feasible. Thanks.