Sunday, November 8, 2009

Data cleaning woes

Got this email today:

Hello Professor,

Even after spending a considerable amount of time cleaning & recoding the data we haven’t been able to finish it yet. I have tried to categorize various activities which we are undertaking

1. Codify non numerical variables
2. Replace blanks with zeros
3. For questions with a others field or a question where users have entered free flowing text we have to code them separately (18 out of 38 questions have a free flowing text field which is around 50% of the questions)

For point 3 above we did try to prioritize the questions but even the prioritized list is sizeable.

Let me know if we are missing something here. Is there a more efficient way of cleaning up data which we are not considering?

Thanks.

My response to th whole class, copied here:

People,

Sorry for this HEAVY email, but am now getting worried only. Pls download and delete lest it overload your mailbox size limits. Can’t access blackboard from home currently.

Pls refer to the email forwarded below.

Seems o me too many groups are stuck in data cleaning only. Time is precious and I’d rather you spend it on activity that has more return in terms of both deliverable completion and learning.

The analysis process is *not* sequential. You do NOT have to first finish cleaning up *all* the data and then and only then start formal analyses. Pls don’t go down that pointless path!

Data cleaning and recoding are ideally done on a strictly as-needed basis.

1. First ID the question of interest (e.g. “Can the hatchback buyers be identified using demographic markers?”).
2. Then map that question onto info needed (e.g. “will need a crosstab of current/previous hatchback owners + those saying they will buy hatchbacks + demographic variables”)
3. Then only ID the cols for decoding. (e.g. “OK, current owner brand-model + demographic”)
4. One BIG help in decoding and recoding data are…no surprises….Pivots. Their ability to list in a column all the data (sans duplications and blank cells) is priceless. Pivots are not for crosstabs alone – they are indispensable in listing out column-wise data. Am sure most of you already know this but am taking no chances, hence have cleaned up a couple of messy columns for demonstration purposes.
5. Have attached an excel sheet in which I start to clean up data for Brand, Model, year and fuel-type using a pivot. Then SORT and VLOOKUP become your friends once you have pivoted the free-of-duplicates list of column data. Look at the sheet marked ‘selected raw data’ for the columns selected. Then ‘pivot’ contains a pivot from which I made duplicates-free lists of columns entries. Most turned out to be unusable. Theek hai. Leave them as blank and proceed with what looks fine. We have a few 1000 fine ones, we are good. Now what remains is doing a VLOOKUP from the cleaned cols sheet to the raw data. This took me all of 15 mins maybe.

People, pls DON’T get caught up in frivolous data cleaning exercises. Analyze the data that is already relatively cleaner – such as those options selected off a menu rather than typed in such as attribute preferences, the psychographics and the demographics.

P.S.
I now realize we should had a drop down of the top 10 brands and the top 10 models each with an others option in the drop down….. Anyway, we all learn the hard way, sometimes, I guess. Sorry about that.

Sudhir

No comments:

Post a Comment

Constructive feedback appreciated. Please try to be civil, as far as feasible. Thanks.