Tuesday, December 18, 2012

Session 6 HW and other Notes

Article Update:
Found this really neat article:Different ways in which you can become a data scientist. Given that Data sciences are going to become an element of competitive advantage in the medium term, I'd say that article is of interest to all MKTR students (regardless of background).

**********************************

Update: Here's an interesting article on job trends in the coming year. 6 startup trends in 2013: bootstrapping, marketing, B2B. Thought it relevant to highlight this part of the article for our MKTR course:

5) Marketing becomes as hot as tech.By 2017, CMOs will be spending more on IT than CIOs. Driving this massive shift is the customer data that simply did not exist a decade ago.” -Ajay Agarwal, Bain Capital Ventures

6) Service marketplaces — not individual suppliers — will become the “brand.” Just as Amazon has become a leading brand for books (versus individual publishers), consumers will look to branded marketplaces for various services, such as teaching, cleaning, or construction. -Eric Chin, general partner, Crosslink

Capital

Point? What looks arcane and abstract right now (e.g., decision trees or targeting algorithms we used) is the future. **********************************

Hi all,

Your Session 6 HW (on some targeting algorithms we covered in class) requires you to go through the short caselet 'Kirin' (PDF uploaded). This HW has 4 Qs as detailed below and will use a PPT submission format (the same as for the session 5 HW).The submission deadline is 26-Dec Wednesday midnight into a dropbox.

Some background to the Qs first. Before targeting we need to do segmentation. Q2 deals with segmentation and the interpretation of segments. However, prior to segmentation, we need to know what constructs may underlie the basis variables. Q1 deals with this aspect. You did factor analysis and segmentation already in Session 5 HW. It occurs again in this HW in Q1 and Q2 but with less emphasis. The focus is more on Q3 and Q4 where we apply the randomForest and neural net algorithms respectively.

This HW also demonstrates the importance of selecting good discriminant variables. As it turns out, the discriminant variables used here are lousy and yield remarkably low predictive accuracy rates even with such sophisticated algos. The takeaway? Methods cannot alleviate deficiencies in the data beyond a point. OK, without further ado, here we go:

Questions for Session 6 HW:

  • Q1.Find what constructs may underlie the basis variables. Use factor analysis, report eigenvalue scree plot & factor loadings table. Answer the following Q sub-parts:
  • (i) Which variables load less than 30% onto the factor solution? (Hint: Look for Uniqueness threshold of 1-0.30 = 0.70 or above)
  • (ii) ID and label the constructs you find among the variables that do load well onto the factor solution.
  • Q2. Use mclust to segment the respondents. Answer the following Q parts.
  • (i) What is the optimal no. of clusters?
  • (ii) Report a clusplot. What is the % of variance explained by the top 2 principal components in the cluster solution?
  • (iii) ID and label the segments you find.
  • Q3. Split the kirin datasets into training sample (first 212 rows) and test sample (the remaining 105 rows). Train the randomForest algorithm on the training sample. Predict test sample's segment membership. Answer the following Qs:
  • (i) Record predictive accuracy in both training and test samples
  • (ii) Which segment appears to havwe the highest error rate?
  • (iii) List the top 3 variables that best discriminate among the segments (use Mean Decrease in Accuracy metric)
  • Q4. Use the split kirin datasets to try multinomial logit with neural nets on the training and test samples. Predict test sample's segment membership.
  • (i) Record predictive accuracy in both training and test samples
  • (ii) Which segment appears to havwe the highest error rate?
  • (iii) List the top 3 variables that best discriminate among the segments (use significance as a metric here)

Some Notes on why use R:

In case you are wondering why you are being asked to bother with R, some points to note:

  • Its imperative you learn how to run MKTR analyses at least once. The reason is you won't be able to effectively lead a team of people who do what you've never done. Sure, your analytics team will run the analysis but you need to have an idea of what that entails, what to expect etc.
  • Folks who have any programming experience at school or at work can probably vouch for the fact that the R language is about as straightforward as, well, English. Pls implement the R code *yourself* to get a feel of the platform. Pls help your peers out who may not be as comfortable with R.
  • Those who are ambivalent or undecided about using R, I say pls give it a sincere try. Its a worthwhile investment. Learning in this course is best leveraged with a basic understanding of what a versatile analytics platform does.
  • Pls share problems in the code or the data that you found, workarounds that you figured out, other packages you discovered that do the same thing better/faster etc on the blog as well. R is a community based platform and it draws its strength from a distributed user and developer base.
  • Those who are determined to not touch R are free to do so. No harm, no foul. Pls borrow the plots and tables from your peers, but interpret them yourself.
Dassit for now. Gotta get to work on tomorrow's session slides and associated blog-code.

Sudhir

5 comments:

  1. Dear Prof,

    In the Hyd example from here: http://marketing-yogi.blogspot.in/2012/11/interpreting-factor-analysis-results.html you were removing the factors where the variables were not loading heavily.

    In our HW 5, we have added a few variables along with the factor. Can you please explain the difference between the 2 approaches.

    Thanks

    ReplyDelete
    Replies
    1. Hi Anon,

      We'd discussed this aspect in the R tutorial yesterday. Wish more folks had attended.

      High Uniqueness for a variable means it's variance is not being explained (or captured) well by the factor solution. So drop these variables from the factor solution. But these variables remain informative variables and can be considered as stand-alone factors in their own right. So we append them to the factor solution obtained before proceeding for downstream analysis.

      Hope that clarifies.

      Sudhir

      Delete
    2. Thanks Prof for the clarification. This makes things clearer now. Sorry could not attend the session yesterday. We all are pressed for time due to assignments and 4 exams next week and Day 2 prep.

      Delete
  2. I think there is a small error in the code. In Q2 replace fit <- Mclust(dmat) with fit <- Mclust(mydata)

    Please let me know if I got this wrong.

    ReplyDelete
    Replies
    1. That is correct, dmat should be replaced with mydata. Shall ask Ankit to make the correction on LMS.

      Sudhir

      Delete

Constructive feedback appreciated. Please try to be civil, as far as feasible. Thanks.