Today's second module - Targeting - is crucial in MKTR and underlies much of the work we're currently engaged in in terms of predictive analytics. It is also a lengthy one and so will likely spill over into part of Wednesday's class. Just FYI.
1. Decision Trees
First off, a quick drive by on what they are and why they matter in MKTR. This is how Wiki defines Decision trees:
A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility.
OK, lots of fancy verbiage there. Perhaps an example can illustrate better. Many cognitive-logical decisions can be represented in an algorithmic form with a tree like structure. For example - "Should we enter market A or not?" Imagine two paths out of this Question - one saying yes and the other 'no'. Each of these paths (or 'branches') can then be further split into more branches, say, 'cost' and 'benefit' and so on till a decision is reachable.
Essentially, we are trying to partition this dataset along that branch network which best explains variation in Y - our chosen variable of interest.
2. Building a simple Decision Tree
The dataset used is an internal R dataset 'cu.summary'. No need to read it in, its pre-loaded. Exhibit 3 in the handout gives some rows of the dataset.
# view dataset summary # library(rpart) data(cu.summary) dim(cu.summary) summary(cu.summary) cu.summary[1:3,] |
Now make sure you have installed the 'rpart' package before doing the rest.
# Regression Tree Example # grow tree fit <- rpart(Mileage~Price + Country + Reliability + Type, method="anova", data=cu.summary) summary(fit) # detailed summary of splits |
OK. Time to plot the tree itself. Just copy-paste the code below.
par(col="black", mfrow=c(1,1)) # create attractive postcript plot of tree post(fit, file = "", title = "Regression Tree for Mileage ", cex=0.7) |
These algorithms follow certain rules - start rules, stop rules etc and essentially operate by maximizing come criterion. In this case, it is minimizing the informational entropy associated with each possible branch split. However, in some critical applications, one may want to go beyond minimizing entropy and say, well, I want to assess statistical significance for a node split into branches. In such delicate situations, nonparametric conditional decision trees ride to the rescue. Ensure you have the party package installed before trying the following:
##--- nonparametric conditional inference --- library(party) fit2 <- ctree(Mileage~Price + Country + Reliability + Type, data=na.omit(cu.summary)) plot(fit2) |
3. Multivariate Decision Trees
For this variation in decision trees, copy data from cells A1 to I52 in this google spreadsheet. This is data from your preference ratings for the term 4 offerings. Your 4-dimensional preference vector now becomes the Y variable here.
The task now is to partition the dataset's demographic variables to best explain the distribution of the preference vector. Ensure you have package mvpart installed before trying the following code.
##--- multivariate decision trees --- data = read.table(file.choose(), header = TRUE) library(mvpart) mvpart(data.matrix(data[,1:4])~workex+Gender+Engg+Masters+MktgMajor, data) |
Couple of points as I signoff.
- The algos we executed today with just a few copy-pastes of generic R code are actually quite sophisticated and very powerful. Now available on a platter to you thanks to R.
- Implementation of business solutions based on these very algos can cost easily upwards of 1000s of USD per installation. Those savings are very real, especially if yours is a small firm/startup etc. I hope what this means will find appreciation.
- The applications are more important than the coding. The interpretation, the context and the understanding of what a tool is meant to do is more important than merely getting code to run.
- My effort is to expose you to the possibilities so that when opportunity arises, you will be able to connect tool to application and extract insight in the process.
Well, this is it for now from me. The targeting based algos - the random Forest and neural nets alongside good old Logit will come next session then. See you in class tomorrow.
Sudhir
No comments:
Post a Comment
Constructive feedback appreciated. Please try to be civil, as far as feasible. Thanks.