Sunday, December 9, 2012

Targeting Tools in R - Decision Trees

Hi all,

Today's second module - Targeting - is crucial in MKTR and underlies much of the work we're currently engaged in in terms of predictive analytics. It is also a lengthy one and so will likely spill over into part of Wednesday's class. Just FYI.

1. Decision Trees

First off, a quick drive by on what they are and why they matter in MKTR. This is how Wiki defines Decision trees:

A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility.

OK, lots of fancy verbiage there. Perhaps an example can illustrate better. Many cognitive-logical decisions can be represented in an algorithmic form with a tree like structure. For example - "Should we enter market A or not?" Imagine two paths out of this Question - one saying yes and the other 'no'. Each of these paths (or 'branches') can then be further split into more branches, say, 'cost' and 'benefit' and so on till a decision is reachable.

Essentially, we are trying to partition this dataset along that branch network which best explains variation in Y - our chosen variable of interest.

2. Building a simple Decision Tree

The dataset used is an internal R dataset 'cu.summary'. No need to read it in, its pre-loaded. Exhibit 3 in the handout gives some rows of the dataset.

# view dataset summary #
library(rpart)
data(cu.summary)
dim(cu.summary)
summary(cu.summary)
cu.summary[1:3,]
This is the dataset summary. Click for larger image.

Now make sure you have installed the 'rpart' package before doing the rest.

# Regression Tree Example

# grow tree
fit <- rpart(Mileage~Price + Country + Reliability + Type,
method="anova", data=cu.summary)

summary(fit) # detailed summary of splits
The summary of results gives a lot of things. Note the formula used, the number of splits (nsplit), the variable importance (on a constant sum of 100) and hajaar detail on each node. Of course, we'll be plotting the nodes, so no problemo.

OK. Time to plot the tree itself. Just copy-paste the code below.

par(col="black", mfrow=c(1,1))
# create attractive postcript plot of tree
post(fit, file = "", title = "Regression Tree for Mileage ", cex=0.7)
Click for larger image.

These algorithms follow certain rules - start rules, stop rules etc and essentially operate by maximizing come criterion. In this case, it is minimizing the informational entropy associated with each possible branch split. However, in some critical applications, one may want to go beyond minimizing entropy and say, well, I want to assess statistical significance for a node split into branches. In such delicate situations, nonparametric conditional decision trees ride to the rescue. Ensure you have the party package installed before trying the following:

##--- nonparametric conditional inference ---
library(party)

fit2 <- ctree(Mileage~Price + Country + Reliability + Type,
data=na.omit(cu.summary))

plot(fit2)
Well, when should I go for one (rpart) or the other (party)? Well, traditional decision trees are fine in most apps and are also more intuitive to explain and communicate, so maybe that is what you want to stick with.

3. Multivariate Decision Trees

For this variation in decision trees, copy data from cells A1 to I52 in this google spreadsheet. This is data from your preference ratings for the term 4 offerings. Your 4-dimensional preference vector now becomes the Y variable here.

The task now is to partition the dataset's demographic variables to best explain the distribution of the preference vector. Ensure you have package mvpart installed before trying the following code.

##--- multivariate decision trees ---
data = read.table(file.choose(), header = TRUE)

library(mvpart)
mvpart(data.matrix(data[,1:4])~workex+Gender+Engg+Masters+MktgMajor, data)
Click for a larger image.

Couple of points as I signoff.

  • The algos we executed today with just a few copy-pastes of generic R code are actually quite sophisticated and very powerful. Now available on a platter to you thanks to R.
  • Implementation of business solutions based on these very algos can cost easily upwards of 1000s of USD per installation. Those savings are very real, especially if yours is a small firm/startup etc. I hope what this means will find appreciation.
  • The applications are more important than the coding. The interpretation, the context and the understanding of what a tool is meant to do is more important than merely getting code to run.
  • My effort is to expose you to the possibilities so that when opportunity arises, you will be able to connect tool to application and extract insight in the process.

Well, this is it for now from me. The targeting based algos - the random Forest and neural nets alongside good old Logit will come next session then. See you in class tomorrow.

Sudhir

No comments:

Post a Comment

Constructive feedback appreciated. Please try to be civil, as far as feasible. Thanks.