Tuesday, December 11, 2012

Targeting in R - Session 6 PDA Caselet

Update: NYT Video link for social media as a focus group

*******************************

Hi all,

This is the code for classwork MEXL example "Conglomerate's PDA". This is the roadmap for what we are going to do:

  • First we segment the customer base using model based clustering or mclust, the recommended method.
  • Then we randomly split the dataset into training and test samples. The test sample is about one-third of the original dataset in size, following accepted practice.
  • Then we try to establish via the training sample, how the discriminant variables relate to segment membership. This is where we train the Targeting algorithm to learn about how discriminant variables relate to segment memberships.
  • Then comes the real test - validate algorithm performance on the test dataset. We compare prediction accuracy across traditional and proposed methods.
  • Since R is happening, there are many targeting algorithms to choose from on R. I have decided to go with one that has shown good promise of late - the randomForest algorithm. Where we had seen decision trees in Session 5, think now of 'decision forests' in a sense...
  • Other available algorithms that we can run (provided there is popular demand) are artificial neural nets (multi-layer perceptrons) and Support vector machines. But for now, these are not part of this course.
So without further ado, let me start right away.

1. Segment the customer Base.

To read-in data, directly save and use the 'basis' and 'discrim' notepads I have sent you by email. Then ensure you have packages 'mclust' and 'cluster' installed before running the clustering code.

# read-in basis and discrim variables
basis = read.table(file.choose(), header=TRUE)
dim(basis); basis[1:3,]
summary(basis)

discrim = read.table(file.choose(), header=TRUE)
dim(discrim); discrim[1:3,]
summary(discrim)

# Run segmentation on the basis.training dataset library(mclust) #invoke library
fit <- Mclust(basis) # run mclust
fit # view result
classif = fit$classification
# print cluster sizes
for (i1 in 1:max(classif)){print(sum(classif==i1))}

# Cluster Plot against 1st 2 principal components
require(cluster)
fit1=cbind(classif)
rownames(fit1)=rownames(basis)
clusplot(basis, fit1, color=TRUE, shade=TRUE,labels=2, lines=0)
The segmentation produces 4 optimal clusters. Below is the clusplot where, interestingly, despite our using 15 basis variables, we see decent separation among the clusters in the top 2 principal components directly.
Click on the above image for larger size.

2. Split dataset into Training & Test samples

I wrote a function that randomly splits the dataset into training and test samples. Copy-paste it as a block into the R console and we can then invoke it peacefully downstream anywhere in this R session.

# Function to split dataset into training and test samples

split.data <- function(data){ #func begins
train.index = sample(1:nrow(data), round((2/3)*nrow(data)))
a2=1:nrow(data);
test.index=a2[!(a2 %in% train.index)]

### split sample into 2/3rds training and 1/3rds test
train = data[train.index,]
test = data[test.index,]
outp = list(train, test)
outp } # func ends

# run split.data func on our data
data = cbind(basis, discrim, classif)
outp = split.data(data)

# create training and test datasets
n1 = ncol(basis); n2 = ncol(data)
basis.training = outp[[1]][,1:n1]
discrim.training = outp[[1]][,(n1+1):(n2-1)]
y.training = outp[[1]][,n2]
y.test = outp[[2]][,n2]
discrim.test = outp[[2]][,(n1+1):(n2-1)]
There, now we have the training and test samples defined both for basis and for discriminant matrices. Time to train the machine on the training dataset next.

3. Train the Targeting Algorithm

Before proceeding further, a quick note on what random forests are and how they relate to decision trees. Recall what decision trees are (loosely speaking) - 'classifiers' of sorts that pick the variable that best splits a given Y variable into 2 or more branches. For instance, in our car example in class, we saw that our Y variable 'Mileage' was best split first by 'Price' and then by 'Type'. We call this procedure 'recursive partitioning' and it is repeated at each node (or branch ending).

For a small set of X variables, individual decision trees are fine. But what if we had X variables numbering the dozens, scores, hundreds even? To solve this issue, instead of one decision tree, imagine we had 1000s of trees, each acting on random subsets of the original variable set. Each time a forest grows, we then summarize the fit obtained across different variables. Then, we grow another forest. Then another. That, in short, is the random forest algorithm. A forest of decision trees that allows us to efficiently parallelize some recursive partitioning tasks, keep track of variable importance and predictive fit, and thereby tackle large datasets efficiently. Again, this is just an intuitive explanation, it is not rigorous in a technical sense, so don;t take it too literally.

Ensure you have library 'randomForest' installed before trying the below code.

### --- using the randomForest algo ---- ###
library(randomForest)

test.rf <- randomForest(factor(y.training) ~.,
data = discrim.training,
xtest = discrim.test, ytest = factor(y.test),
ntree = 3000, proximity = TRUE, importance = TRUE)
print(test.rf)

That was it. If no error showed up, the algo has executed successfully. Those few lines of code invoked a rather powerful classification algorithm that finds use in myriad scientific explorations today. Fortunately for us, its is right up our street in the Targeting arena. Let us proceed further and see how the algo has performed on in-sample and out of-sample prediction. This is the output I got on executing the above code:

The output is as straightforward to read as the input was to run. Let me summarize in bullet points for ease of interpretation.
  • The output tells us that the Type of run was "Classification" implying that a regression based run for continuous data is very much possible (recall we did the regression based decision tree only for the Mileage example).
  • The no. of variables tried at each split is basically how many levels our Y variable had. Well, we have 4 segments for which we want to classify customers into, so we it is 4.
  • The confusion matrix is a cross-tab between the predicted and the observed classifications. Higher the numbers along the diagonal, the better. Interestingly, we have an in-sample accuracy of !62% but a test sample predictive accuracy of ~72%! Usually its the other way round.
  • Does a mere 72% accuracy seem too low for such a much-touted algorithm? Aah. well, let's see. First, a random coin-toss would give us ~ 25% accuracy (roughly, across 4 classes).
  • Second, there were only 107 training cases for 17 discriminant variables. If you try any parametric model like, say, Logit for this, the chances the model will give up midway are quite high indeed. And even if it goes through, don't expect much by way of variable significance in the results. Why talk, let me show you an application of the multinomial Logit for this dataset shortly.
  • But what about variable significance (or 'importance') in random Forests, first? Sure, we're going there next.

4. Interpreting random Forest Output

Copy paste the following code to get variable importance tables and an associated plot. Also, the actual versus predicted segment membership tables can also be obtained using the code below.

# List the importance of the variables.
rn <- round(importance(test.rf), 2)
rn[order(rn[, (ncol(rn)-1)], decreasing=TRUE),]
varImpPlot(test.rf) # plot variable importance

# -- predictions ---
# observed and predicted y outputted
yhat.train = cbind(y.training, test.rf$predicted)# insample
yhat.test = cbind(y.test, test.rf$test[[1]])# test sample

This is a variable importance plot. Useful to use the top few variables on the left side plot.
That was it. We used a sophisticated Targeting algorithm for our decidedly tiny and simple data application. This algo scales up massively and for many more complex applications, BTW.

5. A neural-net based Logit Model

The Q may well arise, how do I know how good the proposed new algo really is unless I run it on something known and traditional? So, how about we run it on the well-established, parametric Logistic Regression model? Since our Y has 4 levels, we would ordinarily go for the Multinomial Logit model (in the R package 'mlogit').

Let me tell you what happens when I try to replicate the above in mlogit. The program shows error messages galore. Says that for 160 rows and 17 variables, we get a matrix close to Singular (expected, BTW). So I try reducing the number of X variables. The problem persists. Finally I give up.

And turn to artificial neural-nets.

Turns out that the R package 'nnet' implements a multinomial Logit on an underlying neural net. The function multinom() is the key here. The code below will invoke, execute and help interpret multinomial Logit results for our PDA dataset.

### --- doing multinom for Targetting --- ###
library(nnet) # use neural nets to fit a MNL!
library(car) # for Anova func call
# Build a Regression model.
a1 <- multinom(formula = y.training ~ ., data = discrim.training, trace = FALSE, maxit = 1000)

# For inference
summary(a1, Wald.ratios=TRUE)
cat('==== ANOVA ====')
print(Anova(a1))

This is what I get from the ANOVA table:
Notice right off how variable significance seems to be such a problem. Well, in predictive analytics, we're more concerned with how well the variables as a group are able to predict segment membership than how significant any particular variable is. Hence, the block of code that follows gives us how well neural nets are able to predict both within sample (on the training dataset) and on the test dataset.

### Model fit in training sample ###

a3 = round(a1$fitted.values,3) # probability table

# build predicted Y values vector
yhat = matrix(0, nrow(a3), 1)
for (i1 in 1:nrow(a3)){ for (i2 in 1:ncol(a3)){
if (a3[i1, i2] == max(a3[i1,])) {yhat[i1,1] = i2}}}

# confusion matrix for insample
table(y.training, yhat)

# predictive accuracy
sum(diag(table(y.training, yhat)))/nrow(discrim.training)

This is what I got:
Wow or what? neuralnets netted us a 76% accuracy in training sample prediction. In contrast, we got only some 62% accuracy in random forests. Does that mean that neural nets are better than random forests in this application? Well, hold on.

The test dataset is still there. How well does the neural net do there? Its not necessary it will trump random forests here just because it did so in the training sample.

# build predicted Y values vector for test dataset
yhat1 = predict(a1, cbind(y.test, discrim.test), type="class")
# confusion matrix for multinom test dataset
table(y.test, yhat1)

# predictive accuracy
sum(diag(table(y.test,yhat1)))/nrow(discrim.test)

See what I mean? neural nets get a 64% accuracy compared to 72% in random Forests in the test dataset. I'd say, in any given app, try both. Use whichever gives you a higher test dataset prediction accuracy.

Well, Dassit for now. See you in class tomorrow.

Sudhir

5 comments:

  1. Dear Professor,

    I am not able to find the data set for today's class not even on LMS.. could you please guide on the file name or send through email before class?
    Kruti

    ReplyDelete
  2. Dear Professor,

    I am not able to install the package - randomForest.

    I am using the code uploaded on LMS. Kindly, suggest what i should do.

    Thanks,
    Vipul

    ReplyDelete
    Replies
    1. Well, use results obtained from any of your peers' in that case. Interpretation should be an individual activity. I'm assuming this is for the HW submission thing.

      Else, one can download package binary for Windows and 'install from local zip file' in the Packages pull-down menu.

      Hope that helped.

      Sudhir

      Delete
  3. These lessons are great, I hope you don't mind someone not in your class asking a question. What if you have an observation that gets classified as being part of one of your clusters (because it will be forced into one no matter what) but really it shouldn't be considered as being part of that cluster? Is there a way to measure probability of actually belonging to that cluster?

    For instance, if I have 2 classes "white males age 25" and "black males age 45". Then I predict on an observation that in reality is an "asian male age 65" it will have to force that observation into one of the 2 classes. Is there a way to obtain a "confidence" for each predicted observation besides just the votes generated by RF? Maybe the answer isn't in RF, but could you point me to a methodology for this sort of thing? Thanks!

    ReplyDelete

Constructive feedback appreciated. Please try to be civil, as far as feasible. Thanks.