*******************************
Hi all,
This is the code for classwork MEXL example "Conglomerate's PDA". This is the roadmap for what we are going to do:
- First we segment the customer base using model based clustering or mclust, the recommended method.
- Then we randomly split the dataset into training and test samples. The test sample is about one-third of the original dataset in size, following accepted practice.
- Then we try to establish via the training sample, how the discriminant variables relate to segment membership. This is where we train the Targeting algorithm to learn about how discriminant variables relate to segment memberships.
- Then comes the real test - validate algorithm performance on the test dataset. We compare prediction accuracy across traditional and proposed methods.
- Since R is happening, there are many targeting algorithms to choose from on R. I have decided to go with one that has shown good promise of late - the randomForest algorithm. Where we had seen decision trees in Session 5, think now of 'decision forests' in a sense...
- Other available algorithms that we can run (provided there is popular demand) are artificial neural nets (multi-layer perceptrons) and Support vector machines. But for now, these are not part of this course.
1. Segment the customer Base.
To read-in data, directly save and use the 'basis' and 'discrim' notepads I have sent you by email. Then ensure you have packages 'mclust' and 'cluster' installed before running the clustering code.
# read-in basis and discrim variables basis = read.table(file.choose(), header=TRUE) dim(basis); basis[1:3,] summary(basis) discrim = read.table(file.choose(), header=TRUE) dim(discrim); discrim[1:3,] summary(discrim) # Run segmentation on the basis.training dataset library(mclust) #invoke library fit <- Mclust(basis) # run mclust fit # view result classif = fit$classification # print cluster sizes for (i1 in 1:max(classif)){print(sum(classif==i1))} # Cluster Plot against 1st 2 principal components require(cluster) fit1=cbind(classif) rownames(fit1)=rownames(basis) clusplot(basis, fit1, color=TRUE, shade=TRUE,labels=2, lines=0) |
2. Split dataset into Training & Test samples
I wrote a function that randomly splits the dataset into training and test samples. Copy-paste it as a block into the R console and we can then invoke it peacefully downstream anywhere in this R session.
# Function to split dataset into training and test samples split.data <- function(data){ #func begins train.index = sample(1:nrow(data), round((2/3)*nrow(data))) a2=1:nrow(data); test.index=a2[!(a2 %in% train.index)] ### split sample into 2/3rds training and 1/3rds test train = data[train.index,] test = data[test.index,] outp = list(train, test) outp } # func ends # run split.data func on our data data = cbind(basis, discrim, classif) outp = split.data(data) # create training and test datasets n1 = ncol(basis); n2 = ncol(data) basis.training = outp[[1]][,1:n1] discrim.training = outp[[1]][,(n1+1):(n2-1)] y.training = outp[[1]][,n2] y.test = outp[[2]][,n2] discrim.test = outp[[2]][,(n1+1):(n2-1)] |
3. Train the Targeting Algorithm
Before proceeding further, a quick note on what random forests are and how they relate to decision trees. Recall what decision trees are (loosely speaking) - 'classifiers' of sorts that pick the variable that best splits a given Y variable into 2 or more branches. For instance, in our car example in class, we saw that our Y variable 'Mileage' was best split first by 'Price' and then by 'Type'. We call this procedure 'recursive partitioning' and it is repeated at each node (or branch ending).
For a small set of X variables, individual decision trees are fine. But what if we had X variables numbering the dozens, scores, hundreds even? To solve this issue, instead of one decision tree, imagine we had 1000s of trees, each acting on random subsets of the original variable set. Each time a forest grows, we then summarize the fit obtained across different variables. Then, we grow another forest. Then another. That, in short, is the random forest algorithm. A forest of decision trees that allows us to efficiently parallelize some recursive partitioning tasks, keep track of variable importance and predictive fit, and thereby tackle large datasets efficiently. Again, this is just an intuitive explanation, it is not rigorous in a technical sense, so don;t take it too literally.
Ensure you have library 'randomForest' installed before trying the below code.
### --- using the randomForest algo ---- ### library(randomForest) test.rf <- randomForest(factor(y.training) ~., data = discrim.training, xtest = discrim.test, ytest = factor(y.test), ntree = 3000, proximity = TRUE, importance = TRUE) print(test.rf) |
The output is as straightforward to read as the input was to run. Let me summarize in bullet points for ease of interpretation.
- The output tells us that the Type of run was "Classification" implying that a regression based run for continuous data is very much possible (recall we did the regression based decision tree only for the Mileage example).
- The no. of variables tried at each split is basically how many levels our Y variable had. Well, we have 4 segments for which we want to classify customers into, so we it is 4.
- The confusion matrix is a cross-tab between the predicted and the observed classifications. Higher the numbers along the diagonal, the better. Interestingly, we have an in-sample accuracy of !62% but a test sample predictive accuracy of ~72%! Usually its the other way round.
- Does a mere 72% accuracy seem too low for such a much-touted algorithm? Aah. well, let's see. First, a random coin-toss would give us ~ 25% accuracy (roughly, across 4 classes).
- Second, there were only 107 training cases for 17 discriminant variables. If you try any parametric model like, say, Logit for this, the chances the model will give up midway are quite high indeed. And even if it goes through, don't expect much by way of variable significance in the results. Why talk, let me show you an application of the multinomial Logit for this dataset shortly.
- But what about variable significance (or 'importance') in random Forests, first? Sure, we're going there next.
4. Interpreting random Forest Output
Copy paste the following code to get variable importance tables and an associated plot. Also, the actual versus predicted segment membership tables can also be obtained using the code below.
# List the importance of the variables. rn <- round(importance(test.rf), 2) rn[order(rn[, (ncol(rn)-1)], decreasing=TRUE),] varImpPlot(test.rf) # plot variable importance # -- predictions --- # observed and predicted y outputted yhat.train = cbind(y.training, test.rf$predicted)# insample yhat.test = cbind(y.test, test.rf$test[[1]])# test sample |
5. A neural-net based Logit Model
The Q may well arise, how do I know how good the proposed new algo really is unless I run it on something known and traditional? So, how about we run it on the well-established, parametric Logistic Regression model? Since our Y has 4 levels, we would ordinarily go for the Multinomial Logit model (in the R package 'mlogit').
Let me tell you what happens when I try to replicate the above in mlogit. The program shows error messages galore. Says that for 160 rows and 17 variables, we get a matrix close to Singular (expected, BTW). So I try reducing the number of X variables. The problem persists. Finally I give up.
And turn to artificial neural-nets.
Turns out that the R package 'nnet' implements a multinomial Logit on an underlying neural net. The function multinom() is the key here. The code below will invoke, execute and help interpret multinomial Logit results for our PDA dataset.
### --- doing multinom for Targetting --- ### library(nnet) # use neural nets to fit a MNL! library(car) # for Anova func call # Build a Regression model. a1 <- multinom(formula = y.training ~ ., data = discrim.training, trace = FALSE, maxit = 1000) # For inference summary(a1, Wald.ratios=TRUE) cat('==== ANOVA ====') print(Anova(a1)) |
### Model fit in training sample ### a3 = round(a1$fitted.values,3) # probability table # build predicted Y values vector yhat = matrix(0, nrow(a3), 1) for (i1 in 1:nrow(a3)){ for (i2 in 1:ncol(a3)){ if (a3[i1, i2] == max(a3[i1,])) {yhat[i1,1] = i2}}} # confusion matrix for insample table(y.training, yhat) # predictive accuracy sum(diag(table(y.training, yhat)))/nrow(discrim.training) |
The test dataset is still there. How well does the neural net do there? Its not necessary it will trump random forests here just because it did so in the training sample.
# build predicted Y values vector for test dataset yhat1 = predict(a1, cbind(y.test, discrim.test), type="class") # confusion matrix for multinom test dataset table(y.test, yhat1) # predictive accuracy sum(diag(table(y.test,yhat1)))/nrow(discrim.test) |
Well, Dassit for now. See you in class tomorrow.
Sudhir
Dear Professor,
ReplyDeleteI am not able to find the data set for today's class not even on LMS.. could you please guide on the file name or send through email before class?
Kruti
ReSent over email.
DeleteSudhir
Dear Professor,
ReplyDeleteI am not able to install the package - randomForest.
I am using the code uploaded on LMS. Kindly, suggest what i should do.
Thanks,
Vipul
Well, use results obtained from any of your peers' in that case. Interpretation should be an individual activity. I'm assuming this is for the HW submission thing.
DeleteElse, one can download package binary for Windows and 'install from local zip file' in the Packages pull-down menu.
Hope that helped.
Sudhir
These lessons are great, I hope you don't mind someone not in your class asking a question. What if you have an observation that gets classified as being part of one of your clusters (because it will be forced into one no matter what) but really it shouldn't be considered as being part of that cluster? Is there a way to measure probability of actually belonging to that cluster?
ReplyDeleteFor instance, if I have 2 classes "white males age 25" and "black males age 45". Then I predict on an observation that in reality is an "asian male age 65" it will have to force that observation into one of the 2 classes. Is there a way to obtain a "confidence" for each predicted observation besides just the votes generated by RF? Maybe the answer isn't in RF, but could you point me to a methodology for this sort of thing? Thanks!