Thursday, December 12, 2013

Session 5 Updates - Targeting

Hi all,

We'll quickly go over the targeting portion of the PDA case. Pls ensure you're comfortable with the How and why of segmentation and targeting from the lecture slides before going ahead with this one. I will assume you know the contents of the slides well for what follows.

#----------------------------------------------#
##### PDA caselet from MEXL - Targetting #######
#----------------------------------------------#

rm(list = ls()) # clear workspace

# read in 'PDA case discriminant variables.txt'

mydata = read.table(file.choose(), header=TRUE)

head(mydata) # view top few rows of dataset

The last column labeled 'memb' is the cluster membership assigned by mclust in the previous blogpost.

The purpose of targeting is to *predict* with as much accuracy as feasible, a previously unknown customer's segment membership. Since we cannot make such predictions with certainty, what we obtain as output are probabilities of segment membership for each customer.

First, we must assess how much accuracy our targeting algorithm has. There are many targeting algorithms developed and deployed for this purpose. We'll use the simplest and best known - the multinomial logit model.

To assess accuracy, we split the dataset *randomly* into a training dataset and a validation dataset. The code below does that (we use 'test' in place of validation in the code below).

# build training and test samples using random assignment

# two-thirds of sample is for training

train_index = sample(1:nrow(mydata), floor(nrow(mydata)*0.65))

train_data = mydata[train_index, ]

test_data = mydata[-(train_index), ]

train_x = data.matrix(train_data[ ,c(2:18)])

train_y = data.matrix(train_data[ , ncol(mydata)])

test_x = data.matrix(test_data[ ,c(2:18)])

test_y = test_data[ , ncol(mydata)]

And now we're ready to run logit (from the package 'textir'). Ensure the package is installed. And just follow the code below.

###### Multinomial logit using Rpackage textir ###

library(textir)

covars = mydata[ ,c(2,4,14)]; s=sdev(mydata[,c(2,4,14)]));

dd = data.frame(cbind(memb=mydata$memb,covars,mydata[ ,c(3,5:13,15:18)]));

train_ml = dd[train_index, ];

test_ml = dd[-(train_index), ];

gg = mnlm(counts = as.factor(train_ml$memb), penalty = 1, covars = train_ml[ ,2:18]);

prob = predict(gg, test_ml[ ,2:18]);

head(prob);

Should see the following result.

Note the table below shows probabilities. To read the table, consider the first row. Each column in the first row shows the probability that the first row respondent belongs to cluster 1 (with column 1 probability), to cluster 2 (with column 2 probability) and so on.

For convenience sake, we merely assign the member to that cluster at which he/she has maximum probability of belonging. Now, we can compare how well our predicted membership agrees with the actual membership.

To see this, run the following code and obtain what is called a 'confusion matrix' - a cross-tabulation between observed and predicted memberships. In the confusion matrix, the diagonal cells represent correctly classified respondents and off-diagonal cells the misclassified ones.

pred = matrix(0, nrow(test_ml), 1);

accuracy = matrix(0, nrow(test_ml), 1);

for(j in 1:nrow(test_ml)){

pred[j, 1] = which.max(prob[j, ]);

if(pred[j, 1]==test_ml$memb[j]) {accuracy[j, 1] = 1}

}

mean(accuracy)

The mean accuracy of the algo appears to be 63% in my run. YOurs may vary slightly due to randomly allocated traiing and validation samples. This 63% accuracy copares very well indeed with a 25% average accuracy if we were to depend merely on chance to allocate respondents to clusters.

That's it for now. Any queries etc. contact me over email or better still, use the comments section below this post.

Sudhir

No comments:

Post a Comment

Constructive feedback appreciated. Please try to be civil, as far as feasible. Thanks.