Monday, November 5, 2012

A better way to do Discriminant Analysis

Update: Folks, don't worry about the Discriminant part for the HW. Consider it not part of the session 5 HW. I ought not to have put that in when I didn't cover it in-depth. Sorry about the confusion. Hi all,

I am attaching below the R code for performing discriminant analysis using multinomial logit code. A quick background to all this:

  • After segmentation, comes targeting.
  • In targeting, the main goal is to *predict* which segment a given customer may belong to given some easily observed traits of that customer (which we call 'discriminant' variables). Typically these used to be demographic variables, but increasingly we see behavioral and transactions-based variables becoming discriminants.
  • Traditionally '(linear) discriminant analysis' was performed to see which discriminant variables were significant predictors of segment membership. However this process is messy and hard to interpret. Since the 80s, this method has been overtaken by discrete choice models - notably the Logit model.
  • We shall use a particular variant of the Logit model called the 'multinomial Logit' to perform discriminant analysis. The code for the same is given below.
  • I was planning on covering Logit in session 8 for secondary data analysis, but might as well introduce it now.

To demonstrate this example, I am taking your Term 4 course ratings dataset available on this google spreadsheet.

  • I run the dataset through Mclust using the 4 attribute ratings and the preference ratings (20 variables in all) as my basis variables. Mclust says a 5 cluster solution is optimal. I save the segment allocations.
  • I have also attached a set of 4 discriminant variables starting cell X5 of the spreadsheet. I have shaded the cells grey to highlight them. The segment classification is the first column of this discriminant dataset.
We are finally ready. We read this data in and process it using the 'mlogit' package in R.

First, read-in the data and run some basic summaries. Note that in the data, no column header has any blank spaces in it. The segment variable should always be the first variable.

##
## --- using mlogit for Discriminant ---
##

# first read-in data
# ensure segment membership is the first column

discrim = read.table(file.choose(), header=TRUE)
dim(discrim); discrim[1:4,]

Now, the data will need to be reformatted as multinomial logit (MNL) input.

# now reformat data for MNL input

k1 = max(discrim[,1]) # no. of segments there are
test = NULL
for (i0 in 1:nrow(discrim)){
chid = NULL;
test0 = matrix(0, k1, ncol(discrim))
for (i1 in 1:k1){
test0[i1, 1] = (discrim[i0, 1] == i1);
chid = rbind(chid, cbind(i0, i1))
for (i2 in 2:ncol(discrim)){ test0[i1, i2] = discrim[i0, i2] }}
test = rbind(test, cbind(test0, chid)) } # i0 ends
colnames(test) = c(colnames(discrim), "chid", "altvar")

Now we setup and run MNL. I will interpret the results after that.

# setup data for mlogit
library(mlogit)
test1a = data.frame(test)
attach(test1a)
test1 = mlogit.data(test1a, choice = "segment", shape = "long", id.var = "chid", alt.var = "altvar")

# run mlogit
summary(mlogit(segment ~ 0|female+engineer+workex_yrs, data = test1))
That was it. The simple statement "segment ~ 0|female+engineer+workex_yrs" run a rather complex discrete choice model. Notice how the dependent variable (segment) and the independent predictors (female, engineer, workex_yrs) have been placed and used.
This is the result I got:

How to read the output table above.

  • The dependent variable consists of 5 values - membership to segments 1 to 5. 'Frequency of alternatives' in the result above gives how often they occurred.
  • Look at the 'Coefficients:' table in the results image above. It gives the coefficient estimate, std error and p-value (significance) for each of the discriminant variables. Thus the coefficient for '2:female' gives the parameters for how a person being female affects their probability of membership to segment 2. And so on.
  • Membership to segment 1 is the "reference level" and all coefficients for it are set to zero. All other coefficients are relative to this zero reference level.
  • Our set of discriminant variables was not a good one because most are not significant in their ability to predict psychographic segment membership.
  • 'Mcfadden's R-square is some sort of a fit metric analogous to the regular R-square. It says only 6% of the variance in the Y is explained by the model.
We will cover Logit models with a smaller and easier example in Session 8.

That's it for now from me. Ciao.

Sudhir

2 comments:

  1. Hello Professor
    When I'm running the code for MNL input I'm getting the following error
    Error in 1:nrow(data) : argument of length 0
    > colnames(test) = c(colnames(discrim), "chid", "altvar")
    Error in `colnames<-`(`*tmp*`, value = c("segment", "female", "engineer", :
    attempt to set colnames on object with less than two dimensions

    Please help me in this regard

    ReplyDelete
    Replies
    1. Hi 'Unknown',

      Let me look into this. I've received a few
      other queries about it as well. I'll do after Thursday, right now am swamped with prepping Session 8.

      Sudhir

      P.S.
      Pls write your name after your comment, for politeness sake. Thanks.

      Delete

Constructive feedback appreciated. Please try to be civil, as far as feasible. Thanks.