I am attaching below the R code for performing discriminant analysis using multinomial logit code. A quick background to all this:
- After segmentation, comes targeting.
- In targeting, the main goal is to *predict* which segment a given customer may belong to given some easily observed traits of that customer (which we call 'discriminant' variables). Typically these used to be demographic variables, but increasingly we see behavioral and transactions-based variables becoming discriminants.
- Traditionally '(linear) discriminant analysis' was performed to see which discriminant variables were significant predictors of segment membership. However this process is messy and hard to interpret. Since the 80s, this method has been overtaken by discrete choice models - notably the Logit model.
- We shall use a particular variant of the Logit model called the 'multinomial Logit' to perform discriminant analysis. The code for the same is given below.
- I was planning on covering Logit in session 8 for secondary data analysis, but might as well introduce it now.
To demonstrate this example, I am taking your Term 4 course ratings dataset available on this google spreadsheet.
- I run the dataset through Mclust using the 4 attribute ratings and the preference ratings (20 variables in all) as my basis variables. Mclust says a 5 cluster solution is optimal. I save the segment allocations.
- I have also attached a set of 4 discriminant variables starting cell X5 of the spreadsheet. I have shaded the cells grey to highlight them. The segment classification is the first column of this discriminant dataset.
First, read-in the data and run some basic summaries. Note that in the data, no column header has any blank spaces in it. The segment variable should always be the first variable.
## ## --- using mlogit for Discriminant --- ## # first read-in data # ensure segment membership is the first column discrim = read.table(file.choose(), header=TRUE) dim(discrim); discrim[1:4,] |
Now, the data will need to be reformatted as multinomial logit (MNL) input.
# now reformat data for MNL input k1 = max(discrim[,1]) # no. of segments there are test = NULL for (i0 in 1:nrow(discrim)){ chid = NULL; test0 = matrix(0, k1, ncol(discrim)) for (i1 in 1:k1){ test0[i1, 1] = (discrim[i0, 1] == i1); chid = rbind(chid, cbind(i0, i1)) for (i2 in 2:ncol(discrim)){ test0[i1, i2] = discrim[i0, i2] }} test = rbind(test, cbind(test0, chid)) } # i0 ends colnames(test) = c(colnames(discrim), "chid", "altvar") |
Now we setup and run MNL. I will interpret the results after that.
# setup data for mlogit library(mlogit) test1a = data.frame(test) attach(test1a) test1 = mlogit.data(test1a, choice = "segment", shape = "long", id.var = "chid", alt.var = "altvar") # run mlogit summary(mlogit(segment ~ 0|female+engineer+workex_yrs, data = test1)) |
This is the result I got:
How to read the output table above.
- The dependent variable consists of 5 values - membership to segments 1 to 5. 'Frequency of alternatives' in the result above gives how often they occurred.
- Look at the 'Coefficients:' table in the results image above. It gives the coefficient estimate, std error and p-value (significance) for each of the discriminant variables. Thus the coefficient for '2:female' gives the parameters for how a person being female affects their probability of membership to segment 2. And so on.
- Membership to segment 1 is the "reference level" and all coefficients for it are set to zero. All other coefficients are relative to this zero reference level.
- Our set of discriminant variables was not a good one because most are not significant in their ability to predict psychographic segment membership.
- 'Mcfadden's R-square is some sort of a fit metric analogous to the regular R-square. It says only 6% of the variance in the Y is explained by the model.
That's it for now from me. Ciao.
Sudhir
Hello Professor
ReplyDeleteWhen I'm running the code for MNL input I'm getting the following error
Error in 1:nrow(data) : argument of length 0
> colnames(test) = c(colnames(discrim), "chid", "altvar")
Error in `colnames<-`(`*tmp*`, value = c("segment", "female", "engineer", :
attempt to set colnames on object with less than two dimensions
Please help me in this regard
Hi 'Unknown',
DeleteLet me look into this. I've received a few
other queries about it as well. I'll do after Thursday, right now am swamped with prepping Session 8.
Sudhir
P.S.
Pls write your name after your comment, for politeness sake. Thanks.