Marketing Yogi: Session 4 - MDS, Factor Analysis: R code and HW

Update:

Received this email from ms Priyanka Guha:

Hello Professor,
I think there is a missing data in row 19 and column AJ of the second data sheet and so, I am getting an error while running the code. Can you please suggest how to go forward about that.

My response:

Hi Priyanka,
Yes, I just checked.
Pls replace it with the median for that column (3, I think). More generally, missing data is a part of most real world data sets. We usually impute the missing values provided there’s not too many of them. Means, medians etc are often used.

******************************************

Hi all,

Like I said in my opening statement in class today: Session 4 onwards MKTR gets heavy on R and tools and analysis.

Last post, I discussed, demo-ed and deconstructed Joint Space maps (JSMs). This post is dedicated to MDS (and if its not too long), factor analysis. So, w/o further ado, here goes:

5. MDS code

MDS or Multi-dimensional scaling is the way we analyze overall-similarity (OS) data. Recall that you rated your impression of overall similarity among 9 car brands in a series of paired comparisons. We use that data for MDS here. The input is required in a particular format and there's some data cleaning and aggregation required. Luckily, a simple R function now takes care of that problem.

Pls find the data for MDS and subsequently factor analysis input in this google spreadsheet.

This is what the code below is helping you do:

Read in data with headers from cells D38 to AM90 into 'inp'.
Then read in the brand names vector (cells C39 to C47) into 'names'.
Build a dissimilarity (or distance) matrix 'dmat'

# read in the dissimilarities matrix #
inp = read.table(file.choose(),header=TRUE)
# read in brand names vector #
names = read.table(file.choose())

# build distance matrix #
k = nrow(names)
dmat = matrix(0, k, k)

for (i1 in 1:(k-1)){
a1=grepl(names[i1,1], colnames(inp))
for (i2 in (i1+1):k){
a2=grepl(names[i2,1], colnames(inp))
a3 = a1*a2
a4 = match(1, a3)
dmat[i1, i2] = mean(inp[, a4])
dmat[i2, i1] = dmat[i1, i2]
}}

colnames(dmat) = names[,1]
rownames(dmat) = names[,1]

Click on the image to see what the distance matrix 'dmat' looks like. This is the average of the similarity perceptions for the whole class.

First, we do metric MDS. Metric MDS uses metric data as its input. Since you rated similarity-dissimilarity on a 1-7 interval scale, metric MDS will work just fine. Here's the code for metric MDS, just copy-paste onto the R console.

d = as.dist(dmat)
# Classical MDS into k dimensions#
fit <- cmdscale(d,eig=TRUE, k=2)
fit # view results

# plot solution #
x <- fit$points[,1]
y <- fit$points[,2]
plot(x, y, xlab="Coordinate 1", ylab="Coordinate 2", main="Metric MDS", xlim = c(floor(min(x)), ceiling(max(x))), ylim = c(floor(min(y)), ceiling(max(y))), type="p",pch=19, col="red")
text(x, y, labels = rownames(fit$points), cex=1.1, pos=1)
abline(h=0); abline(v=0)

This is my graphical MDS output. Click on image for larger size.

Suppose the similarity-dissimilarity judgments were in yes/no terms rather than in metric ratings. Then metric MDS becomes dicey to use as it relies on interval scaling assumptions. The more robust (but somewhat less efficient) nonmetric MDS then becomes the way to go. I would recommend using nonmetric MDS when in doubt, and certainly for MDS maps for individual respondents.

Nonmetric MDS is just as easy to run. However, make sure you have the MASS package installed before running it.

library(MASS)
d <- as.dist(dmat)
fit <- isoMDS(d, k=2)
fit # view results

# plot solution $
x <- fit$points[,1]
y <- fit$points[,2]
plot(x, y, xlab="Coordinate 1", ylab="Coordinate 2", xlim = c(floor(min(x)), ceiling(max(x))), ylim = c(floor(min(y)), ceiling(max(y))), main="Nonmetric MDS", type="p", pch=19, col="red")
text(x, y, labels = rownames(fit$points), cex=1.1, pos=1)
abline(h=0); abline(v=0)

Click on image for larger size.

5a. Subset analyses with MDS

We can analyze select subsets of interest via MDS. I've defined a generic function 'dmat.subset' to do this with. It requires you to provide the brand names vector 'names', the input matrix 'inp' and a subset selection vector 'k1'. First, just copy-paste the following code onto the R console.

# dmat.subset func define #
dmat.subset <- function(names, inp, k1){

k = nrow(names)
dmat = matrix(0, k, k)
inp1 = inp[k1, ]

for (i1 in 1:(k-1)){
a1=grepl(names[i1,1], colnames(inp1))
for (i2 in (i1+1):k){

a2=grepl(names[i2,1], colnames(inp1))
a3 = a1*a2
a4 = match(1, a3)
dmat[i1, i2] = mean(inp1[, a4])
dmat[i2, i1] = dmat[i1, i2]

}}
colnames(dmat) = names[,1]
rownames(dmat) = names[,1]

dmat }

To select the second row in 'inp', simply define

k1 = 2

To select the second, fifth and tenth rows (say), write

k1 = c(2, 5, 10)

To select rows 2 to 5 and 22-to-39, write

k1 = c(2:5, 22:39)

And so on. Hopefully, the logic is clear enough now.

Suppose you want to use PGID to select. Then first read cells B39 to B90 into a dataset, say 'pgid'. Say you want to select PGID 61310363. Now do the following (useful from your HW point of view):

# read-in pgid vector
pgid = read.table(file.choose())

k1 = (pgid == 61310363) # define subset

Now, once having selected the subset of interest into 'k1', we can run MDS on it quite simply as:

# define dist matrix for the subset
dmat = dmat.subset(names, inp, k1)

### --- nonMetric MDS for individuals ---
library(MASS)
d <- as.dist(dmat)
fit <- isoMDS(d, k=2)
fit # view results

# plot solution $
x <- fit$points[,1]
y <- fit$points[,2]
plot(x, y, xlab="Coordinate 1", ylab="Coordinate 2", xlim = c(floor(min(x)), ceiling(max(x))), ylim = c(floor(min(y)), ceiling(max(y))), main="Nonmetric MDS", type="p", pch=19, col="red")
text(x, y, labels = rownames(fit$points), cex=0.85, pos=1)
abline(h=0); abline(v=0)

This is what I got for pgid 363. Notice how an individual's MDS plot differs markedly from the overall average for the class. Which one would you trust more? Why (not)?

5b. HW2 for Session 4 on MDS

Pls use the data on the google spreadsheet mentioned in this post. I have also sent it separately in the worksheet labeled 'MDS HW' in the excel file emailed to you.

Your task is to:

Draw an MDS map using your own individual input.
Interpret what the axes mean and what you might have had at the back of your mind when you were evaluating the survey Qs.
Make a note of whether there are any attributes you had in mind which are not reflected in the MDS. And if so, what these attributes are.
Now choose any 2 other students and draw 2 MDS plots corresponding to each of these students. Take care that their maps should differ substantially from your own. Interpret what major attributes they may have had in mind when they filled in the survey.

The format for submission:

Paste the MDS output on PPT slides.
For each plot, write your interpetation of the plot using bullet points on the next slide.
Any queries, clarifications etc, pls use the comments box below to reach me.
Any help needed with R, pls reach out to the AA Mr Ankit Anand or to me.

And here's the point of it all, the value-add that will accrue once you successfully (and sincerely) navigate this Session4-HW2.

Good luck. And I hope you have fun doing the HW rather than push it in too serious a mood.

6. Factor analysis code

Update: This blog-post contains a detailed explanation of how to interpret factor analysis results using a HW dataset I'd given out in term 5, Hyd.

We will use exploratory (or 'common') factor analysis first on the toothpaste survey dataset. This dataset can be found from calls H4 to P34 in the google spreadsheet mentioned above. Need to install package 'nFactors' (R is case sensitive, always) for running the scree plot.

# read in the data #
mydata=read.table(file.choose(),header=TRUE)
mydata[1:5,]#view first 5 rows

# install the required package first
install.packages("nFactors")

# determine optimal no. of factors#
library(nFactors) # invoke library
ev <- eigen(cor(mydata)) # get eigenvalues
ap <- parallel(subject=nrow(mydata),var=ncol(mydata),rep=100,cent=.05)
nS <- nScree(ev$values, ap$eigen$qevpea)
plotnScree(nS)

On the scree plot that appears, The green horizontal line represents the Eigenvalue=1 level. Simply count how many green triangles (in the figure above) lie before the black line cuts the green line. That is the optimal no. of factors. Here, it is 2. The plot looks intimidating as it is, hence, pls do not bother with any other color-coded information given - blue, black or green. Just stick to the instructions above.

k1=2 # set optimal no. of factors

If the optimal no. of factors changes when you use a new dataset, simply change the value of k1 in the line above. Copy paste the line onto a notepad, change it to 'k1=6' or whatever you get as optimal and paste onto R console. Rest of the code runs as-is.

# extracting k1 factors #
# with varimax rotation #
fit <- factanal(mydata, k1, scores="Bartlett",rotation="varimax")
print(fit, digits=2, cutoff=.3, sort=TRUE)

# plot factor 1 by factor 2 #
load <- fit$loadings[,1:2]
par(col="black")#black lines in plots
plot(load,type="p",pch=19,col="red") # set up plot
abline(h=0);abline(v=0)#draw axes
text(load,labels=names(mydata),cex=1,pos=1)
# view & save factor scores #
fit$scores[1:4,]#view factor scores

#save factor scores (if needed)
write.table(fit$scores, file.choose())

Click on image for larger size.

In case you are wondering how the variable loading onto factors looks like in R (after factor rotation), here is the relevant R console snapshot.

Clearly, the top 3 variables load onto factor 1 and the bottom 3 onto factor 2. That is all for now. See you soon.

6a. Factor Analysis Session 4-HW3.

Go to the 'HW3 Factor Analysis input data' worksheet. Read-in the dataset from cells B5 to L56. The annotation for the variables V1-V11 can be found in columns N and O.

Your task is to:

Run a factor analysis on the psychographics matrix
Use the scree plot to identify the optimal no. of factors
Assign variables to factors
Interpret what the factors may mean given the variables that load on them

Submission format:

Again, PPT format. leave one slide blank to separate HWs and HW3 in your PPT
Paste the scree plot and the actual factor table obtained
Write your interpretation of the factors as bullet points on a fresh slide

I will go over the factor analysis part again in class and in the tutorial. In particular, we will cover what is factor analysis, what are factor loadings and factor scores, how to use factor scores in downstream analysis and so on.

Sudhir

14 comments:

SSDecember 12, 2012 at 5:34 PM
Sir,
Can we just use the Distance matrix that you have put up on the blog, as I am getting an error with the Distance Matrix as well(for the Car Data)
Thankyou
SSDecember 12, 2012 at 8:17 PM
Professor,
Excerpts from the blog above(Factor Analysis Homework)-

This dataset can be found from calls H4 to P34 in the google spreadsheet mentioned above.
Should it not be K4 to P34??
Regards
SSDecember 12, 2012 at 8:39 PM
Professor,
Is the following a part of the homework as well?
6a. Factor Analysis Session 4-HW3.
What data set is it referring to?Where is it? 6a. Factor Analysis Session 4-HW3.

Go to the 'HW3 Factor Analysis input data' worksheet. Read-in the dataset from cells B5 to L56. The annotation for the variables V1-V11 can be found in columns N and O.??

Thankyou
AnonymousDecember 17, 2012 at 9:16 AM
6a. Factor Analysis Session 4-HW3???

Prof, can you please make the labeling of the ques a little bit easier e.g. We can just have a simple PDF with q1, q2, q3 instead of 6a-4-HW3.

Sorry if I am nitpicking here, but we would want to ideally spend more time on the content of the assignment rather than trying to decipher structure of the Hw.

Thanks.
Priyanka PrakashDecember 18, 2012 at 12:14 PM
Hi Professor

I'm getting the below error:

> d <- as.dist(mydata)
Warning message:
In as.dist.default(mydata) : non-square matrix
> fit <- isoMDS(d, k=2)
Error in cmdscale(d, k) :
length of 'dimnames' [1] not equal to array extent
In addition: Warning message:
In cmdscale(d, k) :
number of items to replace is not a multiple of replacement length

Do suggest the steps required.

-Priyanka
AnonymousDecember 18, 2012 at 12:40 PM
Hi Professor,

Even after replacing "mydata" with "dmat" I faced the following error:

> fit <- isoMDS(d, k=2)
Error in isoMDS(d, k = 2) :
an initial configuration must be supplied with NA/Infs in 'd'

Please guide.

Thanks,
Roohi
AnonymousDecember 19, 2012 at 12:15 AM
Hi Professor,

While installing the nFactor package i am getting the following error:

package ‘nFactors’ is not available (for R version 2.15.2)

Kindly, suggest which version i should install..

Constructive feedback appreciated. Please try to be civil, as far as feasible. Thanks.

Marketing Yogi

Wednesday, December 5, 2012

Session 4 - MDS, Factor Analysis: R code and HW

14 comments:

Blog Archive