Wednesday, December 5, 2012

Session 4 - MDS, Factor Analysis: R code and HW

Update:

Received this email from ms Priyanka Guha:

Hello Professor,

I think there is a missing data in row 19 and column AJ of the second data sheet and so, I am getting an error while running the code. Can you please suggest how to go forward about that.

My response:
Hi Priyanka,

Yes, I just checked.

Pls replace it with the median for that column (3, I think). More generally, missing data is a part of most real world data sets. We usually impute the missing values provided there’s not too many of them. Means, medians etc are often used.

.

******************************************

Hi all,

Like I said in my opening statement in class today: Session 4 onwards MKTR gets heavy on R and tools and analysis.

Last post, I discussed, demo-ed and deconstructed Joint Space maps (JSMs). This post is dedicated to MDS (and if its not too long), factor analysis. So, w/o further ado, here goes:

5. MDS code

MDS or Multi-dimensional scaling is the way we analyze overall-similarity (OS) data. Recall that you rated your impression of overall similarity among 9 car brands in a series of paired comparisons. We use that data for MDS here. The input is required in a particular format and there's some data cleaning and aggregation required. Luckily, a simple R function now takes care of that problem.

Pls find the data for MDS and subsequently factor analysis input in this google spreadsheet.

This is what the code below is helping you do:

  • Read in data with headers from cells D38 to AM90 into 'inp'.
  • Then read in the brand names vector (cells C39 to C47) into 'names'.
  • Build a dissimilarity (or distance) matrix 'dmat'
# read in the dissimilarities matrix #
inp = read.table(file.choose(),header=TRUE)
# read in brand names vector #
names = read.table(file.choose())

# build distance matrix #
k = nrow(names)
dmat = matrix(0, k, k)

for (i1 in 1:(k-1)){
a1=grepl(names[i1,1], colnames(inp))
for (i2 in (i1+1):k){
a2=grepl(names[i2,1], colnames(inp))
a3 = a1*a2
a4 = match(1, a3)
dmat[i1, i2] = mean(inp[, a4])
dmat[i2, i1] = dmat[i1, i2]
}}

colnames(dmat) = names[,1]
rownames(dmat) = names[,1]
Click on the image to see what the distance matrix 'dmat' looks like. This is the average of the similarity perceptions for the whole class.

First, we do metric MDS. Metric MDS uses metric data as its input. Since you rated similarity-dissimilarity on a 1-7 interval scale, metric MDS will work just fine. Here's the code for metric MDS, just copy-paste onto the R console.

d = as.dist(dmat)
# Classical MDS into k dimensions#
fit <- cmdscale(d,eig=TRUE, k=2)
fit # view results

# plot solution #
x <- fit$points[,1]
y <- fit$points[,2]
plot(x, y, xlab="Coordinate 1", ylab="Coordinate 2", main="Metric MDS", xlim = c(floor(min(x)), ceiling(max(x))), ylim = c(floor(min(y)), ceiling(max(y))), type="p",pch=19, col="red")
text(x, y, labels = rownames(fit$points), cex=1.1, pos=1)
abline(h=0); abline(v=0)
This is my graphical MDS output. Click on image for larger size.

Suppose the similarity-dissimilarity judgments were in yes/no terms rather than in metric ratings. Then metric MDS becomes dicey to use as it relies on interval scaling assumptions. The more robust (but somewhat less efficient) nonmetric MDS then becomes the way to go. I would recommend using nonmetric MDS when in doubt, and certainly for MDS maps for individual respondents.

Nonmetric MDS is just as easy to run. However, make sure you have the MASS package installed before running it.

library(MASS)
d <- as.dist(dmat)
fit <- isoMDS(d, k=2)
fit # view results


# plot solution $
x <- fit$points[,1]
y <- fit$points[,2]
plot(x, y, xlab="Coordinate 1", ylab="Coordinate 2", xlim = c(floor(min(x)), ceiling(max(x))), ylim = c(floor(min(y)), ceiling(max(y))), main="Nonmetric MDS", type="p", pch=19, col="red")
text(x, y, labels = rownames(fit$points), cex=1.1, pos=1)
abline(h=0); abline(v=0)
Click on image for larger size.

5a. Subset analyses with MDS

We can analyze select subsets of interest via MDS. I've defined a generic function 'dmat.subset' to do this with. It requires you to provide the brand names vector 'names', the input matrix 'inp' and a subset selection vector 'k1'. First, just copy-paste the following code onto the R console.

# dmat.subset func define #
dmat.subset <- function(names, inp, k1){

k = nrow(names)
dmat = matrix(0, k, k)
inp1 = inp[k1, ]

for (i1 in 1:(k-1)){
a1=grepl(names[i1,1], colnames(inp1))
for (i2 in (i1+1):k){

a2=grepl(names[i2,1], colnames(inp1))
a3 = a1*a2
a4 = match(1, a3)
dmat[i1, i2] = mean(inp1[, a4])
dmat[i2, i1] = dmat[i1, i2]

}}
colnames(dmat) = names[,1]
rownames(dmat) = names[,1]

dmat }

To select the second row in 'inp', simply define

k1 = 2

To select the second, fifth and tenth rows (say), write

k1 = c(2, 5, 10)

To select rows 2 to 5 and 22-to-39, write

k1 = c(2:5, 22:39)

And so on. Hopefully, the logic is clear enough now.

Suppose you want to use PGID to select. Then first read cells B39 to B90 into a dataset, say 'pgid'. Say you want to select PGID 61310363. Now do the following (useful from your HW point of view):

# read-in pgid vector
pgid = read.table(file.choose())

k1 = (pgid == 61310363) # define subset

Now, once having selected the subset of interest into 'k1', we can run MDS on it quite simply as:

# define dist matrix for the subset
dmat = dmat.subset(names, inp, k1)

### --- nonMetric MDS for individuals ---
library(MASS)
d <- as.dist(dmat)
fit <- isoMDS(d, k=2)
fit # view results

# plot solution $
x <- fit$points[,1]
y <- fit$points[,2]
plot(x, y, xlab="Coordinate 1", ylab="Coordinate 2", xlim = c(floor(min(x)), ceiling(max(x))), ylim = c(floor(min(y)), ceiling(max(y))), main="Nonmetric MDS", type="p", pch=19, col="red")
text(x, y, labels = rownames(fit$points), cex=0.85, pos=1)
abline(h=0); abline(v=0)
This is what I got for pgid 363. Notice how an individual's MDS plot differs markedly from the overall average for the class. Which one would you trust more? Why (not)?

5b. HW2 for Session 4 on MDS

Pls use the data on the google spreadsheet mentioned in this post. I have also sent it separately in the worksheet labeled 'MDS HW' in the excel file emailed to you.

Your task is to:

  • Draw an MDS map using your own individual input.
  • Interpret what the axes mean and what you might have had at the back of your mind when you were evaluating the survey Qs.
  • Make a note of whether there are any attributes you had in mind which are not reflected in the MDS. And if so, what these attributes are.
  • Now choose any 2 other students and draw 2 MDS plots corresponding to each of these students. Take care that their maps should differ substantially from your own. Interpret what major attributes they may have had in mind when they filled in the survey.

The format for submission:

  • Paste the MDS output on PPT slides.
  • For each plot, write your interpetation of the plot using bullet points on the next slide.
  • Any queries, clarifications etc, pls use the comments box below to reach me.
  • Any help needed with R, pls reach out to the AA Mr Ankit Anand or to me.

And here's the point of it all, the value-add that will accrue once you successfully (and sincerely) navigate this Session4-HW2.

Good luck. And I hope you have fun doing the HW rather than push it in too serious a mood.

6. Factor analysis code

Update: This blog-post contains a detailed explanation of how to interpret factor analysis results using a HW dataset I'd given out in term 5, Hyd.

We will use exploratory (or 'common') factor analysis first on the toothpaste survey dataset. This dataset can be found from calls H4 to P34 in the google spreadsheet mentioned above. Need to install package 'nFactors' (R is case sensitive, always) for running the scree plot.

# read in the data #
mydata=read.table(file.choose(),header=TRUE)
mydata[1:5,]#view first 5 rows

# install the required package first
install.packages("nFactors")

# determine optimal no. of factors#
library(nFactors) # invoke library
ev <- eigen(cor(mydata)) # get eigenvalues
ap <- parallel(subject=nrow(mydata),var=ncol(mydata),rep=100,cent=.05)
nS <- nScree(ev$values, ap$eigen$qevpea)
plotnScree(nS)

On the scree plot that appears, The green horizontal line represents the Eigenvalue=1 level. Simply count how many green triangles (in the figure above) lie before the black line cuts the green line. That is the optimal no. of factors. Here, it is 2. The plot looks intimidating as it is, hence, pls do not bother with any other color-coded information given - blue, black or green. Just stick to the instructions above.

k1=2 # set optimal no. of factors
If the optimal no. of factors changes when you use a new dataset, simply change the value of k1 in the line above. Copy paste the line onto a notepad, change it to 'k1=6' or whatever you get as optimal and paste onto R console. Rest of the code runs as-is.

# extracting k1 factors #
# with varimax rotation #
fit <- factanal(mydata, k1, scores="Bartlett",rotation="varimax")
print(fit, digits=2, cutoff=.3, sort=TRUE)

# plot factor 1 by factor 2 #
load <- fit$loadings[,1:2]
par(col="black")#black lines in plots
plot(load,type="p",pch=19,col="red") # set up plot
abline(h=0);abline(v=0)#draw axes
text(load,labels=names(mydata),cex=1,pos=1)
# view & save factor scores #
fit$scores[1:4,]#view factor scores

#save factor scores (if needed)
write.table(fit$scores, file.choose())
Click on image for larger size.

In case you are wondering how the variable loading onto factors looks like in R (after factor rotation), here is the relevant R console snapshot.

Clearly, the top 3 variables load onto factor 1 and the bottom 3 onto factor 2. That is all for now. See you soon.

6a. Factor Analysis Session 4-HW3.

Go to the 'HW3 Factor Analysis input data' worksheet. Read-in the dataset from cells B5 to L56. The annotation for the variables V1-V11 can be found in columns N and O.

Your task is to:

  • Run a factor analysis on the psychographics matrix
  • Use the scree plot to identify the optimal no. of factors
  • Assign variables to factors
  • Interpret what the factors may mean given the variables that load on them

Submission format:

  • Again, PPT format. leave one slide blank to separate HWs and HW3 in your PPT
  • Paste the scree plot and the actual factor table obtained
  • Write your interpretation of the factors as bullet points on a fresh slide

I will go over the factor analysis part again in class and in the tutorial. In particular, we will cover what is factor analysis, what are factor loadings and factor scores, how to use factor scores in downstream analysis and so on.

Sudhir


14 comments:

  1. Sir,
    Can we just use the Distance matrix that you have put up on the blog, as I am getting an error with the Distance Matrix as well(for the Car Data)
    Thankyou

    ReplyDelete
    Replies
    1. HI SS<

      OK. However if you show me a screenshot of the error, I could guide you around it.

      While you are free to borrow plots from your peers (or the blog), I'd say try to run the code a few times yourself. If its not working, then fine, don't burn too much time over it.

      I'm presently in Hyd (returning Sunday) but Ankit's available in Mohali for any R related assistance.

      Sudhir

      Delete
  2. Professor,
    Excerpts from the blog above(Factor Analysis Homework)-

    This dataset can be found from calls H4 to P34 in the google spreadsheet mentioned above.
    Should it not be K4 to P34??
    Regards

    ReplyDelete
    Replies
    1. This refers to the excel file sent over email to you on 6-dec. Pls check email.

      Sudhir

      Delete
  3. Professor,
    Is the following a part of the homework as well?
    6a. Factor Analysis Session 4-HW3.
    What data set is it referring to?Where is it? 6a. Factor Analysis Session 4-HW3.

    Go to the 'HW3 Factor Analysis input data' worksheet. Read-in the dataset from cells B5 to L56. The annotation for the variables V1-V11 can be found in columns N and O.??

    Thankyou

    ReplyDelete
    Replies
    1. Hi SS,

      Yes, there was an excel file 'S4 HW data.xls' sent over email to the class on 06-dec at 10.59 am. The above refers to the third worksheet in that file.

      Hope that clarifies.

      Sudhir

      Delete
  4. 6a. Factor Analysis Session 4-HW3???

    Prof, can you please make the labeling of the ques a little bit easier e.g. We can just have a simple PDF with q1, q2, q3 instead of 6a-4-HW3.

    Sorry if I am nitpicking here, but we would want to ideally spend more time on the content of the assignment rather than trying to decipher structure of the Hw.

    Thanks.

    ReplyDelete
    Replies
    1. Hi Anon,

      Sounds like a good idea re the numbering of the questions. The blog is written the way it is because there are clear sub-modules within a session.

      It should be simple enough matter to collect all the Qs in one place. However, I'd rather people read their way through the blog and arrived at the Qs so that the context and point of the Qs is more clear.

      I agree on the Q numbering part though. Will work on it.

      Sudhir

      Delete
  5. Hi Professor

    I'm getting the below error:

    > d <- as.dist(mydata)
    Warning message:
    In as.dist.default(mydata) : non-square matrix
    > fit <- isoMDS(d, k=2)
    Error in cmdscale(d, k) :
    length of 'dimnames' [1] not equal to array extent
    In addition: Warning message:
    In cmdscale(d, k) :
    number of items to replace is not a multiple of replacement length

    Do suggest the steps required.

    -Priyanka

    ReplyDelete
    Replies
    1. Sorry, that should be 'dmat' and not 'mydata'. Pls try now. I';ll update the blog.

      Sudhir

      Delete
  6. Hi Professor,

    Even after replacing "mydata" with "dmat" I faced the following error:

    > fit <- isoMDS(d, k=2)
    Error in isoMDS(d, k = 2) :
    an initial configuration must be supplied with NA/Infs in 'd'

    Please guide.

    Thanks,
    Roohi

    ReplyDelete
    Replies
    1. Hi Roohi,

      Its hard to debug from a distance. Are you around in school right now? Could you drop by with your laptop? I'm in 114, ext 1714. Let me know.

      Sudhir

      Delete
  7. Hi Professor,

    While installing the nFactor package i am getting the following error:

    package ‘nFactors’ is not available (for R version 2.15.2)

    Kindly, suggest which version i should install..

    ReplyDelete
    Replies
    1. Hi Anon,

      Don't know why. Nobody else has reported this issue so far for nFactors.

      Go here:http://cran.r-project.org/web/packages/nFactors/index.html

      Download the windows binary for the package 'nFactors_2.3.3.zip', save it someplace. Don't unzip it or anything.

      Goto R console -> 'Packages' menu pull down -> 'Install Package(s) from local zip files'

      Should install peacefully, I hope.

      Sudhir

      Delete

Constructive feedback appreciated. Please try to be civil, as far as feasible. Thanks.