This post is prompted by a student query which asked for a post on factor analysis on the lines of the post for joint space mapping interpretation.. A look at a couple of HW submissions I received also convinced me that this would be a good idea. So here goes.
In the session 5 HW, the first Question asks you to work with the psychographic profiles you had provided in Session 2. It asks you to (i) reduce the 25 psychographic Qs to a smaller no. of factors based on common constructs that might underlie some of the Qs. We will do this just now and I'll explain the process step-by-step.
Next (ii) the HW asks you to segment the class based on the psychographic factors/constructs obtained. We'll do that too. In both the factor and cluster analysis Qs, you're required to interpret what the underlying construct is (for factor analysis) and what the characteristics of the segments are (in cluster analysis). So, without further ado, let's get started.
1. Factor Analysis
If you want to see what each line of code does, just paste the code line by line instead of as a block. If after each line is pasted, the cursor stays on the same line, then hit enter.
# read-in data mydata = read.table(file.choose(), header=TRUE) dim(mydata) # show dataset dimensions summary(mydata) # show variable summaries mydata[1:10,1:12] # show first 10 rows, first 12 cols |
Now, ensure you have library 'nFactors' installed before this next step.
# determine optimal no. of factors library(nFactors) ev <- eigen(cor(mydata)) # get eigenvalues ap <- parallel(subject=nrow(mydata),var=ncol(mydata),rep=100,cent=.05) nS <- nScree(ev$values, ap$eigen$qevpea) plotnScree(nS) |
Count how many green triangles are there before the black line cuts the green line. That's the optimal no. of factors. Set 'k1' to the optimal no. of factors. The scree plot above says 'k1=10'.
# 10 green triangles before the line-cut # set k1 to 10 k1 = 10 # set optimal no. of factors |
Extract the factors. R does a varimax rotation of factor axes by default, no worries on that front. Then print the solution.
# extracting k1 factors # # with varimax rotation # fit <- factanal(mydata, k1, scores="Bartlett",rotation="varimax") print(fit, digits=2, cutoff=.3, sort=TRUE) |
There are 10 factors and each of the 25 variables (V1-V25) load usually onto no more than one factor only (thanks to factor rotation). However, there are some exceptions and we will see them shortly. See the last line "Cumulative Var." in the image? It says that the 10 factor solution explain 56% of the total variation in the original 25 variables.
To interpret the above factor table, Look at the first column for Factor 1. It says that Factor 1 comprises 3 variables that load on it - V4, V17 and V9. Look-up the annotation table pasted above (after data read-in) and try to think of what is common between the variables? What construct might possibly underlie all the variables. Similarly, Factor 2 comprises V2, V3 and V24. And so on.
However, when we come to factor 3, two issues emerge - one, factor 3 "shares" V24 with factor 2 and two, V5 is negative on factor 3. How to interpret these? The "sharing" of a variable across factors is generally not a good thing. It has happened here despite the factor rotation (it would be more widespread and much worse without the rotation). One convenient way is to 'assign' all the weigh of the variable to one factor and ignore the other factor (This may cause distortions in factor interpretation though). Another interpretation is that both factors 2 and 3 have some characteristic of V24 in them. Regarding the negative value of V5 on Factor 3, it means that people who disagree on V5 load strongly on Factor 3. And so on.
The rest of the factors can be interpreted similarly. Ideally use a tabular form to list variables in each factor and their interpretation. Am pasting a simplified example from Co2010's car project here:
2. Segmentation Using Factor Analysis OutputThere are 70 of you who took that survey. The original psychogr matrix was 70x25 in dimension. But now we want a matrix only 70x10 in dimension that we can use as basis variables for segmentation. This 70x10 matrix is called a 'factor scores' matrix. It gives each respondent's "score" on each of the 10 factors we have selected. Thus, the factors replace the original variables and data reduction is achieved.
# view & save factor scores # fit$scores[1:4,]#show 4 rows of factor scores # save them in dataset called Fscores fscores = fit$scores write.table(fscores, file.choose()) |
For clustering, I directly went to model based clustering as it's the best available.
# Model Based Clustering library(mclust) fit <- Mclust(fscores) fit # view solution summary |
When one set of basis variables don't yield a good clustering solution, chances are more that the basis variables set is suboptimal rather than that the population is truly homogeneous. So, we have to rationalize the basis variables set. Look at R's factor loadings table (two images above). Look for factors comprising variables *none* of which load heavily (i.e. with over 0.67, say) weight onto their factors. These are the weakest links (informationally speaking) and can perhaps be dropped. From the table above, factors 7 and 10 seem to be leading candidates.
. So, I drop factor 7 first (lowest average loading), steer clear of Mclust (since it will likely again give a 1-component solution) and run hierarchical and k-means methods.
# factor 7 has lowest avg loading fscores.store=fscores # save a copy, just in case fscores = fscores.store[,c(-7)] # drop 2nd column # Now let's steer clear of Mclust and try the alternative methods # start with Hierarchical clust # Ward Hierarchical Clustering d <- dist(fscores, method = "euclidean") # distance matrix fit <- hclust(d, method="ward") plot(fit) # display dendogram # Eyeballing says 3 or 5. # now use k-means scree-plot to visualize par(col="black") # Determine number of clusters # wss <- (nrow(fscores)-1)*sum(apply(fscores,2,var)) for (i in 2:15) wss[i] <- sum(kmeans(fscores,centers=i)$withinss) plot(1:15, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares") # Look for an "elbow" in the scree plot # # eyeballing scree-plot says 5. Go with 5 # |
OK, now how to characterize the segments? We profile them on the basis of their basis variables - in this case the 9 factors we used as inputs. The following code gives the means (each row of means is sometimes called a 'centroid' vector) of the segment's membership along the 9 basis factors. ID which factors a particular segment is high/medium/low on and profile it accordingly.
# set optimal no. of clusters. k1 = 5 # k1 is set # K-Means Cluster Analysis fit <- kmeans(mydata, k1) # k1 cluster solution fit$size # relative segmt sizes # get cluster means aggregate(fscores,by=list(fit$cluster),FUN=mean) # append cluster assignment mydata1 <- data.frame(mydata, fit$cluster) # save dataset with cluster membership write.table(mydata1, file.choose()) |
As for ClusPlots, don't bother when you have as many as 9 basis variables. The plot will usually be uninterpretable anyway. Still, here goes:
3. Discriminant Analysis
Take the saved dataset into MEXL and then do discriminant analysis.
The reason I'm under-whelmed with discriminant analysis is that it has been in recent years superceded by the logit model. We'll cover the logit model in Session 8. There I can do a brief detour and return to the segment-prediction power that discriminant models are supposed to have and why logit is a better alternative that linear discriminant analysis that we currently are stuck with.
Well, that's it for now. Ciao.
Sudhir
Dear professor,
ReplyDeleteCan you please explain again how you drew the conclusion that you drew when you said "So I looked at Mclust's expanded output and saw that both 1 and 2 component solutions had the same BIC. (See image below)"? I can't spot the 1 or 2 cluster solutions in the image. Also, what is BIC please?
Regards,
Shouri Kamtala
Hi Shouri,
ReplyDeleteBIC stands for "Bayesian information criterion" and, like the adjusted R-square in a regression, it is a measure of complexity-penalized model fit.
You're right, in my hurry I took the second column instead of the second row as a 2-component solution whereas it is also a 1-component solution. Now, unfortunately, we have a situ with real data (even from a small sample) wherein neither the k-means scree plot nor Mclust nor the Hclust dendograms give much guidance on what to do. Have to go with judgement in such cases I guess. But at least we ensure, other objective avenues have first been explored and ruled out before going here.
Hope that helped.
Sudhir
Hi Shouri,
ReplyDeleteI forgot to add this: One way out to get a better clustering solution is to drop a few of the basis variables. Use judgment, or alternately, drop the factor that has the least variation (lowest std deviation) and is hence the least information. See if the Mclust improves. Else, drop one more and then see. And so on. I'll do this and update this post after that.
Sudhir
Hi Professor,
ReplyDeleteThe reason for different shapes of factor analysis graph(Scree Plot) could be because of inherent randomness in kmeans function, which is used to generate wss array.
par(col="black")
# Determine number of clusters #
wss <- (nrow(fscores)-1)*sum(apply(fscores,2,var))
wss #print wss[1]
fscores #print fscores
for (i in 2:15) wss[i] <- sum(kmeans(fscores,centers=i)$withinss)
wss #print wss[1-15]. Prints differnt values for wss array every time
kmeans documentation link: http://rgm2.lab.nig.ac.jp/RGM2/func.php?rd_id=Rcmdr:KMeans
Regards,
Goutham
Thanks, Goutham.
DeleteWhile I know that kmeans uses a different random generator seed each time it needs a starting values set, I wasn't aware it can make such a big difference overall. Usually, the solution tends to converge peacefully regardless of starting values. If it isn't, then maybe Mclust was right after all, there are no multiple segments in the data. Perhaps.
Sudhir
Professor,
ReplyDeleteWhen you wrote that Factor 1 loads onto V4, V17 and V10 (factor analysis), I believe you made a typo because from what I can see it loads onto V4, V17 and V9. Kindly clarify if what I am thinking is correct or there is underlying meaning!
Hi Pradyut,
DeleteYou're right about the typo. It should be V9 and not V10. Shall make the correction the blog.
Sudhir