Tuesday, December 3, 2013

Session 3 Updates

Hi all,

Update:

This blog post from last year contains more details on how to interpret factor analysis results in R.

Session 3 covers two seemingly diverse topics - Questionnaire design and Data reduction via Factor analysis.

Each topic brings its own HW with it. And yes, let me pre-empt some possible grips that may arise.... No, the HWs aren't overly heavy yet. The Hyd folks did the similar HWs despite also having a project.

The Questionnaire design portion has a HW that asks you to program a websurvey based on a D.P.-R.O. that you extract from a given M.P. Let me introduce that HW right away below:

Consider the following problem context:

Flipkart, a leading Indian e-tailer, wants to know about how students in premier professional colleges in India view shopping online. Flipkart believes that this segment will, a few years down, become profitable, a source of positive word of mouth from a set of opinion leaders. This will seed the next wave of customer acquisition and growth and is hence a high stakes project for Flipkart.

Flipkart wants to get some idea about the online buying habits, buying process concerns and considerations, product categories of interest, basic demographics, media consumption (in order to better reach this segment) and some idea of the psychographics of this segment.

As lead consultant in this engagement, you must now come up with a quick way to prioritize and assess the target-segment's perceptions on these diverse parameters.

HW Q: Build a short survey (no longer than 12-15 minutes of fill-up time for the average respondent) on qualtrics web survey software for this purpose. Pls submit websurvey link in this google form. The deadline is before Session 5 starts.

******************************************

OK, now let's start. Fire up your Rstudio. Download all the data files required from the 'Session 3 files' folder on LMS.

Copy the code below and paste it on the 'Console' in Rstudio. A window will open up asking for the location of the dataset to be read in. Read in 'factorAn data.txt'.Use the 'Packages' tab in the lower right pane in Rstudio to install the nFactors package.

rm(list = ls()) # clear workspace first

# read in the data 'factorAn data.txt'

mydata=read.table(file.choose(),header=TRUE)

mydata[1:5,] #view first 5 rows

# install the required package first

install.packages("nFactors")

# determine optimal no. of factors

library(nFactors) # invoke library

ev = eigen(cor(mydata)) # get eigenvalues

ap = parallel(subject=nrow(mydata),var=ncol(mydata),rep=100,cent=.05);

nS = nScree(ev$values, ap$eigen$qevpea);

plotnScree(nS)

A scree plot should appear, like the one below:

On the scree plot that appears, The green horizontal line represents the Eigenvalue=1 level. Simply count how many green triangles (in the figure above) lie before the black line cuts the green line. That is the optimal no. of factors. Here, it is 2. The plot looks intimidating as it is, hence, pls do not bother with any other color-coded information given - blue, black or green. Just stick to the instructions above. Now, we set 'k1' to 2 as shown below:

k1 = 2 # set optimal no. of factors

If the optimal no. of factors changes when you use a new dataset, simply change the value of "k1" in the line above. Copy paste the line onto a notepad, change it to 'k1=6' or whatever you get as optimal and paste onto R console. Rest of the code runs as-is.

# extracting k1 factors with varimax rotation

fit = factanal(mydata, k1, scores="Bartlett", rotation="varimax");

print(fit, digits=2, cutoff=.3, sort = TRUE)

You'll see something like this below (click for larger image)

Clearly, the top 3 variables load onto factor 1 and the bottom 3 onto factor 2.

Another point of interest is the last line in the image above which says "Cumulative Var". It stands for Cumulative variance explained by the factor solution. For our 2 factor solution, the cumulative variance explained is 0.74 or 74%. In other words, close to three-quarters or 74% of the net information content in the original 6-variable dataset is retained by the 2-factor solution.

Also, look at 'Uniquenesses' of the variables. The more 'Unique' a variable is, the less it is explained by the facctor solution. hence, often times, we drop variables with very high uniqueness (say over 2/3rds) and re-run the analysis on the remaining variables. The dropped variables can essentially be considered factors in their own right and are included as such in downstream analysis. If there is *any* aspect of the above process that you want to see expanded or see more detail on, pls let me know. I shall do so to the best I can.

We can now plot the variables onto the top 2 factors (if we want to) and see how that looks like. Also, we can save the factor scores for later use downstream, if we want to.

# plot factor 1 by factor 2

load <- fit$loadings[,1:2]

par(col="black") #black lines in plots

plot(load,type="p",pch=19,col="red") # set up plot

abline(h=0);abline(v=0)#draw axes

text(load,labels=names(mydata),cex=1,pos=1)

# view & save factor scores

fit$scores[1:4,]#view factor scores

write.table(fit$scores, file.choose())

The above is the plot of the variable on the first two factors. The variables closest to the axes (factors) load onto it.

*************************************

Session 3 - Homework 2: Factor Analysis:

  • In the Session 3 files' folder on LMS, there is a dataset labeled 'personality survey responses new.txt'.
  • This is *your* data - 33 psychographic variables that should map onto 4 personality factors - that you answered in the surveys of Session 2 HW.
  • Read it into R using the code given above.
  • Run the analysis as per the code given above or as given in 'R code for factor analysis.txt' notepad.
  • Look up the scree plot and decide what is the optimal # factors.
  • Plug that number into 'k1= ' piece of the code.
  • Copy and save the plots as metafiles directly onto a PPT slide.
  • Copy and paste the R results tables either into excel or as images onto PPT.
  • See the image below for 'Q reference' to map which variable meant what in the survey.
  • *Interpret* any 5 of the factors that emerge based on the variables that load onto them. Label those factors.
  • Write your name and PGID on the PPT title slide.
  • Name the PPT yourname_session3HW.pptx and submit into the session 3 dropbox on LMS
  • Submission deadline is a week from now, before the start of session 5.

Above is the Qs cross-reference, just in case.

Any Qs or clarifications etc, contact me.Pls check this blog post regularly for updates. As Qs and clarifications come to me, I will update this post to handle them.

Sudhir

No comments:

Post a Comment

Constructive feedback appreciated. Please try to be civil, as far as feasible. Thanks.