Sunday, December 2, 2012

Session 3 - Intro to R in a Sampling Demo

Hi all,

Time to gently introduce R into MKTR. In Session 3, I attempt to demonstrate sampling related concepts using R. This does not require that you as students do anything other than watch and listen, discuss and learn.  However, I would encourage you to try the same thing I demo in class at home on R. All you have to do is copy-paste the Rcode below (in the grey shaded boxes) onto the R console. Since there is no grading or assignment pressure involved, consider this a gentle intro to R.  Later, when the graded homework assignments happen, then also, expect the same thing - I will put out R code here on the blog and you have the option of using it to solve your homeworks directly.

1. To download and install R:
Go to the following link
CRAN project: R download links
download the installer, and follow its instructions.

After that, I expect that in < 20 minutes (on a good net connection), you will have top-class computational firepower ready on your machines.

Open R and, well, look around. The GUI won't look like much probably, but appearances can be deceiving (as we'll see later in the course in Session 6 - Qualitative research and Session 7 - Experimentation).

2. Read in Session 3 Dataset into R
For session 3, to demonstrate sampling concepts on R, I collected data from students in the MKTR class of 2011 on their self-reported height (in cms) and weight (in kgs).

This data is available for copying from this google spreadsheet. 

Please copy the data onto a .txt file on your computer and save it (preferably on the desktop). Use the code in step 3 to read it into R.


3. Basic Input/Output in R
  • The '#' symbol is a comment character and text following it on any line is not executed
  • The code below will read the dataset into an R object called 'mydata'. You can give any name you want to the dataset.

mydata = read.table(file.choose(),

header = TRUE) # TRUE only if there're column headers

  • After reading in *any* dataset, its always good practice to lookup data summaries - to eyeball if all is well or not.
  • If you want to understand the code, copy-paste the text in the grey boxes line by line. If you don't particularly care for coding in R, just copy paste the entire block of code in the grey boxes onto the R console.
dim(mydata) # show dataset's dimensions

summary(mydata) # show variables' descriptive summaries

mydata[1:5,] # show first 5 rows of mydata

To sum up, we have now both read in the dataset (named 'mydata') into R and run a quick-check on it to see its summary characteristics.
Click on the image for full-size.
4. Run your first set of analyses on R - histograms for what the data look like.
# What are the data like? Visualizing. #

attach(mydata) # enables calling columns by name

par(mfcol=c(2,1))# makes plots on a single page
# in a 2 x 1 pattern

for (i in 1:ncol(mydata)){ # starts loop for i from 1 to 4

hist(mydata[,i], breaks=30, # histogram with 30 breaks

main=dimnames(mydata)[[2]][i], # give plot title

col="gray") # shade the hist grey

abline(v=mean(mydata[,i]), # draw vertical line at mean

lwd=3, lty=2, col="Red") # of color red & width 3

} # loop ends

Just copy-paste the above code into the R console. Should run peacefully.

Those wanting to see what the code does should copy-paste line-by-line. Others can copy-paste the entire block of code in grey.

You should see something like the image above. Now, before doing any more copy-paste you have to click on the R console once to return control there.

5. Set sample size (k) and see by how much sample based estimates differ from true values.
# Randomly sample 10 values & estimate mean height, weight.

k = 10 # set sample size 'k' to 10

ht = sample(height,k); ht # sample k height values randomly

wt = sample(weight,k); wt # same for weight

mean(ht) # show mean of the height sample

mean(wt) # show mean of the weight sample

error.ht = mean(height)-mean(ht) # calculate sampling error

error.ht # show sampling error in height

error.wt = mean(weight)-mean(wt)
error.wt # do likewise for weight

This is what I got. Your figures will likely be different (because your samples will be different).

6. Now set a higher sample size, say k=40 (instead of 10) and rerun the above code. The errors come down, usually (but not always).
# Randomly sample 10 values & estimate mean height, weight.
k = 40
ht = sample(height,k); ht
wt = sample(weight,k); wt
mean(ht); mean(wt)
mean(height); mean(weight)
error.ht = mean(height)-mean(ht); error.ht
error.wt = mean(weight)-mean(wt); error.wt

7. Plot and see how closely sample distribution approximates population distribution.
par(mfrow=c(2,2)) # draws plots in 2x2 pattern

hist(mydata[,1], breaks=30, # same hist() function
main ="Population Height", xlim=c(140,200), col="gray")
abline(v=mean(mydata[,1]), lwd=3, lty=2, col="Red")

hist(mydata[,2], breaks=30, main="Population Weight", xlim=c(40,100), col="gray")
abline(v=mean(mydata[,2]), lwd=3, lty=2, col="Red")

hist(ht, breaks=20, main="Sample size k=40", xlim=c(140,200), col="beige")
abline(v=mean(ht), lwd=3, lty=2, col="Red")

hist(wt, breaks=20, main="Sample size k=40", xlim=c(40,100), col="beige")
abline(v=mean(wt), lwd=3, lty=2, col="Red")

By the way, if you want to save the above (or any other) plot from R, just right-click, copy as metafile and then paste onto your PPT or word doc etc.

8. Time to deep-dive into Sampling Distributions now. Ask yourself, what happens if we repeat the sampling process 1000 times? We get 1000 means from 1000 samples. Now, what do the 1000 means look like as a histogram?
outp = matrix(0,nrow=1000,ncol=2)# build empty output matrix

k = 10;# set sample size

for (i in 1:1000){ # open loop

outp[i,1]=mean(sample(height,k))# save sample statistics
outp[i,2]=mean(sample(weight,k))# in the outp matrix

} # close loop

par(mfrow=c(2,2)) # 2x2 pattern plots

for (i in 1:ncol(outp)){ # open plotting loop again

hist(outp[,i], breaks=10,main=c( "sample size=",k),

xlab=dimnames(mydata)[[2]][i],xlim=range(mydata[,i]))}


9. Now repeat the above code but with k=40 (instead of k=10) and see. The plots are superimposed in a 2x2 pattern. Do you see fall or rise in the std error? The standard error will roughly half, following the square root rule.
outp = matrix(0,nrow=1000,ncol=2)
k = 40;# set sample size
for (i in 1:1000){ outp[i,1]=mean(sample(height,k))
outp[i,2]=mean(sample(weight,k))}
#par(mfrow=c(2,2))
for (i in 1:ncol(outp)){
hist(outp[,i], breaks=10,main=c( "sample size=",k),
xlab=dimnames(mydata)[[2]][i],xlim=range(mydata[,i]))}

If all went well and you were able to replicate what I got in class in Session 3, well then, congratulations, I guess. You've run R peacefully for MKTR.

I'd strongly recommend folks try R to the max extent feasible for this course. Pls correspond with me using this blog for any R related queries, trouble-shoots etc.

Sudhir

P.S.
I also tried the flipped classroom thing and have putup two youtube vids on how to read in and write out data from R. Here are the links:
5 Steps to Read data into R
4 Steps to Save data from R




No comments:

Post a Comment

Constructive feedback appreciated. Please try to be civil, as far as feasible. Thanks.