Sunday, December 16, 2012

Session 7 - Hypothesis Testing R code

Hi all,

The previous blog-post introduced Unstructured text analysis in Session 7. This one is a separate module - to do Hypothesis testing. In this blog-post, I post R code required for classwork examples for two common types of hypothesis tests - tests of differences and tests of association.

Module 1. Hypothesis Testing

  • t-tests for differences
Use the notepad titled 'Nike dataset.txt' for this.
## hypothesis testing
## Nike example

# read-in data
nike = read.table(file.choose(), header=TRUE)
dim(nike); summary(nike); nike[1:5,] # some summaries
attach(nike) # Call indiv columns by name

# Q is: "Is awareness for Nike (X, say) different from 3?” We first delete those rows that have a '0' in them as these were non-responses. Then we proceed with t-tests as usual.

# remove rows with zeros #
x1 = Awareness[(Awareness != 0)]
# test if x1's mean *significantly* differs from mu
t.test(x1, mu=3) # reset 'mu=' value as required.
This is the result I got:
To interpret, first recall the hypothesis. The null said:"mean awareness is no different from 3". However, the p-value of the t-test is well below 0.01. Thus, we reject the null with over 99% confidence and accept the alternative (H1: Mean awareness is significantly different from 3). Happily, pls notice that R states the alternative hypothesis in plain words as part of its output.

The second Q we had was:

# “Q: Does awareness for Nike exceed 4?”
# change 'mu=3' to 'mu=4' now
t.test(x1, mu=4, alternative=c("greater")) # one-sided test
The p-value I get is 0.2627. We can no longer reject the Null even at the 90% confidence level. Thus, we infer that mean awareness of 4.18 odd does *not* significantly exceed 4.

Next Q was: "Does awareness for Nike in females (Xf, say) exceed that in males (Xm)?”

# first make the two vectors
xm = x1[(Gender==1)]
xf = x1[(Gender==2)]
# Alternative Hypothesis says xf '>' xm
t.test(xf, xm, alternative="greater")
In the code above, we specify for 'alternative=' whatever the alternative hypothesis says. In this case it said greater than, so we used "greater". Else we would have used "less".
The p-value is slightly above 0.05. So we should reject the Null at the 95% level. However, it is almost at 95% and the confidence level limits we put are essentially arbitrary only. We can choose accept to the alternative that "xf at 5.18 significantly exceeds xm at 3.73" in such circumstances. It is a judgment call.

  • chi.square-tests for Association
The data for this example is on the same google doc starting cell K5. Code for the first classwork example on Gender versus Internet Usage follows. First, let me state the hypotheses:
  • H0: "There is no systematic association between Gender and Internet Usage." . In other words, distribution of people we see in the 4 cells is a purely random phenomenon.
  • H1: "There is systematic association between Gender and Internet Usage."
# read in data WITHOUT headers
a1=read.table(file.choose()) # 2x2 matrix
a1 # view the data once
chisq.test(a1) # voila. Done!
This is the result I get:
Clearly, with a p-value of 0.1441, we can no longer reject the Null at even the 90% level. So we cannot infer any significant association between Gender and Internet Usage levels. The entire sample we had was 30 people large. Suppose we had a much bigger sample but a similar distribution across cells. Would the inference change? Let's find out.

Suppose we scaled up the previous matrix by 10. We will then have 300 people and a corresponding distribution in the four cells. But now, at this sample size, random variation can no longer explain the huge differences we will see between the different cells and the inference will change.

a1*10 # view the new dataset
chisq.test(a1*10)
The results are as expected. While a 5 person difference can be attributed to random variation in a 30 person dataset, a 50 person variation cannot be so attributed in a 300 person dataset.

Our last example on tests of association uses the Nike dataset. Does Nike Usage vary systematically with Gender? Suppose we had reason to believe so. Then our Null would be: Usage does not vary systematically with Gender, random variation can explain the pattern of variation. Let us test this in R:

# build cross tab
mytable=table(Usage,Gender)
mytable #view crosstab

chisq.test(mytable)
The example above is also meant to showcase the cross-tab capabilities of R using the table() function. R peacefully does 3 way cross-tabs and more as well. Anyway,here's the result: seems we can reject the Null at the 95% level.

Dassit from me for now. See you all in Class

Sudhir

No comments:

Post a Comment

Constructive feedback appreciated. Please try to be civil, as far as feasible. Thanks.