The previous blog-post introduced Unstructured text analysis in Session 7. This one is a separate module - to do Hypothesis testing. In this blog-post, I post R code required for classwork examples for two common types of hypothesis tests - tests of differences and tests of association.
Module 1. Hypothesis Testing
- t-tests for differences
## hypothesis testing ## Nike example # read-in data nike = read.table(file.choose(), header=TRUE) dim(nike); summary(nike); nike[1:5,] # some summaries attach(nike) # Call indiv columns by name |
# Q is: "Is awareness for Nike (X, say) different from 3?” We first delete those rows that have a '0' in them as these were non-responses. Then we proceed with t-tests as usual.
# remove rows with zeros # x1 = Awareness[(Awareness != 0)] # test if x1's mean *significantly* differs from mu t.test(x1, mu=3) # reset 'mu=' value as required. |
The second Q we had was:
# “Q: Does awareness for Nike exceed 4?” # change 'mu=3' to 'mu=4' now t.test(x1, mu=4, alternative=c("greater")) # one-sided test |
Next Q was: "Does awareness for Nike in females (Xf, say) exceed that in males (Xm)?”
# first make the two vectors xm = x1[(Gender==1)] xf = x1[(Gender==2)] # Alternative Hypothesis says xf '>' xm t.test(xf, xm, alternative="greater") |
- chi.square-tests for Association
- H0: "There is no systematic association between Gender and Internet Usage." . In other words, distribution of people we see in the 4 cells is a purely random phenomenon.
- H1: "There is systematic association between Gender and Internet Usage."
# read in data WITHOUT headers a1=read.table(file.choose()) # 2x2 matrix a1 # view the data once chisq.test(a1) # voila. Done! |
Suppose we scaled up the previous matrix by 10. We will then have 300 people and a corresponding distribution in the four cells. But now, at this sample size, random variation can no longer explain the huge differences we will see between the different cells and the inference will change.
a1*10 # view the new dataset chisq.test(a1*10) |
Our last example on tests of association uses the Nike dataset. Does Nike Usage vary systematically with Gender? Suppose we had reason to believe so. Then our Null would be: Usage does not vary systematically with Gender, random variation can explain the pattern of variation. Let us test this in R:
# build cross tab mytable=table(Usage,Gender) mytable #view crosstab chisq.test(mytable) |
Dassit from me for now. See you all in Class
Sudhir
No comments:
Post a Comment
Constructive feedback appreciated. Please try to be civil, as far as feasible. Thanks.