Hi all,
The previous blog-post introducing Session 8 modules can be found here. In this blog-post, I post R code required for classwork examples and links to the associated datasets.
Module 1. Hypothesis Testing
- t-tests for differences
## hypothesis testing ## Nike example # read-in data nike = read.table(file.choose(), header=TRUE) dim(nike); summary(nike); nike[1:5,] # some summaries attach(nike) # Call indiv columns by name |
# Q is: "Is awareness for Nike (X, say) different from 3?” We first delete those rows that have a '0' in them as these were non-responses. Then we proceed with t-tests as usual.
# remove rows with zeros # x1 = Awareness[(Awareness != 0)] # test if x1's mean *significantly* differs from mu t.test(x1, mu=3) # reset 'mu=' value as required. |
The second Q we had was:
# “Q: Does awareness for Nike exceed 4?” # change 'mu=3' to 'mu=4' now t.test(x1, mu=4, alternative=c("greater")) # one-sided test |
Next Q was: "Does awareness for Nike in females (Xf, say) exceed that in males (Xm)?”
# first make the two vectors xm = x1[(Gender==1)] xf = x1[(Gender==2)] # Alternative Hypothesis says xf '>' xm t.test(xf, xm, alternative="greater") |
- chi.square-tests for Association
- H0: "There is no systematic association between Gender and Internet Usage." . In other words, distribution of people we see in the 4 cells is a purely random phenomenon.
- H1: "There is systematic association between Gender and Internet Usage."
# read in data WITHOUT headers a1=read.table(file.choose()) # 2x2 matrix a1 # view the data once chisq.test(a1) # voila. Done! |
Suppose we scaled up the previous matrix by 10. We will then have 300 people and a corresponding distribution in the four cells. But now, at this sample size, random variation can no longer explain the huge differences we will see between the different cells and the inference will change.
a1*10 # view the new dataset chisq.test(a1*10) |
Our last example on tests of association uses the Nike dataset. Does Nike Usage vary systematically with Gender? Suppose we had reason to believe so. Then our Null would be: Usage does not vary systematically with Gender, random variation can explain the pattern of variation. Let us test this in R:
# build cross tab mytable=table(Usage,Gender) mytable #view crosstab chisq.test(mytable) |
Module 2. Secondary Data Analysis
There are 3 types of broadly used secondary data that you will be exposed to in this module. First, a small sample or subset of a supermarket shopping basket dataset upon which we will perform *affinity analysis*. The second is aggregate sales data for a category (beer) provided by syndicated data providers (like A C Nielsen and IRI). The third is a dummy dataset upon which we will exercise the Logit model.
- Affinity analysis on supermarket shopping basket data
# read-in data data = read.table(file.choose(), header=TRUE) dim(data); data[1:5, 1:10] # --- build affinity matrix shell --- d1 = ncol(data) afmat = matrix(0, d1, d1) rownames(afmat) <- colnames(data) colnames(afmat) <- colnames(data) # --- fill up afmat lower triangle --- for (i1 in 2:d1){ for (i2 in 1:(d1-1)){ test = data[, i1]*data[, i2] afmat[i2, i1] = 100*sum(test[]>0)/nrow(data) afmat[i1, i2] = afmat[i2, i1] } afmat[i1, i1] = 0 } colSum = apply(afmat, 2, sum) a1=sort(colSum, index.return=TRUE, decreasing=TRUE) afmat1=afmat[,a1$ix] afmat2=afmat1[a1$ix,] |
# each cell shows % of times the row & col items # # were purchased together, in the same basket # afmat2[1:40,1:8] # partially view affinity matrix # see some basic summaries max(afmat2) mean(afmat2) median(afmat2) |
Could the above affinity be viewed also in the form of a collocation dendogram? Or as a word-cloud? Sure. Here is the code for doing so and my dendogram result.
# affinity dendogram? mydata = afmat2[1:50, 1:50] # cluster top 50 categories # Ward Hierarchical Clustering d <- as.dist(mydata) # distance matrix fit <- hclust(d, method="ward") plot(fit) # display dendogram # affinity wordcloud? colSum = apply(mydata, 2, sum) library(wordcloud) wordcloud(colnames(mydata), colSum, col=1:10) |
- Analysis of Syndicated Datasets
Much of real-world commerce depends on trading data - buying and selling it - via intermediaries called 'Syndicated data providers'. AC Nielsen and IRI are two of the largest such around in the retail trade. We will next analyze a small portion of a syndicated dataset to gain familiarity with its format, build some basic models and run some neat regression variants on the data.
The (small) portion of the dataset, which we will use for analysis is here in this google document. First, as usual, read-in the data and let the games begin.
- There's a text column in the data ('brand2' for brand names). So its better to read it in from .csv rather than .txt using the following code.
- Also I do some variable transformations to obtain a sku.size variable (measures size of the sale unit in fluid ounces. E.g. a 6 pack 12 Oz sale unit has 6*12=72 ounces of beer Volume).
# read data from a .csv file x = read.table(file.choose(), header=TRUE, sep=",") dim(x); summary(x); x[1:5,] # summary views attach(x) # call indiv columns # some variable transformations sku.size.sq = sku.size*sku.size |
- Start with the simplest thing we can - a plain linear regression of sales volume on the 4Ps and on some environmental and control variables (months for seasonality etc.).
# running an ordinary linear regression beer.lm = lm(volsold~adspend1+price +distbn+promo +sku.size+bottle +golden+light+lite+reg +jan+feb +march+april +factor(brand2), data=x) # view results summary(beer.lm) |
- We move to running a linear regression with quadratic terms to accommodate non-linear shaped response.
- E.g., suppose sales rise with SKU size but only upto a point (the 'ideal' point). They then start falling off as SKU sizes get too big to sell very well. Thus, there may be an 'optimal' SKU size out there. We cannot capture this effect with simply a linear term.
# running an OLS with quadratic terms beer.lm = lm(volsold~adspend1+price +distbn+promo+sku.size +sku.size.sq+bottle +golden+light+lite+reg +jan+feb+march+april +factor(brand2), data=x) #view results summary(beer.lm) |
- A log-log regression takes the natural log of metric terms on both sides of the equation and runs them as an ordinary regression.
- The output is not ordinary though. The coefficients of log variables can now be directly interpreted as point elasticities with respect to sales.
# run a log-log regression beer.lm = lm(log(volsold)~log(adspend1)+log(price) +log(distbn)+log(promo) +sku.size+sku.size.sq+bottle +golden+light+lite+reg +jan+feb+march+april +factor(brand2), data=x) summary(beer.lm) |
- What if there are more variables than required for a regression? How to know which ones are most relevant and which ones can safely be dropped?
- Enter variable selection - we use a scientific, complexity penalized fit metric, the Akaike Information Criterion or AIC, to decide what the best fitting set of variables is for a given regression.
## variable selection library(MASS) step = stepAIC(beer.lm, direction="both") step$anova # display results |
Whew. With that, Logit alone remains to be covered. Will do so after a break. Shall put this up on the blog meanwhile for checking purposes.
3. Discrete Choice Models: The (Multinomial) Logit
Hi all,
Logit is slightly hard to get running on R. Lots of things to keep track of when entering data. But here's some generic code that should do the trick. The dataset for the following example is on this google spreadsheet.
Step1: Read-in data
- Drop any variables that will not be used in the analysis.E.g. 'storenum'
- Convert any nominal or categorical variables into dummy variables
- Keep the dependent variable 'instorepromo' as the leftmost variable in the input dataset
### ### --- do MNL on store-promo example --- ### data = read.table(file.choose(), header=TRUE) summary(data); attach(data) data[1:3,] data = data[,2:ncol(data)] # drop storenum |
- Step 2:To convert the categorical 'coupon' (in general, any set of categorical variables) into a set of binary variables, I wrote the following subroutine. Just copy-paste it in.
## --- write func to build dummy vars for numeric vectors ---- makeDummy <- function(x1){ x1 = data.matrix(x1); test=NULL for (i1 in 1:ncol(x1)){ test1=NULL; a1=x1[,i1][!duplicated(x1[,i1])]; k1 = length(a1); test1 = matrix(0, nrow(x1), (k1-1)); colnames0 = NULL for (i2 in 1:(k1-1)){ test1[,i2] = 1*(a1[i2] == x1[,i1]) colnames0 = c(colnames0, paste(colnames(x1)[i1], a1[i2])) } colnames(test1) = colnames0 test1 = data.matrix(test1); test = cbind(test, test1) } test} # convert coupon to binary variable coupon1 = makeDummy(coupon) colnames(coupon1) = "coupon1" # name it |
- Step 3: Reformat dataset for mlogit input. Just copy-paste only.
# make dataset with Y variable as the first data1 = cbind(instorepromo, coupon1, sales, clientrating) # now reformat data for MNL input k1 = max(data1[,1]) # no. of segments there are test = NULL for (i0 in 1:nrow(data1)){ chid = NULL; test0 = matrix(0, k1, ncol(data1)) for (i1 in 1:k1){ test0[i1, 1] = (data1[i0, 1] == i1); chid = rbind(chid, cbind(i0, i1)) for (i2 in 2:ncol(data1)){ test0[i1, i2] = data1[i0, i2] }} test = rbind(test, cbind(test0, chid)) } # i0 ends colnames(test) = c(colnames(data1), "chid", "altvar") colnames(test)=c("y",colnames(test)[2:ncol(test)]) dim(test); test[1:5,] # view few rows summary(test) # build mlogit's data input matrix test1a = data.frame(test) attach(test1a) test1 = mlogit.data(test1a, choice = "y", shape = "long", id.var = "chid", alt.var = "altvar") summary(test1); test1[1:5,] #view few rows |
- Step 4: Run the analysis, enjoy the result.
- After all this exhausting work, we've finally reached the top the hill. No more work now. Let R take opver from here.
- P.S.:This seemed like a lot of work for this small dataset, perhaps. But pls note, this code is generic and can be used with much bigger and complex datasets with equal facility.
# run mlogit fit1 = mlogit(y ~ 0|coupon1+sales+clientrating, data=test1) summary(fit1) |
It says that the store profile that best suits (or has the highest probability of being) in-store promo type 1 is - (i) must be of coupon type 2, (ii) must have high sales, (iii)must have high client rating.
Alternately, we could say that a store with (i) coupon type 1, (ii) low sales and (iii) low clientrating has a higher chance of being in-store promo type 3 or 2 than 1.
What's more, we can compute the exact probabilities involved. But that is fodder for another day and another course perhaps. Here's the code for it and the results it gives. (You're Welcome)
##---write code to obtain fitted values coeff = as.matrix(fit1$coefficients) a1 = exp(model.matrix(fit1) %*% coeff) attach(test1) chid = test1[,(ncol(test1)-1)] Prob.matrix = NULL for (i1 in 1:max(chid)){ pr = NULL test2 = a1[(chid==i1), 1] pr = test2/sum(test2) k2=ncol(test1) a2 = cbind(test1[(chid==i1),c(1,(k2-1),k2)],pr) Prob.matrix = rbind(Prob.matrix, a2)} colnames(Prob.matrix) = c("observed y", "row id", "promo_format", "Probability") Prob.matrix[,ncol(Prob.matrix)] = round(Prob.matrix[,ncol(Prob.matrix)], 3) Prob.matrix[,1]=as.numeric(Prob.matrix[,1]) Prob.matrix[1:10,]#view few rows of probabilities |
And now for the final piece - prediction.
Suppose there are 5 stores for which we don't know what promo_format is in force. Can we predict what the promo format will be given store profile? Let's find out. I have pasted on the same google doc from cell H3 the prediction dataset. Read it in and let the following code do its job.
##--code for predicting membership for new data # copy from cell H3 of google doc newdata1 = read.table(file.choose(), header=TRUE) attach(newdata1) # convert coupon to binary variable coupon1 = makeDummy(coupon) colnames(coupon1) = "coupon1" # name it # make dataset for analysis newdata = cbind(coupon1, sales, clientrating) |
Next, convert data format for mlogit processing
# reformat newdata for mlogit input test3 = NULL intcpt = matrix(0,k1, (k1-1)) for (i2 in 2:k1){ intcpt[i2, (i2-1)] = 1 } for (i0 in 1:nrow(newdata)){ test3a = NULL; test3b=NULL; for (i1 in 1:ncol(newdata)){ test3a[[i1]]=matrix(0, k1, (k1-1)) for (i2 in 2:k1){ test3a[[i1]][i2, (i2-1)] = newdata[i0, i1] } test3b = cbind(test3b, test3a[[i1]]) } test3b = cbind(intcpt, test3b) test3 = rbind(test3, test3b) } colnames(test3) = colnames(model.matrix(fit1)) dim(test3); test3[1:5,] # view a few rows |
Now, finally, run the new data and obtain predicted probabilities of belonging to particular promo formats in a probability table.
# obtain predicted probabilities coeff = as.matrix(fit1$coefficients) a1 = exp(test3 %*% coeff) chid1=NULL; for (i in 1:nrow(newdata1)){chid1=c(chid1, rep(i,k1)) } altvar1=rep(seq(from=1, to=k1),nrow(newdata1));altvar1 Prob.matrix = NULL for (i1 in 1:max(chid1)){ pr = NULL test2 = a1[(chid1==i1), 1] pr = test2/sum(test2) Prob.matrix = rbind(Prob.matrix, cbind(i1, altvar1[1:k1], pr)) } colnames(Prob.matrix) = c("row id", "promo_format", "Probability") Prob.matrix[,ncol(Prob.matrix)] = round(Prob.matrix[,ncol(Prob.matrix)], 3) Prob.matrix # view predicted probs |
With that I close. Its 6 am on the morning that I am to cover this subject in class. Am just glad I have managed to get it in order with time enough to prep the material.
See you all in Class
Sudhir
Dear Sudhir,
ReplyDeleteFor the 2 sample t-test, "Does awareness for Nike in females (Xf, say) exceed that in males (Xm)?”, I am getting some different results. Just wanted to let you know.
> # first make the two vectors
> xm = x1[(Gender==1)]
> xf = x1[(Gender==2)]
> # Alternative Hypothesis says xf '>' xm
> t.test(xf, xm, alternative="greater")
Welch Two Sample t-test
data: xf and xm
t = 1.6775, df = 41.796, p-value = 0.05046
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
-0.002581457 Inf
sample estimates:
mean of x mean of y
4.666667 3.739130
>
Dear Sudhir,
ReplyDeleteIs it OK to have an interaction variable comprised of a categorical variable and a continuous variable?
Hi Samik,
DeleteIts possible to do but typically not much recommended because interpretation is muddled. Ideally, interactions represent a condition wherein two categorical variables are active in particular ways.
Sudhir
Dear Professor,
ReplyDeletein this blog you have said "The image above gives the results. It says that the store profile that best suits (or has the highest probability of being) in-store promo type 1 is - (i) must be of coupon type 2, (ii) must have high sales, (iii)must have high client rating." I do not understand how to infer this from the mlogit output. Request you to please clarify
Hi Divya,
ReplyDeleteSorrey abt the delayed response, was off-campus most of today. I'll be available post-lunch tomorrow (Sat) in my office, pls call my ext #7106 and drop by. Am happy to help.
Sudhir
Dear Professor,
DeleteYour recent blog was very helpful . thank you!
For logit too.. we have use "april" as the base/reference. This means that, depending on the coefficients of other variables, the predicted sales will be higher or lower.
Regards,
DCF