Wednesday, November 7, 2012

Session 8 Classwork Rcode

Update: All classwork examples data files are now available directly from LMS on an excel sheet. This is in response to reports of issues in reading data off the google spreadsheet.

Hi all,

The previous blog-post introducing Session 8 modules can be found here. In this blog-post, I post R code required for classwork examples and links to the associated datasets.

Module 1. Hypothesis Testing

  • t-tests for differences
The Nike dataset is on this google spreadsheet starting cell B5.
## hypothesis testing
## Nike example

# read-in data
nike = read.table(file.choose(), header=TRUE)
dim(nike); summary(nike); nike[1:5,] # some summaries
attach(nike) # Call indiv columns by name

# Q is: "Is awareness for Nike (X, say) different from 3?” We first delete those rows that have a '0' in them as these were non-responses. Then we proceed with t-tests as usual.

# remove rows with zeros #
x1 = Awareness[(Awareness != 0)]
# test if x1's mean *significantly* differs from mu
t.test(x1, mu=3) # reset 'mu=' value as required.
This is the result I got:
To interpret, first recall the hypothesis. The null said:"mean awareness is no different from 3". However, the p-value of the t-test is well below 0.01. Thus, we reject the null with over 99% confidence and accept the alternative (H1: Mean awareness is significantly different from 3). Happily, pls notice that R states the alternative hypothesis in plain words as part of its output.

The second Q we had was:

# “Q: Does awareness for Nike exceed 4?”
# change 'mu=3' to 'mu=4' now
t.test(x1, mu=4, alternative=c("greater")) # one-sided test
The p-value I get is 0.2627. We can no longer reject the Null even at the 90% confidence level. Thus, we infer that mean awareness of 4.18 odd does *not* significantly exceed 4.

Next Q was: "Does awareness for Nike in females (Xf, say) exceed that in males (Xm)?”

# first make the two vectors
xm = x1[(Gender==1)]
xf = x1[(Gender==2)]
# Alternative Hypothesis says xf '>' xm
t.test(xf, xm, alternative="greater")
In the code above, we specify for 'alternative=' whatever the alternative hypothesis says. In this case it said greater than, so we used "greater". Else we would have used "less".
The p-value is below 0.05. So we reject the Null at the 95% level. We accept the alternative that "xf at 5.18 significantly exceeds xm at 3.73".

  • chi.square-tests for Association
The data for this example is on the same google doc starting cell K5. Code for the first classwork example on Gender versus Internet Usage follows. First, let me state the hypotheses:
  • H0: "There is no systematic association between Gender and Internet Usage." . In other words, distribution of people we see in the 4 cells is a purely random phenomenon.
  • H1: "There is systematic association between Gender and Internet Usage."
# read in data WITHOUT headers
a1=read.table(file.choose()) # 2x2 matrix
a1 # view the data once
chisq.test(a1) # voila. Done!
This is the result I get:
Clearly, with a p-value of 0.1441, we can no longer reject the Null at even the 90% level. So we cannot infer any significant association between Gender and Internet Usage levels. The entire sample we had was 30 people large. Suppose we had a much bigger sample but a similar distribution across cells. Would the inference change? Let's find out.

Suppose we scaled up the previous matrix by 10. We will then have 300 people and a corresponding distribution in the four cells. But now, at this sample size, random variation can no longer explain the huge differences we will see between the different cells and the inference will change.

a1*10 # view the new dataset
chisq.test(a1*10)
The results are as expected. While a 5 person difference can be attributed to random variation in a 30 person dataset, a 50 person variation cannot be so attributed in a 300 person dataset.

Our last example on tests of association uses the Nike dataset. Does Nike Usage vary systematically with Gender? Suppose we had reason to believe so. Then our Null would be: Usage does not vary systematically with Gender, random variation can explain the pattern of variation. Let us test this in R:

# build cross tab
mytable=table(Usage,Gender)
mytable #view crosstab

chisq.test(mytable)
The example above is also meant to showcase the cross-tab capabilities of R using the table() function. R peacefully does 3 way cross-tabs and more as well. Anyway,here's the result: seems we can reject the Null at the 95% level.

Module 2. Secondary Data Analysis

There are 3 types of broadly used secondary data that you will be exposed to in this module. First, a small sample or subset of a supermarket shopping basket dataset upon which we will perform *affinity analysis*. The second is aggregate sales data for a category (beer) provided by syndicated data providers (like A C Nielsen and IRI). The third is a dummy dataset upon which we will exercise the Logit model.

  • Affinity analysis on supermarket shopping basket data
Data on some 1018 shopping baskets and covering some 276 product categories from a Hyd Supermarket are put up as an excel sheet on LMS. I used Excel's Pivot tables to extract the 141 most purchased categories. The resulting 1018*141 matrix I read into R. The matrix is available as a text file for R input on LMS. Now use the R code as follows:
# read-in data
data = read.table(file.choose(), header=TRUE)
dim(data); data[1:5, 1:10]

# --- build affinity matrix shell ---
d1 = ncol(data)
afmat = matrix(0, d1, d1)
rownames(afmat) <- colnames(data)
colnames(afmat) <- colnames(data)

# --- fill up afmat lower triangle ---
for (i1 in 2:d1){ for (i2 in 1:(d1-1)){
test = data[, i1]*data[, i2]
afmat[i2, i1] = 100*sum(test[]>0)/nrow(data)
afmat[i1, i2] = afmat[i2, i1] }
afmat[i1, i1] = 0 }

colSum = apply(afmat, 2, sum)
a1=sort(colSum, index.return=TRUE, decreasing=TRUE)
afmat1=afmat[,a1$ix]
afmat2=afmat1[a1$ix,]
So, with that long piece of code, we have processed the data and built an 'affinity matrix'. Its rows and columns are both product categories. Each cell in the matrix tells how often the column category and the row category were purchased together in the average shopping basket. Let us see what it looks like and get some summary measures as well.

# each cell shows % of times the row & col items #
# were purchased together, in the same basket #
afmat2[1:40,1:8] # partially view affinity matrix

# see some basic summaries
max(afmat2)
mean(afmat2)
median(afmat2)
We viewed some 40 rows and 8 columns of the affinity matrix. Each cell depicts the % of shopping baskets in which the row and column categories occurred together. For example the first row second column says that vegetables and juices were part of the a purchase basket 7.66% of the time whereas bread and Vegetables were so 8.54% of the time. The average cross-category affinity is 0.27%. The image below shows the output for some 12 rows and 6 columns of the affinity matrix.

Could the above affinity be viewed also in the form of a collocation dendogram? Or as a word-cloud? Sure. Here is the code for doing so and my dendogram result.

# affinity dendogram?
mydata = afmat2[1:50, 1:50] # cluster top 50 categories

# Ward Hierarchical Clustering
d <- as.dist(mydata) # distance matrix
fit <- hclust(d, method="ward")
plot(fit) # display dendogram

# affinity wordcloud?
colSum = apply(mydata, 2, sum)
library(wordcloud)
wordcloud(colnames(mydata), colSum, col=1:10)

  • Analysis of Syndicated Datasets

Much of real-world commerce depends on trading data - buying and selling it - via intermediaries called 'Syndicated data providers'. AC Nielsen and IRI are two of the largest such around in the retail trade. We will next analyze a small portion of a syndicated dataset to gain familiarity with its format, build some basic models and run some neat regression variants on the data.

The (small) portion of the dataset, which we will use for analysis is here in this google document. First, as usual, read-in the data and let the games begin.

  • There's a text column in the data ('brand2' for brand names). So its better to read it in from .csv rather than .txt using the following code.
  • Also I do some variable transformations to obtain a sku.size variable (measures size of the sale unit in fluid ounces. E.g. a 6 pack 12 Oz sale unit has 6*12=72 ounces of beer Volume).
# read data from a .csv file
x = read.table(file.choose(), header=TRUE, sep=",")
dim(x); summary(x); x[1:5,] # summary views
attach(x) # call indiv columns

# some variable transformations
sku.size.sq = sku.size*sku.size

  • Start with the simplest thing we can - a plain linear regression of sales volume on the 4Ps and on some environmental and control variables (months for seasonality etc.).
# running an ordinary linear regression
beer.lm = lm(volsold~adspend1+price +distbn+promo +sku.size+bottle +golden+light+lite+reg +jan+feb +march+april +factor(brand2), data=x)
# view results
summary(beer.lm)
Notice how the use of factors() enables us to directly use a txt column consisting of brand names inside the regression. The following image shows the result I get.
Reading regression output is straightforward. To refresh your memory on this, try the following Qs. (i) What is the R-square? (ii) How many variables significantly impact sales Volume at 95%? at 99%? (iii) Which brand appears to have the highest impact on sales volume (relative to reference brand 'Amstel')? (iv) How do beer sales fare in the months Jan, Feb and March relative to the reference month April? (v) Do larger sized SKUs have higher sales volume? Etc. Pls feel free to consult others around and get the interpretation of standard regression output clarified in case you've forgotten or something.

  • We move to running a linear regression with quadratic terms to accommodate non-linear shaped response.
  • E.g., suppose sales rise with SKU size but only upto a point (the 'ideal' point). They then start falling off as SKU sizes get too big to sell very well. Thus, there may be an 'optimal' SKU size out there. We cannot capture this effect with simply a linear term.
# running an OLS with quadratic terms
beer.lm = lm(volsold~adspend1+price +distbn+promo+sku.size +sku.size.sq+bottle +golden+light+lite+reg +jan+feb+march+april +factor(brand2), data=x)
#view results
summary(beer.lm)
The image shows that yes, both sku.size and its quadratic (or square) term are significant at the 99% level *and* they have opposite signs. So yes, indeed, there seems to be an optimal size at which SKUs sell best, other things being equal.

  • A log-log regression takes the natural log of metric terms on both sides of the equation and runs them as an ordinary regression.
  • The output is not ordinary though. The coefficients of log variables can now be directly interpreted as point elasticities with respect to sales.
# run a log-log regression
beer.lm = lm(log(volsold)~log(adspend1)+log(price) +log(distbn)+log(promo) +sku.size+sku.size.sq+bottle +golden+light+lite+reg +jan+feb+march+april +factor(brand2), data=x)
summary(beer.lm)
Thus, the results say that the price elasticity of sales is thus -2.29, the distribution elasticity of sales is 1.32 and the promotion and advertising elasticities of sales are, respectively, 0.57 and 0.44.

  • What if there are more variables than required for a regression? How to know which ones are most relevant and which ones can safely be dropped?
  • Enter variable selection - we use a scientific, complexity penalized fit metric, the Akaike Information Criterion or AIC, to decide what the best fitting set of variables is for a given regression.
## variable selection
library(MASS)
step = stepAIC(beer.lm, direction="both")
step$anova # display results
Look at the initial regression equation and the final regression equation (in blue font, mentioned). Look at the variables R wants dropped (with a '-' to them and those R wants added '+' after dropping).Thus, a simple 3 line piece of code on R helps ID the best fitting set of variables in a given set of predictors.

Whew. With that, Logit alone remains to be covered. Will do so after a break. Shall put this up on the blog meanwhile for checking purposes.

3. Discrete Choice Models: The (Multinomial) Logit

Hi all,

Logit is slightly hard to get running on R. Lots of things to keep track of when entering data. But here's some generic code that should do the trick. The dataset for the following example is on this google spreadsheet.

Step1: Read-in data

  • Drop any variables that will not be used in the analysis.E.g. 'storenum'
  • Convert any nominal or categorical variables into dummy variables
  • Keep the dependent variable 'instorepromo' as the leftmost variable in the input dataset
###
### --- do MNL on store-promo example ---
###

data = read.table(file.choose(), header=TRUE)
summary(data); attach(data)
data[1:3,]
data = data[,2:ncol(data)] # drop storenum

  • Step 2:To convert the categorical 'coupon' (in general, any set of categorical variables) into a set of binary variables, I wrote the following subroutine. Just copy-paste it in.
## --- write func to build dummy vars for numeric vectors ----

makeDummy <- function(x1){
x1 = data.matrix(x1); test=NULL

for (i1 in 1:ncol(x1)){
test1=NULL; a1=x1[,i1][!duplicated(x1[,i1])];
k1 = length(a1); test1 = matrix(0, nrow(x1), (k1-1));
colnames0 = NULL

for (i2 in 1:(k1-1)){
test1[,i2] = 1*(a1[i2] == x1[,i1])
colnames0 = c(colnames0, paste(colnames(x1)[i1], a1[i2])) }
colnames(test1) = colnames0
test1 = data.matrix(test1); test = cbind(test, test1) }
test}

# convert coupon to binary variable
coupon1 = makeDummy(coupon)
colnames(coupon1) = "coupon1" # name it

  • Step 3: Reformat dataset for mlogit input. Just copy-paste only.
# make dataset with Y variable as the first
data1 = cbind(instorepromo, coupon1, sales, clientrating)

# now reformat data for MNL input
k1 = max(data1[,1]) # no. of segments there are
test = NULL
for (i0 in 1:nrow(data1)){
chid = NULL;
test0 = matrix(0, k1, ncol(data1))
for (i1 in 1:k1){
test0[i1, 1] = (data1[i0, 1] == i1);
chid = rbind(chid, cbind(i0, i1))
for (i2 in 2:ncol(data1)){ test0[i1, i2] = data1[i0, i2] }}
test = rbind(test, cbind(test0, chid)) } # i0 ends
colnames(test) = c(colnames(data1), "chid", "altvar")
colnames(test)=c("y",colnames(test)[2:ncol(test)])
dim(test); test[1:5,] # view few rows
summary(test)

# build mlogit's data input matrix
test1a = data.frame(test)
attach(test1a)
test1 = mlogit.data(test1a, choice = "y", shape = "long", id.var = "chid", alt.var = "altvar")
summary(test1); test1[1:5,] #view few rows

  • Step 4: Run the analysis, enjoy the result.
  • After all this exhausting work, we've finally reached the top the hill. No more work now. Let R take opver from here.
  • P.S.:This seemed like a lot of work for this small dataset, perhaps. But pls note, this code is generic and can be used with much bigger and complex datasets with equal facility.
# run mlogit
fit1 = mlogit(y ~ 0|coupon1+sales+clientrating, data=test1)
summary(fit1)
The image above gives the results.
It says that the store profile that best suits (or has the highest probability of being) in-store promo type 1 is - (i) must be of coupon type 2, (ii) must have high sales, (iii)must have high client rating.

Alternately, we could say that a store with (i) coupon type 1, (ii) low sales and (iii) low clientrating has a higher chance of being in-store promo type 3 or 2 than 1.

What's more, we can compute the exact probabilities involved. But that is fodder for another day and another course perhaps. Here's the code for it and the results it gives. (You're Welcome)

##---write code to obtain fitted values

coeff = as.matrix(fit1$coefficients)
a1 = exp(model.matrix(fit1) %*% coeff)

attach(test1)
chid = test1[,(ncol(test1)-1)]
Prob.matrix = NULL
for (i1 in 1:max(chid)){
pr = NULL
test2 = a1[(chid==i1), 1]
pr = test2/sum(test2)
k2=ncol(test1)
a2 = cbind(test1[(chid==i1),c(1,(k2-1),k2)],pr)
Prob.matrix = rbind(Prob.matrix, a2)}

colnames(Prob.matrix) = c("observed y", "row id", "promo_format", "Probability")
Prob.matrix[,ncol(Prob.matrix)] = round(Prob.matrix[,ncol(Prob.matrix)], 3)
Prob.matrix[,1]=as.numeric(Prob.matrix[,1])
Prob.matrix[1:10,]#view few rows of probabilities
The top 10 rows of the calculated probabilities matrix can be seen in the image. Notice how nicely the calculated probabilities and observed choices match.

And now for the final piece - prediction.

Suppose there are 5 stores for which we don't know what promo_format is in force. Can we predict what the promo format will be given store profile? Let's find out. I have pasted on the same google doc from cell H3 the prediction dataset. Read it in and let the following code do its job.

##--code for predicting membership for new data
# copy from cell H3 of google doc
newdata1 = read.table(file.choose(), header=TRUE)
attach(newdata1)

# convert coupon to binary variable
coupon1 = makeDummy(coupon)
colnames(coupon1) = "coupon1" # name it

# make dataset for analysis
newdata = cbind(coupon1, sales, clientrating)

Next, convert data format for mlogit processing

# reformat newdata for mlogit input
test3 = NULL
intcpt = matrix(0,k1, (k1-1))
for (i2 in 2:k1){ intcpt[i2, (i2-1)] = 1 }
for (i0 in 1:nrow(newdata)){
test3a = NULL; test3b=NULL;
for (i1 in 1:ncol(newdata)){
test3a[[i1]]=matrix(0, k1, (k1-1))
for (i2 in 2:k1){ test3a[[i1]][i2, (i2-1)] = newdata[i0, i1] }
test3b = cbind(test3b, test3a[[i1]]) }
test3b = cbind(intcpt, test3b)
test3 = rbind(test3, test3b) }

colnames(test3) = colnames(model.matrix(fit1))
dim(test3); test3[1:5,] # view a few rows

Now, finally, run the new data and obtain predicted probabilities of belonging to particular promo formats in a probability table.

# obtain predicted probabilities
coeff = as.matrix(fit1$coefficients)
a1 = exp(test3 %*% coeff)
chid1=NULL; for (i in 1:nrow(newdata1)){chid1=c(chid1, rep(i,k1)) }
altvar1=rep(seq(from=1, to=k1),nrow(newdata1));altvar1
Prob.matrix = NULL
for (i1 in 1:max(chid1)){
pr = NULL
test2 = a1[(chid1==i1), 1]
pr = test2/sum(test2)
Prob.matrix = rbind(Prob.matrix, cbind(i1, altvar1[1:k1], pr)) }

colnames(Prob.matrix) = c("row id", "promo_format", "Probability")
Prob.matrix[,ncol(Prob.matrix)] = round(Prob.matrix[,ncol(Prob.matrix)], 3)

Prob.matrix # view predicted probs
This is the result I got:

With that I close. Its 6 am on the morning that I am to cover this subject in class. Am just glad I have managed to get it in order with time enough to prep the material.

See you all in Class

Sudhir

6 comments:

  1. Dear Sudhir,

    For the 2 sample t-test, "Does awareness for Nike in females (Xf, say) exceed that in males (Xm)?”, I am getting some different results. Just wanted to let you know.

    > # first make the two vectors
    > xm = x1[(Gender==1)]
    > xf = x1[(Gender==2)]
    > # Alternative Hypothesis says xf '>' xm
    > t.test(xf, xm, alternative="greater")

    Welch Two Sample t-test

    data: xf and xm
    t = 1.6775, df = 41.796, p-value = 0.05046
    alternative hypothesis: true difference in means is greater than 0
    95 percent confidence interval:
    -0.002581457 Inf
    sample estimates:
    mean of x mean of y
    4.666667 3.739130

    >

    ReplyDelete
  2. Dear Sudhir,

    Is it OK to have an interaction variable comprised of a categorical variable and a continuous variable?

    ReplyDelete
    Replies
    1. Hi Samik,

      Its possible to do but typically not much recommended because interpretation is muddled. Ideally, interactions represent a condition wherein two categorical variables are active in particular ways.

      Sudhir

      Delete
  3. Dear Professor,
    in this blog you have said "The image above gives the results. It says that the store profile that best suits (or has the highest probability of being) in-store promo type 1 is - (i) must be of coupon type 2, (ii) must have high sales, (iii)must have high client rating." I do not understand how to infer this from the mlogit output. Request you to please clarify

    ReplyDelete
  4. Hi Divya,

    Sorrey abt the delayed response, was off-campus most of today. I'll be available post-lunch tomorrow (Sat) in my office, pls call my ext #7106 and drop by. Am happy to help.

    Sudhir

    ReplyDelete
    Replies
    1. Dear Professor,

      Your recent blog was very helpful . thank you!
      For logit too.. we have use "april" as the base/reference. This means that, depending on the coefficients of other variables, the predicted sales will be higher or lower.

      Regards,
      DCF

      Delete

Constructive feedback appreciated. Please try to be civil, as far as feasible. Thanks.