Tuesday, December 18, 2012

Session 8 Classwork R code - Secondary Data Analysis

Update: Video link on what firms' thinking today on Experimentation in MKTR

**********************************************

Got this email from Sharath Srinivas:

Dear Prof. Voleti,

I see the power of R from the few HWs that we have done. But, wanted to check if there are useful GUI extensions to R.

If someone could provide Matlab GUIDE like options to build GUI on top of R code, it accelerate the adoption of R and its use in businesses. Somehow this does not seem to be happening, which is surprising to me. Do you have any insights on this issue.

My response:

Hi Sharath,

Thanks for asking. Yes, there are a few. I can think of ‘R commander’ and ‘Rattle’ among GUI extensions to R, the former for statistical analysis and the latter for data mining. The Rstudio is available as a developers platform to write extensions into R code. Enterprise R has some (non-free) GUI versions, am told.

R started off as a tool for the research community. Being freeware, it did not (and could not) attract professional app developers who worked for profit. Even R has its limitations, I guess. The research community didn’t much care for user-friendliness at that stage.

IMO, in terms of sheer intellectual horsepower that is using, sharing, tweaking, modifying and extending R, I think R is world class and then some. No comerical vendor’s corporate R&D budget, however big, can compete with all the University departments in the world put together.

Anyway, we take the highs with the lows. Compared to what I was using before for MKTR – SPSS, MEXL, JMP etc – R allows me so much more freedom to create, extend and imagine data modeling possibilities. I only want to expose the same imagination of possibilities to MKTR students as part of the course. Sure, the course is more than just the R part of it.

This is the first year in which I have made R the flagship platform for MKTR. So things haven’t settled to equilibrium stability yet. The course structure, content, organization etc is still evolving and changing (as you no doubt would have guessed by now). In that vein, any feedback on what can be done to make R easier/ more accessible would be very welcome.

*****************************************

Hi all,

The first module in Session 8 - experimentation doesn't have any R component to it. We already covered the relevant portion under 'Hypothesis testing' in the previous section. Onwards to Module 2.

Module 2. Secondary Data Analysis

There are 2 types of broadly used secondary data that you will be exposed to in this module. First, a small sample or subset of a supermarket shopping basket dataset upon which we will perform *affinity analysis*. The second is aggregate sales data for a category (beer) provided by syndicated data providers (like A C Nielsen and IRI).

  • Affinity analysis on supermarket shopping basket data
Data on some 1018 shopping baskets and covering some 276 product categories from a Hyd Supermarket are put up as an excel sheet on LMS. I used Excel's Pivot tables to extract the 141 most purchased categories. The resulting 1018*141 matrix I read into R. The matrix is available as a text file for R input on LMS. Now use the R code as follows:
# read-in affinity analysis dataset.txt
data = read.table(file.choose(), header=TRUE)
dim(data); data[1:5, 1:10]

# --- build affinity matrix shell ---
d1 = ncol(data)
afmat = matrix(0, d1, d1)
rownames(afmat) <- colnames(data)
colnames(afmat) <- colnames(data)

# --- fill up afmat lower triangle ---
for (i1 in 2:d1){ for (i2 in 1:(d1-1)){
test = data[, i1]*data[, i2]
afmat[i2, i1] = 100*sum(test[]>0)/nrow(data)
afmat[i1, i2] = afmat[i2, i1] }
afmat[i1, i1] = 0 }

colSum = apply(afmat, 2, sum)
a1=sort(colSum, index.return=TRUE, decreasing=TRUE)
afmat1=afmat[,a1$ix]
afmat2=afmat1[a1$ix,]
So, with that long piece of code, we have processed the data and built an 'affinity matrix'. Its rows and columns are both product categories. Each cell in the matrix tells how often the column category and the row category were purchased together in the average shopping basket. Let us see what it looks like and get some summary measures as well.

# each cell shows % of times the row & col items #
# were purchased together, in the same basket #
afmat2[1:40,1:8] # partially view affinity matrix

# see some basic summaries
max(afmat2)
mean(afmat2)
median(afmat2)
We viewed some 40 rows and 8 columns of the affinity matrix. Each cell depicts the % of shopping baskets in which the row and column categories occurred together. For example the first row second column says that vegetables and juices were part of the a purchase basket 7.66% of the time whereas bread and Vegetables were so 8.54% of the time. The average cross-category affinity is 0.27%. The image below shows the output for some 12 rows and 6 columns of the affinity matrix.

Could the above affinity be viewed also in the form of a collocation dendogram? Or as a word-cloud? Sure. Here is the code for doing so and my dendogram result.

# affinity dendogram?
mydata = afmat2[1:50, 1:50] # cluster top 50 categories

# Ward Hierarchical Clustering
d <- as.dist(mydata) # distance matrix
fit <- hclust(d, method="ward")
plot(fit) # display dendogram

# affinity wordcloud?
colSum = apply(mydata, 2, sum)
library(wordcloud)
wordcloud(colnames(mydata), colSum, col=1:10)

  • Analysis of Syndicated Datasets

Much of real-world commerce depends on trading data - buying and selling it - via intermediaries called 'Syndicated data providers'. AC Nielsen and IRI are two of the largest such around in the retail trade. We will next analyze a small portion of a syndicated dataset to gain familiarity with its format, build some basic models and run some neat regression variants on the data.

The (small) portion of the dataset, which we will use for analysis is up on LMS. First, as usual, read-in the data and let the games begin.

  • There's a text column in the data ('brand2' for brand names). So its better to read it in from .csv rather than .txt using the following code.
  • Also I do some variable transformations to obtain a sku.size variable (measures size of the sale unit in fluid ounces. E.g. a 6 pack 12 Oz sale unit has 6*12=72 ounces of beer Volume).
# read data from syndicated beer dataset.csv
x = read.csv(file.choose(), header=TRUE)
dim(x); summary(x); x[1:5,] # summary views
attach(x) # call indiv columns

# some variable transformations
sku.size.sq = sku.size*sku.size

  • Start with the simplest thing we can - a plain linear regression of sales volume on the 4Ps and on some environmental and control variables (months for seasonality etc.).
# running an ordinary linear regression
beer.lm = lm(volsold~adspend1+price +distbn+promo +sku.size+bottle +golden+light+lite+reg +jan+feb +march+april +factor(brand2), data=x)
# view results
summary(beer.lm)
Notice how the use of factors() enables us to directly use a txt column consisting of brand names inside the regression. The following image shows the result I get.
Reading regression output is straightforward. To refresh your memory on this, try the following Qs. (i) What is the R-square? (ii) How many variables significantly impact sales Volume at 95%? at 99%? (iii) Which brand appears to have the highest impact on sales volume (relative to reference brand 'Amstel')? (iv) How do beer sales fare in the months Jan, Feb and March relative to the reference month April? (v) Do larger sized SKUs have higher sales volume? Etc. Pls feel free to consult others around and get the interpretation of standard regression output clarified in case you've forgotten or something.

  • We move to running a linear regression with quadratic terms to accommodate non-linear shaped response.
  • E.g., suppose sales rise with SKU size but only upto a point (the 'ideal' point). They then start falling off as SKU sizes get too big to sell very well. Thus, there may be an 'optimal' SKU size out there. We cannot capture this effect with simply a linear term.
# running an OLS with quadratic terms
beer.lm = lm(volsold~adspend1+price +distbn+promo+sku.size +sku.size.sq+bottle +golden+light+lite+reg +jan+feb+march+april +factor(brand2), data=x)
#view results
summary(beer.lm)
The image shows that yes, both sku.size and its quadratic (or square) term are significant at the 99% level *and* they have opposite signs. So yes, indeed, there seems to be an optimal size at which SKUs sell best, other things being equal.

  • A log-log regression takes the natural log of metric terms on both sides of the equation and runs them as an ordinary regression.
  • The output is not ordinary though. The coefficients of log variables can now be directly interpreted as point elasticities with respect to sales.
# run a log-log regression
beer.lm = lm(log(volsold)~log(adspend1)+log(price) +log(distbn)+log(promo) +sku.size+sku.size.sq+bottle +golden+light+lite+reg +jan+feb+march+april +factor(brand2), data=x)
summary(beer.lm)
Thus, the results say that the price elasticity of sales is thus -2.29, the distribution elasticity of sales is 1.32 and the promotion and advertising elasticities of sales are, respectively, 0.57 and 0.44.

  • What if there are more variables than required for a regression? How to know which ones are most relevant and which ones can safely be dropped?
  • Enter variable selection - we use a scientific, complexity penalized fit metric, the Akaike Information Criterion or AIC, to decide what the best fitting set of variables is for a given regression.
  • However, there is a problem I must resolve first. Turns out this variable selection func doesn't work when we have categorical data.
  • So we first convert the categorical data to dummy variable columns.
## --- Func to build dummy vars for numeric vectors
makeDummy <- function(x1){
x1 = data.matrix(x1); test=NULL

for (i1 in 1:ncol(x1)){
test1=NULL; a1=x1[,i1][!duplicated(x1[,i1])];
k1 = length(a1); test1 = matrix(0, nrow(x1), (k1-1));
colnames0 = NULL

for (i2 in 1:(k1-1)){
test1[,i2] = 1*(a1[i2] == x1[,i1])
colnames0 = c(colnames0, paste(colnames(x1)[i1], a1[i2])) }
colnames(test1) = colnames0
test1 = data.matrix(test1); test = cbind(test, test1) }
test}

attach(x)
brands = makeDummy(brand2)
data.new = cbind(x,brands)

# running an OLS with quadratic terms
beer.lm = lm(volsold~adspend1+price +distbn+promo+sku.size +sku.size.sq+bottle +golden+light+lite+reg +jan+feb+march+april+brands, data=data.new)
#view results
summary(beer.lm)

Note that the results are the same as when we had used factor(brand2). Now we are ready to use the variable selection method on the regression results object 'beer.lm'.

## variable selection
library(MASS)
step = stepAIC(beer.lm, direction="both")
step$anova # display results
Look at the initial regression equation and the final regression equation (in blue font, mentioned). Look at the variables R wants dropped (with a '-' to them and those R wants added '+' after dropping).Thus, a simple 3 line piece of code on R helps ID the best fitting set of variables in a given set of predictors. Nothing surprising about what R told us here - drop the insignificant ones. But it could get more complicated in more involved datasets, perhaps.

See you all in Class

Sudhir

No comments:

Post a Comment

Constructive feedback appreciated. Please try to be civil, as far as feasible. Thanks.