**********************************************
Got this email from Sharath Srinivas:
Dear Prof. Voleti,I see the power of R from the few HWs that we have done. But, wanted to check if there are useful GUI extensions to R.
If someone could provide Matlab GUIDE like options to build GUI on top of R code, it accelerate the adoption of R and its use in businesses. Somehow this does not seem to be happening, which is surprising to me. Do you have any insights on this issue.
My response:
Hi Sharath,*****************************************Thanks for asking. Yes, there are a few. I can think of ‘R commander’ and ‘Rattle’ among GUI extensions to R, the former for statistical analysis and the latter for data mining. The Rstudio is available as a developers platform to write extensions into R code. Enterprise R has some (non-free) GUI versions, am told.
R started off as a tool for the research community. Being freeware, it did not (and could not) attract professional app developers who worked for profit. Even R has its limitations, I guess. The research community didn’t much care for user-friendliness at that stage.
IMO, in terms of sheer intellectual horsepower that is using, sharing, tweaking, modifying and extending R, I think R is world class and then some. No comerical vendor’s corporate R&D budget, however big, can compete with all the University departments in the world put together.
Anyway, we take the highs with the lows. Compared to what I was using before for MKTR – SPSS, MEXL, JMP etc – R allows me so much more freedom to create, extend and imagine data modeling possibilities. I only want to expose the same imagination of possibilities to MKTR students as part of the course. Sure, the course is more than just the R part of it.
This is the first year in which I have made R the flagship platform for MKTR. So things haven’t settled to equilibrium stability yet. The course structure, content, organization etc is still evolving and changing (as you no doubt would have guessed by now). In that vein, any feedback on what can be done to make R easier/ more accessible would be very welcome.
Hi all,
The first module in Session 8 - experimentation doesn't have any R component to it. We already covered the relevant portion under 'Hypothesis testing' in the previous section. Onwards to Module 2.
Module 2. Secondary Data Analysis
There are 2 types of broadly used secondary data that you will be exposed to in this module. First, a small sample or subset of a supermarket shopping basket dataset upon which we will perform *affinity analysis*. The second is aggregate sales data for a category (beer) provided by syndicated data providers (like A C Nielsen and IRI).
- Affinity analysis on supermarket shopping basket data
# read-in affinity analysis dataset.txt data = read.table(file.choose(), header=TRUE) dim(data); data[1:5, 1:10] # --- build affinity matrix shell --- d1 = ncol(data) afmat = matrix(0, d1, d1) rownames(afmat) <- colnames(data) colnames(afmat) <- colnames(data) # --- fill up afmat lower triangle --- for (i1 in 2:d1){ for (i2 in 1:(d1-1)){ test = data[, i1]*data[, i2] afmat[i2, i1] = 100*sum(test[]>0)/nrow(data) afmat[i1, i2] = afmat[i2, i1] } afmat[i1, i1] = 0 } colSum = apply(afmat, 2, sum) a1=sort(colSum, index.return=TRUE, decreasing=TRUE) afmat1=afmat[,a1$ix] afmat2=afmat1[a1$ix,] |
# each cell shows % of times the row & col items # # were purchased together, in the same basket # afmat2[1:40,1:8] # partially view affinity matrix # see some basic summaries max(afmat2) mean(afmat2) median(afmat2) |
Could the above affinity be viewed also in the form of a collocation dendogram? Or as a word-cloud? Sure. Here is the code for doing so and my dendogram result.
# affinity dendogram? mydata = afmat2[1:50, 1:50] # cluster top 50 categories # Ward Hierarchical Clustering d <- as.dist(mydata) # distance matrix fit <- hclust(d, method="ward") plot(fit) # display dendogram # affinity wordcloud? colSum = apply(mydata, 2, sum) library(wordcloud) wordcloud(colnames(mydata), colSum, col=1:10) |
- Analysis of Syndicated Datasets
Much of real-world commerce depends on trading data - buying and selling it - via intermediaries called 'Syndicated data providers'. AC Nielsen and IRI are two of the largest such around in the retail trade. We will next analyze a small portion of a syndicated dataset to gain familiarity with its format, build some basic models and run some neat regression variants on the data.
The (small) portion of the dataset, which we will use for analysis is up on LMS. First, as usual, read-in the data and let the games begin.
- There's a text column in the data ('brand2' for brand names). So its better to read it in from .csv rather than .txt using the following code.
- Also I do some variable transformations to obtain a sku.size variable (measures size of the sale unit in fluid ounces. E.g. a 6 pack 12 Oz sale unit has 6*12=72 ounces of beer Volume).
# read data from syndicated beer dataset.csv x = read.csv(file.choose(), header=TRUE) dim(x); summary(x); x[1:5,] # summary views attach(x) # call indiv columns # some variable transformations sku.size.sq = sku.size*sku.size |
- Start with the simplest thing we can - a plain linear regression of sales volume on the 4Ps and on some environmental and control variables (months for seasonality etc.).
# running an ordinary linear regression beer.lm = lm(volsold~adspend1+price +distbn+promo +sku.size+bottle +golden+light+lite+reg +jan+feb +march+april +factor(brand2), data=x) # view results summary(beer.lm) |
- We move to running a linear regression with quadratic terms to accommodate non-linear shaped response.
- E.g., suppose sales rise with SKU size but only upto a point (the 'ideal' point). They then start falling off as SKU sizes get too big to sell very well. Thus, there may be an 'optimal' SKU size out there. We cannot capture this effect with simply a linear term.
# running an OLS with quadratic terms beer.lm = lm(volsold~adspend1+price +distbn+promo+sku.size +sku.size.sq+bottle +golden+light+lite+reg +jan+feb+march+april +factor(brand2), data=x) #view results summary(beer.lm) |
- A log-log regression takes the natural log of metric terms on both sides of the equation and runs them as an ordinary regression.
- The output is not ordinary though. The coefficients of log variables can now be directly interpreted as point elasticities with respect to sales.
# run a log-log regression beer.lm = lm(log(volsold)~log(adspend1)+log(price) +log(distbn)+log(promo) +sku.size+sku.size.sq+bottle +golden+light+lite+reg +jan+feb+march+april +factor(brand2), data=x) summary(beer.lm) |
- What if there are more variables than required for a regression? How to know which ones are most relevant and which ones can safely be dropped?
- Enter variable selection - we use a scientific, complexity penalized fit metric, the Akaike Information Criterion or AIC, to decide what the best fitting set of variables is for a given regression.
- However, there is a problem I must resolve first. Turns out this variable selection func doesn't work when we have categorical data.
- So we first convert the categorical data to dummy variable columns.
## --- Func to build dummy vars for numeric vectors makeDummy <- function(x1){ x1 = data.matrix(x1); test=NULL for (i1 in 1:ncol(x1)){ test1=NULL; a1=x1[,i1][!duplicated(x1[,i1])]; k1 = length(a1); test1 = matrix(0, nrow(x1), (k1-1)); colnames0 = NULL for (i2 in 1:(k1-1)){ test1[,i2] = 1*(a1[i2] == x1[,i1]) colnames0 = c(colnames0, paste(colnames(x1)[i1], a1[i2])) } colnames(test1) = colnames0 test1 = data.matrix(test1); test = cbind(test, test1) } test} attach(x) brands = makeDummy(brand2) data.new = cbind(x,brands) # running an OLS with quadratic terms beer.lm = lm(volsold~adspend1+price +distbn+promo+sku.size +sku.size.sq+bottle +golden+light+lite+reg +jan+feb+march+april+brands, data=data.new) #view results summary(beer.lm) |
Note that the results are the same as when we had used factor(brand2). Now we are ready to use the variable selection method on the regression results object 'beer.lm'.
## variable selection library(MASS) step = stepAIC(beer.lm, direction="both") step$anova # display results |
See you all in Class
Sudhir
No comments:
Post a Comment
Constructive feedback appreciated. Please try to be civil, as far as feasible. Thanks.