Tuesday, December 25, 2012

Sessions 9 and 10 Annotations

Hi all,

The following annotation is regarding the Socio-economic class (SEC) classification I covered in session 9.

  • As mentioned in class, the SEC system has changed substantially in this decade. I discussed in class what the old system was like.
  • Mr Sharath Srinivas from your class has sent slides that explain the new system (Thanks!). These slides are putup on LMS.
  • This Wiki page also explains the change that has happened and the reasons driving that change.
  • Pls go through the new SEC system at your leisure. I believe it will help to know about the new SEC in your placement interviews with B2C Marketing firms.
  • However, since I haven't discussed it explicitly in class, the new SEC system will not be important from your end-term viewpoint.
**************************

In Session 10, as I wrap-up the course, I champion R as an Analysis (and not just analytics) platform for industry use, try to convince you to include R in your consideration sets, and talk about its learning curve. This post from term 5 also contains some pointers and sources for learning R.

For instance, to demonstrate the wide flexibility and inter0operability of R with so many other packages around, I pick common 'how to' phrases people may encounter in MKTR and google their results on R. Thus, one could ask "How to test hypotheses in R?", "How to sort in R?", "How to do cluster analysis in R?", or "How to do ANOVA in R?" etc.

Some of these results I present below:

Normally, somebody somewhere would have created an R package with basic functions that others would later extend. For running SQL queries in R, look at the first result - a package 'sqldf' comes up. The third result, a PDF also shows up. To my untrained eye, the 5th result too looks promising and so on.

****************************

Here's another instance: say, in your work, you come across a potential opportunity that you want to proof. but it involves text analytics. And your firm isn't currently using any licensed commercial text analysis platform. But you recall MKTR@ISB and turn to R. You google for 'text analytics in R' and here's what you'd find:

The first two results show you the 'tm' package in R. The first link is from 2008 when 'tm' was first introduced as a basic version, its quite enhanced in functionality now. The 'tm' package manual you find in the third search result. The fifth result is a research paper that provides Perl+R programs for a particular function. And so on. Any decent stats grad you hire as an analyst can pickup R in a few weeks. You would have inexpensive access to a great platform.

***************************

Say, you're working on something and you want to merge two datasets. You don;t know how. Just google it. That's how I learned my R in the initial days, getting stuck on something and then figuring out through Google how to do it in R.

Lemme quickly mention Re the first search result: Quick-R is a great resource for do-it-yourself on R. In fact, this blog's format for R code is inspired largely by Quick-R only.

Well, Dassit for now. Over to the last class tomorrow and I hope you would have gained as much from the course as you had expected.

Sudhir

Saturday, December 22, 2012

Exam Pattern & Session 10 case Write-up

Hi all,

1. One page Case write-up for session 10

Pls submit a one-page case write-up [one side of the page only, Font: Times New Roman, size 11, spacing 1.5 lines] on the fashion channel case based on the following:

  • What is the management problem? Describe also the major symptoms.
  • What are some of the likely decision problems (DPs) that emerge based on the management problem?
  • List a few research objectives (R.O.s) that emerge based on the DPs.
Pls remember to type your name, PGID and section when submitting it in session 10. The case write-up may get some small portion of the HW grade.

2. End-term exam pattern Notes

Am working on making the Q paper. My notes for the end-term exam:

  • There are a total of 50 Qs, 2 marks each.
  • The Qs are broken down into 5-6 Question-sets, each having tables or figures and Qs based on them
  • The Qs are all short-answer - True/False, fill-in-the-blanks, write expression for ...., name these factors, type of stuff.
  • If any Q comes from any pre-read, the concerned pre-read will be specified in the Q itself. So bring your course-pack to the exam. Not all pre-reads are relevant, only the ones I've specifically asked you to read. Properly speaking, those pre-reads are a part of the course.
  • Nothing that was not covered in class will show up anywhere in the exam.
  • Time will not be a problem - you'll have 150 minutes for a 120 minute paper.
  • Please bring a calculator with you (no mobiles or laptops allowed). Borrow or otherwise arrange for this.

Pls use the comments section to this post for any Q&A so that it is visible to the class at large.


Friday, December 21, 2012

Session 7 HW Assignment

Hi all,

Pls find on LMS a folder titled 'Session 7 HW' with a set of files inside.

  • This HW (along with the next one for Sessions 8) will be due on midnight 6-Jan-2013 - the last day before the new term starts.
  • Both these HWs are PPT based, using the same submission format as earlier MKTR assignments.
  • My advice is not to wait for the last minute, to collaborate with peers and submit early.
  • Reminder: R portion can be shared with peers, but write-up and interpretation should be individual only.
The Qs for the Session 7 HW are:

Reading Based HWs:

Reading 1:"How ‘social intelligence’ can guide decisions"

Q1. Summarize the main points the article makes.(in <30 words, in bullet points, font 20 on your PPT).

Q2. List some real-world examples of how social media has changed the marketing intelligence cycle.(in 1 slide max, in bullet points, font 20 on your PPT).

Q3. List any three functions that the social-intelligence toolkit replaces in the traditional toolkit. Give one real-world example for each function you list (exclude examples already in the reading).(in <30 words, in bullet points, font 20 on your PPT).

********************************

Reading 2: "Big Data, Big Ruse"

Q4. Summarize the main points the article makes (in <30 words, in bullet points, font 20 on your PPT).

Q5. With respect to the main points raised in the article, how does an open-source, no-frills analysis platform (i.e. R, basically) compare to commercial BI platforms? Pls take a stand (doesn't necessarily have to be pro-R) and argue your case. (in <30 words, in bullet points, font 20 on your PPT).

******************************

Text and Sentiment Analysis: Homework

I've putup some 85 user reviews of the the xbox 360 and the Sony Play Station 3, pulled from Amazon into xbox reviews.txt and ps3reviews.txt respectively.

The Code to execute the assignment is also putup here in R code textAnalysis.txt (its a minor variation over the classwork code). You are strongly advised to first try replicating the classwork examples on your machine, available in this blog-post, before trying this one.

Q6. Your task is to use R to text analyze the dataset. Figure out:

(i) what most people seem to be saying about each product. And thereby interpret a general 'sense' of the talk or buzz around the product.

(ii) List what positive emotions seem associated with the xbox. Likewise for the PS3. And thereby interpret what each product's strengths are. [The business implications of such early signs of Word-of-mouth, instantaneous customer feedback, buzz etc for positioning, branding, promotions, communications and other tools in the Mktg repertoire are easy to see.]

(iii) List what negative associations seem to be around. And ideate on how each product's plausible weaknesses and how it can try to defend itself. [The business importance of early warning systems, damage assessment and speedy damage control are hard to miss.]

************************************

Hypothesis Testing HW

Read in car preference ratings from G20 Infiniti data.txt into R. The G20 case background intro is available in the first couple of pages of the G20 Infiniti MEXL caselet.

The relevant R code can be found in R code hypth testing.txt. YOu are advised to refer to the class work examples as well, available in this blog-post.

Q7. Build and test the following Hypotheses (comment on what can be inferred)

(i) No significant Preference difference between Audi and Honda.

(ii) Significantly higher preference for Honda as compared to G20

(iii) Split the 1-9 rating scale into Hi/Medium/Low preference corresponding to Hi = {9,8,7}, Medium = {6,5,4} and Low = {3,2,1}.Test for whether the Hi/Med/Low preference ratings across the Ford and Pontiac are systematic or not.

(iv) Run a chisquare test similar to (iii) for G20 versus Saab's preference ratings.

In each case above, copy-paste the R output table and interpret it on the same slide.

We will solve this HW tomorrow in the proposed R tutorial. Dassit for now. For any Queries, comments, feedback etc, fel free to contact me anytime through email or the blog comments. Sudhir

Thursday, December 20, 2012

Session 8 - Annotations

Hi all,

In the Regression Modeling module in Session 8, we talked about how quadratic terms can act to capture simple curves. Here is an R simulation to demonstrate the same, hopefully to greater clarity.

Here is what the code does. It randomly draws a 100 numbers between 0 and 2 (into vector x1), creates a quadratic term vector x2, the square of x1, then plots 4 scenarios in which a vector and its quadratic term together give rise to different shapes simple curve shapes. Read the code while copy-pasting it and hopefully, as you walk the data through the processing, the point of the exercise will be clear.

## --- Quadratic Terms Effect: Demo --- ##

# generate random numbers in vector x1
x1=runif(100)*2

# x2 is the quadratic term (or square) of x1
x2=x1*x1
x3 = cbind(x1,x2) # x3 is a matrix
x3=x3[order(x1),] # sort x3 in descending order

# different variations of outcome y
op <- par(mfrow = c(2,2), oma = c(0, 0, 3, 0))

# when x1 has +ve and x2 a negative coeff
y = x3[,1] - 0.5*x3[,2];
plot(y=y,x=x3[,1],type="b", main="(A) First +ve then -ve")

# when x1 has -ve and x2 a +ve coeff
y= -1*x3[,1] + 0.5*x3[,2];
plot(y=y,x=x3[,1],type="b", main="(B) First -ve then +ve")

# when x1 has +ve and x2 a +ve coeff
y= x3[,1] + 0.5*x3[,2];
plot(y=y,x=x3[,1],type="b", main="(C) Both +ve")

# when x1 has -ve and x2 a -ve coeff
y= -1*x3[,1] - 0.5*x3[,2];
plot(y=y,x=x3[,1],type="b", main="(D) Both -ve")

mtext("Capturing simple curves using Quadratic terms", outer = TRUE, cex=1.2)

par(op)
This is the image I got.
Observe how changing the signs changes the curve type captured in panels (A) to (D) in the figure. Thus in summary, we learn that quadratic terms give much flexibility to capture nonlinear shapes, useful in modeling secondary data in MKTR. Higher order polynomials (cubes etc) can capture even finer and more complex curve shapes.

***************

There was a Q about the rationale or intuition behind the chi-squared test. I'd say, start with this wikipedia page on the common Pearson's chi-squared test. The R code below helps simulate some chi-square distributions of varying degrees of freedom (from 1 to 8 in this case) on R.

## --- chi-squared distribution simulation on R --- ##

# define list Z for storing std normal draws
Z = NULL;
for (i1 in 1:8){ Z[[i1]] = rnorm(1000) }
Z1 = Z[[1]]; for (i1 in 2:8) { Z1 = cbind(Z1, Z[[i1]]) }

# create chi-squared distribution C for varying deg of freedom
Z2 = Z1*Z1
C = Z2
for (i1 in 2:8){ C[,i1] = as.matrix(apply(Z2[,1:i1],1,sum)) }

par(mfrow=c(1,1))
test = C[,1]
plot(density(C[,1]), col="black", type="l", main = "ChiSq distbn simulation in R")

for (i2 in 2:ncol(C)){
lines(density(x = C[,i2]), col=i2)}
Compare this with the plot on the wikipedia page. The point of this exercise was to demonstrate the construction of the chi-square distribution - as a function of the squares of the standard normal variates. This is strictly part of the stats core and I don't want to go any further into it. However, folks with more specific Qs and clarification are free to drop by and contact me.

Dassit for now. Need to work on your Session 7 and 8 HWs and prepare the end-term paper and so on.

Sudhir

Tuesday, December 18, 2012

Session 8 Classwork R code - Secondary Data Analysis

Update: Video link on what firms' thinking today on Experimentation in MKTR

**********************************************

Got this email from Sharath Srinivas:

Dear Prof. Voleti,

I see the power of R from the few HWs that we have done. But, wanted to check if there are useful GUI extensions to R.

If someone could provide Matlab GUIDE like options to build GUI on top of R code, it accelerate the adoption of R and its use in businesses. Somehow this does not seem to be happening, which is surprising to me. Do you have any insights on this issue.

My response:

Hi Sharath,

Thanks for asking. Yes, there are a few. I can think of ‘R commander’ and ‘Rattle’ among GUI extensions to R, the former for statistical analysis and the latter for data mining. The Rstudio is available as a developers platform to write extensions into R code. Enterprise R has some (non-free) GUI versions, am told.

R started off as a tool for the research community. Being freeware, it did not (and could not) attract professional app developers who worked for profit. Even R has its limitations, I guess. The research community didn’t much care for user-friendliness at that stage.

IMO, in terms of sheer intellectual horsepower that is using, sharing, tweaking, modifying and extending R, I think R is world class and then some. No comerical vendor’s corporate R&D budget, however big, can compete with all the University departments in the world put together.

Anyway, we take the highs with the lows. Compared to what I was using before for MKTR – SPSS, MEXL, JMP etc – R allows me so much more freedom to create, extend and imagine data modeling possibilities. I only want to expose the same imagination of possibilities to MKTR students as part of the course. Sure, the course is more than just the R part of it.

This is the first year in which I have made R the flagship platform for MKTR. So things haven’t settled to equilibrium stability yet. The course structure, content, organization etc is still evolving and changing (as you no doubt would have guessed by now). In that vein, any feedback on what can be done to make R easier/ more accessible would be very welcome.

*****************************************

Hi all,

The first module in Session 8 - experimentation doesn't have any R component to it. We already covered the relevant portion under 'Hypothesis testing' in the previous section. Onwards to Module 2.

Module 2. Secondary Data Analysis

There are 2 types of broadly used secondary data that you will be exposed to in this module. First, a small sample or subset of a supermarket shopping basket dataset upon which we will perform *affinity analysis*. The second is aggregate sales data for a category (beer) provided by syndicated data providers (like A C Nielsen and IRI).

  • Affinity analysis on supermarket shopping basket data
Data on some 1018 shopping baskets and covering some 276 product categories from a Hyd Supermarket are put up as an excel sheet on LMS. I used Excel's Pivot tables to extract the 141 most purchased categories. The resulting 1018*141 matrix I read into R. The matrix is available as a text file for R input on LMS. Now use the R code as follows:
# read-in affinity analysis dataset.txt
data = read.table(file.choose(), header=TRUE)
dim(data); data[1:5, 1:10]

# --- build affinity matrix shell ---
d1 = ncol(data)
afmat = matrix(0, d1, d1)
rownames(afmat) <- colnames(data)
colnames(afmat) <- colnames(data)

# --- fill up afmat lower triangle ---
for (i1 in 2:d1){ for (i2 in 1:(d1-1)){
test = data[, i1]*data[, i2]
afmat[i2, i1] = 100*sum(test[]>0)/nrow(data)
afmat[i1, i2] = afmat[i2, i1] }
afmat[i1, i1] = 0 }

colSum = apply(afmat, 2, sum)
a1=sort(colSum, index.return=TRUE, decreasing=TRUE)
afmat1=afmat[,a1$ix]
afmat2=afmat1[a1$ix,]
So, with that long piece of code, we have processed the data and built an 'affinity matrix'. Its rows and columns are both product categories. Each cell in the matrix tells how often the column category and the row category were purchased together in the average shopping basket. Let us see what it looks like and get some summary measures as well.

# each cell shows % of times the row & col items #
# were purchased together, in the same basket #
afmat2[1:40,1:8] # partially view affinity matrix

# see some basic summaries
max(afmat2)
mean(afmat2)
median(afmat2)
We viewed some 40 rows and 8 columns of the affinity matrix. Each cell depicts the % of shopping baskets in which the row and column categories occurred together. For example the first row second column says that vegetables and juices were part of the a purchase basket 7.66% of the time whereas bread and Vegetables were so 8.54% of the time. The average cross-category affinity is 0.27%. The image below shows the output for some 12 rows and 6 columns of the affinity matrix.

Could the above affinity be viewed also in the form of a collocation dendogram? Or as a word-cloud? Sure. Here is the code for doing so and my dendogram result.

# affinity dendogram?
mydata = afmat2[1:50, 1:50] # cluster top 50 categories

# Ward Hierarchical Clustering
d <- as.dist(mydata) # distance matrix
fit <- hclust(d, method="ward")
plot(fit) # display dendogram

# affinity wordcloud?
colSum = apply(mydata, 2, sum)
library(wordcloud)
wordcloud(colnames(mydata), colSum, col=1:10)

  • Analysis of Syndicated Datasets

Much of real-world commerce depends on trading data - buying and selling it - via intermediaries called 'Syndicated data providers'. AC Nielsen and IRI are two of the largest such around in the retail trade. We will next analyze a small portion of a syndicated dataset to gain familiarity with its format, build some basic models and run some neat regression variants on the data.

The (small) portion of the dataset, which we will use for analysis is up on LMS. First, as usual, read-in the data and let the games begin.

  • There's a text column in the data ('brand2' for brand names). So its better to read it in from .csv rather than .txt using the following code.
  • Also I do some variable transformations to obtain a sku.size variable (measures size of the sale unit in fluid ounces. E.g. a 6 pack 12 Oz sale unit has 6*12=72 ounces of beer Volume).
# read data from syndicated beer dataset.csv
x = read.csv(file.choose(), header=TRUE)
dim(x); summary(x); x[1:5,] # summary views
attach(x) # call indiv columns

# some variable transformations
sku.size.sq = sku.size*sku.size

  • Start with the simplest thing we can - a plain linear regression of sales volume on the 4Ps and on some environmental and control variables (months for seasonality etc.).
# running an ordinary linear regression
beer.lm = lm(volsold~adspend1+price +distbn+promo +sku.size+bottle +golden+light+lite+reg +jan+feb +march+april +factor(brand2), data=x)
# view results
summary(beer.lm)
Notice how the use of factors() enables us to directly use a txt column consisting of brand names inside the regression. The following image shows the result I get.
Reading regression output is straightforward. To refresh your memory on this, try the following Qs. (i) What is the R-square? (ii) How many variables significantly impact sales Volume at 95%? at 99%? (iii) Which brand appears to have the highest impact on sales volume (relative to reference brand 'Amstel')? (iv) How do beer sales fare in the months Jan, Feb and March relative to the reference month April? (v) Do larger sized SKUs have higher sales volume? Etc. Pls feel free to consult others around and get the interpretation of standard regression output clarified in case you've forgotten or something.

  • We move to running a linear regression with quadratic terms to accommodate non-linear shaped response.
  • E.g., suppose sales rise with SKU size but only upto a point (the 'ideal' point). They then start falling off as SKU sizes get too big to sell very well. Thus, there may be an 'optimal' SKU size out there. We cannot capture this effect with simply a linear term.
# running an OLS with quadratic terms
beer.lm = lm(volsold~adspend1+price +distbn+promo+sku.size +sku.size.sq+bottle +golden+light+lite+reg +jan+feb+march+april +factor(brand2), data=x)
#view results
summary(beer.lm)
The image shows that yes, both sku.size and its quadratic (or square) term are significant at the 99% level *and* they have opposite signs. So yes, indeed, there seems to be an optimal size at which SKUs sell best, other things being equal.

  • A log-log regression takes the natural log of metric terms on both sides of the equation and runs them as an ordinary regression.
  • The output is not ordinary though. The coefficients of log variables can now be directly interpreted as point elasticities with respect to sales.
# run a log-log regression
beer.lm = lm(log(volsold)~log(adspend1)+log(price) +log(distbn)+log(promo) +sku.size+sku.size.sq+bottle +golden+light+lite+reg +jan+feb+march+april +factor(brand2), data=x)
summary(beer.lm)
Thus, the results say that the price elasticity of sales is thus -2.29, the distribution elasticity of sales is 1.32 and the promotion and advertising elasticities of sales are, respectively, 0.57 and 0.44.

  • What if there are more variables than required for a regression? How to know which ones are most relevant and which ones can safely be dropped?
  • Enter variable selection - we use a scientific, complexity penalized fit metric, the Akaike Information Criterion or AIC, to decide what the best fitting set of variables is for a given regression.
  • However, there is a problem I must resolve first. Turns out this variable selection func doesn't work when we have categorical data.
  • So we first convert the categorical data to dummy variable columns.
## --- Func to build dummy vars for numeric vectors
makeDummy <- function(x1){
x1 = data.matrix(x1); test=NULL

for (i1 in 1:ncol(x1)){
test1=NULL; a1=x1[,i1][!duplicated(x1[,i1])];
k1 = length(a1); test1 = matrix(0, nrow(x1), (k1-1));
colnames0 = NULL

for (i2 in 1:(k1-1)){
test1[,i2] = 1*(a1[i2] == x1[,i1])
colnames0 = c(colnames0, paste(colnames(x1)[i1], a1[i2])) }
colnames(test1) = colnames0
test1 = data.matrix(test1); test = cbind(test, test1) }
test}

attach(x)
brands = makeDummy(brand2)
data.new = cbind(x,brands)

# running an OLS with quadratic terms
beer.lm = lm(volsold~adspend1+price +distbn+promo+sku.size +sku.size.sq+bottle +golden+light+lite+reg +jan+feb+march+april+brands, data=data.new)
#view results
summary(beer.lm)

Note that the results are the same as when we had used factor(brand2). Now we are ready to use the variable selection method on the regression results object 'beer.lm'.

## variable selection
library(MASS)
step = stepAIC(beer.lm, direction="both")
step$anova # display results
Look at the initial regression equation and the final regression equation (in blue font, mentioned). Look at the variables R wants dropped (with a '-' to them and those R wants added '+' after dropping).Thus, a simple 3 line piece of code on R helps ID the best fitting set of variables in a given set of predictors. Nothing surprising about what R told us here - drop the insignificant ones. But it could get more complicated in more involved datasets, perhaps.

See you all in Class

Sudhir

Session 6 HW and other Notes

Article Update:
Found this really neat article:Different ways in which you can become a data scientist. Given that Data sciences are going to become an element of competitive advantage in the medium term, I'd say that article is of interest to all MKTR students (regardless of background).

**********************************

Update: Here's an interesting article on job trends in the coming year. 6 startup trends in 2013: bootstrapping, marketing, B2B. Thought it relevant to highlight this part of the article for our MKTR course:

5) Marketing becomes as hot as tech.By 2017, CMOs will be spending more on IT than CIOs. Driving this massive shift is the customer data that simply did not exist a decade ago.” -Ajay Agarwal, Bain Capital Ventures

6) Service marketplaces — not individual suppliers — will become the “brand.” Just as Amazon has become a leading brand for books (versus individual publishers), consumers will look to branded marketplaces for various services, such as teaching, cleaning, or construction. -Eric Chin, general partner, Crosslink

Capital

Point? What looks arcane and abstract right now (e.g., decision trees or targeting algorithms we used) is the future. **********************************

Hi all,

Your Session 6 HW (on some targeting algorithms we covered in class) requires you to go through the short caselet 'Kirin' (PDF uploaded). This HW has 4 Qs as detailed below and will use a PPT submission format (the same as for the session 5 HW).The submission deadline is 26-Dec Wednesday midnight into a dropbox.

Some background to the Qs first. Before targeting we need to do segmentation. Q2 deals with segmentation and the interpretation of segments. However, prior to segmentation, we need to know what constructs may underlie the basis variables. Q1 deals with this aspect. You did factor analysis and segmentation already in Session 5 HW. It occurs again in this HW in Q1 and Q2 but with less emphasis. The focus is more on Q3 and Q4 where we apply the randomForest and neural net algorithms respectively.

This HW also demonstrates the importance of selecting good discriminant variables. As it turns out, the discriminant variables used here are lousy and yield remarkably low predictive accuracy rates even with such sophisticated algos. The takeaway? Methods cannot alleviate deficiencies in the data beyond a point. OK, without further ado, here we go:

Questions for Session 6 HW:

  • Q1.Find what constructs may underlie the basis variables. Use factor analysis, report eigenvalue scree plot & factor loadings table. Answer the following Q sub-parts:
  • (i) Which variables load less than 30% onto the factor solution? (Hint: Look for Uniqueness threshold of 1-0.30 = 0.70 or above)
  • (ii) ID and label the constructs you find among the variables that do load well onto the factor solution.
  • Q2. Use mclust to segment the respondents. Answer the following Q parts.
  • (i) What is the optimal no. of clusters?
  • (ii) Report a clusplot. What is the % of variance explained by the top 2 principal components in the cluster solution?
  • (iii) ID and label the segments you find.
  • Q3. Split the kirin datasets into training sample (first 212 rows) and test sample (the remaining 105 rows). Train the randomForest algorithm on the training sample. Predict test sample's segment membership. Answer the following Qs:
  • (i) Record predictive accuracy in both training and test samples
  • (ii) Which segment appears to havwe the highest error rate?
  • (iii) List the top 3 variables that best discriminate among the segments (use Mean Decrease in Accuracy metric)
  • Q4. Use the split kirin datasets to try multinomial logit with neural nets on the training and test samples. Predict test sample's segment membership.
  • (i) Record predictive accuracy in both training and test samples
  • (ii) Which segment appears to havwe the highest error rate?
  • (iii) List the top 3 variables that best discriminate among the segments (use significance as a metric here)

Some Notes on why use R:

In case you are wondering why you are being asked to bother with R, some points to note:

  • Its imperative you learn how to run MKTR analyses at least once. The reason is you won't be able to effectively lead a team of people who do what you've never done. Sure, your analytics team will run the analysis but you need to have an idea of what that entails, what to expect etc.
  • Folks who have any programming experience at school or at work can probably vouch for the fact that the R language is about as straightforward as, well, English. Pls implement the R code *yourself* to get a feel of the platform. Pls help your peers out who may not be as comfortable with R.
  • Those who are ambivalent or undecided about using R, I say pls give it a sincere try. Its a worthwhile investment. Learning in this course is best leveraged with a basic understanding of what a versatile analytics platform does.
  • Pls share problems in the code or the data that you found, workarounds that you figured out, other packages you discovered that do the same thing better/faster etc on the blog as well. R is a community based platform and it draws its strength from a distributed user and developer base.
  • Those who are determined to not touch R are free to do so. No harm, no foul. Pls borrow the plots and tables from your peers, but interpret them yourself.
Dassit for now. Gotta get to work on tomorrow's session slides and associated blog-code.

Sudhir

Monday, December 17, 2012

Session 5 HW

Hi all,

Your 'Session 5 HW' is out, in a folder of the same name on LMS. The R code I used to test the HW is put up as a notepad. Feel free to use blocks of that R code directly for your HW.

Important: Pls read the short caselet in a PDF file 'Conglomerate PDA' in the HW folder *before* attempting the exercise. As an instructor, I assure you that if you try interpreting the analyses without reading the caselet, it will show.

Recommended:

  • Pls try the classwork examples on R before trying the HW examples. The classwork blog-post has explanations for each block of code.
  • Ensure you have the required packages loaded before you start. These are (IIRC): nFactors, cluster, mclust and rpart.

HW Questions:

  • Q1. Are there any 'constructs' underlying the basis variables used for segmentation in the PDA caselet? What might they be? Give a name to and interpret these factors/ constructs.
  • Q2. Segment respondent data using hclust, mclust and k-means (with scree-plot). Record how many segments you find. Draw clusplots for k-means and mclust outputs.
  • Q3. Characterize or 'profile' the clusters obtained from mclust. Name the segments (similar to the China-digital consumer segments reading)
  • Q4. Read-in discriminant data for the PDA case. Make a dataset consisting of the interval scaled discriminant variables only. Now plot a decision tree to see which variables best explain the membership to the largest segment (segment 1). List the variables by order of importance.
Submission format is a set of PPT slides:
  • Ensure your name and PGID are written on the title slide.
  • Ensure all your plots (and the important tables) are pasted as images on the slides. Typically metafile images are best.
  • Pls give each slide an informative title and mention question number on it.
  • Pls be aware that while you are free to consult peers on the R part of the plots making, interpretation and writing up is solely an individual activity.
  • Submission deadline is Monday Midnight 24-Dec-2012 in a LMS dropbox.
Session 6 HW is in the making. Will happen shortly and its deadline will be close to this one's. FYI.

Any Qs on this HW, pls contact me directly. Pls use the blog comments pages to reach me fastest. Feedback on any aspect of the HW or the course is most welcome.

Sudhir

Sunday, December 16, 2012

Session 7 - Hypothesis Testing R code

Hi all,

The previous blog-post introduced Unstructured text analysis in Session 7. This one is a separate module - to do Hypothesis testing. In this blog-post, I post R code required for classwork examples for two common types of hypothesis tests - tests of differences and tests of association.

Module 1. Hypothesis Testing

  • t-tests for differences
Use the notepad titled 'Nike dataset.txt' for this.
## hypothesis testing
## Nike example

# read-in data
nike = read.table(file.choose(), header=TRUE)
dim(nike); summary(nike); nike[1:5,] # some summaries
attach(nike) # Call indiv columns by name

# Q is: "Is awareness for Nike (X, say) different from 3?” We first delete those rows that have a '0' in them as these were non-responses. Then we proceed with t-tests as usual.

# remove rows with zeros #
x1 = Awareness[(Awareness != 0)]
# test if x1's mean *significantly* differs from mu
t.test(x1, mu=3) # reset 'mu=' value as required.
This is the result I got:
To interpret, first recall the hypothesis. The null said:"mean awareness is no different from 3". However, the p-value of the t-test is well below 0.01. Thus, we reject the null with over 99% confidence and accept the alternative (H1: Mean awareness is significantly different from 3). Happily, pls notice that R states the alternative hypothesis in plain words as part of its output.

The second Q we had was:

# “Q: Does awareness for Nike exceed 4?”
# change 'mu=3' to 'mu=4' now
t.test(x1, mu=4, alternative=c("greater")) # one-sided test
The p-value I get is 0.2627. We can no longer reject the Null even at the 90% confidence level. Thus, we infer that mean awareness of 4.18 odd does *not* significantly exceed 4.

Next Q was: "Does awareness for Nike in females (Xf, say) exceed that in males (Xm)?”

# first make the two vectors
xm = x1[(Gender==1)]
xf = x1[(Gender==2)]
# Alternative Hypothesis says xf '>' xm
t.test(xf, xm, alternative="greater")
In the code above, we specify for 'alternative=' whatever the alternative hypothesis says. In this case it said greater than, so we used "greater". Else we would have used "less".
The p-value is slightly above 0.05. So we should reject the Null at the 95% level. However, it is almost at 95% and the confidence level limits we put are essentially arbitrary only. We can choose accept to the alternative that "xf at 5.18 significantly exceeds xm at 3.73" in such circumstances. It is a judgment call.

  • chi.square-tests for Association
The data for this example is on the same google doc starting cell K5. Code for the first classwork example on Gender versus Internet Usage follows. First, let me state the hypotheses:
  • H0: "There is no systematic association between Gender and Internet Usage." . In other words, distribution of people we see in the 4 cells is a purely random phenomenon.
  • H1: "There is systematic association between Gender and Internet Usage."
# read in data WITHOUT headers
a1=read.table(file.choose()) # 2x2 matrix
a1 # view the data once
chisq.test(a1) # voila. Done!
This is the result I get:
Clearly, with a p-value of 0.1441, we can no longer reject the Null at even the 90% level. So we cannot infer any significant association between Gender and Internet Usage levels. The entire sample we had was 30 people large. Suppose we had a much bigger sample but a similar distribution across cells. Would the inference change? Let's find out.

Suppose we scaled up the previous matrix by 10. We will then have 300 people and a corresponding distribution in the four cells. But now, at this sample size, random variation can no longer explain the huge differences we will see between the different cells and the inference will change.

a1*10 # view the new dataset
chisq.test(a1*10)
The results are as expected. While a 5 person difference can be attributed to random variation in a 30 person dataset, a 50 person variation cannot be so attributed in a 300 person dataset.

Our last example on tests of association uses the Nike dataset. Does Nike Usage vary systematically with Gender? Suppose we had reason to believe so. Then our Null would be: Usage does not vary systematically with Gender, random variation can explain the pattern of variation. Let us test this in R:

# build cross tab
mytable=table(Usage,Gender)
mytable #view crosstab

chisq.test(mytable)
The example above is also meant to showcase the cross-tab capabilities of R using the table() function. R peacefully does 3 way cross-tabs and more as well. Anyway,here's the result: seems we can reject the Null at the 95% level.

Dassit from me for now. See you all in Class

Sudhir

Session 7 R code for Basic Text Analysis

Hi all,

Session 7 has the following modules:

  • Continuing Qualitative MKTR from session 6, we cover Focus Group Discussions (or FGDs) in the first sub-module. A short video 'reading' will be present.
  • The second sub-module in Qualitative MKTR - Analysis of Unstructured text - is covered next. The R code below deals with elementary text mining in R.
  • The third Module deals with building and Testing Hypotheses in MKTR. We'll use R's statistical abilities to run quick tests on two major classes of Hypotheses.
So without any further ado, here goes sub-module 1.

1. Elementary Text Mining in R

First install the required libraries as shown below and read in the data (in 'Q25.txt' in the Session 7 folder on LMS).

library(tm)
library(Snowball)
library(wordcloud)
library(sentiment)

###
### --- code for basic text mining ---
###

# first, read-in data from 'Q25.txt'
x = readLines(file.choose())
x1 = Corpus(VectorSource(x))

Now, run the following code to process the unstructured text and obtain from it a Term-frequency document matrix. The output obtained is show below.

# standardize the text - remove blanks, uppercase #
# punctuation, English connectors etc. #
x1 = tm_map(x1, stripWhitespace)
x1 = tm_map(x1, tolower)
x1 = tm_map(x1, removePunctuation)
x1 = tm_map(x1, removeNumbers)
# removing it from stopwords

myStopwords <- c(stopwords('english'), "ice", "cream")

x1 = tm_map(x1, removeWords, myStopwords)
x1 = tm_map(x1, stemDocument)

# make the doc-term matrix #
x1mat = DocumentTermMatrix(x1)
# --- sort the TermDoc matrix --- #
# removes sparse entries
mydata = removeSparseTerms(x1mat,0.998)
dim(mydata.df <- as.data.frame(inspect(mydata))); mydata.df[1:10,]
mydata.df1 = mydata.df[ ,order(-colSums(mydata.df))]
dim(mydata.df1)
mydata.df1[1:10, 1:10] # view 10 rows & 10 cols

# view frequencies of the top few terms
colSums(mydata.df1) # term name & freq listed
'Stopwords' in the above code are words we do not want analyzed. The list of stopwords can be arbitrarily long.
Notice the Document Corpus x1, the term-frequency document matrix (first 10 rows & cols only). The image below gives the tital frequency of occurrence of each term in the entire corpus.

2. Making Wordclouds

Wordclouds are useful ways to visualize relative frequencies of words in the corpus (size is proportional to frequency). The colors of the words are random, though.

# make wordcloud to visualize word frequencies
wordcloud(colnames(mydata.df1), colSums(mydata.df1)*10, scale=c(4, 0.5), colors=1:10)
So, what can we say from a wordcloud? The above one seems to suggest that 'chocolate' is the flavor most on the minds of people, followed by vanilla. The relative importance of other words can be assessed similarly. But precious little else we find.

It would be more useful to actually see what words are used together most often in a document. For example, 'butter' could mean anything - from 'butter scotch' to 'butter pecan' to 'peanut butter'. To obtain which pairs of words occur most commonly together, I tweaked the R code a bit and the below resulted in a 'collocation dendogram':

# --- making dendograms to #
# visualize word-collocations --- #
min1 = min(mydata$ncol,25) # of top 25 terms
test = matrix(0,min1,min1)
test1 = test
for(i1 in 1:(min1-1)){ for(i2 in i1:min1){
test = mydata.df1[ ,i1]*mydata.df1[ ,i2]
test1[i1,i2] = sum(test); test1[i2, i1] = test1[i1, i2] }}

# make dissimilarity matrix out of the freq one
test2 = (max(test1)+1) - test1
rownames(test2) <- colnames(mydata.df1)[1:min1]
# now plot collocation dendogram
d <- dist(test2, method = "euclidean") # distance matrix
fit <- hclust(d, method="ward")
plot(fit) # display dendogram

Click on the image for larger size. Can you look at the image and say what flavors seem to go best with 'coffee'? With 'Vanilla'? With 'Strawberry'?

It is important not to lose sight of the weaknesses and problems that dog current text mining capabilities.

  • For instance, someone writing 'not good' versus 'good' will both have 'good' picked up by the text miner. So 'meaning' per se is lost on the miner.
  • The text miner is notoriously poor at picking up wit, sarcasm, exaggerations and the like. Again, the 'meaning' part is lost on the miner.
  • Typos etc can also play havoc. Synonyms can cause trouble too, in some contexts. E.g., Some say 'complex', others may say 'complicated' to refer to the same attribute. These will appear as separate terms in the analysis.
  • So text mining is more useful as an exploratory tool, to point out qualitatively what topics and words appear to weigh most on respondent minds. It helps downstream analysis by providing inputs for hypothesis building, for more in-depth investigation later and so on.
  • It *is* important that the more interesting comments, opinions etc be manually checked before arriving at any conclusions.
That concludes our small, elementary text-mining foray using R. R's capabilities in the text-mining arena are quite advanced, extensible and evolving. We ventured in to get a simple example done. This example however scales up easily to larger and more complex datasets of unstructured text. Onwards now to our second sub-module wherein we combine two important elements within text analysis for MKTR - social media chatter and sentiment mining.

3. Sentiment Mining of twitter data

the twitteR package in R showcases well the social media capabilities of R. It allows you to search for particular keywords in particular geographic areas (cities, for example). Thus, you could compare the response to the movie 'Talaash' in Delhi versus say, in Hyderabad.

For our class work exercise, I am using twitter data on #skyfall in the weekend following its release, from London. Read the data in. Run the R code for text mining as we had done above. After that only, implement basic sentiment analysis on the tweets as follows:

#######################################
### --- sentiment mining block ---- ###
#######################################

# After doing text analysis, run this
### --- sentiment analysis --- ###
# read-in positive-words.txt
pos=scan(file.choose(), what="character", comment.char=";")
# read-in negative-words.txt
neg=scan(file.choose(), what="character", comment.char=";")
# including our own positive words to the existing list
pos.words=c(pos,"wow", "kudos", "hurray")
neg.words = c(neg)

# match() returns the position of the matched term or NA
pos.matches = match(colnames(mydata.df1), pos.words)
pos.matches = !is.na(pos.matches)
b1 = colSums(mydata.df1)[pos.matches]
b1 = as.data.frame(b1)
colnames(b1) = c("freq")
wordcloud(rownames(b1), b1[,1], scale=c(5, 1), colors=1:10)
neg.matches = match(colnames(mydata.df1), neg.words)
neg.matches = !is.na(neg.matches)
b2 = colSums(mydata.df1)[neg.matches]
b2 = as.data.frame(b2)
colnames(b2) = c("freq")
wordcloud(rownames(b2), b2[,1], scale=c(5, 1), colors=1:10)
2 Wordclouds will appear - one only for words that have positive sentiment or emotional content. The other for negative ones.

4. Determining Sentiment Polarities

Can we measure 'how much' emotional content or intensity etc a tweet or comment may contain? Well, at least at the ordinal level, perhaps. The package 'sentiment' offers a way to measure sentiment polarities in terms of log-likelihoods of comments being of one polarity versus another. This can be a useful first step in basic sentiment analysis.

#######################################
### --- sentiment mining block II ---- ###
#######################################

### --- inspect only those tweets #
# which got a clear sentiment orientation ---
a1=classify_emotion(x1)
a2=x[(!is.na(a1[,7]))] # 447 of the 1566 tweets had clear polarity
#a3=PlainTextDocument(a2)
a2[1:10]
# what is the polarity of each tweet? #
# that is, what's the ratio of pos to neg content? #
b1=classify_polarity(x1)
dim(b1)
b1[1:5,] # view polarities table

5. Determining Sentiment Dimensions

Can we do more than just do sentiment polarities? Can we get more specific about which primary emotion dominates a particular tweet or opinion or comment? Turns out the sentiment package in R does provide one way out. How well established or usable this is in a given context is caveat emptor.

a1a=data.matrix(as.numeric(a1))
a1b=matrix(a1a,nrow(a1),ncol(a1))
# build sentiment type-score matrix
a1b[1:4,] # view few rows

# recover and remove the mode values
mode1 <- function(x){names(sort(-table(x)))[1]}
for (i1 in 1:6){ # for the 6 primary emotion dimensions
mode11=as.numeric(mode1(a1b[,i1]))
a1b[,i1] = a1b[,i1]-mode11}

summary(a1b)
a1c = a1b[,1:6]
colnames(a1c) <- c("Anger", "Disgust", "fear", "joy", "sadness", "surprise")
a1c[1:10,]
## -- see top 20 tweets in "Joy" (for example) ---
a1c=as.data.frame(a1c);attach(a1c)
test = x[(joy != 0)]; test[1:10]
# for the top few tweets in "Anger" ---
test = x[(Anger != 0)]; test[1:10]
test = x[(sadness != 0)]; test[1:10]
test = x[(Disgust != 0)]; test[1:10]
test = x[(fear != 0)]; test[1:10]
That does it for text and sentiment analysis on R. Again, these were just exploratory forays. The serious manager can choose to invest in R's capabilities for delivering these analytical capabilities quickly and economically. Your exposure to this area now enables you to take that call as a manager tomorrow.

This is it for now. Will putup the Hypothesis testing related R code in a separate blog-post.

Sudhir

Saturday, December 15, 2012

Session 4 HW Help

Hi all,

Thanks to the tireless efforts of one Mr Shaurya Singh, I've put up what is (almost) the solution to your Session 4 HW on LMS.

The data and code required to run all your session 4 HW exercises (due wednesday 19-Dec midnight)are up on a folder imaginatively named 'session 4 HW' on LMS.

As you know, session 4 HW has three parts to it. The data and code for each part are placed in separate subfolders.

Instead of copy-pasting code from the blog, I urge you to copy-paste bvlocks of code directly from the R code.txt files in each of the subfolders. The data required to be read-in are available directly as .txt files in the same sub-folders.

I now estimate no more than 30 minutes for you to run through the entire R portion of your session 4 HW. Pls ensure you have a good interpretation (or 'story') ready to be written as bullet points onto your HW slides.

My colleagues tell me there is hope the students may come back to normal with Day 1 of placements over, for the rest of the term. Well, we'll see. Cheers

Sudhir

Tuesday, December 11, 2012

Targeting in R - Session 6 PDA Caselet

Update: NYT Video link for social media as a focus group

*******************************

Hi all,

This is the code for classwork MEXL example "Conglomerate's PDA". This is the roadmap for what we are going to do:

  • First we segment the customer base using model based clustering or mclust, the recommended method.
  • Then we randomly split the dataset into training and test samples. The test sample is about one-third of the original dataset in size, following accepted practice.
  • Then we try to establish via the training sample, how the discriminant variables relate to segment membership. This is where we train the Targeting algorithm to learn about how discriminant variables relate to segment memberships.
  • Then comes the real test - validate algorithm performance on the test dataset. We compare prediction accuracy across traditional and proposed methods.
  • Since R is happening, there are many targeting algorithms to choose from on R. I have decided to go with one that has shown good promise of late - the randomForest algorithm. Where we had seen decision trees in Session 5, think now of 'decision forests' in a sense...
  • Other available algorithms that we can run (provided there is popular demand) are artificial neural nets (multi-layer perceptrons) and Support vector machines. But for now, these are not part of this course.
So without further ado, let me start right away.

1. Segment the customer Base.

To read-in data, directly save and use the 'basis' and 'discrim' notepads I have sent you by email. Then ensure you have packages 'mclust' and 'cluster' installed before running the clustering code.

# read-in basis and discrim variables
basis = read.table(file.choose(), header=TRUE)
dim(basis); basis[1:3,]
summary(basis)

discrim = read.table(file.choose(), header=TRUE)
dim(discrim); discrim[1:3,]
summary(discrim)

# Run segmentation on the basis.training dataset library(mclust) #invoke library
fit <- Mclust(basis) # run mclust
fit # view result
classif = fit$classification
# print cluster sizes
for (i1 in 1:max(classif)){print(sum(classif==i1))}

# Cluster Plot against 1st 2 principal components
require(cluster)
fit1=cbind(classif)
rownames(fit1)=rownames(basis)
clusplot(basis, fit1, color=TRUE, shade=TRUE,labels=2, lines=0)
The segmentation produces 4 optimal clusters. Below is the clusplot where, interestingly, despite our using 15 basis variables, we see decent separation among the clusters in the top 2 principal components directly.
Click on the above image for larger size.

2. Split dataset into Training & Test samples

I wrote a function that randomly splits the dataset into training and test samples. Copy-paste it as a block into the R console and we can then invoke it peacefully downstream anywhere in this R session.

# Function to split dataset into training and test samples

split.data <- function(data){ #func begins
train.index = sample(1:nrow(data), round((2/3)*nrow(data)))
a2=1:nrow(data);
test.index=a2[!(a2 %in% train.index)]

### split sample into 2/3rds training and 1/3rds test
train = data[train.index,]
test = data[test.index,]
outp = list(train, test)
outp } # func ends

# run split.data func on our data
data = cbind(basis, discrim, classif)
outp = split.data(data)

# create training and test datasets
n1 = ncol(basis); n2 = ncol(data)
basis.training = outp[[1]][,1:n1]
discrim.training = outp[[1]][,(n1+1):(n2-1)]
y.training = outp[[1]][,n2]
y.test = outp[[2]][,n2]
discrim.test = outp[[2]][,(n1+1):(n2-1)]
There, now we have the training and test samples defined both for basis and for discriminant matrices. Time to train the machine on the training dataset next.

3. Train the Targeting Algorithm

Before proceeding further, a quick note on what random forests are and how they relate to decision trees. Recall what decision trees are (loosely speaking) - 'classifiers' of sorts that pick the variable that best splits a given Y variable into 2 or more branches. For instance, in our car example in class, we saw that our Y variable 'Mileage' was best split first by 'Price' and then by 'Type'. We call this procedure 'recursive partitioning' and it is repeated at each node (or branch ending).

For a small set of X variables, individual decision trees are fine. But what if we had X variables numbering the dozens, scores, hundreds even? To solve this issue, instead of one decision tree, imagine we had 1000s of trees, each acting on random subsets of the original variable set. Each time a forest grows, we then summarize the fit obtained across different variables. Then, we grow another forest. Then another. That, in short, is the random forest algorithm. A forest of decision trees that allows us to efficiently parallelize some recursive partitioning tasks, keep track of variable importance and predictive fit, and thereby tackle large datasets efficiently. Again, this is just an intuitive explanation, it is not rigorous in a technical sense, so don;t take it too literally.

Ensure you have library 'randomForest' installed before trying the below code.

### --- using the randomForest algo ---- ###
library(randomForest)

test.rf <- randomForest(factor(y.training) ~.,
data = discrim.training,
xtest = discrim.test, ytest = factor(y.test),
ntree = 3000, proximity = TRUE, importance = TRUE)
print(test.rf)

That was it. If no error showed up, the algo has executed successfully. Those few lines of code invoked a rather powerful classification algorithm that finds use in myriad scientific explorations today. Fortunately for us, its is right up our street in the Targeting arena. Let us proceed further and see how the algo has performed on in-sample and out of-sample prediction. This is the output I got on executing the above code:

The output is as straightforward to read as the input was to run. Let me summarize in bullet points for ease of interpretation.
  • The output tells us that the Type of run was "Classification" implying that a regression based run for continuous data is very much possible (recall we did the regression based decision tree only for the Mileage example).
  • The no. of variables tried at each split is basically how many levels our Y variable had. Well, we have 4 segments for which we want to classify customers into, so we it is 4.
  • The confusion matrix is a cross-tab between the predicted and the observed classifications. Higher the numbers along the diagonal, the better. Interestingly, we have an in-sample accuracy of !62% but a test sample predictive accuracy of ~72%! Usually its the other way round.
  • Does a mere 72% accuracy seem too low for such a much-touted algorithm? Aah. well, let's see. First, a random coin-toss would give us ~ 25% accuracy (roughly, across 4 classes).
  • Second, there were only 107 training cases for 17 discriminant variables. If you try any parametric model like, say, Logit for this, the chances the model will give up midway are quite high indeed. And even if it goes through, don't expect much by way of variable significance in the results. Why talk, let me show you an application of the multinomial Logit for this dataset shortly.
  • But what about variable significance (or 'importance') in random Forests, first? Sure, we're going there next.

4. Interpreting random Forest Output

Copy paste the following code to get variable importance tables and an associated plot. Also, the actual versus predicted segment membership tables can also be obtained using the code below.

# List the importance of the variables.
rn <- round(importance(test.rf), 2)
rn[order(rn[, (ncol(rn)-1)], decreasing=TRUE),]
varImpPlot(test.rf) # plot variable importance

# -- predictions ---
# observed and predicted y outputted
yhat.train = cbind(y.training, test.rf$predicted)# insample
yhat.test = cbind(y.test, test.rf$test[[1]])# test sample

This is a variable importance plot. Useful to use the top few variables on the left side plot.
That was it. We used a sophisticated Targeting algorithm for our decidedly tiny and simple data application. This algo scales up massively and for many more complex applications, BTW.

5. A neural-net based Logit Model

The Q may well arise, how do I know how good the proposed new algo really is unless I run it on something known and traditional? So, how about we run it on the well-established, parametric Logistic Regression model? Since our Y has 4 levels, we would ordinarily go for the Multinomial Logit model (in the R package 'mlogit').

Let me tell you what happens when I try to replicate the above in mlogit. The program shows error messages galore. Says that for 160 rows and 17 variables, we get a matrix close to Singular (expected, BTW). So I try reducing the number of X variables. The problem persists. Finally I give up.

And turn to artificial neural-nets.

Turns out that the R package 'nnet' implements a multinomial Logit on an underlying neural net. The function multinom() is the key here. The code below will invoke, execute and help interpret multinomial Logit results for our PDA dataset.

### --- doing multinom for Targetting --- ###
library(nnet) # use neural nets to fit a MNL!
library(car) # for Anova func call
# Build a Regression model.
a1 <- multinom(formula = y.training ~ ., data = discrim.training, trace = FALSE, maxit = 1000)

# For inference
summary(a1, Wald.ratios=TRUE)
cat('==== ANOVA ====')
print(Anova(a1))

This is what I get from the ANOVA table:
Notice right off how variable significance seems to be such a problem. Well, in predictive analytics, we're more concerned with how well the variables as a group are able to predict segment membership than how significant any particular variable is. Hence, the block of code that follows gives us how well neural nets are able to predict both within sample (on the training dataset) and on the test dataset.

### Model fit in training sample ###

a3 = round(a1$fitted.values,3) # probability table

# build predicted Y values vector
yhat = matrix(0, nrow(a3), 1)
for (i1 in 1:nrow(a3)){ for (i2 in 1:ncol(a3)){
if (a3[i1, i2] == max(a3[i1,])) {yhat[i1,1] = i2}}}

# confusion matrix for insample
table(y.training, yhat)

# predictive accuracy
sum(diag(table(y.training, yhat)))/nrow(discrim.training)

This is what I got:
Wow or what? neuralnets netted us a 76% accuracy in training sample prediction. In contrast, we got only some 62% accuracy in random forests. Does that mean that neural nets are better than random forests in this application? Well, hold on.

The test dataset is still there. How well does the neural net do there? Its not necessary it will trump random forests here just because it did so in the training sample.

# build predicted Y values vector for test dataset
yhat1 = predict(a1, cbind(y.test, discrim.test), type="class")
# confusion matrix for multinom test dataset
table(y.test, yhat1)

# predictive accuracy
sum(diag(table(y.test,yhat1)))/nrow(discrim.test)

See what I mean? neural nets get a 64% accuracy compared to 72% in random Forests in the test dataset. I'd say, in any given app, try both. Use whichever gives you a higher test dataset prediction accuracy.

Well, Dassit for now. See you in class tomorrow.

Sudhir

Sunday, December 9, 2012

Targeting Tools in R - Decision Trees

Hi all,

Today's second module - Targeting - is crucial in MKTR and underlies much of the work we're currently engaged in in terms of predictive analytics. It is also a lengthy one and so will likely spill over into part of Wednesday's class. Just FYI.

1. Decision Trees

First off, a quick drive by on what they are and why they matter in MKTR. This is how Wiki defines Decision trees:

A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility.

OK, lots of fancy verbiage there. Perhaps an example can illustrate better. Many cognitive-logical decisions can be represented in an algorithmic form with a tree like structure. For example - "Should we enter market A or not?" Imagine two paths out of this Question - one saying yes and the other 'no'. Each of these paths (or 'branches') can then be further split into more branches, say, 'cost' and 'benefit' and so on till a decision is reachable.

Essentially, we are trying to partition this dataset along that branch network which best explains variation in Y - our chosen variable of interest.

2. Building a simple Decision Tree

The dataset used is an internal R dataset 'cu.summary'. No need to read it in, its pre-loaded. Exhibit 3 in the handout gives some rows of the dataset.

# view dataset summary #
library(rpart)
data(cu.summary)
dim(cu.summary)
summary(cu.summary)
cu.summary[1:3,]
This is the dataset summary. Click for larger image.

Now make sure you have installed the 'rpart' package before doing the rest.

# Regression Tree Example

# grow tree
fit <- rpart(Mileage~Price + Country + Reliability + Type,
method="anova", data=cu.summary)

summary(fit) # detailed summary of splits
The summary of results gives a lot of things. Note the formula used, the number of splits (nsplit), the variable importance (on a constant sum of 100) and hajaar detail on each node. Of course, we'll be plotting the nodes, so no problemo.

OK. Time to plot the tree itself. Just copy-paste the code below.

par(col="black", mfrow=c(1,1))
# create attractive postcript plot of tree
post(fit, file = "", title = "Regression Tree for Mileage ", cex=0.7)
Click for larger image.

These algorithms follow certain rules - start rules, stop rules etc and essentially operate by maximizing come criterion. In this case, it is minimizing the informational entropy associated with each possible branch split. However, in some critical applications, one may want to go beyond minimizing entropy and say, well, I want to assess statistical significance for a node split into branches. In such delicate situations, nonparametric conditional decision trees ride to the rescue. Ensure you have the party package installed before trying the following:

##--- nonparametric conditional inference ---
library(party)

fit2 <- ctree(Mileage~Price + Country + Reliability + Type,
data=na.omit(cu.summary))

plot(fit2)
Well, when should I go for one (rpart) or the other (party)? Well, traditional decision trees are fine in most apps and are also more intuitive to explain and communicate, so maybe that is what you want to stick with.

3. Multivariate Decision Trees

For this variation in decision trees, copy data from cells A1 to I52 in this google spreadsheet. This is data from your preference ratings for the term 4 offerings. Your 4-dimensional preference vector now becomes the Y variable here.

The task now is to partition the dataset's demographic variables to best explain the distribution of the preference vector. Ensure you have package mvpart installed before trying the following code.

##--- multivariate decision trees ---
data = read.table(file.choose(), header = TRUE)

library(mvpart)
mvpart(data.matrix(data[,1:4])~workex+Gender+Engg+Masters+MktgMajor, data)
Click for a larger image.

Couple of points as I signoff.

  • The algos we executed today with just a few copy-pastes of generic R code are actually quite sophisticated and very powerful. Now available on a platter to you thanks to R.
  • Implementation of business solutions based on these very algos can cost easily upwards of 1000s of USD per installation. Those savings are very real, especially if yours is a small firm/startup etc. I hope what this means will find appreciation.
  • The applications are more important than the coding. The interpretation, the context and the understanding of what a tool is meant to do is more important than merely getting code to run.
  • My effort is to expose you to the possibilities so that when opportunity arises, you will be able to connect tool to application and extract insight in the process.

Well, this is it for now from me. The targeting based algos - the random Forest and neural nets alongside good old Logit will come next session then. See you in class tomorrow.

Sudhir

Segmentation Tools in R - Session 5

Hi all,

Welcome to the R codes to Session 5 classroom examples. Pls try these at home. Use the blog comments section for any queries or comments.

There are 3 basic modules of direct managerial relevance in session 5 -

(a) Segmentation: We use cluster analysis tools in R (Hierarchical, k-means and model-based algorithms)
(b) Decision Trees: Using some sophisticated R algorithms (recursive partitioning, conditional partitioning, random forests)
(c) Targeting: Using both parametric (Multinomial Logit) and non-parametric (Machine learning based) algorithms.

Now it may well be true that one session may not do justice to this range of topics. So, even if it spills over into other sessions, no problemo. But I hope and intend to cover this well.

This blog-post is for Segmentation via cluster analysis. The other two will be discussed in a separate blog post for clarity.

There are many approaches to doing cluster analysis and R handles a dizzying variety of them. We'll focus on 3 broad approaches - Agglomerative Hierarchical clustering (under which we will do basic hierarchical clustering with dendograms), Partitioning (here, we do K-means) and model based clustering. Each has its pros and cons. Model based is probably the best around, highly recommended.

1. Cluster Analysis Data preparation
First read in the data. USArrests is pre-loaded, so no sweat. I use the USArrests dataset example throughout for cluster analysis.

#first read-in data#
mydata = USArrests

Data preparation is required to remove variable scaling effects. To see this, consider a simple example. If you measure weight in Kgs and I do so in Grams - all other variables being the same - we'll get two very different clustering solutions from what is otherwise the same dataset. To get rid of this problem, just copy-paste the following code.

# Prepare Data #
mydata <- na.omit(mydata) # listwise deletion of missing
mydata.orig = mydata #save orig data copy
mydata <- scale(mydata) # standardize variables

2. Now we first do agglomerative Hierarchical clustering, plot dendograms, split them around and see what is happening.

# Ward Hierarchical Clustering
d <- dist(mydata, method = "euclidean") # distance matrix
fit <- hclust(d, method="ward")
plot(fit) # display dendogram
Click on image for larger size.

Eyeball the dendogram. Imagine horizontally slicing through the dendogram's longest vertical lines, each of which represents a cluster. Should you cut it at 2 clusters or at 4? How to know? Sometimes eyeballing is enough to give a clear idea, sometimes not. Suppose you decide 2 is better. Then set the optimal no. of clusters 'k1' to 2.

k1 = 2 # eyeball the no. of clusters
Note: If for another dataset, the optimal no. of clusters changes to, say, 5 then use 'k1=5' in the line above instead. Don't blindly copy-paste that part. However, once you have set 'k1', the rest of the code can be peacefully copy-pasted as-is.

# cut tree into k1 clusters
groups <- cutree(fit, k=k1)
# draw dendogram with red borders around the k1 clusters
rect.hclust(fit, k=k1, border="red")

3. Coming to the second approach, 'partitioning', we use the popular K-means method.

Again, the Q arises, how to know the optimal no. of clusters? Eyeballing the dendogram might sometimes help. But at other times, what should you do? MEXL (and most commercial software too) requires you to magically come up with the correct number as input to K-means. R does one better and shows you a scree plot of sorts that shows how the within-segment variance (a proxy for clustering solution quality) varies with the no. of clusters. So with R, you can actually take an informed call.

# Determine number of clusters #
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(mydata,centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")
# Look for an "elbow" in the scree plot #
Look for an "elbow" in the scree plot. The interior node at which the angle formed by the 'arms' is the smallest. This scree-plot is not unlike the one we saw in factor-analysis. Again, as with the dendogram, we get either 2 or 4 as the options available. Suppose we go with 2.

# Use optimal no. of clusters in k-means #
k1=2
Note: If for another dataset, the optimal no. of clusters changes to, say, 5 then use 'k1=5' in the line above instead. Don't blindly copy-paste that part. However, once you have set 'k1', the rest of the code can be peacefully copy-pasted as-is.

# K-Means Cluster Analysis
fit <- kmeans(mydata, k1) # k1 cluster solution

To understand a clustering solution, we need to go beyond merely IDing which individual unit goes to which cluster. We have to characterize the cluster, interpret what is it that's common among a cluster's membership, give each cluster a name, an identity, if possible. Ideally, after this we should be able to think in terms of clusters (or segments) rather than individuals for downstream analysis.

# get cluster means
aggregate(mydata.orig,by=list(fit$cluster),FUN=mean)
# append cluster assignment
mydata1 <- data.frame(mydata.orig, fit$cluster)

OK, That is fine., But can I actually, visually, *see* what the clustering solution looks like? Sure. In 2-dimensions, the easiest way is to plot the clusters on the 2 biggest principal components that arise. Before copy-pasting the following code, ensure we have the 'cluster' package installed.

# Cluster Plot against 1st 2 principal components
# vary parameters for most readable graph
library(cluster)
clusplot(mydata, fit$cluster, color=TRUE, shade=TRUE,labels=2, lines=0)
Two clear cut clusters emerge. Missouri seems to border the two. Some overlap is also seen. Overall, the clusPlot seems to put a nice visualization over the clustering process. Neat, eh? Try doing this with R's competitors...:)

4. Finally, the last (and best) approach - Model based clustering.

'Best' because it is the most general approach (it nests the others as special cases), is the most robust to distributional and linkage assumptions and because it penalizes for surplus complexity (resolves the fit-complexity tradeoff in an objective way). My thumb-rule is: When in doubt, use model based clustering. And yes, mclust is available *only* on R to my knowledge.

Install the 'mclust' package for this first. Then run the following code.

# Model Based Clustering
library(mclust)
fit <- Mclust(mydata)
fit # view solution summary
The mclust solution has 3 components! Something neither the dendogram nor the k-means scree-plot predicted. Perhaps the assumptions underlying the other approaches don't hold for this dataset. I'll go with mclust simply because it is more general than the other approaches. Remember, when in doubt, go with mclust.

fit$BIC # lookup all the options attempted
classif = fit$classification # classifn vector
mydata1 = cbind(mydata.orig, classif) # append to dataset
mydata1[1:10,] #view top 10 rows

# Use only if you want to save the output
write.table(mydata1,file.choose())#save output
The classification vector is appended to the original dataset as its last column. Can now easily assign individual units to segments.

Visualize the solution. See how exactly it differs from that for the other approaches.

fit1=cbind(classif)
rownames(fit1)=rownames(mydata)
library(cluster)
clusplot(mydata, fit1, color=TRUE, shade=TRUE,labels=2, lines=0)
Imagine if you're a medium sized home-security solutions vendor looking to expand into a couple of new states. Think of how much it matters that the optimal solution had 3 segments - not 2 or 4.

To help characterize the clusters, examine the cluster means (sometimes also called 'centroids', for each basis variable.

# get cluster means
cmeans=aggregate(mydata.orig,by=list(classif),FUN=mean); cmeans
Seems like we have 3 clusters of US states emerging - the unsafe, the safe and the super-safe.

Now, we can do the same copy-paste for any other datasets that may show up in classwork or homework. I'll close the segmentation module here. R tools for the Targeting module are discussed in the next blog post. Any queries or comment, pls use the comments box below to reach me fastest.

Sudhir