Friday, November 30, 2012

External Speaker - Talk Agenda

Hi all,

Mr Sanjay Dutta of ITC will deliver a talk as an external speaker for the course on Friday 7-Dec from 3 to 5 pm. Attendance is mandatory.

Mr Dutta laid out the agenda for the talk very well in his email, partially reproduced below:

The  perspective  Prof,  I want to bring in is the challenges   Market  Research  faces  due  to  its scrappy  record  of  providing meaningful insights that business can use. For every long, MR has been content  supplying data and answering here and now questions  in  a  mode  of  churning project after project.  That has been their business model. Come to Insights, come to predictions, most MR agencies have  shied away. Even the biggest and best in the field. We work with all of them.

There  are  many  forces  at  work. Little time or patience    for    people    to    answer   survey questionnaires   or  to  attend  focus  groups  to growing  realization  that   decisions we make are mostly   emotional   and   instinctive,  seemingly irrational.  Hence  not  very amenable to question answer  format  where coming across as a rational, logical person is non negotiable self image. Enter ethnography, tracking over the net and analysis of big  data  that  happen  as  a  result, Behavioral Economics, neuro marketing to name a few. Today we know more about a consumer that we could ever know and  we  are  revealing more data about ourselves.

The  analytics  is  a  skill  gap at the moment to convert   this  mountain  of  data  into  business friendly insights.

Research  is  evolving rapidly. Any conference one listens to, or papers you see nowadays shows a new set of agencies and skill sets much different from the traditional players.

We  are  conducting  elaborate  validation studies with  some of those specialists like Neuroscience, Behavioral  Economics.  Emanating  from  these new methods,  the  evidence  of  how  consumer  feels, thinks,  acts  are  quite different than what have been  assumed  so  far.  They  are  scientific and compelling.  Our  research  protocols  may  change after those validations.

Presentation  was  meant  to provide in 2 hours, a peek  into  the kind of role research should start playing  soon  by exposing them the challenges and the excitement of the new that lie ahead.
I hope you are as excited about the talk as I am. I'll provide more detail in due course.

Sudhir

Thursday, November 29, 2012

Notes on Session 2 - Mohali

Hi all,

First off, an affable welcome to the final registrants for MKTR. I hope the course meets your expectations.

Session 2 is done.Kindly bookmark this blog and visit periodically for updates.

1. Re the in-class readings for Session 2, here are the original sources:

Reading 1 - Meet the New Boss: Big Data
Reading 2 - The Unbearable Lightness of Opinion Polls
Readings 3 and 4 - Book excerpts from Malcolm Gladwell's Blink

 ****
2. About the homework assignments for session 2:

There are 2 pieces to the homework. Both are straightforward websurvey filling exercises. Here are the websurvey links:

Perceptual Map input using Term IV Core Courses as stimuli

Multidimensional Scaling Map Input


   The first websurvey basically collects data from you on what we call Attribute Ratings (AR) based perceptual mapping, to be used in Session 4. It contains a small section with open-ended Qs on general MKTR topics. The external speaker, Mr Sanjay Datta of ITC, has asked for this information to incorporate into his talk. He will speak about the real-world MKTR apps on Friday, 7-Dec, 3-5 pm.

          The second websurvey collects data using the Overall Similarity (OS) method to perform percpetual mapping, again in session 4. Both additionally feature a psychographics section. Please fill out *both* surveys sincerely. I expect no more than 15-20 minutes per survey. 


***

3. The AA Mr Ankit Anand tells me that some students have ignored the seating arrangement and the name-tags requirement while in class. This is causing issues with attendance and CP marking, both grade-bearing components in this course. Henceforth, pls stick to the seating chart and use your name-tags. Folks not following these instructions risk losing some attendance and CP marks.


***

4. Pls use the poll adjacent and vote on our choice. I'm told that ASA exressedly forbids changes to the grading scheme after session 1. So no guarantees anything will change should you vote against the end-term. FYI, the end-term is open-notes and short-answer/MCQs based (much like the pre-reads quiz sample you saw in class today). So it is designed to not be burdensome, anyhow.

****

5. For those curious to know what is coming up in Session 3, here's a blog post on session 3 topics from last time on Session 3. Not much has changed since then. I'll make a new blog post for session 3 Mohali with tags 'Mohali' and '2012' to make it easier to search. The above blog post contains a gentle introduction to R, some R code to start with and so on.   

***

OK, That's all from me for now. Will see you all in session 3 then.

Sudhir

Wednesday, November 28, 2012

Session 1 related material

Hi all,

Session 1 is over. In case folks want to see where the in-class readings were sourced from, here is a list:

Reading 1: "Have Breakfast… or…Be Breakfast!" 
Reading 2: The magic of good service
Reading 3: Even in emerging markets, Nokia's star is fading
Reading 4: Join them or Joyn them
Reading 5: Jobs of the future

About pedagogy going forward
The parts of the pedagogy that make MKTR distinctive : pre-reads quick-checks using index cards, in-class reads, this blog and R.

About this blog
This is an informal blog that oncerns itself soley with MKTR affairs. It's informal in the sense that its not part of the LMS system and the language used here is more on the casual side. Otherwise, its a relevant place to visit if you're taking MKTR. Pls expect MKTR course related announcements, R code, Q&A, feedback etc. on here.

In term 5, students said that using both LMS and the blog (not to mention email alerts etc) was inefficient and confusing. I'm hence making this blog the single-point contact for all relevant MKTR course related info henceforth. LMS will only be used for file transfers and email notifications to the class only in emergencies. Each session's blog-post will be updated with later news coming at the top of the post. Kindly bookmark this blog and visit regularly for updates.

About Session 2
Session 2 deals with psychographic scaling techniques and delves into the mechanics of questionnaire design. Am not sure 'mechanics' is the right word here because questionnaire design is more of an art than an exact science.

There will be a series of 3 homework assignments for session 2. These concern merely filling up surveys that I will send you (this data we will use later in the course for perceptual mapping and segmentation). Nothing too troublesome, as you can see.

The two pre-reads for session 2 might appear lenghty-ish but are easy-reads (I think). And yes, there will be a short pre-reads based quick-check on them. So do please read them and come.
The pre-reads for each session are listed in page ix of the course-pack.

OK, see you folks in session 2 then.

Sudhir

Monday, November 26, 2012

MKTR @ ISB, Term 6 in Mohali.

Update:
Might as well clear some confusion. The Homeworks and the end-term will each be 40% of the total grade each. CP will be 12%. I'd initially put CP at 22% but later realized that for a non-case based course like MKTR, having such a large CP component that could by itself swing letter grades, might not be wise.
Now, I hear ASA is saying I can't change the grading criteria set in session 1 and cannot use a course project even if students may want it later on. Well, I've asked for clarifications from ASA, let's see.

Sudhir

Hi Class of 2013 at Mohali,

The MKTR course starts today. Welcome to the informal course blog.

There have been a lot of changes, revamps, streamlining, additions etc based on feedback, hindsight and some foresight this year.

I've taken up MEXL as one of the main software platforms for MKTR this year based on popular preference. The other platform MKTR will use is R.

I don't want anyone to feel iffy about the use of R. In SPSS or MEXL you use a menu-driven process to execute analysis. I have endeavored to make R as close to menu-driven in ease of use as possible. Most assignments and MKTR tools on R have ready code associated with them. I will publish the code needed to solve any assignment I give on this blog. All you need to do is essentially copy-paste my R code to run the same analysis. Pls see this blog-post as an example.
Bottomline: The course is not about you "learning" to code on R but to use ready-made code to get R to do what you want.

One reason why R is on this course is that it allows me to to introduce a wide range of new research methods that are only now coming into the mainstream and which will take a long while before they make their way into commercial software packages. Things like text mining (see this blog-post for what we did in term 5), Twitter stream extraction and analysis (see this blog-post) etc. are just a few copy-pastes of R code away on R. Good luck with trying those things on MEXL, JMP or SPSS.

Actively seeking student feedback has helped me course-correct in the past-term. I intend to continue this practice.

More later today when we meet in class.

Sudhir

Sunday, November 25, 2012

Some project related Q&A

Hi all,

Been receiving project specific emails but there are some generalities that might benefit with more dissemination. So here goes.

Hello Professor,

We are currently working to finish our marketing research assignment. We have a query about our perception output we are getting from our survey

We got 3 segment using m clust. We still haven’t interpreted them. However, we used this segmented data to draw our perception maps for each segment. I am attaching the outputs with the email. We are not sure how to interpret the perception map for segment 1. Please help us to interpret this output.

Also, please have a look at segment 2 and 3 output as well, and please confirm if they make sense to you

Many Thanks!
G

My response:
Hi G and team,

The blue arrows should be the attributes and the red dots the brands. Your maps seem to be the other way round.

Pls use the transpose of mydata or (‘t(mydata)’ in R) as input to the JSM procedure. Other option is doing it in MEXL.

Interpretation should be straightfwd after that, I think.

Sudhir

***************************

Here's my response to one team that ran into clustering issues.

Hi S and team,

Some quick observations from my side:

1. Why factorize the demographics? Usually these would be used as discriminants. But if you’ve good reason to do this and the factors that emerge make sense, then fine, I guess.

2. What psychographics did you use and how many Qs do you have? Ideally, they would revolve around lifestyle habits. If so, if factorizing them gives useful constructs as factors, then use the factor scores of the factors of interest as downstream variables for your cluster analysis. In Rcode terms, we’d have:

# view & save factor scores #
fit$scores[1:4,]#view factor scores

#save factor scores (if needed)
write.table(fit$scores, file.choose())
The fit$scores object contains the factor scores.

3. Now, you needn’t even use all the factors that emerge, just the ones that seem meaningful. For the rest of the variables (loading onto factors that do not seem interesting), you can directly use the variables as-is instead of its factor score in downstream cluster analysis.

4. Typically, one would want to segment on the basis of needs or benefits sought. All other observed characteristics of consumers would become discriminant variables. Demographics usually fit this bill. And also purchase history data if such exist. I would suggest trying a logit:

segment membership = f(demographics, other factors)

to see if there is any systematic relation between the needs based psychographic segments you got and the observed discriminant variables. Even if there is none, try profiling segments that emerge by eyeballing the centroids of the demographics of those segments to see if there is any link between the two.

5. From segmentation can flow segment sizes and from segment profiles can emerge campaign ideas for selling to different segments. And so on and so forth.

Hope that helped.

Sudhir

**************************

This one from a team having mlogit issues:

Dear Professor,

We are facing some issues in identifying factors in R using mlogit. We were wondering if we can seek 10 minutes of your time to help us run the code for R to get the result. Please let us know what would be a good time today to have this discussion.

PS: Attached is the file that we are trying to work with. The columns on the left are the factors that impact the right most column “contrib.”. We wanted to run Mlogit to assess which factor has the max impact. But the code on the blog was for discriminant analysis.

Regards,

N

My response:
Folks,

This is straightfwd regression, no need to go logistic here.

The dependent variable ‘contrib’ appears to be continuous. Then after you define the conceptual model:

Y = f( X1, X2,…)

Just run a simple linear model for f(.) or perhaps some variation of a linear model – log-log, quadratics, interaction terms and the like. . Results too are easy to interpret for regressions.

Sudhir

********************************

Another one on mlogit:

Dear Sir,
I am completely stuck at mlogit. I have made my file, 1st col of dependent variable (binary) followed by 17 col of independent variables.
Whenever I am trying to read the data (be it csv or notepad) , its showing following error,

discrim = read.table(file.choose(), header=TRUE)
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 1 did not have 13 elements
> dim(discrim); discrim[1:17,]
Error: object 'discrim' not found
> discrim = read.table(file.choose(), header=TRUE)
Error in file.choose() : file choice cancelled
> dim(discrim); discrim[1:17,]
Can you pl suggest, what am I doing wrong, and what if means (discrim not found)
D

My response:
Hi D,

Some causes that come immediately to my mind:

1. The file 'discrim' wasn't read-in properly. So either there are spaces in the column headers or some variables are character strings.

2. In such a scenario, it is always safer to read in from a .csv file rather than a .txt one.

3. As long as this error is showing, the file wasn't even read in correctly.

"Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
line 1 did not have 13 elements"

Its typically good practice to view a few rows of the file first just after reading in to see if all is well.

discrim[1:5,] etc.

4. 17 X variables in logit is quite a lot. Ensure you have sufficient amount of data (#observations) that can sustain such an estimation exercise. Might take a few minutes of run-time on R.

Good luck. I'm off to Mohali tomorrow but shall be available over email. I'd prefer the blog for such Q&A in future.

Hope this helps. Let me know either way.

Sudhir

More as they come.

Wednesday, November 21, 2012

Project deadline extended by 24 hours

Hi all,

Notes on Project expectations:
Had a conversation with a couple of folks. Here's to set someproject related expectations straight.

  • The max slide limit is now down to 25 from 30.
  • This doesn't mean you have to fill-up 25 slides. It just says don't exceed 25 slides. If you are able to fit in a good story within 15 or 20 slides, that's perfectly OK.
  • Use at least 2 tools from our MKTR toolkit. You don't have to use everything we covered. Any 2 tools - say, text mining + p-maps or regression modeling + p-maps or surveys + secondary data etc is good enough
Guidance on R resources:
People asked for some good R links which they could work off directly from. I think an excellent resource (whose format I borrowed for the blog R code) is quick-r (from UCLA, I think). here's a link for factor analysis in R.
  • In the above Quick-R link, look at the pane on the left for links to most functions you have worked with in MKTR. And the format is similar to what we are used to on the blog.
  • As an example, here's Quick-R on cluster analysis. As you can see we only used a subset of these functions in our course. Down the line you can explore more of this functionality for your purposes.
  • Why restrict ourselves to just the tools we did in MKTR? Here's ready to run R code for time series analysis, courtesy Quick-R, of course. Here's stuff on Generalized linear (regression) models from Quick-R.
  • More generally, R's own help functions are quite useful. If you want to see the syntax and/or description for any function in R (say, mclust) than simply type "?mclust" (without the quotes) and hit enter. It opens a HTML page with explanations.
  • I must mention a special thanks to Mr Mohd. Nizam who was kind enough to give me all the R packages we have used in the course in binary form. I plan to upload it on LMS for term 6 Mohali so that students can directly download and install all the required packages at the start itself.
  • I will continue to be available over phone and email even when in Mohali should further queries on R arise beyond this short course. I'd prefer the blog for communication - to reinforce its value as a platform for our MKTR on R ideas.
  • Oh, in case you needed a little more selling-to regarding R's extensibility, here's a short list of lists of what all R can do at present. I hope you'll find it useful to convince others (and yourself, if necessary) that it may be worthwhile investing effort in R.
*******************

Received this communication from Shouri.

For the future, it'll ease our burden a bit if you have only one channel of communication - maybe, have everything on the blog and nothing on LMS/mail (except for urgent notifications). Also, the course is understandably assignment-heavy to get our hands dirty with data - may be one of these components could be eliminated then: group project or end term to make the course load manageable. in my opinion, the course was a lot of work. thank you.
My response:
Thanks for the feedback, Shouri.

Points noted and in fact, based also on other feedback I have received, I intend to act on it for the Term 6 MKTR course in Mohali.

(1) I'll drop the course project and re-allocate its points among CP, HWs and exam.
(2) I make the blog the one-point of contact for all course related announcements. LMS will only be for file up/down loads.
Yes, I'm keen on the course being known as not overly burdensome. I'm happy to make efforts to lighten the workload as far as feasible, to that end.

Sudhir

****************************

I hope the exam wasn't overly bothersome. Received this email from Abhishek:

Dear Professor,

Request you to please postpone the submission of project to next Wednesday. A lot of people have exams and would also need the much awaited break.
Would really appreciate if you give it a thought. Thanks!

Best Regards
Abhishek

and my response is here:
Hi Abhishek,

Thanks for the feedback. I'm happy to do as you say but there's a problem - I shouldn't impose term 5 work once term 6 has started.

Hence, as a compromise, I shall extend the deadline till Monday midnight, an extension of some 24 hours, for whatever that is worth.

Ideally, I'd rather folks submitted their work prior to that itself. <25 slides and nothing very fancy expected - just evidence that you have understood the context and formulated the problem well. The rest -> data collection and analysis -> follow accordingly. Additional material can go into the annexes which can be referred to directly from within the main PPT. However, anything really important should go into the main PPTs. There's no guarantee annexes will be seen or evaluated.

Sreenath and Chandana, kindly set the dropbox deadline accordingly.

Regards,

Last but not least...
Sudhir

Monday, November 19, 2012

Exam related Q&A

Hi all,

Shall post answers to Qs I receive that I think can do with greater dissemination.

Update: Got these Qs from Amrita over email today. My responses are prefaced with '>>'

Hi Amrita,

Need some clarification on the following questions based on hw-session 8-

1. Pls explain the difference b/w standardized and unstandardized co-efficients? What are they used to interpret respectively?

>> Ordinarily, we cannot compare unstandardized betas because the scale of the variable affects the coeff. E.g., measuring weight in Kgs instead of in grams would raise the new weight coeff by 1000 times. Standardized coeffs are the coeffs we get when the X variables have been standardized (i.e. transformed to mean 0, variance 1). The coeffs can hence now be compared with one another.

2. For q4, hw session8, what do we refer to (which column of the coefficients table) to gague the impact on sales. Also, just an addition to point 2- usually we refer to the beat coefficients (and shall refer to t-statistic in absence of beta coefficients) and as per that the ranking should be- LnPrice, Bud, Miller, LnSize, Promo, Light. However, as per solution it is Bud, Miller, LnPrice, LnSize, Promo, Light which is according t-statistic. Hence need a clarification on this.

>>If standardized coeffs is given, we use it. If it is not given, we usually go with the (absolute) value of the t-stat. In the exam, your R output might not contain the standardized coeffs, then go with the t-stats. This was mentioned in a blog-post also. We look at the absolute value of the t-stat or th standardized betas. I used the absolute value (i.e. ignore sign) of the t-stats tobuild the list in q4, hw session 8.

3. For q1, session8, pls advise whether we need to refer to the ‘ standard coefficient- beta’ column or the ‘t-statistic’ column?

>> Either is fine. If std beta coeff is given ,go with it. I went with t-stat. I think much of this was explained in the blog posts mentioned in my last email to the class. Pls go thru those again carefully.

Sudhir

************************************

Shouri has sent in some more Qs. My responses to his earlier set of Qs is here.

My responses are prefaced with a '>>' symbol.

1. A scale say from Least likely (1) to Most likely (7) – this is a strictly an ordinal scale, right? Or can it pass of as an interval scale?

>>1-7 Likert is an interval scale. 'xyz Likely' is scale guidance. An interval scale has ordinal (direction info) properties, so yes, its ordinal too.

2. Please confirm that we are not covering the article “Forecasting the adoption of a new product”

>> We're not. More generally, *anything* not covered in class in both sections won't land up in the exam. I'd say go easy on the pre-reads as well. The in-class reads and guest lecture topics too are not important from the exam point of view, besides.

3. Is a perceptual map plotted on principal components or factors?

>> Principal components. More generally, whenever we want to "project" data into 2 dimensions for plotting or visualization purposes, we go with principal components. When we want to know about possible underlying constructs etc, we use common factor analysis (factor loadings table and all that).

4. Surrogate analysis has been mentioned as a qualitative way of demand estimation. Can you briefly explain what this is?

>> It is taking a good or service that exists as a surrogate for the proposed good or service and then trying to estimate demand for the new offering based on the evolution of demand for the identified surrogate. Thus, for example, we could say that 3D-printers will go the way of 2-D photo printing in its eventual adoption by the private sector - i.e. people can print photos at home but don't, they prefer photo kiosks for this; similarly, people could print objects at home in 3D but probably won't and will prefer kiosk model for this too. Etc.

I couldn't cover this in the second section, so its out of exam bounds. Again, "anything* in the slides that was not taken up in class is out of bounds.

Hope that clarifies. Shall append more Q&As as received to the top of this post.

Sudhir

Saturday, November 17, 2012

How to interpret mlogit Output

Hi all,

Have received several emails asking for more detail on the mlogit output interpretation. I shall use the old mlogit example we did in session 8 classwork. To recap, the following 2 images give the context to the problem:

The image above shows that the dependent (or, Y) variable is discrete and can take on of 3 values - EDLP, Deep-infrequent and Frequent-shallow. For convenience, let us denote them as 1,2,3. These 1,2,3 are nominal and have no other interpretations. The second image below shows what we have set out to do.
So, what we have set out to do includes standard regression stuff such as estimating the direction (positive or negative) as well as some idea of the relative magnitude of each variable's impact on the probability the store belongs to y=1,2,3. It also includes weighty stuff like calculating probabilities predicting for new storeprofiles.

After running the analysis, this was the results table:

Things to note here:
  • There are 3 levels of Y (y=1,2,3) and what we are modeling is the probability that a given profile's y takes values 1,2 or 3 i.e. Pr(y=1,2 or 3)
  • The leftmost numbers on the row names of the Coefficients table indicate which of y=1,2,3 that coefficient belongs to. Thus '2:sales' refers to y=2 and 3:sales refers to y=3. Now, '2:sales' has estimate -2.696 means that to calculate the probability that y=2 {denoted by 'Pr(y=2)'}, we use -2.696 as the coefficient for sales.
  • Likewise, because '3:sales' has the value -5.358, we would use -5.358 as the coefficient for sales to calculate Pr(y=3). The same generalizes for coupon1 and clientrating as well.
  • One level is the 'reference' level, the baseline relative to which the coefficients for the other 2 levels of y are measured. In our example, the reference level is y=1. Thus, all the coefficients for Pr(y=1) are set to zero.
  • This means that 1:(intercept), 1:coupon1, 1:sales and 1:clientrating all have estimate '0'.
  • Now, since y=2 has a negative coefficient for sales, it means that higher the sales, lesser the Probability that y=2 compared to y=1. In other words, as sales goes up, Pr(y=1) goes up and Pr(y=2) comes down.
  • Similarly, since y=3 has an even more negative coefficient than y=2, it means that higher the sales, lower the Probability that y=3 compared to y=2 or y=1. Thus, sales sales rise, Pr(y=1) > Pr(y=2) > Pr(y=3)
  • Putting a similar interpretation to the variable 'coupon1', I'd say, if coupon1=TRUE (i.e. coupon variable has value '1' instead of '2') then Pr(y=1) < Pr(y=2) < Pr(y=3). And so on
  • The intercept doesn't have any interpretation per se. It is used as is and its variable always has value 1. The estimate for 1:(intercept) will have value 0, as usual.
  • Clientrating is not significant. This means, clientrating does not seem to systematically vary with Y. So, while for fitted values, we use the Estimates shown, we do not infer anything from clientrating in this case.
Well, I hope that helps. Folks, I'm available for meeting 2pm today onwards and all of tomrrow. Just call my extn #7106 and drop by.

Sudhir

Thursday, November 15, 2012

Some Admin Announcements

Update: Got this email with Qs from SK. My response is also appended in it.

Hi S,

My answers to your Qs are prefaced with '>>'

1. In Q1, you have ordered even insignificant X variables. Can they be ordered if they are insignificant?

>>You can't infer anything from insignificant variables. Still, an ordering by 'importance' was asked for, t-stats used as proxy (ideally standardized coeffieicnts should be used but in their absence t-stats will do too even though the correlation isn't perfect) and hence an ordering was given. Yes, I wouldn't read too much into ordering or impact of non-significant variables.

2. In Q4, we see that LnSize is insignificant. How then can we conclude that if average size could be raised by one ounce, average sales would go up by close to 2%.

>>Tricky one, this. LnSize is insignificant at the 90% level but signif at the 85% confidence level. True that we take an arbitrary, 'safe' cut-off of 95%, however when calculating fitted values or taking marginal effects or making predictions etc, the variables below the 95% threshold don't go away, they continue to count. However, that said, I see the point of your objection. Such a Q, if it were to occur in the exam would be redundant - as in both TRUE and FALSE would get marks on this one.

3. In Q5, I know that adjusted R-square is not a measure of variation explained in the sample. On the net, some people interpret adjusted R-square as the percentage of variation explained in the population and not the sample. If you look at this question in that light, the question seems ambiguous. Please clarify.

>>That would not be the right interpretation of Adj-R-square. The simple or multiple R-square is the answer to that Q, period.

4. In Q6, again “Promo” and “Light” are both insignificant. How can we conclude anything?

>>Agreed. Again, a redundant Q - both TRUE and FALSE will count for this one. In the exam there's a third category called 'CAN'T SAY' and it would be best suited to Qs dealing with insignificant variables.

Sudhir

*********************************************

Update: I have sent an email to you all with the attachments as LMS is playing games and both the AAs are on leave today.

Received this email from DB:

Dear Sir,
Pl clarify one doubt, in the following table, one asked about “Importance” we have to just refer to “t”values and not Betas?
My response:
Hi D,

You're right in that ideally, we would look at 'standardized beta values'. This means we have standardized the X variables (so no scaling effects are there) and the betas we get after that are comparable with one another.

However, in the absence of standardized beta values (which may need to be computed separately), we usually look at t-stats which are often very highly correlated with the standardized betas. Its not perfect but it does the job. In this case, with SPSS outpout, since you have standardized betas given to you, you can go with that, no problem.

So yes, you could put LnPrice before Miller in the example given and that would be correct. IN the exam however, if standardized betas are not given, pls rely on t-stats.

Hope that clarifies.

***************************************

Hi all,

Based on what the AAs have told me, a 'solution' to the sample end-term is in order. Hence, I have uploaded a "solution to HW 8" document into the session 8 folder. If it is not visible for any reason, pls email me and let me know.

Some folks have emailed saying they might need hard copies of lecture slides for sessions they may have missed. If you email me the sessions you have missed, I will keep printouts ready with my secretary Ms Sireesha. You can pick the same up during working hours anytime.

I'm teaching the PhD class (FPM students) between 21 and 23-Nov and am off to Mohali for term 6 on 24-Nov. However, if you want to meet for any reason, pls email fro an appointment and I'll ensure I'm around at the appointed time.

Was a pleasure teaching you. Class size does make a difference - smaller classes are inherently nicer to interact with. I've come to know more students by name in this year than I have in all the 3 previous years combined. Thanks!

Sudhir

Tuesday, November 13, 2012

End-Term notes & Case write-up for Session 10

Hi all,

1. One page Case write-up for session 10

Pls submit a one-page case write-up [one side of the page only, Font: Times New Roman, size 11, spacing 1.5 lines] on the fashion channel case based on the following:

  • What is the management problem? Describe also the major symptoms.
  • What are some of the likely decision problems (DPs) that emerge based on the management problem?
  • List a few research objectives (R.O.s) that emerge based on the DPs.
Pls remember to type your name, PGID and section when submitting it in session 10. The case write-up may get some small portion of the HW grade.

2. End-term exam pattern Notes

Just finished making the Q paper. Based on some Qs received by students, my notes for the end-term exam:

  • There are a total of 50 Qs, 2 marks each.
  • The Qs are broken down into 6 Question-sets, each having tables or figures and Qs based on them
  • The Qs are all short-answer - True/False, fill-in-the-blanks, write expression for ...., name these factors, type of stuff.
  • If any Q comes from any pre-read, the concerned pre-read will be specified in the Q itself. Properly speaking the pre-reads are a part of the course.
  • Nothing that was not covered in class (common to both sections) will show up anywhere in the exam.
  • Time will not be a problem - you'll have 150 minutes for a 120 minute paper.
  • Please bring a calculator with you (no mobiles or laptops allowed). Borrow or otherwise arrange for this.

Pls use the comments section to this post for any Q&A so that it is visible to the class at large.

3. Post-mortem of session 9

For once, the session went far more smoothly and peacefully in Section B than in Section A. I totally lost track of time in Sec A, something I'm usually sensitive to. IMO, its partly because an all-reading session is new to me and I didn't quite know how to time it very well. Anyway, section B benefited from my wisening up from my section A travails.

I've received some very informative emails as a follow-up to my in-class discussions during session 9. I'm putting their emails up here for wider dissemination:

Update: Got this email from Shashaank Singhal:

Prof

An article that might interest you about crowdsourcing

http://econsultancy.com/in/blog/11098-how-coca-cola-uses-co-creation-to-crowdsource-new-marketing-ideas?utm_medium=feeds&utm_source=blog

Also I had mentioned the site www.crowdspring.com that crowdsources a lot of branding work (including LG’s design a phone project)

Thanks, Shashaank.

The samsung controversy and the murky world of tech blogging

Thanks to Hari for this link. Deals with the paid promotions and referrals issue that we talked about in Reading 4.

Varun Parwal from Section B sent some great articles on the fMRI phenomemon:

Professor, here is the article
http://www.sciencedaily.com/releases/2011/06/110613014455.htm

MRI scans predict pop music success

The study in brief (PDF file)

Using analytics to predict the success of Hollywood blockbusters:

An insightful article on how Obama's team used a metric driven campaign in the recent US elections - and how a blogger used the publicly available data to predict Obama's victory:
Some useful lessons here for all companies utilizing digital media in one form or the other!

Would be great if you could share more of the related/dropped readings for the class today – I at least would like to know more and of course read more on the pros/cons of these advancements in marketing.

Thanks Varun. Will do so in a while. Right now am working away to get the slides ready for printing for tomorrow's session.

Sudhir

Monday, November 12, 2012

twitteR - twitter data collection & analysis on R

Hi all,

Session 9 -"Emerging trends in MKTR" - is going to be reading heavy (you've been warned). Sooo many great readings and so little time. Anyway, there's this twitter based reading for which I'm putting up sample code below. Will ask AAs to load the data on LMS. But before that, some background.

Some folks have asked why we stopped where we did with text analysis. When, obviously, so much more downstream analysis and processing could have been done. Sure, a lot is possible and do-able on R. But class time is limited and only so much can fit in. One particular Q that came up:

"Can we do better sentiment analysis than what we just did for the session 6 HW?"
Sure, we can. It would be great if we could categorize sentiment and then classify text responses accordingly.
Here's what Wikipedia says on the subject:
A basic task in sentiment analysis is classifying the polarity of a given text at the document, sentence, or feature/aspect level — whether the expressed opinion in a document, a sentence or an entity feature/aspect is positive, negative, or neutral. Advanced, "beyond polarity" sentiment classification looks, for instance, at emotional states such as "angry," "sad," and "happy."

The rise of social media such as blogs and social networks has fueled interest in sentiment analysis. With the proliferation of reviews, ratings, recommendations and other forms of online expression, online opinion has turned into a kind of virtual currency for businesses looking to market their products, identify new opportunities and manage their reputations. As businesses look to automate the process of filtering out the noise, understanding the conversations, identifying the relevant content and actioning it appropriately, many are now looking to the field of sentiment analysis. If web 2.0 was all about democratizing publishing, then the next stage of the web may well be based on democratizing data mining of all the content that is getting published.

Several research teams in universities around the world currently focus on understanding the dynamics of sentiment in e-communities through sentiment analysis.The CyberEmotions project, for instance, recently identified the role of negative emotions in driving social networks discussions. Sentiment analysis could therefore help understand why certain e-communities die or fade away (e.g., MySpace) while others seem to grow without limits (e.g., Facebook).

The problem is that most sentiment analysis algorithms use simple terms to express sentiment about a product or service. However, cultural factors, linguistic nuances and differing contexts make it extremely difficult to turn a string of written text into a simple pro or con sentiment. The fact that humans often disagree on the sentiment of text illustrates how big a task it is for computers to get this right. The shorter the string of text, the harder it becomes.

Anyway, R does sentiment analysis. Its package twitteR (last 'R' is capital) lets you set what keywords you want mined from twitter feeds, where in the world you want this data collected form (specify latitude and longitude of major cities, for example and a 50mile radius around them), collect that data, text mine it and analyze its content, score its sentiment and more. Neat, eh? Well, that's R for you.

Now, finally, on popular demand, here is some elementary R code that I used to analyze tweeple reactions to the latest Bond movie 'Skyfall'.

Step 1: Invoke appropriate libraries. Ensure you've the 'twitteR' and 'sentiment' packages downloaded and installed.

library(twitteR)
library(sentiment)
library(tm)
library(Snowball)
library(wordcloud)

Step 2: Send R to search for and save the data you want. This step is a little involved. Pls read instruction given in bullet points below carefully.

  • First copy and paste the below block of code into an empty notepad. Make all edits to the code in this nbotepad and then copy-paste to the R console.
  • Your PGP username and password that you use to connect to the web is required. Enter these in the code in place of 'username' and 'password' as given in 'set proxy for R' step.
  • If you want specific city based tweets only, use the geocode option in the searchTwitter() function below. For example, ' geocode="29.0167, 77.3833, 50mi" ' refers to tweets originating from a 50 mile radius around the center of Delhi.
  • In write.table(), write the tweets collected to a notepad only.
  • If you ask R to save more than n=500 tweets in the searchTwitter() function, it might take upto a couple of minutes (depending on your web connection) to find and save them.
###### search in twitter #######
#set proxy in R
Sys.setenv(http_proxy = "http://username:password@172.16.0.87:8080")

# send R to go collect data
rev = searchTwitter("#skyfall", n=500, lang="en")

## -- to do location specific searches ---
searchTwitter(searchString, n=25, lang="en", since=date, until=date, geocode = "38.5, 81.4, 50mi")

rev[1:5] #shows first 5 tweets
rev.df = twListToDF(rev) # changes tweets to data frame
#save data
write.table(as.matrix(rev.df[ ,1]), file.choose())
Here's what the first 5 recent tweets that R captured looked like:

Step 3: Standard text mining stuff which we already saw in session 6. I won't go into making barplots and histograms, you can do that yourself using the session 6 code.

x = readLines(file.choose())
x1 = Corpus(VectorSource(x))
# standardize the text - remove blanks, uppercase #
# punctuation, English connectors etc. #
x1 = tm_map(x1, stripWhitespace)
x1 = tm_map(x1, tolower)
x1 = tm_map(x1, removePunctuation)
x1 = tm_map(x1, removeNumbers)
# removing it from stopwords
myStopwords <- c(stopwords('english'), "skyfall", "bond")
x1 = tm_map(x1, removeWords, myStopwords)
x1 = tm_map(x1, stemDocument)
# make the doc-term matrix #
x1mat = DocumentTermMatrix(x1)

Step 4: Invoke sentiment analysis. Classify the tweets by emotion, find the polarity (or which emotion pole - pos or neg - dominates a text output) using simple functions.

## --- inspect only those tweets which
## got a clear sentiment orientation ---

library(sentiment)
a1=classify_emotion(x1)
a2=x[(!is.na(a1[,7]))] # which tweets had clear polarity
a2[1:10]

# what is the polarity score of each tweet? #
# that is, what's the ratio of pos to neg content? #
b1=classify_polarity(x1)
dim(b1)
# build polarities table
b1[1:5,] # view a few rows
The top 10 tweets are a subset of the ones which have emotional content in them. The bottom table shows rows from the emotional polarities table - gives a measure of the POS score, the NEG score, the POS/NEG ratio and then computes a net-net balance polarity for the document (in this case, a tweet) under the column BEST_FIT.

Step 5: Now we dive deeper into emotion classification. Six primary emotion states are available in twitter output from the sentiment package: "Anger", "Disgust", "fear", "joy", "sadness", and "surprise". We classify which tweets score high on which emotion type and view a few rows of each type.

##--- changing the a1 thing to reg data frame
a1a=data.matrix(as.numeric(a1))
a1b=matrix(a1a,nrow(a1),ncol(a1))
# build sentiment type-score matrix
a1b[1:4,] # view few rows

# recover and remove the mode values
mode1 <- function(x){names(sort(-table(x)))[1]}
for (i1 in 1:6){ # for the 6 primary emotion dimensions
mode11=as.numeric(mode1(a1b[,i1]))
a1b[,i1] = a1b[,i1]-mode11 }
summary(a1b)
a1c = a1b[,1:6]
colnames(a1c) <- c("Anger", "Disgust", "fear", "joy", "sadness", "surprise")
a1c[1:10,] # view a few rows

## -- see top 10 tweets in "Joy" (for example)
a1c=as.data.frame(a1c);attach(a1c)
test = x[(joy != 0)]; test[1:10]
# for the top few tweets in "Anger"
test = x[(Anger != 0)]; test[1:10]
test = x[(sadness != 0)]; test[1:10]
The above image shows the results I got. If you pull out tweets at a later time, you will get a different set of tweets and a different set of results than what I got. To replicate my results, pls use my dataset (up on LMS under skyfall_twitteR.txt).

Could more be done downstream? Can I now cluster tweets by sentiment? Do collocation dendograms by sentiment polarity?
Sure and more.
But I will stop here for now.

See you in class soon. Ciao.

Sudhir

Sunday, November 11, 2012

Some Project Guidelines

Hi all,

I have met with some project groups to informally discuss their approach to the project etc. Here are some of my notes from those meetings which I think all groups can benefit from.

  • Pls feel free to update, revise and edit your management problem, decision problems and R.O.s at any stage in the project.
  • The R.O.s shouldn't be overly ambitious as that leads to last minute rush and compromises on quality. They shouldn't be overly narrow to the point of being trivial either. Make them 'reasonable' so that you get scope enough to plan and think through the process, to showcase various MKTR tools, etc.
  • Its OK to piggyback on existing projects - e.g. in Pricing or ENDM for which you may have done some ground work and data collection. However, these should be suitably adapted for MKTR requirements, use of MKTR tools as discussed in class etc.
  • Projects can focus either on an exploratory or a confirmatory research design. If your project involves a bit of both of them, pls ensure it is not overly burdensome in terms of resource commitments.
  • Its OK to go confirmatory on well established or 'traditional' product categories such as many FMCG ones we see around. However, for new and emerging categories, pls rethink going the confirmatory route. Ask if you have enough clarity about consumer perspectives on that category to go confirmatory on it.
  • Be creative in using some of the extended analysis tools we discussed in the classroom - e.g. text analysis, collocation dendograms for a variety of objects etc. Scour the web for source material for text analysis (e.g., customer or movie reviews, articles from the popular press, etc.)
  • What is important is that your project should have a coherent 'storyline' running through it - the mgmt problem -> the decision problems -> the ROs -> the tools mapped -> the data collected -> analysis done -> findings and conclusions.
  • Ensure that whatever ROs you choose are 'covered' by your analysis, that your DP is covered by the ROs and so on. Coherence, storyline, conclusion.
  • Don't worry overly about 'representativeness' of sample at this stage. However I would advise that folks choose product/service categories to focus on that would have your peer group at ISB as their primary target audience. This helps in getting access to a large and ready pool of subjects/respondents.
  • If you want to bring particular tools into play and showcase them (e.g. perceptual maps) then you can revise the ROs such that scope for this is created.
  • The submission is set for the day before the first day of term 6. So time is not that compressed, however plan it well.
  • In an earlier blogpost I had laid out an indicative list of grading criteria for the project. Pls go through them. Th project after all carries the largest chunk of grading weight in this course.

That's it from me for now. Pls use the comments box to reach me fast.

Sudhir

Saturday, November 10, 2012

Session 8 HW & End-term Sample Paper

Hi all,

The session 8 HW, which doubles up as the end-term sample paper has been putup on LMS. It has two sets of Qs:

  • One Q set is based on regression results from the beer dataset. Incidentally, this is the same Q that showed up in last year's end-tem (which is why the tables in it are from SPSS and not R).
  • Pls note the Q types - fill-in-the-blanks, true/false, MCQs, perhaps a brief one line text input (e.g., 'write the conceptual model for...' or 'write the econometric model expression for ...').
  • Expect end-term Qs to be along similar lines. We're planning 40 such small Qs over about 8-10 Q-sets covering most of the modules done in class.
  • Exam is designed for max 2 hours though you will have have 2.5 hours officially. Its open-book, open-notes, so don't sweat about memorizing random stuff.
  • The second Q set is based on a logit model results set from the Nike example we covered in class in session 8.
  • Write or type only in the space provided, both for this HW and for the end-term later on.
  • The HW is due as a hardcopy submission in the break during session 10. No scope for late submissions here.
  • More Miscell and General announcements

  • The meetings with groups over their project status have been instructive and informative. I'll put up a separate blog post on this soon. Update: That blogpost is now up and is here
  • The R portion is now more or less done. No further HWs after session 8's. The coming week will be more consolidation & integration of what we did in the past 4 weeks and about looking ahead to the future.
  • I have had some informative discussions with a few students who have volunteered feedback on what can be done to improve the course, better differentiate it from similar offerings and so on. I would appreciate further pointers from folks who feel strongly about any aspect of the course.

That's it for now. Ciao.

Sudhir

Friday, November 9, 2012

An Aside

Students from the web 2.0 class dropped by. They were making a viral video. Well, yours truly put in a cameo appearance for the noble cause.

Heh.

Wednesday, November 7, 2012

Session 8 Classwork Rcode

Update: All classwork examples data files are now available directly from LMS on an excel sheet. This is in response to reports of issues in reading data off the google spreadsheet.

Hi all,

The previous blog-post introducing Session 8 modules can be found here. In this blog-post, I post R code required for classwork examples and links to the associated datasets.

Module 1. Hypothesis Testing

  • t-tests for differences
The Nike dataset is on this google spreadsheet starting cell B5.
## hypothesis testing
## Nike example

# read-in data
nike = read.table(file.choose(), header=TRUE)
dim(nike); summary(nike); nike[1:5,] # some summaries
attach(nike) # Call indiv columns by name

# Q is: "Is awareness for Nike (X, say) different from 3?” We first delete those rows that have a '0' in them as these were non-responses. Then we proceed with t-tests as usual.

# remove rows with zeros #
x1 = Awareness[(Awareness != 0)]
# test if x1's mean *significantly* differs from mu
t.test(x1, mu=3) # reset 'mu=' value as required.
This is the result I got:
To interpret, first recall the hypothesis. The null said:"mean awareness is no different from 3". However, the p-value of the t-test is well below 0.01. Thus, we reject the null with over 99% confidence and accept the alternative (H1: Mean awareness is significantly different from 3). Happily, pls notice that R states the alternative hypothesis in plain words as part of its output.

The second Q we had was:

# “Q: Does awareness for Nike exceed 4?”
# change 'mu=3' to 'mu=4' now
t.test(x1, mu=4, alternative=c("greater")) # one-sided test
The p-value I get is 0.2627. We can no longer reject the Null even at the 90% confidence level. Thus, we infer that mean awareness of 4.18 odd does *not* significantly exceed 4.

Next Q was: "Does awareness for Nike in females (Xf, say) exceed that in males (Xm)?”

# first make the two vectors
xm = x1[(Gender==1)]
xf = x1[(Gender==2)]
# Alternative Hypothesis says xf '>' xm
t.test(xf, xm, alternative="greater")
In the code above, we specify for 'alternative=' whatever the alternative hypothesis says. In this case it said greater than, so we used "greater". Else we would have used "less".
The p-value is below 0.05. So we reject the Null at the 95% level. We accept the alternative that "xf at 5.18 significantly exceeds xm at 3.73".

  • chi.square-tests for Association
The data for this example is on the same google doc starting cell K5. Code for the first classwork example on Gender versus Internet Usage follows. First, let me state the hypotheses:
  • H0: "There is no systematic association between Gender and Internet Usage." . In other words, distribution of people we see in the 4 cells is a purely random phenomenon.
  • H1: "There is systematic association between Gender and Internet Usage."
# read in data WITHOUT headers
a1=read.table(file.choose()) # 2x2 matrix
a1 # view the data once
chisq.test(a1) # voila. Done!
This is the result I get:
Clearly, with a p-value of 0.1441, we can no longer reject the Null at even the 90% level. So we cannot infer any significant association between Gender and Internet Usage levels. The entire sample we had was 30 people large. Suppose we had a much bigger sample but a similar distribution across cells. Would the inference change? Let's find out.

Suppose we scaled up the previous matrix by 10. We will then have 300 people and a corresponding distribution in the four cells. But now, at this sample size, random variation can no longer explain the huge differences we will see between the different cells and the inference will change.

a1*10 # view the new dataset
chisq.test(a1*10)
The results are as expected. While a 5 person difference can be attributed to random variation in a 30 person dataset, a 50 person variation cannot be so attributed in a 300 person dataset.

Our last example on tests of association uses the Nike dataset. Does Nike Usage vary systematically with Gender? Suppose we had reason to believe so. Then our Null would be: Usage does not vary systematically with Gender, random variation can explain the pattern of variation. Let us test this in R:

# build cross tab
mytable=table(Usage,Gender)
mytable #view crosstab

chisq.test(mytable)
The example above is also meant to showcase the cross-tab capabilities of R using the table() function. R peacefully does 3 way cross-tabs and more as well. Anyway,here's the result: seems we can reject the Null at the 95% level.

Module 2. Secondary Data Analysis

There are 3 types of broadly used secondary data that you will be exposed to in this module. First, a small sample or subset of a supermarket shopping basket dataset upon which we will perform *affinity analysis*. The second is aggregate sales data for a category (beer) provided by syndicated data providers (like A C Nielsen and IRI). The third is a dummy dataset upon which we will exercise the Logit model.

  • Affinity analysis on supermarket shopping basket data
Data on some 1018 shopping baskets and covering some 276 product categories from a Hyd Supermarket are put up as an excel sheet on LMS. I used Excel's Pivot tables to extract the 141 most purchased categories. The resulting 1018*141 matrix I read into R. The matrix is available as a text file for R input on LMS. Now use the R code as follows:
# read-in data
data = read.table(file.choose(), header=TRUE)
dim(data); data[1:5, 1:10]

# --- build affinity matrix shell ---
d1 = ncol(data)
afmat = matrix(0, d1, d1)
rownames(afmat) <- colnames(data)
colnames(afmat) <- colnames(data)

# --- fill up afmat lower triangle ---
for (i1 in 2:d1){ for (i2 in 1:(d1-1)){
test = data[, i1]*data[, i2]
afmat[i2, i1] = 100*sum(test[]>0)/nrow(data)
afmat[i1, i2] = afmat[i2, i1] }
afmat[i1, i1] = 0 }

colSum = apply(afmat, 2, sum)
a1=sort(colSum, index.return=TRUE, decreasing=TRUE)
afmat1=afmat[,a1$ix]
afmat2=afmat1[a1$ix,]
So, with that long piece of code, we have processed the data and built an 'affinity matrix'. Its rows and columns are both product categories. Each cell in the matrix tells how often the column category and the row category were purchased together in the average shopping basket. Let us see what it looks like and get some summary measures as well.

# each cell shows % of times the row & col items #
# were purchased together, in the same basket #
afmat2[1:40,1:8] # partially view affinity matrix

# see some basic summaries
max(afmat2)
mean(afmat2)
median(afmat2)
We viewed some 40 rows and 8 columns of the affinity matrix. Each cell depicts the % of shopping baskets in which the row and column categories occurred together. For example the first row second column says that vegetables and juices were part of the a purchase basket 7.66% of the time whereas bread and Vegetables were so 8.54% of the time. The average cross-category affinity is 0.27%. The image below shows the output for some 12 rows and 6 columns of the affinity matrix.

Could the above affinity be viewed also in the form of a collocation dendogram? Or as a word-cloud? Sure. Here is the code for doing so and my dendogram result.

# affinity dendogram?
mydata = afmat2[1:50, 1:50] # cluster top 50 categories

# Ward Hierarchical Clustering
d <- as.dist(mydata) # distance matrix
fit <- hclust(d, method="ward")
plot(fit) # display dendogram

# affinity wordcloud?
colSum = apply(mydata, 2, sum)
library(wordcloud)
wordcloud(colnames(mydata), colSum, col=1:10)

  • Analysis of Syndicated Datasets

Much of real-world commerce depends on trading data - buying and selling it - via intermediaries called 'Syndicated data providers'. AC Nielsen and IRI are two of the largest such around in the retail trade. We will next analyze a small portion of a syndicated dataset to gain familiarity with its format, build some basic models and run some neat regression variants on the data.

The (small) portion of the dataset, which we will use for analysis is here in this google document. First, as usual, read-in the data and let the games begin.

  • There's a text column in the data ('brand2' for brand names). So its better to read it in from .csv rather than .txt using the following code.
  • Also I do some variable transformations to obtain a sku.size variable (measures size of the sale unit in fluid ounces. E.g. a 6 pack 12 Oz sale unit has 6*12=72 ounces of beer Volume).
# read data from a .csv file
x = read.table(file.choose(), header=TRUE, sep=",")
dim(x); summary(x); x[1:5,] # summary views
attach(x) # call indiv columns

# some variable transformations
sku.size.sq = sku.size*sku.size

  • Start with the simplest thing we can - a plain linear regression of sales volume on the 4Ps and on some environmental and control variables (months for seasonality etc.).
# running an ordinary linear regression
beer.lm = lm(volsold~adspend1+price +distbn+promo +sku.size+bottle +golden+light+lite+reg +jan+feb +march+april +factor(brand2), data=x)
# view results
summary(beer.lm)
Notice how the use of factors() enables us to directly use a txt column consisting of brand names inside the regression. The following image shows the result I get.
Reading regression output is straightforward. To refresh your memory on this, try the following Qs. (i) What is the R-square? (ii) How many variables significantly impact sales Volume at 95%? at 99%? (iii) Which brand appears to have the highest impact on sales volume (relative to reference brand 'Amstel')? (iv) How do beer sales fare in the months Jan, Feb and March relative to the reference month April? (v) Do larger sized SKUs have higher sales volume? Etc. Pls feel free to consult others around and get the interpretation of standard regression output clarified in case you've forgotten or something.

  • We move to running a linear regression with quadratic terms to accommodate non-linear shaped response.
  • E.g., suppose sales rise with SKU size but only upto a point (the 'ideal' point). They then start falling off as SKU sizes get too big to sell very well. Thus, there may be an 'optimal' SKU size out there. We cannot capture this effect with simply a linear term.
# running an OLS with quadratic terms
beer.lm = lm(volsold~adspend1+price +distbn+promo+sku.size +sku.size.sq+bottle +golden+light+lite+reg +jan+feb+march+april +factor(brand2), data=x)
#view results
summary(beer.lm)
The image shows that yes, both sku.size and its quadratic (or square) term are significant at the 99% level *and* they have opposite signs. So yes, indeed, there seems to be an optimal size at which SKUs sell best, other things being equal.

  • A log-log regression takes the natural log of metric terms on both sides of the equation and runs them as an ordinary regression.
  • The output is not ordinary though. The coefficients of log variables can now be directly interpreted as point elasticities with respect to sales.
# run a log-log regression
beer.lm = lm(log(volsold)~log(adspend1)+log(price) +log(distbn)+log(promo) +sku.size+sku.size.sq+bottle +golden+light+lite+reg +jan+feb+march+april +factor(brand2), data=x)
summary(beer.lm)
Thus, the results say that the price elasticity of sales is thus -2.29, the distribution elasticity of sales is 1.32 and the promotion and advertising elasticities of sales are, respectively, 0.57 and 0.44.

  • What if there are more variables than required for a regression? How to know which ones are most relevant and which ones can safely be dropped?
  • Enter variable selection - we use a scientific, complexity penalized fit metric, the Akaike Information Criterion or AIC, to decide what the best fitting set of variables is for a given regression.
## variable selection
library(MASS)
step = stepAIC(beer.lm, direction="both")
step$anova # display results
Look at the initial regression equation and the final regression equation (in blue font, mentioned). Look at the variables R wants dropped (with a '-' to them and those R wants added '+' after dropping).Thus, a simple 3 line piece of code on R helps ID the best fitting set of variables in a given set of predictors.

Whew. With that, Logit alone remains to be covered. Will do so after a break. Shall put this up on the blog meanwhile for checking purposes.

3. Discrete Choice Models: The (Multinomial) Logit

Hi all,

Logit is slightly hard to get running on R. Lots of things to keep track of when entering data. But here's some generic code that should do the trick. The dataset for the following example is on this google spreadsheet.

Step1: Read-in data

  • Drop any variables that will not be used in the analysis.E.g. 'storenum'
  • Convert any nominal or categorical variables into dummy variables
  • Keep the dependent variable 'instorepromo' as the leftmost variable in the input dataset
###
### --- do MNL on store-promo example ---
###

data = read.table(file.choose(), header=TRUE)
summary(data); attach(data)
data[1:3,]
data = data[,2:ncol(data)] # drop storenum

  • Step 2:To convert the categorical 'coupon' (in general, any set of categorical variables) into a set of binary variables, I wrote the following subroutine. Just copy-paste it in.
## --- write func to build dummy vars for numeric vectors ----

makeDummy <- function(x1){
x1 = data.matrix(x1); test=NULL

for (i1 in 1:ncol(x1)){
test1=NULL; a1=x1[,i1][!duplicated(x1[,i1])];
k1 = length(a1); test1 = matrix(0, nrow(x1), (k1-1));
colnames0 = NULL

for (i2 in 1:(k1-1)){
test1[,i2] = 1*(a1[i2] == x1[,i1])
colnames0 = c(colnames0, paste(colnames(x1)[i1], a1[i2])) }
colnames(test1) = colnames0
test1 = data.matrix(test1); test = cbind(test, test1) }
test}

# convert coupon to binary variable
coupon1 = makeDummy(coupon)
colnames(coupon1) = "coupon1" # name it

  • Step 3: Reformat dataset for mlogit input. Just copy-paste only.
# make dataset with Y variable as the first
data1 = cbind(instorepromo, coupon1, sales, clientrating)

# now reformat data for MNL input
k1 = max(data1[,1]) # no. of segments there are
test = NULL
for (i0 in 1:nrow(data1)){
chid = NULL;
test0 = matrix(0, k1, ncol(data1))
for (i1 in 1:k1){
test0[i1, 1] = (data1[i0, 1] == i1);
chid = rbind(chid, cbind(i0, i1))
for (i2 in 2:ncol(data1)){ test0[i1, i2] = data1[i0, i2] }}
test = rbind(test, cbind(test0, chid)) } # i0 ends
colnames(test) = c(colnames(data1), "chid", "altvar")
colnames(test)=c("y",colnames(test)[2:ncol(test)])
dim(test); test[1:5,] # view few rows
summary(test)

# build mlogit's data input matrix
test1a = data.frame(test)
attach(test1a)
test1 = mlogit.data(test1a, choice = "y", shape = "long", id.var = "chid", alt.var = "altvar")
summary(test1); test1[1:5,] #view few rows

  • Step 4: Run the analysis, enjoy the result.
  • After all this exhausting work, we've finally reached the top the hill. No more work now. Let R take opver from here.
  • P.S.:This seemed like a lot of work for this small dataset, perhaps. But pls note, this code is generic and can be used with much bigger and complex datasets with equal facility.
# run mlogit
fit1 = mlogit(y ~ 0|coupon1+sales+clientrating, data=test1)
summary(fit1)
The image above gives the results.
It says that the store profile that best suits (or has the highest probability of being) in-store promo type 1 is - (i) must be of coupon type 2, (ii) must have high sales, (iii)must have high client rating.

Alternately, we could say that a store with (i) coupon type 1, (ii) low sales and (iii) low clientrating has a higher chance of being in-store promo type 3 or 2 than 1.

What's more, we can compute the exact probabilities involved. But that is fodder for another day and another course perhaps. Here's the code for it and the results it gives. (You're Welcome)

##---write code to obtain fitted values

coeff = as.matrix(fit1$coefficients)
a1 = exp(model.matrix(fit1) %*% coeff)

attach(test1)
chid = test1[,(ncol(test1)-1)]
Prob.matrix = NULL
for (i1 in 1:max(chid)){
pr = NULL
test2 = a1[(chid==i1), 1]
pr = test2/sum(test2)
k2=ncol(test1)
a2 = cbind(test1[(chid==i1),c(1,(k2-1),k2)],pr)
Prob.matrix = rbind(Prob.matrix, a2)}

colnames(Prob.matrix) = c("observed y", "row id", "promo_format", "Probability")
Prob.matrix[,ncol(Prob.matrix)] = round(Prob.matrix[,ncol(Prob.matrix)], 3)
Prob.matrix[,1]=as.numeric(Prob.matrix[,1])
Prob.matrix[1:10,]#view few rows of probabilities
The top 10 rows of the calculated probabilities matrix can be seen in the image. Notice how nicely the calculated probabilities and observed choices match.

And now for the final piece - prediction.

Suppose there are 5 stores for which we don't know what promo_format is in force. Can we predict what the promo format will be given store profile? Let's find out. I have pasted on the same google doc from cell H3 the prediction dataset. Read it in and let the following code do its job.

##--code for predicting membership for new data
# copy from cell H3 of google doc
newdata1 = read.table(file.choose(), header=TRUE)
attach(newdata1)

# convert coupon to binary variable
coupon1 = makeDummy(coupon)
colnames(coupon1) = "coupon1" # name it

# make dataset for analysis
newdata = cbind(coupon1, sales, clientrating)

Next, convert data format for mlogit processing

# reformat newdata for mlogit input
test3 = NULL
intcpt = matrix(0,k1, (k1-1))
for (i2 in 2:k1){ intcpt[i2, (i2-1)] = 1 }
for (i0 in 1:nrow(newdata)){
test3a = NULL; test3b=NULL;
for (i1 in 1:ncol(newdata)){
test3a[[i1]]=matrix(0, k1, (k1-1))
for (i2 in 2:k1){ test3a[[i1]][i2, (i2-1)] = newdata[i0, i1] }
test3b = cbind(test3b, test3a[[i1]]) }
test3b = cbind(intcpt, test3b)
test3 = rbind(test3, test3b) }

colnames(test3) = colnames(model.matrix(fit1))
dim(test3); test3[1:5,] # view a few rows

Now, finally, run the new data and obtain predicted probabilities of belonging to particular promo formats in a probability table.

# obtain predicted probabilities
coeff = as.matrix(fit1$coefficients)
a1 = exp(test3 %*% coeff)
chid1=NULL; for (i in 1:nrow(newdata1)){chid1=c(chid1, rep(i,k1)) }
altvar1=rep(seq(from=1, to=k1),nrow(newdata1));altvar1
Prob.matrix = NULL
for (i1 in 1:max(chid1)){
pr = NULL
test2 = a1[(chid1==i1), 1]
pr = test2/sum(test2)
Prob.matrix = rbind(Prob.matrix, cbind(i1, altvar1[1:k1], pr)) }

colnames(Prob.matrix) = c("row id", "promo_format", "Probability")
Prob.matrix[,ncol(Prob.matrix)] = round(Prob.matrix[,ncol(Prob.matrix)], 3)

Prob.matrix # view predicted probs
This is the result I got:

With that I close. Its 6 am on the morning that I am to cover this subject in class. Am just glad I have managed to get it in order with time enough to prep the material.

See you all in Class

Sudhir

Monday, November 5, 2012

A better way to do Discriminant Analysis

Update: Folks, don't worry about the Discriminant part for the HW. Consider it not part of the session 5 HW. I ought not to have put that in when I didn't cover it in-depth. Sorry about the confusion. Hi all,

I am attaching below the R code for performing discriminant analysis using multinomial logit code. A quick background to all this:

  • After segmentation, comes targeting.
  • In targeting, the main goal is to *predict* which segment a given customer may belong to given some easily observed traits of that customer (which we call 'discriminant' variables). Typically these used to be demographic variables, but increasingly we see behavioral and transactions-based variables becoming discriminants.
  • Traditionally '(linear) discriminant analysis' was performed to see which discriminant variables were significant predictors of segment membership. However this process is messy and hard to interpret. Since the 80s, this method has been overtaken by discrete choice models - notably the Logit model.
  • We shall use a particular variant of the Logit model called the 'multinomial Logit' to perform discriminant analysis. The code for the same is given below.
  • I was planning on covering Logit in session 8 for secondary data analysis, but might as well introduce it now.

To demonstrate this example, I am taking your Term 4 course ratings dataset available on this google spreadsheet.

  • I run the dataset through Mclust using the 4 attribute ratings and the preference ratings (20 variables in all) as my basis variables. Mclust says a 5 cluster solution is optimal. I save the segment allocations.
  • I have also attached a set of 4 discriminant variables starting cell X5 of the spreadsheet. I have shaded the cells grey to highlight them. The segment classification is the first column of this discriminant dataset.
We are finally ready. We read this data in and process it using the 'mlogit' package in R.

First, read-in the data and run some basic summaries. Note that in the data, no column header has any blank spaces in it. The segment variable should always be the first variable.

##
## --- using mlogit for Discriminant ---
##

# first read-in data
# ensure segment membership is the first column

discrim = read.table(file.choose(), header=TRUE)
dim(discrim); discrim[1:4,]

Now, the data will need to be reformatted as multinomial logit (MNL) input.

# now reformat data for MNL input

k1 = max(discrim[,1]) # no. of segments there are
test = NULL
for (i0 in 1:nrow(discrim)){
chid = NULL;
test0 = matrix(0, k1, ncol(discrim))
for (i1 in 1:k1){
test0[i1, 1] = (discrim[i0, 1] == i1);
chid = rbind(chid, cbind(i0, i1))
for (i2 in 2:ncol(discrim)){ test0[i1, i2] = discrim[i0, i2] }}
test = rbind(test, cbind(test0, chid)) } # i0 ends
colnames(test) = c(colnames(discrim), "chid", "altvar")

Now we setup and run MNL. I will interpret the results after that.

# setup data for mlogit
library(mlogit)
test1a = data.frame(test)
attach(test1a)
test1 = mlogit.data(test1a, choice = "segment", shape = "long", id.var = "chid", alt.var = "altvar")

# run mlogit
summary(mlogit(segment ~ 0|female+engineer+workex_yrs, data = test1))
That was it. The simple statement "segment ~ 0|female+engineer+workex_yrs" run a rather complex discrete choice model. Notice how the dependent variable (segment) and the independent predictors (female, engineer, workex_yrs) have been placed and used.
This is the result I got:

How to read the output table above.

  • The dependent variable consists of 5 values - membership to segments 1 to 5. 'Frequency of alternatives' in the result above gives how often they occurred.
  • Look at the 'Coefficients:' table in the results image above. It gives the coefficient estimate, std error and p-value (significance) for each of the discriminant variables. Thus the coefficient for '2:female' gives the parameters for how a person being female affects their probability of membership to segment 2. And so on.
  • Membership to segment 1 is the "reference level" and all coefficients for it are set to zero. All other coefficients are relative to this zero reference level.
  • Our set of discriminant variables was not a good one because most are not significant in their ability to predict psychographic segment membership.
  • 'Mcfadden's R-square is some sort of a fit metric analogous to the regular R-square. It says only 6% of the variance in the Y is explained by the model.
We will cover Logit models with a smaller and easier example in Session 8.

That's it for now from me. Ciao.

Sudhir

Saturday, November 3, 2012

Session 6 HW & Project Announcement

Update - Project related Announcement:

Hi all,

I met a couple of project groups in the past two days who had come seeking inputs and guidance for their project. I found insightful a first-hand view of student perspective of MKTR tools and methods, what they're looking to get out of the project, how some students are combining projects from others courses ('Pricing' for instance) and so on. For the record, I'm fine if you seek to leverage data already collected in other projects for the MKTR one. As long as the D.P and the R.O.s have marketing substance to them.

I will be available during working hours, everyday (except on Tue and Thu) from tomorrow to Sunday 11-Nov in my office 2118 in case any group wants to run their project status by me and get some informal feedback and pointers for the way forward. No formal appointment necessary and its fine if at least half the group shows up (not everybody's schedules may agree). Just call my office extn # 7106 and drop by if I'm in. Bringing a PPT (or printout) of your proposal and plans with you would certainly help.

Sudhir

#############################################

Hi all,

The homework for session 6, due 11-Nov Sunday noon, is described here.

I've putup some 86 user reviews of the Samsung Galaxy S3, pulled from Amazon onto LMS. The AAs aren't in and I'm not that familiar with LMS. So, pls let me know if you are having trouble accessing the datasets.

The Code to execute the assignment is also putup here (its a minor variation over the classwork code). You are strongly advised to first try replicating the classwork examples on your machine, available in this blog-post, before trying this one.

Your task is to use R to text analyze the dataset. Figure out:
(i) what most people seem to be saying about the product. And thereby interpret a general 'sense' of the talk or buzz around the product.

(ii) List what positive emotions seem associated with the S3. And thereby interpret what S3's strengths are. The business implications of such early signs of Word-of-mouth, instantaneous customer feedback, buzz etc for positioning, branding, promotions, communications and other tools in the Mktg repertoire are easy to see.

(iii) List what negative associations seem to be around. And ideate on how S3's plausible weaknesses and how it can try to defend itself.The business importance of early warning systems, damage assessment and speedy damage control are hard to miss.

Thus, this HW essentially asks you to do this: From a business point of view, interpret from the first few indications of online chatter surrounding the Samsung Galaxy S3 - its SWOT of sorts. Such an activity would normally fall under the rubric of Mktg intelligence perhaps. But tomorrow's world will likely make the Mktg Intelligence-Mktg Research distinctions blurry anyway, perhaps.

Here's the code for analysis:

## ##
## --- Sentiment mining the Samsung Galaxy S3 --- ##
## ##

# first load libraries
library(tm)
library(wordcloud)
library(Snowball)

Now read in data from S3.csv directly.

# Paste reviews in .csv, one per row.
# Do Ctrl+H and replace all commas with blanks in the reviews.
# Now read in each review as 1 doc with scan() as shown below.

x=scan(file.choose(), sep=",",
dec = ".", what=character(),
na.strings = "NA",
strip.white = TRUE,
comment.char = "", allowEscapes = TRUE, flush = FALSE )
x1=Corpus(VectorSource(x)) # created Doc corpus
summary(x1)
If all went well, R will say in blue font "A corpus with 86 text documents". Good.

Next, we parse the text document, remove stopwords (like "the" etc.), and add our own stopwords to the list on a contextual basis. For instance, 'samsung' and 'phone' would show up as the most frequently used terms. Duh, its a Samsung phone review after all. So itsnot that informative to have these two terms occupy the top 2 slots.

# standardize the text - remove blanks, uppercase #
# punctuation, English connectors etc. #
x1 = tm_map(x1, stripWhitespace)
x1 = tm_map(x1, tolower)
x1 = tm_map(x1, removePunctuation)

# Adding 'phone' &' samsung' to stopwords list
myStopwords <- c(stopwords('english'), "phone", "samsung")
x1 = tm_map(x1, removeWords, myStopwords)
x1 = tm_map(x1, stemDocument)

OK. Now time to build a word-frequency matrix, sort it, get summaries like basic counts etc and see which words top the frequency list using a barplot.

# --- make the doc-term matrix --- #
x1mat = DocumentTermMatrix(x1)

# --- sort the TermDoc matrix --- #
mydata = removeSparseTerms(x1mat,0.99)
dim(mydata.df <- as.data.frame(inspect(mydata))); mydata.df[1:10,]
mydata.df1 = mydata.df[ ,order(-colSums(mydata.df))]

# -- view frequencies of the top few terms --
colSums(mydata.df1) # term name & freq listed

# -- make barplot for term frequencies -- #
barplot(data.matrix(mydata.df1)[,1:10])

Barplots are passe, perhaps. So let's get some more detail and color added. We'll make a wordcloud. Then, we use co-location analysis to see which words occur together most often in a 'typical' review. This we view as a 'collocation dendogram'.

# make wordcloud to visualize word frequencies
wordcloud(colnames(mydata.df1), colSums(mydata.df1), scale=c(4, 0.5), colors=1:10)

# --- making dendograms to visualize
# word-collocations --- #
min1 = min(mydata$ncol, 25) # find for top 25 words
test = matrix(0,min1,min1)
test1 = test
for(i1 in 1:(min1-1)){ for(i2 in i1:min1){
test = mydata.df1[ ,i1]*mydata.df1[ ,i2]
test1[i1,i2] = sum(test); test1[i2, i1] = test1[i1, i2] }}

# make dissimilarity matrix out of the freq one
test2 = (max(test1)+1) - test1
rownames(test2) <- colnames(mydata.df1)[1:min1]

# now plot collocation dendogram
d <- dist(test2, method = "euclidean") # distance matrix
fit <- hclust(d, method="ward")
plot(fit) # display dendogram

OK, Time to wade into sentiment analysis now. People are passionate about brands in certain categories and mobile are pretty much up there on that list. Let's see the emotional connect quotient of the reviewers.

So now, we will build wordlists of positive and negative terms, match the reviews' frequent terms with the wordlists and analyze the results.

### --- sentiment analysis --- ###

# read-in positive-words.txt
pos=scan(file.choose(), what="character", comment.char=";")

# read-in negative-words.txt
neg=scan(file.choose(), what="character", comment.char=";")

# including our own positive words to the existing list
pos.words=c(pos,"sleek", "slick", "light")

#including our own negative words
neg.words=c(neg,"wait", "heavy", "too")

# match() returns the position of the matched term or NA

pos.matches = match(colnames(mydata.df1), pos.words)
pos.matches = !is.na(pos.matches)
b1 = colSums(mydata.df1)[pos.matches]
b1 = as.data.frame(b1)
colnames(b1) = c("freq")

# positive word cloud #
# know your strengths #
wordcloud(rownames(b1), b1[,1]*20, scale=c(8, 1), colors=1:10)

Well, so what is the S3 perceived to be strong on in terms of emotional connect quotient? How about S3's weaknesses?

neg.matches = match(colnames(mydata.df1), neg.words)
neg.matches = !is.na(neg.matches)
b2 = colSums(mydata.df1)[neg.matches]
b2 = as.data.frame(b2)
colnames(b2) = c("freq")

# negative word cloud #
# know your weak points #
wordcloud(rownames(b2), b2[,1]*20, scale=c(8, 1), colors=1:10)

At this point, one may ask, "well, word clustering based on frequency is all fine. But can we also cluster people based on their emotional connect (as seen in their text output)? Sure. Here goes.

# say we decide to use the top 30 emotional words
# to segment users into groups #

mydata.df2 = mydata.df1[,1:30]

# now plot collocation dendogram
d <- dist(mydata.df2, method = "euclidean") # distance matrix
fit <- hclust(d, method="ward")
plot(fit) # display dendogram
# tossup between 2 & 3 clusters

## -- clustering people through reviews -- ##

# Determine number of clusters #
wss <- (nrow(mydata.df2)-1)*sum(apply(mydata.df2,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(mydata.df2,centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")
# Look for an "elbow" in the scree plot #
# seems like elbow is at k=2
Elbow plot seems to suggest k1=2. If you get something else on your screeplot, choose that value for k1 and proceed.

Now, in order to characterize the segments in terms of their emotional text output, let us see what the top 3 words are for each segment and decide.

### for each cluster returns 3 most frequent terms ###

# k-means clustering of tweets
k <- 2
kmeansResult <- kmeans(mydata.df2, k)
kmeansResult$size # segment sizes
# cluster centers
round(kmeansResult$centers, digits=3)

## print cluster profile ##
for (i in 1:k) {
cat(paste("cluster ", i, ": ", sep=""))
s <- sort(kmeansResult$centers[i,], decreasing=T)
cat(names(s)[1:3], "\n")}
# print the words of every cluster
That's it. Pls do this, paste your output on a PPT and interpret the results. Answer the Qs above in bullet points, nothing fancy.

Pls feel free to ask around and take help from your peers. You're always welcome to approach me, the AAs or Ankit Anand with any queries. I'd prefer you use the blog's comments section to reach me fastest. I look forward to hearing your feedback on this and other HWs.

Sudhir