Saturday, November 26, 2011

Eleventh hour Phase III Q&A

Update:
Got this email:
Hi Professor,


Will it be possible to extend the deadline for submission of group project from 6 PM to midnight i.e. 12 AM.

There are couple of reasons behind this request :

1) We followed the instructions for extension of SPSS recommended by IT services. However it involved Virtual Box creation which is a terribly slow way to operate heavy data on SPSS (each excel file is 3-4 MB, and it takes 10 minutes just to transfer data from excel to SPSS run on Virtual Box)

2) A lot of time was spent looking for secondary data which we did not expect.

3) Conflicting deadlines for Pricing, CSOT, ENDM and BVFS between today and tomorrow.

Will be grateful if the extension is allowed.
Regards,
My response:
Hi G and Team,


I have no problem with extending the deadline another 6 hrs.

I don't know how to program turnitin.com.

Chandana has set the deadline for 6 PM on the dropbox.

Let me try to reach her and see if she can extend this to 12 midnight.

Shall inform the class soon regarding this.

Sudhir
Hope that clarifies. Shall let you folks know soon via (yet another) mass e-mail..

Sudhir
------------------------------------------------
Hi all,

Got this series of mails from "t" just now. Have edited to remove specifics and focus on general take-aways to share with the class.

Hi Prof,
One question on cluster analysis-
While choosing variables for cluster analysis, can we pick 3 factor scores which are essentially 3 buckets of multiple variables and two individual variables (ie scores on ‘I am generally budget conscious’ and ‘ I carefully plan my finances’)? Does that distort the output? Can it be interpreted?
Regards,
T
My response:
Hi T,
No, that's perfectly fine. In fact, it is recommended when certain variables don;t load very well onto the factor solution in the upstream stage.
As for interpretation, yes it follows the same as it would if they'd been factors. Factor scores are now essentially variable values where we are concerned in downstream analysis.
And then T replied:
Thanks for that prof. The only problem is that the variables under question are becoming disproportionately important in the 'predictor importance' scores and are pulling down the importance of other factor scores.
At which I wrote:
Make sure all the input variables into a cluster analysis program are standardized (i.e. subtract mean and divide by std dev) before cluster analysis to remove variable scaling effects on clustering. That should help.
Sudhir

P.S.
Will keep you posted as more Q&A happens.

Phase III - Grading Criteria

Hi All,

Might as well outline some thoughts on the grading criteria for the project. These are indicative only and are not exhaustive. However, they give a fairly good idea of what you can expect. Pls ensure your deliverable doesn't lack substance in these broad areas.

1. Quality of the D.P.(s) - How well it aligns with and addresses the business problem vaguely outlined in the project scope document; How well can it be resolved given the data at hand. Etc.

2. Quality of the R.O.s - How well defined and specific the R.O.s are in general; How well the R.O.s cover and address the D.P.s; How well they map onto specific analysis tools; How well they lead to specific recomemndations made to the client in the end. Etc.

3. Quality and rigor of Data cleaning - The thinking that went into the data cleaning exercise; the logic behind the way you went about it; the ways adopted to minimize throwing out useful observations using imputations, for instance; the final size of the clean dataset that you ended up with for indepth analysis. The data section should contain these details, ideally.

4. Clarity, focus and purpose in the Methodology -  Flows from the D.P. and the R.O.s. Why you chose this particular series of analysis steps in your methodology and not some alternative. The methodlogy section would be a subset of a full fledged research design, essentially. The emphasis should be on simplicity, brevity and logical flow.

5. Quality of Assumptions made - Assumptions should be reasonable and clearly stated in different steps. Was there opportunity for any validation of assumptions downstream, any reality checks done to see if things are fine?

6. Quality of results obtained - the actual analysis performed and the results obtained. What problems were encountered and how did you circumvent them. How useful are the results? If they're not very useful, how did you transform them post-analysis into something more relevant and useable.

7. Quality of insight obtained, recommendations made - How all that you did so far is finally integrated into a coherent whole to yield data-backed recommendations that are clear, actionable, specific to the problem at hand and likely to significantly impact the decisions downstream. How well the original D.P. is now 'resolved'.

8. Quality of learnings noted - Post-facto, what generic learnings and take-aways from the project emerged. More specifically, "what would you do differently in questionnaire design, in data collection and in data analysis to get a better outcome?".

9. Completeness of submission - Was sufficient info provided to track back what you actually did, if required - preferably in the main slides, else in the appendices? For instances, were Q no.s provided for the inputs to a factor analysis or cluster analysis exercise?  Were links to appendix tables present in the main slides? Etc.

10. Creativity, story and flow - Was the submission reader-friendly? Does a 'story' come through in an interconnection between one slide and the next? Were important points highlighted, cluttered slides animated in sequence, callouts and other tools used to emphasize important points in particular slides and so on.

OK. Thats quite a lot already, I guess.

Sudhir

Thursday, November 24, 2011

Phase III - D.P. definition travails

Update:
Hi all,

Another Q I got asked recently which, I think, bears wider dissemination.

1. The 40 slide limit is the upper bound. Feel free to have a lower #slides in your deliverable. No problem.

2. The amount of work Phase III may take, by my estimate would be about 15-20 person hours in all. Say about 3 hours per group member. Anything much more than that and perhaps you are going about it the wrong way. The project is the high point of applied learning in MKTR_125. So yes, 3-4 hours of effort on the project is not an unfair amount of load. Besides, effort correlates well with learning, in my experience.

3. Do allow yourself to have fun with the project - its not meant to be some sooper-serious burden, oh no. Keeping a light disposition, a witty touch, a sense of optimism and the big-picture in mind helps with flexibility, creativity, out-of-box and all those nice things, in my experience. Whats more, if you enjoyed doing the project, rest assured it *will* show in the output.

Hopefully that allays some concerns about workload expectations etc. relating to Phase III.

Sudhir

------------------------------------------------------------------------------
Folks,

The example PPTs putup don't have a D.P. explicitly written down because I had given the D.P. in project 2009. In Project 2011 however, you have been given the flexibility to come up with a D.P. of your own, consistent with the project scope. This is both a challenge and an opportunity.

I got some Qs on whether the handset section responses can be ignored because client is primarily a service provider. Well, the handset section contains important information about emerging trends in the consumer's mindspace, the mobile application space and the mobile-related brands' perception space. A telecom service provider would dearly like to know how many target segment people have/are switching to smartphones, which apps they use most often so that the carrier can emphasize those apps more and so on. So yes, the handset section should not be ignored, IMHO.

Hope that clarified.

Sudhir

--------------------------------------------------------------
Got an email outlining whether such-and-such was a good D.P. - R.O. combo.

Now, I can;t share what the D.P. itself was but my response to that team had generalities which might do with some dissemination within the class.

Hi R and team,


Looks good. However, I don't quite see what the decision problem is. Step 1 seems to state more an R.O. than a D.P.

Sure, sometimes a D.P. maps exactly onto a single R.O. and that may well be the case here. But such a D.P. would perhaps be overly narrow considering the project scope initially outlined.

I'd rather you state a broader D.P. and break it down into 3 R.O.s each corresponding to the S, T and P parts of S-T-P.

Well, that's my opinion, you don't necessarily have to buy into it. What you do is ultimately your call.

Do email with queries as they arise.

Good luck with the project and happy learning.
Sudhir
More miscell Q&A:
Dear Sir,


While running the factor analysis on the pychographic questions, do we re-label the levels of agreement and disagreement as 1-5? Will this help us in any way?

Another way out could be to label the responses as 1 and 0 where 1 is for a level of agreement and 0 for disagreement. In the Ice-Cream Survey HW the responses to these psychographic questions were binary which actually helped in the analysis.

A quick response will be really helpful. Our approach is to first segment the consumers using factor analysis and then find some interesting insights into usage based on these segments.

Cheers,

S
My response:


Hi S,
1-5 (or 1-7 in case of a 7 point scale) is the conventional practice. Safe to go with conventional practice.

You might as well choose to go with something else but may need to justify why.

Alternately, copy the data columns of interest onto a new sheet, and reduce the #responses to three (-1 for disagree, 0 for neither and +1 for agree), re-run factor analysis and see if the variance explained, factor structure etc that you now get is better than with 1-5.

Idea is that as and when you approach specific problems, you think up of neat, creative ways to negotiate it and move on. Therein lies the learning in the analysis portion for MKTR. :)

Hope that helped. Do write in with more queries should they arise.
----------------------------------------------------------------------------
Dear Sir,

We are facing an issue with regard to clubbing the dataset with Q48 (in terms of Rank) into the final data set that was uploaded earlier. The following are the discrepancies:

• The new Data set for Q48 does not have serial no’s to link it with earlier set

• The no of rows (entries) in the new data set for Q48 are more than the original one

• There should be ranks from 1 to 3 but we see that some data entries up to rank 9

It would be beneficial for all the groups to have a final data set with Q48(in terms of ranks) extracted from Qualtrics.

Looking forward for a quick response from your end.

Regards,

M

My response:

Hi M,

The respondent ID that is there (leftmost column in the Q48 dataset) can be used to match (via VLOOKUP func in excel) with the same rows in the original dataset. The blogpost made a mention of this specifically.

Once you match the Q48 dataset entries to the original dataset ones, the problem regarding #responses less or more in Q48 dataset also goes away.

The ranks 1 to 3 alone are relevant. Some people ranked more than the top 3 items, so you may get upto 9 but that we can ignore, if need be.

Hope that clarifies.
Sudhir
----------------------------------------------------------------
Received this email today fro group "J":
Dear Sir,

We have the following analysis for our MKTR project:

Decision problem: Foreign handset player wants to enter the Indian market
Research Object: Wants to identify the most lucrative segment to enter.

This would make some of the questions about service provision irrelevant. Do you think we are on the right track?
Regards,
My response:
Hi Team J,


The project scope was designed from the viewpoint of a large Indian telecom service provider.
So a purely handset-maker's perspective would be limiting, I feel and perhaps not entirely consistent with the original project scope document.

My suggestion is you consider modifying the D.P. such that the data on service provider characteristics can also be integrated and used in some manner. For instance, a handset maker looking to ally with a service provider, perhaps.

In general, the D.P. should not be so limiting as to banish a good part of the data we have collected from analysis. There is plenty of scope to creatively come up with D.P.s that while focussing on the handset side of the story also use telecom carrier data.

Hope that clarifies.
Sudhir

Monday, November 21, 2011

Project Phase III Q&A

Update:
A more recent email with similar Qs:

Hi Professor,

Hope you’re well!

I’m writing to ask some help with the data cleaning of the Project’s Phase III data. Can you please help with the following queries?

1. What does it mean when the data field has both blanks as well as “-99” as the response? Thought it was the same thing, yet Q45, for instance, has both fields.

2. Can you please also advise on how to handle coding of ranking/rating questions? Specifically:
a. Q41 – Rating of top two categories of channels watched
b. Q48 – Ranking of only Top 3 brands out of a list of about 10

3. Also, how do we code blank responses (especially for interval or ratio scaled questions)?
a. Do we put a “0” or leave it blank?
b. How does SPSS handle zeroes vs. blanks?

Please advise.
My response:

Hi A and Team,
1. "-99" means respondent has seen the Q but chosen to ignore it. Blank means the respondent never saw the Q, i.e. the skip logic didn't lead him/her to the Q in the first place.


2. Q41 - select multiple options is what was done for the tv channel Q. So, folks have selected 2 (and some have selected more than 2). There is no ranking implicit in what was chosen.

Q48 - I'll download this Q afresh so that the rankings are visible. At present, under 'download as labels', we are unable to see the rankings.

3. a. Depends on how many blanks are there. If the entire row is blank, drop that row. The Q was not relevant to the respondent and hence qualtrics skipped those Qs for him/her. If a few columns here and there are at "-99", then either impite mean/median for that column or "0" for "do not know". Its a call you have to take given the context the Q arises in.

b. Doesn't handle them very well but allows you to do some basic ops. SPSS will ask whether you want to exclude cases (i.e. rows) with missing observations or whether you would rather replace the missing cells with the column means. Choose wisely and proceed.

Am sending the Q48 ranking data afresh to the AAs for an LMS upload. Should be up for viewing and download soon. Use the respondent ID to vlookup and match like rows in the master dataset you are currently working on.

Hope that clarified things somewhat.
Sudhir
-------------------------------------------------------------------------------------------------
Got this email from a team:

Dear Chandana,
Phase 3 project requires a 40 slide PPT as our deliverable. I would like to get the following things clarified:
a)      Is 40 the minimum or maximum limit. It seems too big a task.
b)      Can we generate output tables of tools such as Cluster and Factor analysis and include it as per the content of the PPT or is it that the tables be part of the appendix alone.
c)       Is secondary data analysis mandatory/allowed/is optional. Since the entire survey cannot cover a limit of 40 slides through a primary survey alone.
Kindly help in getting these points clarified as we are in the process of finalizing our approach.

My response:

Hi Team G,
1. The 40 slide limit is the upper limit. Feel free to have your final PPT deliverable less than 40 slides long.
2. There is little point in pasting SPSS output tables for factor.cluster analyses inside the 40 slide limit, IMHO. I'd rather groups present the interpretation/info/insight that emerges from such techniques. A hyperlink to the appendices that contain the SPSS tables shouldn't hurt at all, though.
3. Secondary data usage is welcome as long as the sources are documented and cited meticuluously.
IMHO, the main challenges arise in deciding upon a suitable D.P. and its constituent R.O.s. What follows is straightforward once these are set. I'd say, don't be overly ambitious in defining your D.P. nor overly shallow in scope either.
I hope that clarified things at least somewhat. Pls feel free to write in with queries as and when.

Regards,

Sudhir



Sunday, November 20, 2011

Data cleaning tips for Project Phase III

Hi all,

As you by now well know, one of the exam Questions related to output from your phase III project (the factor analysis one). While analyzing data for this particular problem, I did notice quite a few irregulartities with the data. This is typical. And hence, the need for a reasonable cleaning of the data prior to analysis.

For instance in the Qs related measuring to importance of telecom service provider attributes, there were a few pranksters who rated everything "Never heard of it." Clearly, its stretch to think internet savvy folks have never heard of even one of a telecom carrier's service attributes. Such rows should be cleaned (i.e. either removed from analysis or if there are gaps etc, then imputing missing values for these gaps etc) before analysis can proceed.

Some advice for speeding up the cleaning process:

1. Do *not* attempt to clean data in all the columns. That is way too much. Only for important Qs, on which you will do quantitaive analysis using advanced tools (factor/cluster analysis, MDS, regressions etc) should you consider cleaning the data. So choose carefully the columns to clean data in. Needless to say this will depend on what decision problems you have identified for resolution.

2.  An easy way to start the cleaning process is to check whether there is reasonable variation in responses for each respondent. Thus, for instance, after the psychographic Q columns, insert a new column and in it, for each row, compute the standard deviation of the responses of the psychographic Qs. The respondents with very high or low standard deviations should be investigated for possible data cleaning. Thus for example, this method would catch people who mark the same response for every Q. Or those who mark only extremes for some reason.

3. If there are gaps or missing values (these are depicted with -99 in the dataset) in critical columns, then you may consider replacing these with the median or the mean of that column. A fancier, more general name for such work is imputation which includes also various model based predictions to use as the replacement value. Without imputation, one will be forced to drop the entire row for want of a few missing values here and there. Always ensure, the imputed values are 'reasonable' before going down this path.

More apropriately, imputation might work better if used segment-wise. Suppose you've segmented the consumer base based on some clustering variables. Now since those segments are assumed homogenous in some sense, one can better impute missing values with segment means/medians perhaps.

4. Don't spend too much time on data cleaning also. In the first (half-)hour odd spent on data cleaning, chances are you will cover the majority of troubled rows. After that there is a diminishing returns pattern. So draw a line somewhere, stop and start analysis from that point on.

Update:
I'll share my answers here on the blog to any Qs asked to me by groups. Conversely, I request groups to first check the blog for whether their Qs have already been answered here insome form or the other.

Hope that clarifies.

Sudhir

P.S.
Any feedback on the exam, project or any other course feature etc is welcome. As always.


Friday, November 18, 2011

Homework and Exam related Q&A

Post-Exam Update:
I later noticed, after the exam had started that there was a typo in the binary logit question. The coeff of income^2 was shown as -.088 instead of -0.88. Hence, the predicted probabilities of channel watched for all 3 cases (min, max and mean profile of respondents) would now come up as 1. Folks who show the expression and calculations will get full credit for this problem.

A related Q regarding the DTH caselet. There are many possible assumptions you could make and from each, a different research design might flow. As long as you've stated clearly your assumptions and the research design that follows is logically consistent with your assumptions, you are OK.

Hope that clarifies.

Sudhir

Update:

Another thing I might as well clarify: - regarding the quant portion - only interpretation of a model and its associated tables will be asked for.

Even there, only those tables that we explicitly have discussed in class, not merely shown but discussed, will be important from an exam viewpoint.

Sudhir

Am getting quite a few Qs regarding this. A wrote in just now:
Hi Chandana,

Need a quick clarification. For this home work questions 1,2,3 and 4 are mandatory and 5,6,7 are optional. Is that right ?
Regards,
A
My response:
Sudhir
----------------------------------------------------------------
Got this just now:
Professor,

When I mailed you today morning, I had high aspirations of finishing my studies and then meeting you for a quick review. Unfortunately, I just finished going through Session 4 hand out. My sincere apologies but I guess I will have to cancel this appointment…and give the slot to a better prepared student…!!

Thanks,
K
My response:
That's OK. Drop by anyway. I expected I'd be busy during these office hrs but am (pleasantly) surprised. Seems folks have on average understood the material well and don't need additional office hrs now.

Besides, anyway, the exam is not designed to be troublesome. I wouldn't worry overmuch if I were you.

Sudhir
----------------------------------------------------------------


Hi all,

Was asked yesterday about this and might as well share with the whole class.

Q was that in the factor analysis homework, what to do if the factor solution with eigenvals>1 is still showing cumulative variance explained of <60%?

Well, if the % is in the mid to late 50s, just go with it.

If not, it would seem like the factor solution is not doing a great job of explaining variance in the input variables. This is presumably because at least some variables are not well correlated with the others and are hence weakening the factor solution.

To ID such variables, look either at the correlation tables or at the communlaities table. The variables that show least correlation with others should progressively be removed and factor anbalysis re-conducted at each step till the 60% criterion is met.

If a variable loads entirely on its own factor, drop that factor and use that variable as-is in downstream analysis.

Hope that clarifies. Any more such Qs and I shall share it here.

Sudhir
Hi A,
Yes, only Q1-4 are mandatory. Rest are optional in session 9 HW.

Thursday, November 17, 2011

Project related Q&A

Update:

I split up the old post which had both Q&A as well as deliverable format. The old post is now exclusively deliverable format and is available here:

http://marketing-yogi.blogspot.com/2011/11/more-phase-iii-project-q.html

A recent post on take-aways from past projects (particularly those in 2010) can be found here:

http://marketing-yogi.blogspot.com/2011/11/take-aways-from-past-projects.html
------------------------------------------------------------------------------

Hi all,

Got asked a few Qs, might as well putup answers here.

1."The decision problem is vague. Should we focus more on telecom services or handset features?"

My response: The scope document is vague for good reason. Provides enough wriggle room to variously interpret decision problems. Pls come up with your own decision problem that is not-inconsistent with the project scope.

Regarding telecom service versus handset, well, the scope doc says the client XYZ is a major player in the telecom services space - so it is primarily a service provider. The handset features thing is likely secondary to its main goal of improving its position among telecom service providers.

Pls come up with a suitable D.P. and appropriate downstream analysis. The main theme of interest will be, how much does your D.P. align with the scope document and how logical and consitent is the donwstream analysis with the stated D.P.

2. Scope says, XYZ wants to know "where the market is going". What does this mean?

Well, XYZ will want to know many things, sure, but only so many are knowable with the data at hand. My intention behind that part was, illuminate the current standing of different offerings first and then project a few years into the future. Make assumptions as necessary. There'll be little data to back up projections into the future, understandably so. Shall later expand on wht I mean by this point.

More Q&A as they come will be on this page as updates.

Sudhir
-----------------------------------------------------------

Update: More project related guidance

In general, there are a few Questions which the client is likely to find of interest and which I think each project team could consider going through.

(i) What is the turnover (i.e. rate of change or switching) among attractive customer segments in telecom carriers and handsets? [Hint: See how long one has been with a carrier/handset Q among others]. Is there a trend? Anything systematic that might indicate the market is moving towards or appears to be favoring a certain set of attributes more than others?

(ii) What kind of usage patterns, apps etc are the attractive segments (say in terms of purchasing power or WTP etc) moving towards? More voice? More data? More something? From here can flow recommendations to the client on which applications to focus on, which platforms to explore alliances with etc perhaps.

(iii) Perceptual maps of current reality - where attractive customer segments perceive current carriers and/or handsets to be. What are the critical dimensions or axes along which such percpetions have formed? What are the attractive gaps in positioning that may emerge? Which niche perhaps can a new or existing service lay claim to on the positioning map?

More Qs will be added as they occur.

Pls note that it is *not* necessary that these Qs be answered or recommendations made along their lines. Its just a guide for teams to think about incorporating into their current plan/roadmap. Incorporating these Qs makes sense only if they align with the D.P.s and the R.O.s you have chosen.

Sudhir 

Wednesday, November 16, 2011

Student Contributions - Week 5

Well, the Nielsen story did invite some strong-ish reactions from folks here and there. Good, good.

Here's what Anshul has to say about his experience:
Hi Prof. Voleti,


As discussed in class today, I wanted to mention some shortcomings of Nielsen data that I noticed in my line of work.

I worked in media planning and buying, and used the TAM data quite extensively in order to make decisions regarding the best TV channels to use, as well as in the post analysis of campaigns.
However, quite often, Nielsen data would throw up seemingly garbled figures, due to the following reasons:

1. Insufficient sample size in some markets for the selected TG. This was fine for some weeks, but not others indicating that the sample size could vary from week to week for the same TG and market.

2. Missing data on some ads resulted from inaccurate reporting (missing figures, for instance). In some cases, this was due to inaccurate coding of the creatives.

A possible reason for the errors is that there is still a lot of manual work involved between the time data is collected to when it is shared with agencies.

Therefore, we had to always be careful while using Nielsen TAM data, and tried to cross-check it wherever possible. That was, however, not always possible and we just had to assume that the TAM data was accurate.

This was my observation and thought I’d share. Hope this helps!
Kind regards,
Anshul Joshi
Thanks, Anshul.

Any further student contributions will be put on up here.

Sudhir

Phase III Project - Deliverables and format

Hi Class,


Your deliverable consists of a 40 slide PPT (excluding appendices). Should be emailed to me with a copy to the AAs before deadline: 27-Nov 6 pm.

More specifically, your PPT should contain:


1. Filename (mandatory) - should be of the form group_name.pptx when submitted.

2. Title slide (mandatory) - Project title - something concise but informative that describes the gist of what you've done. In addition, the title slide should also contain your Group Name, Team-member names, PGIDs  MKTR section.
Pls note, make no mention of your team member names anywhere else in the PPT, have it only in the title slide.

3. Presentation Outline (Optional) - a Contents page that outlines your presentation structured along sections (e.g. methodology, [...] , recommendations, etc).

4. Decision Problem(s) (Mandatory) - State as clearly as feasible the decision problems (D.P.s) you are analyzing Give numbers to the D.P.s if there're more than one.

If small enough in number, you may also list the R.O.s you have in this slide.


5. A Methodology Section (Mandatory) - In preferably a graph or flowchart form, lay down what analysis procedures you used, in what order to answer particular Research Objectives (R.O.s) that cover the D.P.s.

6.A Data section (Mandatory) - explains that nature and structure of the data that were used. Be very brief but very informative - write (ii) the dimensions of the data matrices used as input to in different procedures, and (ii) the sources of data - cited sources if secondary data are used, and Question numbers in the survey questionnaire if primary data are used.

I strongly suggest using a tabular format here. Packs a lot of info into compact space. Easy to read and compare too.

Some clue as to what filters or conditions were used to clean the data would be very valuable also.


7. Model Expressions (Mandatory)- Write the conceptual and/or mathematical expressions of any dependence models used. Then directly use the results in downstream analysis.

Kindly place in the appendix section a descriptives table of the input data, a brief explanation of the X variables used, and of course, output tables along with interpretation.

8. Appendix Section (Optional): Some of the less important tables can be plugged into a separate appendix section (outside the 40 slide limit) in case you are running out of slide space. Have only the most important results tables in the main portion.

9. Recommendations (Mandatory) - crisp, clear, in simple words directed towards the client. Emphasize the usability and actionability of the recommendations.

Further, I strongly advise groups to make their PPT deliverables reader-friendly. That is:
- use as simple language as feasible.
- Animate the slides if they are cluttered so that when I go through them later, the proper sequence of material will show up.
- Highlight keywords and important phrases.
- Use callouts etc as required to call attention to particularly important points.
- An economy of words (sentence fragments for eaxmple) is always welcome.

I hope that clarifies a lot of issues with deliverable format. Groups that flout these norms may lose a few points here and there.

Sudhir

Monday, November 14, 2011

Exam related Q&A

Update: Pls read the Coop case in the course-pack. Bring the coursepack with you to the exam hall.

Hi All,

Have just now emailed the end-term exam to ASA and asked the AAs to put up the practice exam on LMS. Admittedly, preparing an open-book exam is not fun. Still, I did the best I could.

1. Ideally, questions would be close-ended and easy to sort out. However, some questions are perforce open-ended and involve text asnwers. To avoid confusion and overly long-winded answers, the exam is limited-space only. Meaning, there'll be some space provided within which your answer should fit. Anything written outside the given space will *not* be counted for evaluation.

2. The end-term has 5 questions, is designed for <2 hours but you will have 2.5 hours to do it in. So, time will not be a problem.

3. I reckon, as long as you're structured in your thinking, economical in your use of words, well-planned in your approach to the answer and reasonable in the assumptions you make and state, you should be OK.

Update: Let me re-emphasize that given constrined resources (answer space, in this case, and the use of pens rather than pencils) proper planning and organization will go a long way in helping students. I'd strongly advise making a brief outline of the answer first on provided rough paper before actually using the answer space.

Please feel free to use technical terms as appropriate, highlight keywords etc if that helps graders better understand your emphasis when making a point.

4. The practice exam is about half as long as the end-term but the question types etc are broadly the same. No solution set will be made for the practice exam, it is just for practice only.

Do bring a calculator to the exam.

Any other Qs etc you have on the exam, I'll be happy to take here and add to this thread.

Sudhir

Sunday, November 13, 2011

Take-aways from past projects

Update: Received a few such emails so might as well clarify:

Hi Prof,

There seems to be some confusion on the final report submission due date. This is to check whether submission is due in session 10.

Kindly confirm.
S

My response:

No. Its due on 27-Nov. - the day before term 6.
Hope that clarifies.
Sudhir
----------------------------------------------------------
Hi all,

As promised, some collected thoughts on where folks ran into obstacles in past years. Am collating and putting up relevant excerpts from past blogposts to start with. Shall add and expand this thread as more info comes in.
------------------------------------------------------------------
From last year (financial planning and savings avenues analysis)


[Some] stuff that IMHO mertits wider dissemination.
1. Let me rush to clarify that no great detail is expected in the supply side Q.
As in, you're not expected to say - "XYZ should offer an FD [fixed deposit] with a two year minimum lock-in offering 8.75% p.a.".
No.
Saying "XYZ should offer FDs in its product portfolio." is sufficient.


2. Make the assumption that the sample adequately represents the target population - of young, urban, upwardly mobile professionals. 

3. Yes, data cleaning is a long, messy process. But it is worthwhile since once it's done, the rest of the analyses follow through very easily indeed, in seconds. 

4. It helps to start with some idea or set of conjectures about a set of product classes and a set of potential target segments in mind, perhaps. One can then use statistical analyses to either confirm or disprove particular hypotheses about them.

5. There is no 'right or wrong' approach to the problem. There is however a logical, coherent and data-driven approach to making actionable recommendations versus one that is not. I'll be looking for logical errors, coherency issues, unsustainable assumptions and the like in your journey to the recommendations you make in phase III.
------------------------------------------------------------------
From last year on stumbles in the analysis phase:
1. Have some basic roadmap in mind before you start: This is important else you risk getting lost in the data and all the analyses that are now possible. There are literally millions of ways in which a dataset that size can be sliced and diced. Groups that had no broad, big-picture idea of where they want to go with the analysis inevitably run into problems.

Now don't get me wrong, this is not to pre-judge or straitjacket your perspective or anything - the initial plan you have in mind doesn't restrict your options. It can and should be changed and improvised as the analysis proceeds.

Update: OK. Some may ask - can we get a more specific example? Here is what I had in mind when I was thinking broad, basic plan from an example I outlined in the comments section to a post below:
E.g. - First we clean data out for missing values in Qs 7,10,27 etc -> then do factor analysis on psychogr and demogr -> then did cluster analysis on the factors -> then we estimate segment sizes thus obtained -> then we look up supply side options -> arrive at recommendations.

Hope that clarifies.
2. Segmentation is the key: The Project essentially, at its core, boils down to an STP or Segmentation-Targeting-Positioning exercise. And it is the Segmentation part which is crucial to getting the TP parts right. What inputs to have for the segmentation part, what clustering bases to use, how many clusters to get out via k-means, how best to characterize those clusters and how to decide which among them is best/most attractive are, IMHO, the real tricky questions in the project.

3. Kindly ask around for software tool gyan: A good number of folk I have met seemed to have basic confusion regarding factor and cluster analyses and how to run these on the software tool. This after I thought I'd done a good job going step-by-step over the procedure in class and interpreting the results. Kindly ask around for clarifications etc on the JMP implementation of these procedures. The textbook contains good overviews about the conceptual aspects of these methods.

I'm hopeful that at least a few folk in each group have a handle on these critical procedures - factor and cluster. I shall, for completeness sake, again go through them quickly tomorrow in class.

4. The 80-20 rule applies very much so in data cleaning:Chances are under 20% of the columns in the dataset will yield over 80% of its usable information content. So don't waste time cleaning data (i.e. removing missing values, nonsense answers etc) from all the columns, just the important ones only. Again, you need to have some basic plan in mind before you can ID the important columns.

Also, not all data cleaning need mean dropping rows. In some instances, missing values can perhaps be safely imputed using column means or medians or the mode (depending on data type). 

Chalo, enough for now. More as updates occur.
Sudhir
------------------------------------------------------------------
More from last year on specific problems encountered in final presentations:


1. Research Objective (R.O.) matters.
Recall from lectures 1, 2 & 3 my repeated exhortations that "A clear cut R.O. that starts with an action verb defined over a crisp actionable object sets the agenda for all that follows". Well that wasn't all blah-blah blah. Its effects are measurable, as I came to see.

Suppose the entire group was on board with and agreed upon a single, well-defined R.O., then planning, delegation and recombining different modules into a whole would have been much simplified. Coherence matters much in a project this complex and with coordination issues of the kind you must've faced. It was likely to visibly impact the quality of the outcome, and IMHO, it did.

2. Two broad approaches emerged - Asset First and Customer First.
One, where you define your research objective (R.O.) as "Identify the most attractive asset class." and the other, "Identify the most attractive customer segment." The two R.O.s lead to 2 very different downstream paths.

Most groups preferred the first (asset first) route. Here, the game was to ID the most attractive asset classes using size, monetary value as addressable market or some such criterion and then filter in only those respondents who showed some interest in the selected asset classes. Then characterize the indirect respondent segments obtained and build recommendations on that basis.

I was trying to nudge people towards the second, "Customer segmentation first" route partly because it aligns much more closely with the core Marketing STP (Segmentation-Targeting-Positioning) way. In this approach, the entire respondent base is first segmented along psychographic- behavioral - motivational or demographic bases, then different segments are evaluated for attractiveness based on some criterion - monetary value, count or share etc, and then the most attractive segments are profiled/analyzed for asset class preferences and investments.

Am happy to say that in a majority of the groups, once a group implicitly chose a particular R.O., the approach that followed was logically consistent with the R.O. 

3. Some novel, surprising things.
Just reeling off a few quick ones that do come to my mind.

One, how do you select the "most attractive" segment or asset class given a set of options? Some groups went with a simple countcriterion - count the # of respondents corresponding to that cluster and pick the largest one. Some groups went further and used a value criterion - multiply the count with (%savings times average income times % asset class allocation) to arrive at a rupee figure. This latter approach is more rigorous and objective, IMHO. There were only 2 groups that went even further in their choice of a attractiveness criterion - the customer lifetime value (CLV) criterion. They multiplied the rupee value per annum per respondent with a (cleaned up) "years to retirement" variable to obtain the revenue stream value of a respondent over his/her pre-retirement lifetime. Post-retirement, people become net consumers and not net savers, so post-retirement is a clean break from pre-retirement. I thought this last approach was simply brilliant. Wow. And even within the two groups that did use this idea, one went further and normalized cluster lifetime earnings by cluster size giving a crisp comparison benchmark.

Two, how to select the basis variables for a good clustering solution? Regardless of which approach you took, a good segmenting solution in which clusters are clear, distinct, sizeable and actionable would be required. One clear thing that emerged across multiple groups was that using only the Q27 psychographics and the Demographics wasn't yielding a good clustering solution. The very first few runs (<1 minute each on JMP and I'm told several minutes on MEXL) should have signaled that things were off with this approach. Adding more variables would have been key. Typically, groups adding savings motivation variables, Q7 constant sum etc were able to see a better clustering solution. There is seldom any ideal clustering solution and that's a valuable learning when dealing with real data (unlike the made-up data of classroom examples).

One group that stood out in the second point approach used all 113 variables in the dataset in a factor analysis -> got some 43 odd factors -> labeled and IDed them -> then selectively chose 40 from among the 43 as a segmenting basis and obtained a neat clustering solution. The reason this approach stands out in my mind 'brute force approach' is that there's no place for subjective judgment, no chance that some correlations among disparate variables will have been overlooked etc. It's also risky as such attempts are fraught with multi-collinearity and inference issues. Anyway, it seemed to have worked.
[...]
Anyway, like I have repeatedly mentioned - effort often correlates positively with learning. So I'm hoping your effort on this project did translate into enduring learning regarding research design, data work, modeling,  project planning, delegation and coordination among other things.
------------------------------------------------------------------
OK, that's it for now. Shall update the thread with more learnings from past years for reference purposes.


Sudhir

Thursday, November 10, 2011

Project Updates

Update: Received a few such emails so might as well clarify:
Hi Prof,


There seems to be some confusion on the final report submission due date. This is to check whether submission is due in session 10.

Kindly confirm.
S
My response:
No. Its due on 27-Nov. - the day before term 6. 
Hope that clarifies.
Sudhir

Update:
The Phase III dataset is available for download from LMS. Happy analyzing.

'-99' is the code for a question seen but not answered. The proportion of -99s for different Qs might give some clue as to the cost of not forcing those Qs to be answered.

Some exemplery projects have also been putup on LMS from MKTR 2009 car survey. Teams Ajmer, Mohali and Kargil did well whereas Jhumri-Taliya didn't do so great.

Just got done with the general tutorial/Q&A in AC2. My thanks to those who showed up. Got some valuable feedback as well. Some possible to implement right away for session 9 & 10, perhaps.

I'll make detailed slides on how-to on SPSS so that getting too much into the tool does not occupy class time. You can go home and practice the classroom examples at leisure. I'll focus more on analysis results and their interpretation. Unlike many other courses you may have taken, MKTR is a tool-heavy course, so some level of engagement with the SPSS tool is unavoidable.

Hope that clarifies.

Sudhir
-------------------------------------------------------------------
Hi All,

Am closing the survey responses for MKTR Project 2011 - we have some 3,153 responses in all. After getting rid of say about 20-25% invalid responses, we may still have well above 2000 responses. Quite sufficient for most of our purposes, I reckon.

I'll putup a preliminarily cleaned version of the dataset in Excel file format for upload on LMS by noon 11-Nov Friday. Phase III can then begin. Pls also find some exemplary PPTs on the MKTR car project 2009 analyses on LMS by Friday noon.

Sudhir

SPSS Issues (not license related)

Update: Received this email from Mr Suraj Amonkar of section C on one possible reason why we saw what we did in the 2-step procedure:
Hello Professor,


I have attached the document which explains a bit the sample-size effect for two step clustering.

Since the method uses the first step to form "pre-clusters" and the second step to use "hierarchical clustering", I suspect having too small a number of samples will not give the method enough information to form good "pre-clusters". Especially if the number of variables are high, relative to the number of samples

“ SPSS has three different procedures that can be used to cluster data: hierarchical cluster analysis, k-means cluster, and two-step cluster. They are all described in this chapter. If you have a large data file (even 1,000 cases is large for clustering) or a mixture of continuous and categorical variables, you should use the SPSS two-step procedure. If you have a small data set and want to easily examine solutions with increasing numbers of clusters, you may want to use hierarchical clustering. If you know how many clusters you want and you have a moderately sized data set, you can use k-means clustering. “
Also, there are methods that automatically detect the ideal “k” for k-means. This in essence would be similar to the “two-step” approach followed by SPSS (which is based on the bottom-up hierarchical approach). Please find attached a paper describing an implementation of this method. I am not sure if SPSS has some implementation for this ; but R or Matlab might.

Thanks,
Suraj.
My response:
Nice, Suraj.


It confirms a lot of what I had in mind regarding two-step. However, the 2-step algo did work perfectly well in 2009 for this very same 20-row dataset. Perhaps it was, as someone was mentioning, because the random number generator seed in the software installed was the same for everybody back then.

Shall put this up on the blog (and have the files uploaded or something) later.

Thanks again.

Sudhir
Added later: Also, a look through the attached papers show that indeed calculating the optimal number of clusters in a k-means scenario is indeed a difficult problem. Some Algorithms have evolved on how to address it but I'd rather we not go there at this point in this course.

Sudhir.
Hi All,

SPSS as we know it has changed from what it was like in 2009 (when I last used it, successfully, in class) to what it is like now. In the interim, IBM took over SPSS and seems to have injected its own code and programs in select routines, one of which is cluster analysis - two-step solution. This change hasn't necessarily been for the better.

1. First off, a few days ago when making slides for this session, I first noticed that a contrived dataset, from a textbook example no less, that I used without problems in 2009, was giving an erraneous result when 2-step cluster analyzed. The results shown was 'only one cluster found optimal for the data' or some such thing. The 20-row dataset is designed to produce 2 (or at most 3) cleanly separating clusters. So something was off for sure.

2. In section A, I over-rode the 'automatically find optimal #clusters' option and manually chose 3. In doing so, I negated the most important advantage 2-step clustering gives us - an information criteria based objective determination of the optimal # clusters. Sure, when over-ridden, the 2-step solution still gives some info on the 'quality' of the clustering solution - based, I suspect, on some variant of the ratio of between-cluster to within-cluster variance that we typically use to assess clustering solution quality.

3. In section B, when I sought to repeat the same exercise, it turned out that some students were getting 2 clusters as optimal whereas I (and some other students) continued to get 1 as optimal. Now what is going on here? Either the algorithm itself is now so unrelaiable that it fails these basic consistency tests for datasets or maybe there's something quirky about this particular dataset due to which we see this discrepancy.

I'd like to know whether you get different optimal #clusters  when doing the homework with the same input.

4. Which brings me to why, despite the issues that've dogged SPSS including license related ones, primarily, I've insited upon and stuck to SPSS. Well, its far more user-friendly and intuitive to work with than Excel, R or JMP, for one. Of course, the second big reason is that SPSS is the industry standard and there's much greater resume-value to saying you're comfortable with conducting SPSS based analyses than saying 'JMP' which many in industry may never have heard of.

A third reason is that in a number of procedures - cluster analysis and MDS among them - SPSS allows us to objectively (that is based on information criteria or other such objective metrics) determine the optimal number of clusters, axes etc that would otherwise need to be subjectively done risking erros along the way. Also, in many other applications, including forecasting along the way, SPSS provides a whole host of options that are not available in other packages (R excepted, of course).

5. Some students reported their ENDM based SPSS licenses expired yesterday. Well, homework-submission for the quant part is anyway gone flexible, so I'll not stress too much on timely submission. However, undue delay is not called for, either. I'm hoping most such students are able to work around the issue with the virtual machine solution that IT has documented and sent you.

Well, I hope that clarifies what's been going on and whay we are where we are with the software side of the MKTR story.


Sudhir

P.S.
6. All LTs are booked almost all day on Friday. The only slot I got is 12.30-2.30 PM on Friday. The poll results do not suggest a strong demand for an R tutorial. So I'm announcing a general SPSS cum R hands on Q&A session for 11-nov Friday 12.30-2.30 PM in AC2 LT. Attendence is entirely optional. If nobody bothers to show up, woh bhi chalega, I'll simply pack up and head home.

Wednesday, November 9, 2011

Session 8 Homework Q&A

Update: Received this email over the weekend.
Dear sir

Actually, I think the issue is not with column V7 (V7 has both 1's and 0's), but rather with column V19, which is all "." (dots) when added to SPSS.
On eliminating V19 and adding V7 back in, the factor reduction is running smooth.
R

Update: Have just sent in my solution to session 7 homeworks to teh AAs. Should be up on LMS soon.

Hi all,

Well, well.. an early bird decided to take on Session 8 HW aaj hi....

Here's an email I got from V:
Dear Prof.,


When trying to do the “factor analysis” on the “hw1 ice-cream dataset” I am encountering the following issue –

The data type is “nominal” (0,1) and when I run a factor analysis on SPSS, it throws up the following error message –

Warnings

There are fewer than two cases, at least one of the variables has zero variance, there is only one variable in the analysis, or correlation coefficients could not be computed for all pairs of variables. No further statistics will be computed.
Could you please help with what I might be doing wrong.

Thanks!

Regards,
V
My response:
Aha.


Check up the descriptive stats. See if all the variables are indeed 'variables'. If I recall correctly, one of the questions was such that *everybody* answered the same way.

So the std deviation in that column was zero. Will need to eliminate that one first.

Sudhir
And then his response:
Thanks professor!


It seems to be working after I eliminated V7.

Regards,
V
All's well that ends well I guess.

Sudhir

9-Nov Phase II interim results

Hi all,

We've some 2600+ responses. Am hoping for 2000+ valid responses now. Great going!. Here's the latest status-check on which team's where.

Row Labels Count of V1


Agra 73

Bhilai 77

Bijapur 60

Cochin 59

Dehradun 67

Dhanbad 37

Dum dum 119

Durgapur 125

Gulmarg 80

Guntur 76

HasmatPet 50

Jaipur 83

Jamshedpur 81

Jaunpur 71

Jhujuni 254

Kakinada 67

Kesaria 107

Naihati 88

Palgat 112

Panaji 55

Peepli 51

Portblair 44

Rampur 117

Sangli 73

Shivneri 55

Sultanpur 82

Team Leh 47

Trawadi 61

Trichy 61

Ujjain 27

(blank) 294

Grand Total 2654

Teams with 8-10 respondents per team member, (or, say, 60+ valid responses per team) are at par. So don't worry overmuch about it.
 
Sudhir

Tuesday, November 8, 2011

Session 7 Homework issues

Got these few emails:

Dear Prof/TA,
I could see two different Homework Part II for Session 7. One in page 11 of the handout that was given to us.
The other is in the slides in Addendum slides for session 7 called Homework Part II ( Optional )
My question is, which one is actually Homework Part II for session 7 and is it optional ?  Sorry if I misunderstood instructions in the class.
Thanks
H

My response:

Yes, the slide deck contains the optional part. Feel free to ignore it.
There are two questions in session 7 Hw - one related to standard regressions (mobile usage example) and one related to multinomial logit in worksheet 'hw4 MNL'.
Hope that clarifies.


Then this one:

Professor,
I am unclear on Part-II of hw 7. We are asked on predicting the edulevel using the means and modes of the relevant model. So we have the “us” or the numerator part of the logit model, but not the “them” part.
To analyze the “them” part, we have 3 levels in rateplan, 3 levels in gender and familysize.
·         Is taking family size as the mean vale for the denominator appropriate?
·         If so, this will give a total of 9 combinations (or parts) for the denominator.
·         We then look at the probability of education level 1 and 2 and whichever is more probable is our answer
Is this approach correct?
Sincerely,
R

My response:

Hi R,
Pls look at the addendum for session 7 in which is put up on LMS on the logit based prediction model.
There, given a set of X values X={10,9,1} for {sales, clientrating, coupon1}, we predicted the probability of instorepromo being 1,2 or 3.
Similalrly, once you have a set of Xs for {edu, famsize, rateplan, etc.}, use the SAME X profile in both the numerator and the denominator of the logit expression.
Hope that clarifies.
Another:
Hi Chandana, Can you let me know where the worksheets for the homework are? I’m unable to find them on LMS. Session 7 slide 29 says  worksheet labeled ex2Session 7 slide 44 says worksheet name ‘hw 1 MNL’ I can’t find either. Thanks,RC

My response:
'ex2' is the standard regressions based homework - mobile usage one. 'hw4 MNL' is the logit based homework. It was wrongly written as 'hw1 MNL'. MNL is Multi Nomial Logit in the sheet name. I hope that clarifies. 
Shall putup more Q&A homework related as they happen in this thread. More recent on top.

Sudhir

SPSS travails regarding a course clash

Update:
ITCS has sent an email with instructions on how to use the virtual machine thing that Manojna below had helpfully pointed out to. This is for those 60 students whose licenses may run out this week. Pls follow the instructions and get your SPSS extended. IT has also kindly agreed to help out should you run into any difficulties during this process. I'm glad a way has been found around this issue.

Sudhir


Update: Received the following from Manojna Belle.
Dear professor,


I’m one of the students affected by the course clash. Thought of sharing the work around that I am using to get around the problem. Here’s what I did:

I installed virtual box (a virtual machine software by Sun/Oracle – it’s open source available at https://www.virtualbox.org/wiki/Downloads), installed windows on it and then installed SPSS there. Seems to be working fine. The whole process took me about 45 mins.

Good thing about this is that an “image” can be created from this virtual machine. Whoever needs to replicate this can just install the virtual machine and import this image. So I’m guessing if IT can do what I did and create an image, everyone can setup their system in about 30 mins by simply installing virtual machine and importing the “image”.

Note: The virtual machine I configured didn’t have online access for some reason(not even the intranet). Had to install cisco nac to resolve it. But doing that was a little tricky – the only way to transfer content onto the virtual machine without internet/intranet is to mount a folder from the “host” machine. Others won’t have to do any of it, if IT can provide an “image” that has online access enabled.

Thanks and Regards,

Manojna Belle

Have forwarded the email to IT to see if they can use this to deploy on a large scale. There are 60 people in common between ENDM and MKTR. So yes, there is a lot at stake here.

Sudhir
---------------------------------------------------------------------------
Hi all,

I received this email from Gaurav yesterday:
Hello Prof,


A lot of us are also taking Prof Arun Preira's course on Entrepreneurial Decision Making (ENDM). We installed SPSS trial version for one of the assignments in that course some days back (more than a week). Since this would be a 15 day trial it might not last the duration of the marketing research course. Request you to have a look at what we can do.

Thanks and Regards,

Gaurav.
I forwarded this to IT:
Hi Team IT,


Could you tell me if the 15 day license would work if a previous 15 day license had already been deployed for another course (see email appended below)?

Pls advise.

Regards,
Sudhir
And this is the reply I got:
Dear Professor,


Regret to inform you that it doesn't allow us to re-install the same for another 15days, because it registers all the values in system files.

Regards,
Satish

Basically, I'm at a loss at this juncture. SPSS is the mainstay for this course and I was under the impression MEXL is the mainstay for the other course. Now, we have ourselves a fix here. I'll take it up with ASA and again with IT to see if there is a fix possible.

I wonder how many such cases of overlap between the 2 courses are there actually. I've asked ASA to send me the list of those affected

Update:
Basically, after a conversation with the IT folks, I see that we have a few options here, none of them particularly good.

Option 1: Re-install the OS. That clears the registry and the new 15 day license will be accepted automatically. However, this typically tends to take an hour odd per machine and is likely to overwhelm both student and IT team time. Still, folks willing to spare an hour for this can opt for it.

Option 2: Partition the hard drive and install Linux or WinXP or something there. Within that, an SPSS version can be installed. Not sure how different this is from option 1 in time and trouble terms.

Option 3: Do without SPSS in class. Borrow a peer's machine for a while or use the LRC lab's comps to complete homeworks on. Not a great option but workable since there is no comp-based SPSS exam component this year.

If anybody has any other ideas, pls let me know. Shall be happy to explore them further. Team IT has agreed to again approach the IBM-SPSS vendor with a fresh request but is not optimistic about a positive response given how much we have already asked of them this year.

Sudhir

Monday, November 7, 2011

7-Nov Phase II interim status-check

Hi All,

Am extremely pleased to say we've close to 1700+ responses already. Assuming even a quarter of them are unusable, we'll still have 1000+ responses to play around with. Good! And still we have some 3 days to go. Great!

Here's the latest status-check for where groups are currently:

Agra 57


Bhilai 58

Bijapur 40

Cochin 35

Dehradun 51

Dhanbad 29

Dum dum 105

Durgapur 108

Gulmarg 30

Guntur 53

HasmatPet 27

Jaipur 44

Jamshedpur 48

Jaunpur 65

Jhujuni 160

Kakinada 18

Kesaria 81

Naihati 30

Palgat 81

Panaji 27

Peepli 27

Portblair 30

Rampur 61

Sangli 65

Shivneri 49

Sultanpur 67

Team Leh 29

Trawadi 13

Trichy 50

Ujjain 20

(blank) 204

Grand Total 1763


I'd say about 8-10 valid responses per team member would be par for this phase. Of course, more is merrier. At or around par would earn most groups 5% of the total 8% grade. The top quarter of groups get 7 or maybe 8% based on the actual numbers.

Soon, very soon, we'll be up and running on this one with Phase III.

Sudhir

Sunday, November 6, 2011

About Mind-reading

Hi All,

We'd discussed in session5 (qualitative research) some of the ethical issues and risks posed by mind-reading-ish technologies. Well, well, just this week, the Economist carried a major piece outlining similar thoughts.
Reading the brain: Mind-goggling

Regarding how far tech has progressed, it says:
Bin He and his colleagues at the University of Minnesota report that their volunteers can successfully fly a helicopter (admittedly a virtual one, on a computer screen) through a three-dimensional digital sky, merely by thinking about it. Signals from electrodes taped to the scalp of such pilots provide enough information for a computer to work out exactly what the pilot wants to do.


That is interesting and useful. Mind-reading of this sort will allow the disabled to lead more normal lives, and the able-bodied to extend their range of possibilities still further. But there is another kind of mind-reading, too: determining, by scanning the brain, what someone is actually thinking about.

Well, imagine the possibilities for psychological and qualitative research then, eh?

Shall append more articles found that are relevant to this post then.

Sudhir

Homework session 6 Issues

Update:
OK, IT tells me they've sent instructions already for SPSS trial version download. Great. Then this homework turns oput to be much easier than I had first imagined. Good. Some of the gyan on re-coding and transforming data for the T-test elaborated below would still hold, I guess.

Hi all,

1. Pls let me know if there are any queries etc you're facing w.r.t. session 6 homework. Shouldn;t take more than an hour odd, by my reckoning but if you've no clue on how to approach the questions, then it can seem quite daunting, I now realize.

2. I'll present my own solution to this homework in class in a few slides.

3. The most common -sensical approach, the way I see it for the first two questions is to take out the four concerned columns in a fresh worksheet (and have REspondent ID also to keep count), build pivots and run chisq.test() in R on the pivots obtained. In Excel, you'll need to also generate the expected distribution. This is done as (row total* column total)/(overall total) for each cell in the table. As a general rule, ignore blanks and non-standard responses in your cross-tabs.

4. For the t-test question, you'll need to re-code data into metric (interval) form. So use CTRL+H or 'Find and replace' function in Excel to transform the text responses obtained into a 1-5 scale (or a -2 to 2) scale or something. Sort the responses to weed out blanks and other such. Then run the simple TTEST() function in Excel.

5. The above is only 1 way of doing these things. It seemed to me to be a common-sensical approach and so I elaborated on it. You may reach the answers in a quicker, smarter way, perhaps. That is entirely fine too.

Hope that helps. Pls use the comments thread below for Q&A in case of any queries.

Sudhir

Saturday, November 5, 2011

Where groups stand currently in Phase II

Hi all,


Qualtrics is reporting some 784 responses so far. IMHO, that is quite encouraging indeed. Linear extrapolations are dicey but at this rate I can cautiously hope we'll meet the 1500 target on time.

Of course, of these 784, about a fifth odd would be invalid/incomplete responses, on the average (in the optimistic case). Some surveys take less than a minute, it seems and others report opening-closing of several hours. There's been murmur about people taking the surveys more than once on behlaf of the same or different groups. Will have to check IP addresses and weed out duplicates for this.

Am attaching below the status thus far of what I received.

Group #responses


Agra 14

Bhilai 34

Bijipur 15

Cochin 12

Dehradun 26

Dhanbad 9

Dum dum 54

Durgapur 69

Gulmarg 20

Guntur 18

HasmatPet 9

Jaipur 17

Jaunpur 44

Jhujuni 36

Kakinada 2

Kesaria 34

Naihati 10

Palgat 54

Panaji 7

Peepli 6

Rampur 40

Sangli 47

Shivneri 43

Sultanpur 27

Team Leh 14

Trawadi 1

Trichy 26

Ujjain 6

Blank 89

Will update again on Tuesday. Shall putup the names column of folks who have responded on LMS on Monday (I don't quite know how to operate LMS, will need to wait for the AAs to come on Monday and do this ) so that groups can check if their contacts have responded or not and follow-up accordingly.
 
The number of blanks (at 89, its easily the mode) is worrisome. Respondents aren't taking the time and trouble, apparently to record the group's name that sent them the survey. Pls reinforce the importance of this aspect amongst your contacts, folks.
 
Phase II is truly underway and soon, very soon, there'll be plenty of data to play around with. Plenty of it. And Phase III will be upon us. The division of grade is roughly 8%, 8% and 19% for the 3 phases, respectively.
 
Pls use the comments section of this post for clarifications or other Q&A, as far as possible.
 
Sudhir

Friday, November 4, 2011

General Info Re Quant Sessions

Update: Quant sessions Homework related


Got this email from SS:
Hi,
I know that the first few classes, we had to submit HW that was handwritten. But for the current Homework, it doesn't make sense to do it hand-written, as most of the work will be done in SPSS (cross-tabs, hypothesis testing).
Does that requirement of homeworks being handwritten still stand? Or can we submit a printed copy containing the tables and the statistics from SPSS?
ThanksSS,
My response:
Hi SS,
The pivots in Question are 2-by-2 tables, at most 3 by 3 ones. Am sure it'll be just as easy writing them down as copy-pasting them.
Yes, I'd like to keep the handwritten requirement continuing. It ensures that I on my part don't give out unrealistic homeworks that can't be written out in a few lines.
Regards,


More generally: 
The quant side home-works emphasize practice. More than that, data preparation, some re-coding here and there, data cleaning to get rid of incomplete or nonsensical responses from the analysis frame and so on. These are common issues that will crop up in the analysis phase of almost all Quantitative MKTR. It important you know how to address and get around these. However, I don't need to know the details of how you got around these obstacles - there are various, equivalent ways of doing so. I just want the final answers - whether something was significant or not, the hypotheses as formulated, null rejected or not, what the p-value is etc. Followed by what this means for MKTR recommendations and the decision problem. In a few lines. That's all.

Another way to think of it is that the homework is an exec summary of the work you've done (which is typically no longer than 1, at most 2 pages) rather than a detailed report with appendices of the work you've done.

Regards,

Sudhir
----------------------------------------------------------------------------

Hi All,

Session 6, the first of the quant sessions, was eye-opening. Was admittedly a tad nervous because hajaar things can go wrong in such situations. The point of the quant sections is to get a good idea of what quant analysis cana nd cannot do and let the software do the rest. Hence, the emphasis on taking apart and understanding some of the technical jargon that often surrounds quant analysis.

Anyway, as expected, some stumbles and hiccups did happen.

For Sections A and B - the addendum contains details on the main reason why sample-sizing is tricky, namely that finding the sample std deviation before sampling is done is a chicken-and-egg situation. hence, the need to use past experience to guide the sigma decision and/or arrive first at an initial sample-size before iterating and arriving at the final sample size.

For section C - that SPSS failed to open was unlucky. Am hoping for better luck session 7 onwards.

For all sections - Please ensure SPSS is downloaded and ready for session 7 a day before the session starts. IT has assured me the instructions will be sent on time. If they aren't out by sunday evening, let me know.

Will conduct an R tutorial next weekend sometime. Will go over all the procedures we have covered in class on R. For the session 6 homework, folks, feel free to consult within and across groups. No problem. I'm just interested that folks ultimately find a way to do the homeworks for themselves. How you get there is upto you.

Sudhir

Thursday, November 3, 2011

Qualitative Homework Related

Update: This I got form Vimalakumar S
Dear Professor,


This is in response to the blog entry regarding Qualitative Homework and use of Software tools to count words.

I have two issues to discuss here: -

1. I find equating all open ended questions and their answers as qualitative inputs as erroneous. Just because the survey question did not restrict the responses to a few choices does not make the inputs qualitative. Another proof of this fact is that most of us who attempted the assignment were inclined to count the recurring words rather than read and understand what the consumer was saying. In my opinion, if you can reduce the data collected onto a frequency distribution chart it is no longer qualitative. May be the survey should have limited the answer choices based on an prior understanding of possible responses.


2. Related issue would the question of “How to analyse qualitative inputs?”. And in my opinion, counting words is not the answer. The researcher seeks qualitative questioning to understand the consumer in a manner better than what a simple quantitative question would convey. To analyse the inputs and take in all its richness we need a better tool than a frequency distribution chart of recurring words.

Regards,

Vimal
My take: Sure. Qualitative info for qualitative understanding has its own importance and we're yet to find a better way than expert analysis for that one. However, when open-endeds are used the way they are in surveys, some attempt at categorization (and thereby dimension reduction) is not a bad idea. Thanks, for writing in and sharing your thoughts.

Update: Session 6 related.
The quant portion has started. Expectedly there have been a few hiccups, esp in the absence of SPSS. Hopefully session 7 will fare smoother than this. I shall resend a slide deck as an addendum to session 6 slides with material that may have been missed in one section or the other. The slide deck will pointedly include R code, instructions and screen caps.

Hi All,

The session 5 homework relatring to some preliminary analysis of open-endeds qualitative data I take was quite challenging.

I would expect folks to take a straightforward common-sensical approach : Take a random sample of some, say, 100 odd responses and analyze them assuming they represent the rest of the responses for the purposes of the homework. Use of excel functions like FIND() etc. would be par for this course.

However, like always happens with homeworks, some have gone further, gone creative or gone haywire in coming up with new approaches to this problem.

Below I present what one student did from his email:

Hey Professor,


I used the following two tools to answer the Qualitative research homework and try and make sense of the mounds of data provided to us and figure out the top reasons :

1.) Word Frequency : http://writewords.org.uk/word_count.asp

2.) Phrase Frequency : http://writewords.org.uk/phrase_count.asp

Of course, I had to figure out the correct number of words to use in the phrase, but after that, it made it much easier to consume the raw data of the survey. There are many more tools which also do this, but this was the first result from Google :).

I was initially planning to build a word cloud and highlight the most common words, but I gave that up after I realized that words like "I" and "had" were the most frequent.

Regards,
Shyam Seshadri

Thanks, Shyam.
 
I'd very much like to hear if you've used something new or creative. Shall put it up on the blog here. Pls use the comments space below for Q&A.

Wednesday, November 2, 2011

Project Phase II is here

Hi All,

Happy to inform that we've somehow managed to put together a reasonably concise questionnaire (49 odd questions in the longest string) based loosely on the project scope document. Sure, some aspects had to be left out as they would have made the survey unweildy but that was a call one has to make at different points in time in MKTR.

Here is the phase II final survey link:


Shall ask the AAs to putup the corresponding word doc on LMS.

Should you find any bloopers in the questionnaire, pls let me know ASAP.

Some points to note:
1. Pls email your contacts the above link to take the survey. Phase II ends 10-Nov Thursday noon. Responses coming in after this will not be counted for analysis.

2. Will release the consolidated database of responses for analysis to the class by Friday 11-Nov noon. Phase III will begin in earnst then

3. Pls call-up, SMS, gently-remind your contacts off-campus to fill-up the survey. In particular, the last section has a question asking respondents to select from a drop-down the name of the group that sent them the survey. Hence, Pls ensure the respondents know your group's name! We'll simply tally the group responses and the groups bringing in the most valid, completed responses score higher in Phase II. This weekend should be a busy time and am hoping we get a good many responses then.

4. Pls ensure your group shows up for the guest lecture tomorrow. Attendance there will be worth between 2 and 3 % of the course grade, just FYI. Even now I see the sign-up sheet is half-empty.

Regards,
Sudhir