Monday, October 29, 2012

R tutorial - Afterthoughts

Hi all,

First off, thanks for attending the tutorial in good numbers and making it interactive and interesting. Now I have a much better sense of practical issues students are facing in taking to R. Accordingly, I have revised, re-formatted and updated the Rcode blog-posts for session 4 and will do the same for session 5 also shortly.

Some thoughts:
1. Data input remains the biggest headache.
The good news of course is that this is eminently solvable. I have edited the session 4 blog-post to reflect this. The data entry part for the joint space maps code is broken into steps 2a, 2b and 2c with clear instructions on the sequence to be followed. Practice data input a few times and after that it becomes a peaceful, mechanical process (as it should be).

2. The graphs I get should be put up on the blog so that people can compare their output with mine.
Thsi is now done. Pls see the plots uploaded as images under each of the steps mentioned in Session 4 Rcode.

Click on the images for a larger size. Let me know if you are unable to see the images for any reason.

3. Google docs spreadsheet is sometimes causing issues with simple copy-pasting.
However, the tables are read in fully despite the error message shown. So pls ignore that particular error message and proceed.

4. File-name confusion erupts, occasionally.
I named most files we read in as 'mydata' for ocnsistency sake. But you can choose any (non-system reserved) file name. No problemo.

5. Package installation issues were there.
I suspect it could be the CISCO problem with the FTP protocol. However, I was able to download packages peacefully after yesterday. Pls let me know if you are still unable to download and install packages. I suggest choosing a US based R server due to their higher bandwidth.

In Session 4, for MDS, we need package 'MASS'; for Eigneval scree plots, we need package 'nFactors' and in session 5, we will need packages 'mclust' and 'cluster' for a variety of cluster analysis algorithms. Pls download and keep this part ready as and when you are able to from your homes.

Added later:
6. The point of it all

People have asked how R is helping fulfill the course goals. Let me recap and clarify.

Folks, the course is designed for managers to understand the scale and scope of the MKTR challenges facing them in the near future. It isn't designed to teach how to code in R.

Now, I think the best way for folks to understand MKTR challenges is:
- to be exposed to the fundemental concepts in the area (through pre-reads, reinforced by classroom discussion and reflection)
- to be exposed to the tools of the trade (through exercises, assignments, the hands-on project and discussions)
- to be exposed to the emerging changes and trends driven by economics, technology and firm strategy (through in-class readings, project and discussions).

R contributes to the course goals by stressing on the last 2 aspects. Of course, you are free use any software platform you like.

However, I recommend R for the following reasons:

(i) You get your hands dirty with data. Unlike some packages (SPSS, SAS etc) R is not about dooing analyses from afar but shaping the flow and transformation of data line by line, para by para. There's no substitute to actually grappling with data to understand what is going on.
(ii) You get acquianted with a very powerful, very flexible and totally cutting edge platform for *all* (and not merely your MKTR) computing needs. Sans licensing worries too. This becomes important because increasingly, problems are becoming multi-disciplinary and as managers you have think outside the MKTR box to solve them.
(iii) You learn the *correct* way - the scientific, data-driven, objective way - to decision making in the data analysis stage. For instance, Qs like "How many factors/clusters should we have?", "Why is one segmentation model better than another?" etc. often call for judgment calls and heuristics which can be risky in new and untested situations. Better to have R do the heavy lifting of model fitting and complexity-penalization and then guide you to the correct decision.
(iv) R offers economies of scale - even large datasets [with 1000s of observations] are as easily and speedily crunched as the small ones. Having larger RAM and faster processors always helps, of course. [However, if your datasets are millions of observations long, then go to SAS.]
(v) R offers economies of scope - not just plain vanilla S-T-P analysis or regressions but v cool, v emerging stuff from text analytics to sentiment analysis to pattern recognition - are all there. Somebody somewhere in the world faced that problem, solved it and put up the general solution procedure neatly packaged into a R library module. And R continues to grow with each passing month.
(vi) Sure, data input/output is not as easy-breezy as in some other platforms but that is the only hurdle to cross (and not a particularly high one at that).

I hope I have persuaded you to stay the course with R and not give up because of point (vi).

That is all I can think of for now. Pls use the comments section below for your queries, suggestions and feedback. Sudhir

4 comments:

  1. Dear professor,

    I am getting the following error when I try to install the MASS or nFactors packages. Can you email us the packages please?

    --- Please select a CRAN mirror for use in this session ---
    Warning: unable to access index for repository http://ftp.iitm.ac.in/cran/bin/windows/contrib/2.15
    Warning: unable to access index for repository http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contrib/2.15
    Error in install.packages(NULL, .libPaths()[1L], dependencies = NA, type = type) :
    no packages were specified
    In addition: Warning message:
    In open.connection(con, "r") :
    unable to connect to 'cran.r-project.org' on port 80.

    ReplyDelete
  2. OK. Have sent email to the AAs. The package binaries for MASS< cluster and nFactors will be putup on LMS.

    Sudhir

    ReplyDelete
  3. Professor, in your post above, you say "[However, if your datasets are millions of observations long, then go to SAS.]". Is it not true that R also can handle millions of records of data? I have heard R is extremely powerful that way too. SAS is a incredibly expensive piece of software the cost of acquiring which cannot be justified at all especially for start-ups.

    ReplyDelete
  4. Hi Shouri,

    Yes. R is hajaar powerful and all, howver there are limitations with the way in which it reads and processes data (it uses the RAM exclusively). SAS processes data differently, row by row rather than entire dataset at a time. So for v large datasets, I find SAS much more efficient. Sure, R can do the job but for some of my retail datasets - millions of rows long, R runs out of RAM space even before the data are fully loaded.

    R by the way, is amenable to parallel processing. So you could hookup multiple workstations that have R installed and run analyses on all of them simultyaneously for large dimensional problems that are amenable to parallel processing. (Think of bootstrapping and jackknifing examples, or bayesian analysis).

    Sudhir

    ReplyDelete

Constructive feedback appreciated. Please try to be civil, as far as feasible. Thanks.