Friday, November 15, 2013

Project related Mailbag, Q&A

Hi all,

This post will be a general purpose one relating to Project based Q&A (the earlier one for End term based Q&A).

Priyo wrote to me with these Qs:

Hi Prof,

Requesting some clarity regarding further scope of work for the project submission.

Data Collection Guidelines—sample size, demographic, etc.
Analysis and flow of slides.
Scope of conclusion.
Read the grading criteria on you blog post but couldn’t glean much regarding the above.

Thanks,
Priyo

My response:

Hi Priyo,

>> Data Collection Guidelines—sample size, demographic, etc.

The data collection guidelines are, well, flexible. Am not expecting rigorous adherence to sample size requirements for instance. I'd say, with 4 people in a group, each person collecting some 10-15 survey responses is quite enough.

Regarding demographic, ideally go for offerings targeted at the most abundant demographic in ISB - upper middle class urban youth in the mid to late 20s.

>> Analysis and flow of slides.

These should be, in one-word, 'Common-sensical'. There's no right or wrong way, just context dependent implementation, I guess.

>> Scope of conclusion.

Depends on the scope of the DP and ROs. If the ROs are confirmatory, then yes, a yes/no kind of clear decision recommendation would be nice.

If exploratory, mere pointers or indications for further investigation are typically deemed sufficient.

Hope that helps.

Sudhir

P.S.
Will put this up on the blog for wider dissemination.

P.S. watch this space for more such updates. More recent updates at the top of the post.

********************************

Update: Incorporating social network analysis via R for MKTR insights:

I got an interesting Q from a student working on twitteR, on whether and how text analytics relates to social network analysis in general. So I started digging around... And recently discovered that R has a full suite of social media mapping and network analysis applications.

Yes, text analytics is merely the tip of the iceberg. R can go far deeper and far higher (at the same time) than merely text analytics. Now let's talk in terms of *networks* - verticies (or nodes) and edges signifying relations between the nodes...

Am going to demo social network analysis 101 on R using your course feedback 'overall feedback.txt' and the names of the associated students in 'names.txt' (see LMS). Social network anlysis would be a full lecture (or, perhaps, even a full course) by itself but the major major take-aways can be skimmed through rather quickly, I reckon.

Try these classwork examples at home, maybe you may want to do this for your project?

# read-in data first

names = read.table(file.choose()) # 'names.txt'

x = readLines(file.choose()) # 'overall feedback.txt'

x1 = Corpus(VectorSource(x)) # build corpus

ngram <- function(x1) NGramTokenizer(x1, Weka_control(min = 1, max = 2))

tdm0 <- TermDocumentMatrix(x1, control = list(tokenize = ngram,
tolower = TRUE,
removePunctuation = TRUE,
removeNumbers = TRUE,
stopwords = TRUE,
stemDocument = TRUE)) # patience. Takes a minute.
# remove columns with zero sums

dim(tdm0); a0 = NULL;

for (i1 in 1:ncol(tdm0)){ if (sum(tdm0[, i1]) == 0) {a0 = c(a0, i1)} }

if (length(a0) >0) { tdm1 = tdm0[, -a0]} else {tdm1 = tdm0}; dim(tdm1)

inspect(tdm1[1:5, 1:10])# to view elements in tdm1, use inspect()

# convert tdms to dtms
# dtm weighting from Tf to TfIdf (term freq Inverse Doc freq)
dtm0 = t(tdm1) # docs are rows and terms are cols
dtm = tfidf(dtm0) # new dtm with TfIdf weighting

# rearrange terms in descending order of TfIDF and view

a1 = apply(dtm, 2, sum); a2 = sort(a1, decreasing = TRUE, index.return = TRUE);

dtm01 = dtm0[, a2$ix]; dtm1 = dtm[, a2$ix];

the baove analysis was standard, pretty much what we did in the classwork in session 9. What follows comes with a twist. Will need to install package 'igraph' for this.

What we do next is find 'relations' or connections between terms - for our context, I define a 'connection' between two terms as the intra-document co-occurence, i.e. how often those terms occured together in a document across all docs in the corpus. Somewhat like a cluster dendogram, I guess, but way cooler.

install.packages("igraph") # install once per comp

### --- making social network of top-40 terms --- ###

dtm1.new = inspect(dtm1[, 1:40]); # top-25 tfidf weighted terms

term.adjacency.mat = t(dtm1.new) %*% dtm1.new; dim(term.adjacency.mat)

## -- now invoke igraph and build a social network --

library(igraph)

g <- graph.adjacency(term.adjacency.mat, weighted = T, mode = "undirected")

g <- simplify(g) # remove loops

V(g)$label <- V(g)$name # set labels and degrees of vertices

V(g)$degree <- degree(g)

# -- now the plot itself

set.seed(1234) # set seed to make the layout reproducible

layout1 <- layout.fruchterman.reingold(g)

plot(g, layout=layout1)

you should see something like this. Click for larger image.

The image depicts connections between terms. Of course, one may say that social networks are built among *people*, not among terms.

OK. Sure.

So can we build one among people using a similar procedure? You bet. This time, we'd be connecting people using the common terms they used in the corpus. The code to do that is below:

### --- make similar network for the individuals --- ###

dtm2.new = inspect(dtm1[,]); dim(dtm2.new)

term.adjacency.mat2 = dtm2.new %*% t(dtm2.new); dim(term.adjacency.mat2)

rownames(term.adjacency.mat2) = as.matrix(names)
colnames(term.adjacency.mat2) = as.matrix(names)

g1 <- graph.adjacency(term.adjacency.mat2,
weighted = T, mode = "undirected");

g1 <- simplify(g1) # remove loops

V(g1)$label <- V(g1)$name # set labels and degrees of vertices
V(g1)$degree <- degree(g1)

# -- now the plot itself --

set.seed(1234) # set seed to make the layout reproducible

layout2 <- layout.fruchterman.reingold(g1)

plot(g1, layout=layout2)

And the result will look something like this:

Recall Session 9's segmentation exercise on ice-cream flavor comments? We were trying to cluster together respondents based on similarity or affinity in terms used. k-means scree plots were a poor way of judging the number of clusters. The above graph provides much better insight that way. Seems like there are two big clusters and some 2-3 smaller ones in the periphery.

Sure, important and interesting Qs arise from this kind of analysis.... For instance, "Who is the most representative commentator, i.e. one whose words best represent the class'?", "Who is best connected with majority opinion?" and so on. Marketers routinely ask similar Qs to try to detect "influencers" (along various metrics of node centrality etc). But again, that is a whole other session.

Sudhir

2 comments:

  1. Hi Prof,

    Is it possible to analyze partworths for a subset of the large number of auto generated ME-XL bundles. Deleting such bundles is causing ME-XL to error out presently. Else, if we set respondent data for bundles that are not of interest to us to zero, will it skew the analysis or is this a viable alternative?

    Thanks,
    Priyo

    ReplyDelete
  2. Hi Priyo,

    OK. Didn't know it would cause errors. Not a great fan of MEXL anyway. However, setting those partworths to zero will bias results. I'd suggest running individual regressions for each respondent (Y=preference rating of product profile, X=product attributes) and use the coefficients as part-worths (its not exactly the same but will do as a proxy). Hope that helps.

    Sudhir

    ReplyDelete

Constructive feedback appreciated. Please try to be civil, as far as feasible. Thanks.