Sorry folks, typed in the wrong URL by mistake. Those looking for sessions 8 and 9 pre-reads, pls go here.
The following are pre-reads for session 7:
1. AI meets the C-Suite (McKinsey Quarterly)
and 2. Track Customer Attitudes to predict their behaviors (HBR)
The course-pack reading for session 7 is optional and these two are mandatory.
Sudhir
-----------------------------
Hi all,
You might want to lookup the list of shiny apps available listed in the session 5 updates blog post here, before we start.
The homework has three parts. Only two of the three need be done and submitted.
Homework part 1: (group submission, mandatory)
Choose any non-obscure product or service on Flipkart or Amazon (or any other review aggregation source).
Your R.O.s are: (1) Find the top few things people like about the product.
(2) Find the top few things people dislike about the product.
(3) Suggest a (re-)positioning strategy for the product based on the above.
Pull 100+ reviews of the product.
Note: A Flipkart shinyapp is available already. Just follow instructions on the first page of the shinyapp.
We're working on an amazon shinyapp as well. watch this space for updates.
Update: Turns out Amazon pages are now dynamic. They were regular pages till last year. So no shinyapp happening on it.
Text analyze the corpus for insights.
Not everything we can do is up on shiny. Would help massively if at least one member per group runs the classwork R code successfully on their machines.
Homework part 2: (Individual submission - option 1)
Use tm.plugin.webmining to pull data from any of the following news aggregators. Pick any product/ firm/ brand/ celebrity that has been in the news lately.
Pull the last 100+ news articles wherein this entity was mentioned in the article title.
Recall the classroom example wherein we did this for Zara:
install.packages("tm") # if using for the first time install.packages("tm.plugin.webmining")
library(tm) # Note: Run below on base R, not RStudio zara <- WebCorpus(GoogleNewsSource("Zara")) x1 = zara # save the corpus in a local file x1 = unlist(lapply(x1, content)) # strip relevant content from x1 x1 = gsub("\n", "", x1) # remove newline chars x1[1:5] # view content write.table(x1, file.choose(), row.names=F, col.names=F) # save file as 'zara_news.txt'
|
Alternately, try running this shiny app for googlenews pulls. Its not very stable but will do for now.
Text-analyze the corpus for sentiment.
Note: Do you see how the corpus thus obtained can potentially help you mine, measure and score some notion of "PR buzz" for the entity?
Your task: ID the two most positive and two most negative articles.
In a PPT slide or two, write what you found about the reasons for positive and negative sentiment.
Update: Pls insert the following lines of code after you run the older code for sentiment analysis.
This is to obtain the most positive and negative documents.
############### head(pol$all[(order(pol$all[,3], decreasing=T)),]) #– Top positive polarity document head(pol$all[(order(pol$all[,3], decreasing=F)),]) #- Top negative polarity document ##################
|
Homework part 2: (Individual submission - option 2)
Alternately, instead of HW part 2 above, you could do the following.
Take any long (as in 10+ pages) soft copy article that you know and have read.
Use the textsplit shiny app to split it into uniform length parts (of say 25-50 words each).
Now, text anaylze the split document for topics using the shinyapp for topic mining.
In a PPT, paste the wordclouds for each topic and write your interpretation for what that topic means (in a few descriptive words, is all).
Deliverables and Deadlines:
The deadline for this session's HWs is a week from now. Next week Friday (26-sept) midnight.
Drop boxes will be up for session 5 HW part 1 and HW part 2 separately.
For both homework parts, pls submit a zipped folder containing (a) the text dataset you used, and (b) the PPT you made.
Pls remember to write your (group) name and PGID on the title slide. Name the PPT as name_HWnumber.pptx
Added later: The PPT should be <10 slides in length. Feel free to add more slides in an annexure, if required.
The HWs are all HCC level 0. Feel free to take any help from anybody as required.
Any queries etc, contact me.
Ciao.
Sudhir
Dear Sir,
ReplyDeleteCould you kindly explain how the document frequency is computed? As per my understanding, it is number of times a word occurs per 100 words in the document. Kindly correct me if I am wrong.
Regards,
Nithya
Hi Nithya,
ReplyDeleteAre you referring to the TFIDF weighing scheme? Well, in the classroom example, my corpus had 100 docs, hence I divided term freq TF by 100. Else, we divide by the no. of docs in the corpus.
In any case, there exist many schema to compute TFIDFs and we can always come up with our own, besides.
So, for now, don't worry about it and use R's internal tfidf scheme. Hope that helps.
Sudhir
Hello Prof,
ReplyDeleteWhile executing the command
zara <- WebCorpus(GoogleNewsSource("Zara"))
I get the following error:
Error in function (type, msg, asError = TRUE) : couldn't connect to host
How can I fix this?
Hi Anand,
DeleteUse base R and not R studio. Also, I have updated the code for newline characters in the post above. Check now and see.
Sudhir
Hi Sir,
DeleteI am facing this error while running base R.
Try this:
Deletehttps://wordcloud.shinyapps.io/googlenews/
Hello Prof. In homework part 2 option 1, you wrote "Your task: ID the two most positive and two most negative articles." How do we ID articles? Did you mean topics?
ReplyDeleteHi Sharath,
DeleteNo, it would be articles. Imagine the articles pulled are documents and you have terms as columns. Upon sentiment analysis (like we did for iron man reviews), you get polarities for each document. Hope that helps.
Sudhir
Hi Professor,
ReplyDeleteI tried running the R code and well the shiny app for google news pull. Both seem to be timing out while trying to establish the connection.
Please help!
Error as seen in R:
Error in function (type, msg, asError = TRUE) : connect() timed out!
Thanks
Sonam
Hi Sonam,
DeleteWould be good to attend the R tutorial today and pose the Q there. Aashish Pandey, who built the shiny app, will be conducting the tutorial.
Sudhir
Sir, in the shiny app for text analysis, Im getting the following error: NA indices not allowed
ReplyDeleteRequest your help
HI Aditi,
DeletePls reach out to Aashish pandey for shiny queries. Did you attend today's R tutorial? In any case, shall postpone the session 5 hw deadline by 24 hrs, too many folks facing issues with the text analytics pieces.
Sudhir
Hi Professor,
ReplyDeleteWhile topic mining should we go with the number of topics suggested by the Log Bayes factor or can we input the number of topics. I see that the Narendra Modi I Day speech had 6 topics.
Regards,
Rohit
You can override the machine. topic interpretability should be our main concern. model fit etc can come later, I guess.
DeleteSudhir
Thanks Prof.
DeleteHi Professor,
ReplyDeleteThe shiny app for basic text analysis appears to be down..been trying to access it for quite some time..Please help!
Thanks
Sonam
This comment has been removed by the author.
ReplyDelete