Session 8's two main topics - (i) Hypotheses formulation and testing, and (ii) basic (regression) modeling hasve been covered.
A recap of big picture session take-aways:
- Assumptions, beliefs and conjectures about events of interest - the stuff of ideas, basically - underlie Hypotheses.
- Its critical that the Null and alternate hypotheses be defined carefully so that they are present a mutually exclusive, exhaustive and logical range of events.
- We make it hard to reject the Null (i.e. status quo) hypotheses to minimize false positives (also called the 'Type 1 error')
- Tests of differences (t-tests, basically) and tests of association (chisquare tests, primarily) together account for a large proportion of MKTR hypothesis testing in pratice
- Modeling underlies our every attempt to structure, order an interpret data.
- Regression modeling is amongst the most common in practice due to its strengths, viz., prediction, description, control, existence and magnitude information, etc.
- The principal regression variants - quadratic and higher order polynomial terms, log-log forms, interaction effects etc. - allow much flexibility in modeling dependence relations.
- Last but not least, the ability to rapidly formulate and test conjectures as well as interpret regression modeling results - is a core (i.e. non-outsourceble) managerial skill.
OK, having beaten the drum on why Session 8's topics are relevant, interesting and important, let me proceed to explaining the Classwork examples we saw in Hypothesis testing.
1. Hypothesis Testing, Select classwork Examples:
Copy-paste the code below. Pls don't do it all at once, but a few lines at a time, after reading the descriptive comments (following the '#' symbol).
rm(list = ls()) # clear workspace
mydata = read.table(file.choose(), header=TRUE) attach(mydata) # allows us to call columns in mydata by name head(mydata) # view top few rows, just in case summary(mydata) # see summary of all cols # t-test for differences # Q: Does preference for INVA equal 4? t.test(INVA, mu = 4) # t.test() is the core func # Q: Does preference for INVA exceed 4? t.test(INVA, mu = 4, alternative = "greater") # Q: is pref for MGTO > GSB? t.test(MGTO, GSB, alternative = "greater") # Q: Did women prefer SAIT more than men did?
xf = SAIT[(male == 0)] # Q: Did men and women prefer SAIT equally? t.test(xf, xm) |
Upon running the above code, each t.test() statement will output results. I will examine only a few here.
The hypothesis being tested is shown in red font above as a commented question. The above example is a one-sample one-tailed test. R tells us what the alternative hypothesis is (in blue font), shows the mean of the quantity being tested and yields the p-value.
This is a 2-sample 1-tailed test. The result is again non-significant at 95%.
The above were tests of differences which deal with metric data. Next, look at a test of association that deal with relations between non-metric (i.e., nominal, ordinal or categorical) variables. [BTW, the two types of tests are connected. It is certainly possible to recast a test of difference as a test of association by changing the hypothsis and the form of the variables involved].
# Q: Are higher than avg workex people mainly male? engineers? # make a yes/no out of 'workex' hi.workex = (workex > mean(workex)); summary(hi.workex) mytable = table(engineer, hi.workex) # build crosstab of counts mytable; chisq.test(mytable) # chisq.test() is the func mytable = table(male, hi.workex) # build crosstab of counts mytable; chisq.test(mytable) |
Only one illustrative result is shown here, below:
A few Qs had come up regarding the t and chisq distributions. Well, these distributions are sensitive to degrees of freedom, so they implicitly account for adjustments required due to changing sample sizes. Still, 'higher the sample size, better the inference' mantra always holds.
2. Regression Modeling, Beer classwork example:
Will discuss only the simple regression model results for beer dataset ('beer dataset.txt') here. Copy code below line by line and paste onto R console.
mydata = read.table(file.choose(),header=TRUE) dim(mydata); summary(mydata) # view summary of variables head(mydata) # view a few data rows attach(mydata) # enables calling variables by name summary(lm(volsold ~ # lm() is linear model func, volsold is Y price + distbn+ promo+ adspend1 # Mktg mix variables +factor(brand2) +bottle +light +amber +golden +lite +reg +sku.size # prod attributes + factor(month) )) # control variables, lm() closes |
Upon running the data, we get the following results:
The table's columns are 'Estimate' == Coefficient, 'Std error' is, well, standard error, 't value' is the t-statistic computed as Estimate/Std error, and 'Pr(>|t|)' is the p-value.
Some things to note in the results (based on Qs received in class):
1. There are seven brands in the data but only six brand intercepts. The seventh, 'Amstel' is the reference brand and its coefficient is fixed to zero, a priori. Why does this happen? Why only 6 brands? Why fix one brand to zero - isn't that arbitrary?
2. Well, consider this: Suppose we had the gender in the X variables. We could have two columns - 'male' and 'female' represented with 1 and 0. But we can use only one of the two columns in the regression, not both, because male = 1-female and vice versa. In other words, the two columns are linearly dependent. Similarly, having the seventh brand in would make the right hand side of the above regression also linearly dependent and the analysis would not run at all.
3. Further, the reference brand is not arbitrary. Any brand can be chosen as reference and the other brand's coefficients would adjust accordingly. FOr example, if we make Miller the reference brand, then we simply add 3.872*10^4 to the coefficients of all the brands (including Amstel). Then Miller would go to zero and the other brands would have a value relative to Miller.
4. Given average values of the X variables, we can multiply them with the coefficient estimates and obtain the average Y value. This is called the predicted Y (or Y-hat). Similarly, we can manipulate some of the X variuables and compute the hypothetical Y-hat for that X-vector. This facility is a neat and powerful advantage of the regression modeling approach.
5. However, be careful and don't stretch predictions beyond the limits of the regression. For instance, just because 'Promotions' has a positive effect on sales doesn't mean that if we increase promotions to 1000x, we'll end up with 1000x times 1.188*10^2 (promotion-coefficient) in sales.... that's probably stretching the model way beyond its reasonable limits. 6. Can't reiterate enopugh how critical it is that tomorrow's managers in general and Mktg ones in particular be comfortable with the regression approach. Certainly expect an exam Q or two on this.
3. HW for Session 8:
Save the file 'feedback ratings.txt' and use the code given in 'session 8 HW R code.txt'. Solve the following Qs by interpreting analysis results:
Q1. Test of Differences: Test the hypotheses that Quant ratings are significantly (i) greater than quali ratings, (ii) greater than R ratings and (iii) about the same as overall ratings. Identify whether the tests you run are one-sample or two-sample, and one-tailed or two-tailed. Interpret the p-value for inference on significant differences.
|
We now divide the sample into 3 groups - High, medium and low - where 'High' ratings are one standard deviation or more above the mean, 'Low' ratings are one stdev or more below the mean and the rest are in the 'Medium' zone. [I wrote a function in the code that'll do this part automatically]
Q2a. Test of Association: Suppose you conjecture that the people who rate Quant High (Low) also rate (i) R High (Low), (ii) the blog High (Low) and (iii) the HWs High (Low). Test this conjecture. Interpret the results. Q2b. Some of the HWs are done on R and a lot of explanation for them can be found on the blog. Test the conjecture that folks who rate the HWs High (Low) also rate (i) the blog High (Low) and (ii) R High (Low).
|
Q3. Basic Regression Modeling: Test the conjecture that overall rating is a function of the component ratings (for quant, quali, R, blog and HWs). Remember to first write a conceptual model that relates the variables by name, then write an econometric model with coefficients and the error term thrown in and then, finally, run the code. Q3a. The overall test of significance (for the regression as a whole) is given by the F-statistic in the last line of output. Interpret whether the Y actually relates to the set of set of Xs chosen. Q3b. The extent to which the X variables explain variation in Y is given by the multiple R squared. What is the extent of unexplained variation in Y in the above regression? Q3c. Interpret the results (coefficients and inference) of the regression.
|
Q3d. Factor regression Modeling: Run a variant of the above regression model. Regress overall rating on the high/med /low categorization for the components ratings. That is, regress overall over quant1, quali1, R1 etc. Interpret the results. Q3e. What practical implications arise for an instructor of MKTR from the above 2 regression results? What should he/she focus on and what should he/she de-emphasize? Is there sufficient information and evidence to warrant your conclusions? Write a few lines about this.
|
The deadline is next week saturday, but as that'll be exam week, pls don't wait that long - finish it off, like, tomorrow and submit.
Sudhir
No comments:
Post a Comment
Constructive feedback appreciated. Please try to be civil, as far as feasible. Thanks.