Homework Set 15, Due Friday November 17, 2017. --------------------------------------------- Assigned 11/8/2017, due 11/17 14 points with 2 Extra-Credit points possible ============================= Consider the following "empirical Bayes" data setting: "esoph" data as in HW 13 (B) 88 groups, consisting of 1 for all but 8 of the 6 x 4 x 4 cross-classified agegp, alcgp and tobgp groups (i,j,k), i=1:6, j=1:4, k=1:4. (That is, there are 8 "empty" (i,j,k) combinations for which there are no cases or controls.) Regard these data as telling the number of "case" events (ncases = r_{ijk}) among the total number (ncases+ncontrols = n_{ijk}) of (agegp=i, alcgp=j, tobgp=k) group members studied. Suppose that each group (i,j,k) has its own separate case-probability p_{ijk}. Think of these as parameter variables a priori distributed independently with Beta(0.5,0.5) distribution, and think of the n_{ijk} group sizes as fixed, and given p_{ijk} , r_{ijk} ~ Binomial(n_{ijk}, p_{ijk}) . (1) Let phat be the vector of Maximum likelihood estimators for the group case-probabilities, (indexed by agegp,alcgp,tobgp in exactly the same order as the esoph data-frame) and ptild be the corresponding vecrtor Bayesian estimators ( = posterior expected value) minimizing the expected squared error loss-function. Do 10,000 simulations of 88 batches of data in three different ways, with r_{ijk} ~ Binomial(n_{ijk},p_{ijk}). Method 1. p_{ijk} = ijk entry of phat for all 10,000 batches of data. Method 2. p_{ijk} = ijk entry of ptild for all 10,000 batches of data. Method 3. In the m'th batch of data, draw p_{ijk} from the posterior (Beta) distribution given the esoph data. (2) In each of the three simulated sets of 10,000 data-batches do the following: first put each dataset into a data-frame with the format of esoph, but with the agegp, alcgp, tobgp variables replaced by their respective numberic index-values as.numeric(agegp), as.numeric(alcgp), and as.numeric(agegp). fit the three models, Fit1: cbind(r,n-r) ~ agegp + alcgp + tobgp + agegp:alcgp Fit2: cbind(r,n-r) ~ agegp + alcgp + tobgp + agegp:tobgp Fit3: cbind(r,n-r) ~ agegp + alcgp + tobgp + alcgp:tobgp and save the resulting fitted values of the interaction coefficient for each model, for each data-batch. These saved values should go into a 10,000 x 3 x 3 array DatArr: the first index is for simulation-repetition, the second for Method 1,2, or 3 of simulation in part (1) above, and the third for model Fit1, Fit2 or Fit3. (3) Calculate the values for these interaction coefficients (with the agegp, alcgp and tobgp variables replaced by their "as.numeric" indices) based on the original dataset esoph. Use the simulated data in DatArr to create reference distributions for the three interaction coefficients in models Fit1, Fit2, Fit3 (singly and jointly) and answer the Research Question: are any of the pairwise interactions between agegp, alcgp and tobgp significant based on these simulated pseudo-data ? Can you detect meaningful differences between the reference distributions for each of |Fit1 coef|, |Fit2 coef|, |Fit3 coef|, across the different Methods 1,2,3 of simulation ? Draw conclusions about the significance of (at least one of the) interactions for these data based on this simulation study. Can you provide a meaningful p-value for a statistic measuring this significance ? ***====================================================================================== EXTRA-CREDIT 2 points: can you think of a reasonable way to do the psuedo-data simulation and answer the question using a NONparametric bootstrap ? ***======================================================================================