Homework Set 15, Due Friday November 17, 2017.
---------------------------------------------

Assigned 11/8/2017, due 11/17               14 points with 2 Extra-Credit points possible
=============================

Consider the following "empirical Bayes" data setting: 

"esoph" data as in HW 13 (B)

88 groups, consisting of 1 for all but 8 of the 6 x 4 x 4 cross-classified 
agegp, alcgp and tobgp groups  (i,j,k), i=1:6, j=1:4, k=1:4. (That is, there 
are 8 "empty"  (i,j,k) combinations for which there are no cases or controls.)

Regard these data as telling the number of "case" events (ncases = r_{ijk})
among the total number (ncases+ncontrols = n_{ijk}) of (agegp=i, alcgp=j, tobgp=k)
group members studied. Suppose that each group  (i,j,k)  has its own separate 
case-probability  p_{ijk}. Think of these as parameter variables a priori distributed 
independently with Beta(0.5,0.5) distribution, and think of the n_{ijk}  group sizes 
as fixed, and given  p_{ijk} ,         r_{ijk} ~ Binomial(n_{ijk}, p_{ijk}) .

(1) Let  phat  be the vector of Maximum likelihood estimators for the group case-probabilities,
(indexed by agegp,alcgp,tobgp in exactly the same order as the esoph data-frame) and
ptild be the corresponding vecrtor Bayesian estimators ( = posterior expected value) minimizing 
the expected squared error loss-function.

Do 10,000 simulations of 88 batches of data in three different ways, with r_{ijk} ~ Binomial(n_{ijk},p_{ijk}).
    Method 1. p_{ijk} = ijk entry of phat  for all 10,000 batches of data.
	Method 2. p_{ijk} = ijk entry of ptild for all 10,000 batches of data.
	Method 3. In the m'th batch of data, draw  p_{ijk}  from the posterior (Beta) distribution
	             given the esoph  data.
				 
(2) In each of the three simulated sets of 10,000 data-batches do the following:

first put each dataset into a data-frame with the format of esoph, but with the agegp, alcgp, tobgp
variables replaced by their respective numberic index-values as.numeric(agegp), as.numeric(alcgp), 
and as.numeric(agegp).

fit the three models,    Fit1:     cbind(r,n-r) ~ agegp + alcgp + tobgp + agegp:alcgp
	                     Fit2:     cbind(r,n-r) ~ agegp + alcgp + tobgp + agegp:tobgp
	                     Fit3:     cbind(r,n-r) ~ agegp + alcgp + tobgp + alcgp:tobgp
and save the resulting fitted values of the interaction coefficient for each model, for each data-batch.
These saved values should go into a 10,000 x 3 x 3 array  DatArr: the first index is for 
simulation-repetition, the second for Method 1,2, or 3 of simulation in part (1) above, and the third
for model Fit1, Fit2 or Fit3.

(3) Calculate the values for these interaction coefficients (with the agegp, alcgp and tobgp 
variables replaced by their "as.numeric" indices) based on the original dataset esoph.

Use the simulated data in DatArr to create reference distributions for the three interaction coefficients 
in models Fit1, Fit2, Fit3  (singly and jointly) and answer the 

Research Question: are any of the pairwise interactions between agegp, alcgp and tobgp  significant 
based on these simulated pseudo-data ?

Can you detect meaningful differences between the reference distributions for each of 
    |Fit1 coef|,  |Fit2 coef|,  |Fit3 coef|,    
across the different Methods 1,2,3 of simulation ? 

Draw conclusions about the significance of (at least one of the) interactions for these data based 
on this simulation study. Can you provide a meaningful p-value for a statistic measuring this 
significance ?

***======================================================================================
EXTRA-CREDIT 2 points: can you think of a reasonable way to do the psuedo-data simulation 
and answer the question using a NONparametric bootstrap ?
***======================================================================================