HW 15 Solution
--------------

## Since MLE for groupwise binomial is number of "successes" (cases) over total trials,
> phat = esoph$ncas/(esoph$ncas+esoph$ncontr)

### Posterior for groupwise success probability is Beta(x+0.5, n-x+0.5), so posterior mean is (x+0.5)/(n+1)
> ptild = (esoph$ncas+0.5)/(esoph$ncas+esoph$ncontr+1)

> set.seed(7001) 

## Part (1): simulation
> newfr = cbind.data.frame(esoph[,1:3], n = esoph$ncas+esoph$ncontr, r=esoph$ncas)
  for (j in 1:3) newfr[,j] = as.numeric(newfr[,j])
               ### This is the data-frame from which we will do all glm analyses, 
             ## changing only the last column based on the different simulation methods, 
          ## and converting the first three columns to as.numeric.
> Coef.Intx = c( glm( cbind(r,n-r) ~ agegp + alcgp + tobgp + agegp:alcgp, 
                               family=binomial, data=newfr)$coef[5],
                 glm( cbind(r,n-r) ~ agegp + alcgp + tobgp + agegp:tobgp, 
                               family=binomial, data=newfr)$coef[5],
                 glm( cbind(r,n-r) ~ agegp + alcgp + tobgp + alcgp:tobgp, 
                               family=binomial, data=newfr)$coef[5]   )
> Coef.Intx 
 agegp:alcgp  agegp:tobgp  alcgp:tobgp 
-0.097788563 -0.005718686 -0.131830749     
   ## these are the interaction coefficients separately fitted on the original data

> set.seed(7001) 
  tmpArr = array(c(rbinom(88*1e4,newfr$n,phat), rbinom(88*1e4,newfr$n,ptild),
                   rbinom(88*1e4,newfr$n,rbeta(88*1e4,esoph$ncas+0.5,esoph$ncontr+0.5))), 
                 c(88,1e4,3), dimnames=list(NULL,NULL, c("MLE","Post.Exp","Bayes")))
 ### Note that new beta variates p are drawn in each batch from the posterior in the "Bayes" method
>  dim(tmpArr)
[1]    88 10000     3

## Part (2): Model fitting
> DatArr = array(0, c(1e4,3,3), dimnames=list(NULL, paste0("Method",1:3), paste0("Fit",1:3)))
  for (i in 1:1e4) for (j in 1:3) {
        newfr$r = tmpArr[,i,j]
        DatArr[i,j,1] = glm( cbind(r,n-r) ~ agegp + alcgp + tobgp + agegp:alcgp, 
                               family=binomial, data=newfr)$coef[5]
        DatArr[i,j,2] = glm( cbind(r,n-r) ~ agegp + alcgp + tobgp + agegp:tobgp, 
                               family=binomial, data=newfr)$coef[5]
        DatArr[i,j,3] = glm( cbind(r,n-r) ~ agegp + alcgp + tobgp + alcgp:tobgp, 
                               family=binomial, data=newfr)$coef[5]   }

> summary(DatArr[,1,1])
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.32394 -0.13534 -0.09777 -0.09683 -0.05929  0.14196  ### compare with -0.097789 above

### Now the reference distributions for the three coefficients SEPARATELY are obtained for 
###    Method j and Fit k from the empirical distribution of DatArr[,j,k]-Coef.Intx[k].
### We use these reference distributions to find how far out in the tails [either one]
###    the original coefficients Coef.IntX - c(0,0,0) are.

### Part (3) 
> Pvals = replace(DatArr[1,,], T, 0)
  for (j in 1:3) for(k in 1:3) Pvals[j,k] = mean( DatArr[,j,k]-Coef.Intx[k] <= Coef.Intx[k] )

> Pvals
          Fit1   Fit2   Fit3
Method1 0.0370 0.4571 0.0290
Method2 0.0279 0.6571 0.0535
Method3 0.0650 0.6225 0.1045

## Sigificance at the 2-sided .05 level of individual coef's is obtained by

> pmin(Pvals,1-Pvals) <= .025

         Fit1  Fit2  Fit3
Method1 FALSE FALSE FALSE
Method2 FALSE FALSE FALSE
Method3 FALSE FALSE FALSE

### Similar question at 2-soded .10 level:

> pmin(Pvals,1-Pvals) <= .05

         Fit1  Fit2  Fit3
Method1  TRUE FALSE  TRUE
Method2  TRUE FALSE FALSE
Method3 FALSE FALSE FALSE

### Note that when significant, we find always that  Coef.Intx[k]  is in the negative direction
###    (because significance is in the lower tail)

### We would have to say that nothing looks significant at the 2-sided .05 level.
### At the 2-sided .10 level, we seem to have significant results in Method1 for the Fit1
###    and Fit 3 (i.e., the agegp:alcgp and alcgp:tobgp) and in Method2 for the Fit1 
###    (agegp:alcgp) coefficients. Under all simulation methods, the Fit2 (agegp:alcgp) 
###    interaction seems to be clearly NONsignificant. Also, Simulation Method 3 seems to 
###    provide LESS significance for all three coefficients that in the other three methods.
### HOWEVER, the differences may not be meaningful because the half-width of CIs of Monte Carlo 
##     estimates respectively of .05 and .10 based on 10,000 Bernoulli trials is
> round(1.96*sqrt(c(.05*.95/1e4, .1*.9/1e4)), 5)
[1] 0.00427 0.00588

THUS EVEN A SIMULATION AS LARGE AS THIS ONE IS VERY CRUDE FOR MEASURING DIFFERENCES OF SIGNIFICANCE
LEVELS ACROSS METHODS, ESPECIALLY AS THE DEPENDENCE OF RESIDUALS ACROSS METHOD AND FIT IS NOT 
VERY STRONG, AS SHOWN IN
> round(cor(array(DatArr,c(1e4,9))), 3)
        [,1]   [,2]   [,3]   [,4]   [,5]   [,6]   [,7]   [,8]   [,9]
 [1,]  1.000 -0.006  0.013 -0.019 -0.019  0.002 -0.144  0.004 -0.004
 [2,] -0.006  1.000 -0.006  0.006 -0.048 -0.003 -0.010 -0.147  0.009
 [3,]  0.013 -0.006  1.000 -0.009 -0.009 -0.040  0.001  0.003 -0.162
 [4,] -0.019  0.006 -0.009  1.000  0.013 -0.007 -0.205  0.000 -0.016
 [5,] -0.019 -0.048 -0.009  0.013  1.000  0.011 -0.021 -0.126 -0.010
 [6,]  0.002 -0.003 -0.040 -0.007  0.011  1.000  0.012 -0.004 -0.110
 [7,] -0.144 -0.010  0.001 -0.205 -0.021  0.012  1.000 -0.004  0.002
 [8,]  0.004 -0.147  0.003  0.000 -0.126 -0.004 -0.004  1.000 -0.009
 [9,] -0.004  0.009 -0.162 -0.016 -0.010 -0.110  0.002 -0.009  1.000

THIS ISSUE OF DEPENDENCE ACROSS FIT (COEF-TYPE, WITHIN DATASET) IS ESPECIALLY RELEVANT, BECAUSE
IN ASKING WHETHER THERE IS AT LEAST ONE SIGNIFICANT INTERACTION-COEFFICIENT WE ARE ASKING THREE 
QUESTIONS (FOR EACH SIMULATION METHOD), ACTUALLY 9 IF WE TAKE DIFFERENT (INDEPENDENT) SIMULATION 
METHODS INTO ACCOUNT.

BASED ON THESE CONSIDERATIONS ALL THAT WE CAN SAY DEFINITELY IS THAT THE FIT 2 COEFFICIENT 
IS DEFINITELY NOT SIGNIFICANT AND THE METHOD 3 SIMULATIONS GIVE LESS SIGNIFICANT FIT1 AND FIT3 
COEFFICIENTS.

AS WE SAW ABOVE, P-VALUES FOR THE 3 COEFFICIENTS CONSIDERED SEPARATELY, AND SEPARATELY FOR EACH 
SIMULATION METHOD, ARE 

2*pmin(Pvals, 1-Pvals)
          Fit1   Fit2  Fit3
Method1 0.0740 0.9142 0.058
Method2 0.0558 0.6858 0.107
Method3 0.1300 0.7550 0.209

TO OBTAIN A P-VALUE FOR THE 3 COEFFICIENTS CONSIDERED JOINTLY, WE COULD (SEPARATELY FOR 
EACH SIMULATION METHOD) PROVIDE A P-VALUE FOR THE MAXIMUM ABSOLUTE COEFFICIENT AS:

> Pjoint = numeric(3)
  for(j in 1:3) Pjoint[j] = mean(apply(abs(DatArr[,j,]-rep(1,1e4) %o% Coef.Intx),1,max) <= 
                     max(abs(Coef.Intx)))

>   2*pmin(Pjoint,1-Pjoint)       ### These are p-values, none close to significant
[1] 0.1742 0.2604 0.5864         ### but Method 1 definitely gives result closest to significance.

===================================================================================

Nonparametric bootstrap could again be done by randomly and equiprobably resampling indices 1:88 
with replacement in the original dataset. Because of the granularity in the data achieved this way,
there is no real reason to believe that would be better than any of the methods studied here,
although the NPB method would have the virtue of requiring no distributional or prior assumptions 
at all. Here, though, Method 1 has a claim toi be called "nonparametric" once we assume independence
of outcomes for the within-group Bernoulli trials ...