HW 15 Solution -------------- ## Since MLE for groupwise binomial is number of "successes" (cases) over total trials, > phat = esoph$ncas/(esoph$ncas+esoph$ncontr) ### Posterior for groupwise success probability is Beta(x+0.5, n-x+0.5), so posterior mean is (x+0.5)/(n+1) > ptild = (esoph$ncas+0.5)/(esoph$ncas+esoph$ncontr+1) > set.seed(7001) ## Part (1): simulation > newfr = cbind.data.frame(esoph[,1:3], n = esoph$ncas+esoph$ncontr, r=esoph$ncas) for (j in 1:3) newfr[,j] = as.numeric(newfr[,j]) ### This is the data-frame from which we will do all glm analyses, ## changing only the last column based on the different simulation methods, ## and converting the first three columns to as.numeric. > Coef.Intx = c( glm( cbind(r,n-r) ~ agegp + alcgp + tobgp + agegp:alcgp, family=binomial, data=newfr)$coef[5], glm( cbind(r,n-r) ~ agegp + alcgp + tobgp + agegp:tobgp, family=binomial, data=newfr)$coef[5], glm( cbind(r,n-r) ~ agegp + alcgp + tobgp + alcgp:tobgp, family=binomial, data=newfr)$coef[5] ) > Coef.Intx agegp:alcgp agegp:tobgp alcgp:tobgp -0.097788563 -0.005718686 -0.131830749 ## these are the interaction coefficients separately fitted on the original data > set.seed(7001) tmpArr = array(c(rbinom(88*1e4,newfr$n,phat), rbinom(88*1e4,newfr$n,ptild), rbinom(88*1e4,newfr$n,rbeta(88*1e4,esoph$ncas+0.5,esoph$ncontr+0.5))), c(88,1e4,3), dimnames=list(NULL,NULL, c("MLE","Post.Exp","Bayes"))) ### Note that new beta variates p are drawn in each batch from the posterior in the "Bayes" method > dim(tmpArr) [1] 88 10000 3 ## Part (2): Model fitting > DatArr = array(0, c(1e4,3,3), dimnames=list(NULL, paste0("Method",1:3), paste0("Fit",1:3))) for (i in 1:1e4) for (j in 1:3) { newfr$r = tmpArr[,i,j] DatArr[i,j,1] = glm( cbind(r,n-r) ~ agegp + alcgp + tobgp + agegp:alcgp, family=binomial, data=newfr)$coef[5] DatArr[i,j,2] = glm( cbind(r,n-r) ~ agegp + alcgp + tobgp + agegp:tobgp, family=binomial, data=newfr)$coef[5] DatArr[i,j,3] = glm( cbind(r,n-r) ~ agegp + alcgp + tobgp + alcgp:tobgp, family=binomial, data=newfr)$coef[5] } > summary(DatArr[,1,1]) Min. 1st Qu. Median Mean 3rd Qu. Max. -0.32394 -0.13534 -0.09777 -0.09683 -0.05929 0.14196 ### compare with -0.097789 above ### Now the reference distributions for the three coefficients SEPARATELY are obtained for ### Method j and Fit k from the empirical distribution of DatArr[,j,k]-Coef.Intx[k]. ### We use these reference distributions to find how far out in the tails [either one] ### the original coefficients Coef.IntX - c(0,0,0) are. ### Part (3) > Pvals = replace(DatArr[1,,], T, 0) for (j in 1:3) for(k in 1:3) Pvals[j,k] = mean( DatArr[,j,k]-Coef.Intx[k] <= Coef.Intx[k] ) > Pvals Fit1 Fit2 Fit3 Method1 0.0370 0.4571 0.0290 Method2 0.0279 0.6571 0.0535 Method3 0.0650 0.6225 0.1045 ## Sigificance at the 2-sided .05 level of individual coef's is obtained by > pmin(Pvals,1-Pvals) <= .025 Fit1 Fit2 Fit3 Method1 FALSE FALSE FALSE Method2 FALSE FALSE FALSE Method3 FALSE FALSE FALSE ### Similar question at 2-soded .10 level: > pmin(Pvals,1-Pvals) <= .05 Fit1 Fit2 Fit3 Method1 TRUE FALSE TRUE Method2 TRUE FALSE FALSE Method3 FALSE FALSE FALSE ### Note that when significant, we find always that Coef.Intx[k] is in the negative direction ### (because significance is in the lower tail) ### We would have to say that nothing looks significant at the 2-sided .05 level. ### At the 2-sided .10 level, we seem to have significant results in Method1 for the Fit1 ### and Fit 3 (i.e., the agegp:alcgp and alcgp:tobgp) and in Method2 for the Fit1 ### (agegp:alcgp) coefficients. Under all simulation methods, the Fit2 (agegp:alcgp) ### interaction seems to be clearly NONsignificant. Also, Simulation Method 3 seems to ### provide LESS significance for all three coefficients that in the other three methods. ### HOWEVER, the differences may not be meaningful because the half-width of CIs of Monte Carlo ## estimates respectively of .05 and .10 based on 10,000 Bernoulli trials is > round(1.96*sqrt(c(.05*.95/1e4, .1*.9/1e4)), 5) [1] 0.00427 0.00588 THUS EVEN A SIMULATION AS LARGE AS THIS ONE IS VERY CRUDE FOR MEASURING DIFFERENCES OF SIGNIFICANCE LEVELS ACROSS METHODS, ESPECIALLY AS THE DEPENDENCE OF RESIDUALS ACROSS METHOD AND FIT IS NOT VERY STRONG, AS SHOWN IN > round(cor(array(DatArr,c(1e4,9))), 3) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [1,] 1.000 -0.006 0.013 -0.019 -0.019 0.002 -0.144 0.004 -0.004 [2,] -0.006 1.000 -0.006 0.006 -0.048 -0.003 -0.010 -0.147 0.009 [3,] 0.013 -0.006 1.000 -0.009 -0.009 -0.040 0.001 0.003 -0.162 [4,] -0.019 0.006 -0.009 1.000 0.013 -0.007 -0.205 0.000 -0.016 [5,] -0.019 -0.048 -0.009 0.013 1.000 0.011 -0.021 -0.126 -0.010 [6,] 0.002 -0.003 -0.040 -0.007 0.011 1.000 0.012 -0.004 -0.110 [7,] -0.144 -0.010 0.001 -0.205 -0.021 0.012 1.000 -0.004 0.002 [8,] 0.004 -0.147 0.003 0.000 -0.126 -0.004 -0.004 1.000 -0.009 [9,] -0.004 0.009 -0.162 -0.016 -0.010 -0.110 0.002 -0.009 1.000 THIS ISSUE OF DEPENDENCE ACROSS FIT (COEF-TYPE, WITHIN DATASET) IS ESPECIALLY RELEVANT, BECAUSE IN ASKING WHETHER THERE IS AT LEAST ONE SIGNIFICANT INTERACTION-COEFFICIENT WE ARE ASKING THREE QUESTIONS (FOR EACH SIMULATION METHOD), ACTUALLY 9 IF WE TAKE DIFFERENT (INDEPENDENT) SIMULATION METHODS INTO ACCOUNT. BASED ON THESE CONSIDERATIONS ALL THAT WE CAN SAY DEFINITELY IS THAT THE FIT 2 COEFFICIENT IS DEFINITELY NOT SIGNIFICANT AND THE METHOD 3 SIMULATIONS GIVE LESS SIGNIFICANT FIT1 AND FIT3 COEFFICIENTS. AS WE SAW ABOVE, P-VALUES FOR THE 3 COEFFICIENTS CONSIDERED SEPARATELY, AND SEPARATELY FOR EACH SIMULATION METHOD, ARE 2*pmin(Pvals, 1-Pvals) Fit1 Fit2 Fit3 Method1 0.0740 0.9142 0.058 Method2 0.0558 0.6858 0.107 Method3 0.1300 0.7550 0.209 TO OBTAIN A P-VALUE FOR THE 3 COEFFICIENTS CONSIDERED JOINTLY, WE COULD (SEPARATELY FOR EACH SIMULATION METHOD) PROVIDE A P-VALUE FOR THE MAXIMUM ABSOLUTE COEFFICIENT AS: > Pjoint = numeric(3) for(j in 1:3) Pjoint[j] = mean(apply(abs(DatArr[,j,]-rep(1,1e4) %o% Coef.Intx),1,max) <= max(abs(Coef.Intx))) > 2*pmin(Pjoint,1-Pjoint) ### These are p-values, none close to significant [1] 0.1742 0.2604 0.5864 ### but Method 1 definitely gives result closest to significance. =================================================================================== Nonparametric bootstrap could again be done by randomly and equiprobably resampling indices 1:88 with replacement in the original dataset. Because of the granularity in the data achieved this way, there is no real reason to believe that would be better than any of the methods studied here, although the NPB method would have the virtue of requiring no distributional or prior assumptions at all. Here, though, Method 1 has a claim toi be called "nonparametric" once we assume independence of outcomes for the within-group Bernoulli trials ...