Homework 14 Solution ==================== ## (A) recoding factor columns and re-running model from HW13B. > esoph2=esoph levels(esoph2$agegp) = c(30, 40, 50, 60, 70, 77) levels(esoph2$alcgp) = c(10, 60, 110, 200) levels(esoph2$tobgp) = c(4, 15, 28, 42) for(i in 1:3) esoph2[[i]] = C(factor(as.character(esoph2[[i]])), contr.helmert) > sapply(esoph2,class) agegp alcgp tobgp ncases ncontrols "factor" "factor" "factor" "numeric" "numeric" > sapply(esoph2[1:3], levels) $agegp [1] "30" "40" "50" "60" "70" "77" $alcgp [1] "10" "110" "200" "60" $tobgp [1] "15" "28" "4" "42" > sapply(esoph2[1:3], contrasts) $agegp [,1] [,2] [,3] [,4] [,5] 30 -1 -1 -1 -1 -1 40 1 -1 -1 -1 -1 50 0 2 -1 -1 -1 60 0 0 3 -1 -1 70 0 0 0 4 -1 77 0 0 0 0 5 $alcgp [,1] [,2] [,3] 10 -1 -1 -1 110 1 -1 -1 200 0 2 -1 60 0 0 3 $tobgp [,1] [,2] [,3] 15 -1 -1 -1 28 1 -1 -1 4 0 2 -1 42 0 0 3 > EsophMod2B = step(update(EsophMod1, data=esoph2), scope=list(lower=cbind(ncases, ncontrols) ~ 1, upper=cbind(ncases, ncontrols) ~ .^2), , k=4) > summary(EsophMod2B) Call: glm(formula = cbind(ncases, ncontrols) ~ agegp + alcgp, family = binomial, data = esoph2) Deviance Residuals: Min 1Q Median 3Q Max -1.8979 -0.5592 -0.1995 0.5029 2.6250 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.91052 0.19391 -9.853 < 2e-16 *** agegp1 0.76879 0.53231 1.444 0.14867 agegp2 0.72606 0.18630 3.897 9.73e-05 *** agegp3 0.45419 0.09635 4.714 2.43e-06 *** agegp4 0.32568 0.06358 5.123 3.01e-07 *** agegp5 0.21851 0.06715 3.254 0.00114 ** alcgp1 0.74753 0.13002 5.749 8.97e-09 *** alcgp2 0.49176 0.08058 6.103 1.04e-09 *** alcgp3 -0.02502 0.04284 -0.584 0.55925 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 227.241 on 87 degrees of freedom Residual deviance: 64.572 on 79 degrees of freedom AIC: 230.05 Number of Fisher Scoring iterations: 6 ### The model differed in its coefficients from the previously fitted EsophMod2 model from 13B, ### but the anova and fitted and residual vectors and all of the deviances are the same. > c( sum(abs(EsophMod2B$fit - EsophMod2$fit)), sum(abs(EsophMod2B$resid - EsophMod2$resid))) [1] 5.970918e-15 9.303669e-14 > data.matrix(anova(EsophMod2B))-data.matrix(anova(EsophMod2)) Df Deviance Resid. Df Resid. Dev NULL NA NA 0 0.000000e+00 agegp 0 -2.842171e-14 0 2.842171e-14 alcgp 0 4.263256e-14 0 -1.421085e-14 > sum(abs(c(model.matrix(EsophMod2B)[,2:9] - cbind(contr.helmert(6)[as.numeric(esoph2$agegp),], contr.helmert(4)[as.numeric(esoph2$alcgp),])))) [1] 0 ### This shows that the model.matrix for factor-derived columns of the fitted model #### EsophMod2B are exactly the same as for helmert contrast-matrix rows at the same levels. ### (B) > set.seed(33312) xvec = rpois(80, 2.3) ## (i) > lam.hat = exp(glm(xvec ~ 1, family=poisson)$coef) c(lam.hat, mean(xvec)) (Intercept) 2.125 2.125 ### Conjugate-family Posterior is Gamma(1 + sum(xvec), 80.05) from prior Gamma(1,.05 > lam.tild = (sum(xvec)+1)/80.05 > lam.tild [1] 2.136165 ## (ii) > pseudo = cbind(PB = rpois(800,lam.hat), NPB = xvec[sample(80, 800, replace=T)], Bayes = rpois(800, rgamma(1,1+sum(xvec),80.05))) ## note that the final Bayes-posterior-sampled lambda was obtained as > set.seed(33312) tmp = runif(80+800+800) lam.Bayes = rgamma(1,1+sum(xvec),80.05) > lam.Bayes [1] 1.895207 ### this accords well with the mean of the Bayes sample, found below ### and we further check that this gives the same sample: > sum(abs(rpois(800,lam.Bayes)-pseudo[,3])) [1] 0 > apply(pseudo,2, summary) ### First descriptive summary PB NPB Bayes Min. 0.00000 0.00000 0.00 1st Qu. 1.00000 1.00000 1.00 Median 2.00000 2.00000 2.00 Mean 2.09875 2.17125 1.89 3rd Qu. 3.00000 3.00000 3.00 Max. 8.00000 6.00000 8.00 ## So the lambda used in parametric bootstrap, namely lam.hat = 2.125, is well represented ## by the mean of both the Parametric and Nonparametric bootstrap samples, and the 3 quartiles ## for the two bootstrap samples are identical. Howver, the lambda chosen in the Bayes re-sampled ## dataset has a somewhat different lambda, since lam.Bayes = 1.895207 is actually far in the ## left tail of the posterior gamma density from which it was sampled, as can be seen by: > pgamma(1.895207, sum(xvec)+1, 80.05) [1] 0.06568474 > apply(pseudo,2,table) $PB 0 1 2 3 4 5 6 7 8 97 216 203 149 87 29 17 1 1 $NPB 0 1 2 3 4 5 6 92 157 219 247 52 5 28 $Bayes 0 1 2 3 4 5 6 7 8 106 242 217 147 57 22 7 1 1 ### From these tables of 800 pseudo-sampled values, we can see that all 3 methods give roughly ### similar outputs: NPB is forced to have the same max as the original sample of 80, which ### is not so reasonable as the max for a similar Poisson population of 800. > apply(pseudo,2,var) PB NPB Bayes 2.111638 1.814191 1.755094 ### But we may think that the variances (which reflect lam.hat and lam.Bayes in the PB and Bayes case) ### are actually a little too low for the NPB and Bayes sample. (The variance would get larger and ### more representative (of the original popuplation) for Bayes when we sample in ### multiple batches, taking the variability of lam.Bayes into account.)