Homework 14 Solution
====================

## (A) recoding factor columns and re-running model from HW13B.

> esoph2=esoph
  levels(esoph2$agegp) = c(30, 40, 50, 60, 70, 77)
  levels(esoph2$alcgp) = c(10, 60, 110, 200)
  levels(esoph2$tobgp) = c(4, 15, 28, 42)
  for(i in 1:3) esoph2[[i]] = C(factor(as.character(esoph2[[i]])), contr.helmert)
>  sapply(esoph2,class)
    agegp     alcgp     tobgp    ncases ncontrols 
 "factor"  "factor"  "factor" "numeric" "numeric" 
> sapply(esoph2[1:3], levels)
$agegp
[1] "30" "40" "50" "60" "70" "77"

$alcgp
[1] "10"  "110" "200" "60" 

$tobgp
[1] "15" "28" "4"  "42"

> sapply(esoph2[1:3], contrasts)
$agegp
   [,1] [,2] [,3] [,4] [,5]
30   -1   -1   -1   -1   -1
40    1   -1   -1   -1   -1
50    0    2   -1   -1   -1
60    0    0    3   -1   -1
70    0    0    0    4   -1
77    0    0    0    0    5

$alcgp
    [,1] [,2] [,3]
10    -1   -1   -1
110    1   -1   -1
200    0    2   -1
60     0    0    3

$tobgp
   [,1] [,2] [,3]
15   -1   -1   -1
28    1   -1   -1
4     0    2   -1
42    0    0    3

> EsophMod2B = step(update(EsophMod1, data=esoph2), scope=list(lower=cbind(ncases, ncontrols) ~ 1, 
         upper=cbind(ncases, ncontrols) ~ .^2), , k=4)
> summary(EsophMod2B)
Call:
glm(formula = cbind(ncases, ncontrols) ~ agegp + alcgp, family = binomial, 
    data = esoph2)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.8979  -0.5592  -0.1995   0.5029   2.6250  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.91052    0.19391  -9.853  < 2e-16 ***
agegp1       0.76879    0.53231   1.444  0.14867    
agegp2       0.72606    0.18630   3.897 9.73e-05 ***
agegp3       0.45419    0.09635   4.714 2.43e-06 ***
agegp4       0.32568    0.06358   5.123 3.01e-07 ***
agegp5       0.21851    0.06715   3.254  0.00114 ** 
alcgp1       0.74753    0.13002   5.749 8.97e-09 ***
alcgp2       0.49176    0.08058   6.103 1.04e-09 ***
alcgp3      -0.02502    0.04284  -0.584  0.55925    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 227.241  on 87  degrees of freedom
Residual deviance:  64.572  on 79  degrees of freedom
AIC: 230.05

Number of Fisher Scoring iterations: 6

### The model differed in its coefficients from the previously fitted EsophMod2 model from 13B,
###  but the anova and fitted and residual vectors and all of the deviances are the same.
> c( sum(abs(EsophMod2B$fit - EsophMod2$fit)), sum(abs(EsophMod2B$resid - EsophMod2$resid)))
[1] 5.970918e-15 9.303669e-14

> data.matrix(anova(EsophMod2B))-data.matrix(anova(EsophMod2))
      Df      Deviance Resid. Df    Resid. Dev
NULL  NA            NA         0  0.000000e+00
agegp  0 -2.842171e-14         0  2.842171e-14
alcgp  0  4.263256e-14         0 -1.421085e-14

> sum(abs(c(model.matrix(EsophMod2B)[,2:9] - cbind(contr.helmert(6)[as.numeric(esoph2$agegp),],
           contr.helmert(4)[as.numeric(esoph2$alcgp),]))))
[1] 0
            ### This shows that the model.matrix for factor-derived columns of the fitted model 
	#### EsophMod2B are exactly the same as for helmert contrast-matrix rows at the same levels.

### (B) 
>  set.seed(33312)
   xvec = rpois(80, 2.3)

## (i) 
> lam.hat = exp(glm(xvec ~ 1, family=poisson)$coef)
  c(lam.hat, mean(xvec))
(Intercept)             
      2.125       2.125 
	  
	  ### Conjugate-family Posterior is Gamma(1 + sum(xvec), 80.05) from prior Gamma(1,.05
> lam.tild = (sum(xvec)+1)/80.05
> lam.tild
[1] 2.136165
	    
## (ii) 

> pseudo = cbind(PB = rpois(800,lam.hat), NPB = xvec[sample(80, 800, replace=T)],
                 Bayes = rpois(800, rgamma(1,1+sum(xvec),80.05)))
## note that the final Bayes-posterior-sampled lambda was obtained as 
> set.seed(33312)
  tmp = runif(80+800+800)
  lam.Bayes = rgamma(1,1+sum(xvec),80.05)
>   lam.Bayes
[1] 1.895207       ### this accords well with the mean of the Bayes sample, found below
              ### and we further check that this gives the same sample:
> sum(abs(rpois(800,lam.Bayes)-pseudo[,3]))
[1] 0

> apply(pseudo,2, summary)               ### First descriptive summary
             PB     NPB Bayes
Min.    0.00000 0.00000  0.00
1st Qu. 1.00000 1.00000  1.00
Median  2.00000 2.00000  2.00
Mean    2.09875 2.17125  1.89
3rd Qu. 3.00000 3.00000  3.00
Max.    8.00000 6.00000  8.00

## So the lambda used in parametric bootstrap, namely lam.hat = 2.125, is well represented 
## by the mean of both the Parametric and Nonparametric bootstrap samples, and the 3 quartiles 
## for the two bootstrap samples are identical. Howver, the lambda chosen in the Bayes re-sampled
## dataset has a somewhat different lambda, since lam.Bayes = 1.895207 is actually far in the 
## left tail of the posterior gamma density from which it was sampled, as can be seen by:
> pgamma(1.895207, sum(xvec)+1, 80.05)
[1] 0.06568474

> apply(pseudo,2,table)
$PB

  0   1   2   3   4   5   6   7   8 
 97 216 203 149  87  29  17   1   1 

$NPB

  0   1   2   3   4   5   6 
 92 157 219 247  52   5  28 

$Bayes

  0   1   2   3   4   5   6   7   8 
106 242 217 147  57  22   7   1   1 

### From these tables of 800 pseudo-sampled values, we can see that all 3 methods give roughly 
###   similar outputs: NPB is forced to have the same max as the original sample of 80, which 
### is not so reasonable as the max for a similar Poisson population of 800.

> apply(pseudo,2,var)
      PB      NPB    Bayes 
2.111638 1.814191 1.755094 
### But we may think that the variances (which reflect lam.hat and lam.Bayes in the PB and Bayes case)
###  are actually a little too low for the NPB and Bayes sample. (The variance would get larger and 
###  more representative (of the original popuplation) for Bayes when we sample in
###  multiple batches, taking the variability of lam.Bayes into account.)