Soutions to HW 16

Assigned 11/13/2017, due 11/22            14 points plus 2 points extra-credit
==============================

## Part (A)

> set.seed(3131)
  gdat = rgamma(87,3,1.2)
  S0 = 0.5*(qgamma(22/87,3,1.2) + qgamma(66/87,3,1.2))
> S0
[1] 2.381622   

## I  parameterize the likelihood by ab = c(a,b), where a = log(alpha) 
##    and b=log(beta), in order to keep the parameters unconstrained.

> NlogLik = function(ab, Datvec) 
     -sum(log(dgamma(Datvec, exp(ab[1]),exp(ab[2]))))

> tmp0 = nlm(NlogLik, c(0,0), hessian=T, Datvec=gdat)
$minimum
[1] 134.5173

$estimate
[1] 1.1802729 0.3523984

$gradient
[1] 1.203670e-06 2.939515e-06

$hessian
          [,1]      [,2]
[1,]  331.1551 -283.2245
[2,] -283.2245  283.2362

$code
[1] 1

$iterations
[1] 11

### Parametric bootstrap:
> NewPBdat = array(rgamma(87*5000, exp(tmp0$est[1]), exp(tmp0$est[2])), c(87,5000))

## Two ways to do it depending on whether we take our statistic to be
##        0.5*(qgamma(22/87,alph.hat,bet.hat) + qgamma(66/87,alph.hat,bet.hat))
## or     mean(sort(gdat)[c(22,66)])

S.hat1 = 0.5*(qgamma(22/87,exp(tmp$est[1]),exp(tmp$est[2])) + 
              qgamma(66/87,exp(tmp$est[1]),exp(tmp$est[2])))    ## =  2.189229

S.hat2 = mean(sort(gdat)[c(22,66)])                             ## =  2.13348

## CI from S.hat1 & S.hat2 definitions
> S1vecPB = S2vecPB = numeric(5000)
  for(i in 1:5000) {
      tmp =  nlm(NlogLik, c(0,0), hessian=T, Datvec=NewPBdat[,i])
      S1vecPB[i] = 0.5*(qgamma(22/87,exp(tmp$est[1]),exp(tmp$est[2])) + 
              qgamma(66/87,exp(tmp$est[1]),exp(tmp$est[2])))
      S2vecPB[i] = mean(sort(NewPBdat[,i])[c(22,66)]) }
  q1 = quantile(S1vecPB, prob=c(.05,.95))
  q2 = quantile(S2vecPB, prob=c(.05,.95))
  CI1 = 2*S.hat1 - q1[2:1]
  CI2 = 2*S.hat2 - q2[2:1]
  round(cbind(rbind(CI1=CI1, CI2=CI2), S.hat=c(S.hat1,S.hat2)),4)
       95%     5%  S.hat
CI1 1.9653 2.4035 2.1892
CI2 1.8521 2.3335 2.1335

## Nonparametric bootstrap intervals make sense (to me) only with the second statistic,
##   the one averaging the 22 and 66 order-statistics.

> NewNPBdat = array(gdat[sample(1:87, 87*5000, replace=T)], c(87,5000))
  q3 = quantile(apply(NewNPBdat,2, function(col) 
                   mean(sort(col)[c(22,66)]))-S.hat2, prob=c(.05,.95))
  CI3 = S.hat2 - q3[2:1]

round(cbind(rbind(CI1=CI1, CI2=CI2, CI3=CI3), S.hat=c(S.hat1,S.hat2, S.hat2)),4)
       95%     5%  S.hat
CI1 1.9653 2.4035 2.1892
CI2 1.8521 2.3335 2.1335
CI3 1.8586 2.3556 2.1335

#------------------------------------------------------------------------------
## Extra-credit: for the Delta method in a studentized version of PB interval, 
##   we use only S1 and S.hat1, using the function of parameters

gfcn = function(alph,bet) 0.5*(qgamma(22/87,alph,bet) + qgamma(66/87,alph,bet))

## We code a numerical approximation to the derivative as

gprim = function(alph,bet, tol=1.e-5) c((gfcn(alph+tol/2,bet)-gfcn(alph-tol/2,bet))/tol,
          gfcn(alph,bet+tol/2)-gfcn(alph,bet-tol/2))

## In terms of these functions, we could approximate via the Delta Method [if we thought 
##   the sample size 87 large enough] the distribution of S1.hat as 
##     N(S0, gprim(alph.hat, bet.hat) %*% solve(tmp0$hessian) %*% gprim(alph.hat, bet.hat))
##   which would justify a "large-sample" confidence interval

alph.hat = exp(tmp0$est[1])
bet.hat = exp(tmp0$est[2])
gpr0 = gprim(alph.hat, bet.hat) 
CI0 = gfcn(alph.hat,bet.hat) + c(-1,1)*qnorm(.95)*
               sqrt(c(gpr0 %*% solve(tmp0$hessian) %*% gpr0))
> CI0
[1] 2.021625 2.356832     ### This is narrower than the other CIs

### If we do not think the large-sample normal approximation is plausible, then
###  we do the studentized version of the parametric bootstrap:

>  SE2B = numeric(5000)
   for(i in 1:5000) {
      tmp =  nlm(NlogLik, c(0,0), hessian=T, Datvec=NewPBdat[,i])
      alph = exp(tmp$est[1])
      bet = exp(tmp$est[2])
      gpr = gprim(alph, bet)
      SE2B[i] = sqrt(c(gpr %*% solve(tmp$hessian) %*% gpr))  }

   q2B = quantile((S2vecPB-S.hat2)/SE2B, prob=c(.05,.95))
   CI2B = S.hat2 - q2B[2:1]*sqrt(c(gpr0 %*% solve(tmp0$hessian) %*% gpr0))

   round(cbind(rbind(CI0=CI0, CI1=CI1, CI2=CI2, CI2B=CI2B, CI3=CI3), 
          S.hat=c(S.hat2,S.hat1,S.hat2, S.hat2, S.hat1)),4)
        95%     5%  S.hat
CI0  2.0216 2.3568 2.1335
CI1  1.9653 2.4035 2.1892
CI2  1.8521 2.3335 2.1335
CI2B 1.8548 2.3528 2.1335
CI3  1.8586 2.3556 2.1892

##---------------------------------------------------------------------------

## part (B) 

fit.oats = lm(Y ~ ., data=oats)
mu.hat = fit.oats$fitted
sig.hat = summary(fit.oats)$sig

> table(oats$V)

Golden.rain  Marvellous     Victory 
         24          24          24 

aggregate(oats$Y, by=list(Variety=oats$V), mean)

      Variety        x
1 Golden.rain 104.5000
2  Marvellous 109.7917
3     Victory  97.6250

aggregate (fit.oats$fitted, by=list(Variety=oats$V), mean)
      Variety        x
1 Golden.rain 104.5000
2  Marvellous 109.7917
3     Victory  97.6250

### This verifies that the variety group means are the same for Y as for fitted Y,
###   as mentioned in class and in the problem statement.

## Our desired statistic is:

> mean(oats$Y[oats$V=="Marvellous"]) - mean(oats$Y[oats$V!="Marvellous"])
[1] 8.729167

## Using the fact that there are 24 indices for which oats$V=="Golden.rain",
##   24 for which $V=="Marvellous" and 24 for which $V=="Victory",
##   we create a vector cvec which has entries 1/24 for the "Marvellous" cases 
##   and -1/48 for all others, so that the desired contrast of variety averages
##   is    sum(cvec*Ycol)  for a given (initial or bootstrapped) Y-column Ycol

> cvec = (c(-1,2,-1)/48)[as.numeric(oats$V)]
  sum(cvec*oats$Y)
[1] 8.729167               ### This is the easiest way to calculate the statistic
                  ### when data are generated as in PB with fixed B,V,N levels

BootPB = array(rnorm(72*5000, mu.hat, sig.hat), c(72,5000))

bootinds = array(sample(1:72,72*5000,replace=T), c(72,5000))
BootNPB = array( oats$Y[bootinds], c(72,5000))

> summary(oats$Y)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   53.0    86.0   102.5   104.0   121.2   174.0 
> summary(c(BootPB))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  3.909  84.706 103.655 103.978 122.117 216.944 
> summary(c(BootNPB))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   53.0    86.0   102.0   103.9   121.0   174.0 

### So the two bootstrapped arrays appear reasonable with respect to the 
###   overall behavior of the simulated Y's. 

### The desired statistic for the parametric bootstrap is calculated 
###    as   c(cvec %*% BootPB).
### But to calculate the group means for the nonparametric bootstrap 
###    requires a for-loop ! And the group means of the bootstrapped Y's
###    and of the fitted Y's from the model fitted to bootstrapped Y's are 
###    no longer the same, because the group sizes are no longer the same !!

NPBstat = NPBstatB = numeric(5000)
for(i in 1:5000) {
    gpsiz = table(as.numeric(oats$V)[bootinds[,i]])
    cvec2  = c(-0.5/gpsiz[1], 1/gpsiz[2], -0.5/gpsiz[3])
    NPBstat[i] = sum(cvec2*BootNPB[,i]) 
    NPBstatB[i] = sum(cvec2*lm(Y~., data=oats[bootinds[,i],])$fit)  }

qPB = quantile( c(cvec %*% BootPB), prob=c(.05,.95))
qNPB  = quantile( NPBstat, prob=c(.05,.95))

CI.PB = 2*sum(cvec*oats$Y) - qPB[2:1]
> CI.PB
      95%        5% 
 2.644333 15.022155               ### very reasonable

CI.NPB = 2*sum(cvec*oats$Y) - qNPB[2:1]   
> CI.NPB
      95%        5% 
-33.47984  62.77589 

CI.NPB2 = 2*sum(cvec*oats$Y) - quantile( NPBstatB, prob=c(.05,.95))[2:1]   
> CI.NPB2
      95%        5% 
-33.03317  62.44689          ### These are completely UNreasonable, in 
                       ### the sense that they are badly centered and far too wide !!!

#### The reason that this approach to Nonparametric Bootstrap gave absurd results
####   is that we did not maintain the blocking structure of the Y data in the NPB case.
###  How can we do it more reasonably ? The answer [WHICH I DID NOT EXPECT YOU TO 
FIND IN DOING THIS PROBLEM] is to do nonparametric bootstrap not on the original data
but rather on the residuals from the linear-model fit !!

###------------- Everything past here is beyond the scope of the assigned problem.

Yrsd = fit.oats$resid
> sd(Yrsd)
[1] 14.19373
> sig.hat
[1] 15.31302

BootNPB2 = fit.oats$fit + array(Yrsd[bootinds], c(72,5000))

CI.NPB3 = 2*sum(cvec*oats$Y) - quantile( c(cvec* BootNPB2), prob=c(.95,.05))
     95%       5% 
11.73785 20.37326   ### much more reasonable width, but still seems to indicate 
                  ###     lack of  fit of the original model


Consider the dataset "oats", and suppose that the target of inference is
the expected yield of the "Marvellous" variety minus the average of the 
"Victory" and "Golden.Rain" varieties. Let the statistic be the average 
of the fitted values under the model  lm(Y ~ ., data=oats), and let the 
parametric model be the regression model with iid normal mean 0 errors.
Find the 90% two-sided CI for the target using each of the methods

(a) Nonparametric bootstrap, non-studentized 
(b) Parametric bootstrap, non-studentized 

Use a reasonably large number of bootstrap iterations (B >=1000).

NOTE: IN THIS PROBLEM, YOU NEED TO FIT THE MODEL  lm(Y ~ ., data=oats)  
IN ORDER TO ESTIMATE THE PARAMETERS NEEDED FOR THE PARAMETRIC BOOTSTRAP.
HOWEVER, THE STATISTIC I HAVE ASKED YOU TO COMPUTE BOTH ON THE ORIGINAL
oats DATASET AND THE LATER BOOTSTRAP SAMPLES, WHICH IS 

(avg of "Marvellous" fitted Y's) - 0.5*(sum of avg of "Victory" fitted Y's 
      and of avg of "GoldenRain" fitted Y's)

TURNS OUT IN EACH POSSIBLE SAMPLE TO BE EXACTLY THE SAME AS THE 

(avg of "Marvellous" Y's) - 0.5*(sum of avg of "Victory" Y's 
      and of avg of "GoldenRain" Y's)

DO THE CALCULATION ON THE ORIGINAL SAMPLE TO CONVINCE YOURSELF OF THIS FACT!
NOTE THAT IN THE SECOND OF THESE STATISTIC EXPRESSIONS, NO MODEL FITTING 
IS INVOLVED.