Soutions to HW 16 Assigned 11/13/2017, due 11/22 14 points plus 2 points extra-credit ============================== ## Part (A) > set.seed(3131) gdat = rgamma(87,3,1.2) S0 = 0.5*(qgamma(22/87,3,1.2) + qgamma(66/87,3,1.2)) > S0 [1] 2.381622 ## I parameterize the likelihood by ab = c(a,b), where a = log(alpha) ## and b=log(beta), in order to keep the parameters unconstrained. > NlogLik = function(ab, Datvec) -sum(log(dgamma(Datvec, exp(ab[1]),exp(ab[2])))) > tmp0 = nlm(NlogLik, c(0,0), hessian=T, Datvec=gdat) $minimum [1] 134.5173 $estimate [1] 1.1802729 0.3523984 $gradient [1] 1.203670e-06 2.939515e-06 $hessian [,1] [,2] [1,] 331.1551 -283.2245 [2,] -283.2245 283.2362 $code [1] 1 $iterations [1] 11 ### Parametric bootstrap: > NewPBdat = array(rgamma(87*5000, exp(tmp0$est[1]), exp(tmp0$est[2])), c(87,5000)) ## Two ways to do it depending on whether we take our statistic to be ## 0.5*(qgamma(22/87,alph.hat,bet.hat) + qgamma(66/87,alph.hat,bet.hat)) ## or mean(sort(gdat)[c(22,66)]) S.hat1 = 0.5*(qgamma(22/87,exp(tmp$est[1]),exp(tmp$est[2])) + qgamma(66/87,exp(tmp$est[1]),exp(tmp$est[2]))) ## = 2.189229 S.hat2 = mean(sort(gdat)[c(22,66)]) ## = 2.13348 ## CI from S.hat1 & S.hat2 definitions > S1vecPB = S2vecPB = numeric(5000) for(i in 1:5000) { tmp = nlm(NlogLik, c(0,0), hessian=T, Datvec=NewPBdat[,i]) S1vecPB[i] = 0.5*(qgamma(22/87,exp(tmp$est[1]),exp(tmp$est[2])) + qgamma(66/87,exp(tmp$est[1]),exp(tmp$est[2]))) S2vecPB[i] = mean(sort(NewPBdat[,i])[c(22,66)]) } q1 = quantile(S1vecPB, prob=c(.05,.95)) q2 = quantile(S2vecPB, prob=c(.05,.95)) CI1 = 2*S.hat1 - q1[2:1] CI2 = 2*S.hat2 - q2[2:1] round(cbind(rbind(CI1=CI1, CI2=CI2), S.hat=c(S.hat1,S.hat2)),4) 95% 5% S.hat CI1 1.9653 2.4035 2.1892 CI2 1.8521 2.3335 2.1335 ## Nonparametric bootstrap intervals make sense (to me) only with the second statistic, ## the one averaging the 22 and 66 order-statistics. > NewNPBdat = array(gdat[sample(1:87, 87*5000, replace=T)], c(87,5000)) q3 = quantile(apply(NewNPBdat,2, function(col) mean(sort(col)[c(22,66)]))-S.hat2, prob=c(.05,.95)) CI3 = S.hat2 - q3[2:1] round(cbind(rbind(CI1=CI1, CI2=CI2, CI3=CI3), S.hat=c(S.hat1,S.hat2, S.hat2)),4) 95% 5% S.hat CI1 1.9653 2.4035 2.1892 CI2 1.8521 2.3335 2.1335 CI3 1.8586 2.3556 2.1335 #------------------------------------------------------------------------------ ## Extra-credit: for the Delta method in a studentized version of PB interval, ## we use only S1 and S.hat1, using the function of parameters gfcn = function(alph,bet) 0.5*(qgamma(22/87,alph,bet) + qgamma(66/87,alph,bet)) ## We code a numerical approximation to the derivative as gprim = function(alph,bet, tol=1.e-5) c((gfcn(alph+tol/2,bet)-gfcn(alph-tol/2,bet))/tol, gfcn(alph,bet+tol/2)-gfcn(alph,bet-tol/2)) ## In terms of these functions, we could approximate via the Delta Method [if we thought ## the sample size 87 large enough] the distribution of S1.hat as ## N(S0, gprim(alph.hat, bet.hat) %*% solve(tmp0$hessian) %*% gprim(alph.hat, bet.hat)) ## which would justify a "large-sample" confidence interval alph.hat = exp(tmp0$est[1]) bet.hat = exp(tmp0$est[2]) gpr0 = gprim(alph.hat, bet.hat) CI0 = gfcn(alph.hat,bet.hat) + c(-1,1)*qnorm(.95)* sqrt(c(gpr0 %*% solve(tmp0$hessian) %*% gpr0)) > CI0 [1] 2.021625 2.356832 ### This is narrower than the other CIs ### If we do not think the large-sample normal approximation is plausible, then ### we do the studentized version of the parametric bootstrap: > SE2B = numeric(5000) for(i in 1:5000) { tmp = nlm(NlogLik, c(0,0), hessian=T, Datvec=NewPBdat[,i]) alph = exp(tmp$est[1]) bet = exp(tmp$est[2]) gpr = gprim(alph, bet) SE2B[i] = sqrt(c(gpr %*% solve(tmp$hessian) %*% gpr)) } q2B = quantile((S2vecPB-S.hat2)/SE2B, prob=c(.05,.95)) CI2B = S.hat2 - q2B[2:1]*sqrt(c(gpr0 %*% solve(tmp0$hessian) %*% gpr0)) round(cbind(rbind(CI0=CI0, CI1=CI1, CI2=CI2, CI2B=CI2B, CI3=CI3), S.hat=c(S.hat2,S.hat1,S.hat2, S.hat2, S.hat1)),4) 95% 5% S.hat CI0 2.0216 2.3568 2.1335 CI1 1.9653 2.4035 2.1892 CI2 1.8521 2.3335 2.1335 CI2B 1.8548 2.3528 2.1335 CI3 1.8586 2.3556 2.1892 ##--------------------------------------------------------------------------- ## part (B) fit.oats = lm(Y ~ ., data=oats) mu.hat = fit.oats$fitted sig.hat = summary(fit.oats)$sig > table(oats$V) Golden.rain Marvellous Victory 24 24 24 aggregate(oats$Y, by=list(Variety=oats$V), mean) Variety x 1 Golden.rain 104.5000 2 Marvellous 109.7917 3 Victory 97.6250 aggregate (fit.oats$fitted, by=list(Variety=oats$V), mean) Variety x 1 Golden.rain 104.5000 2 Marvellous 109.7917 3 Victory 97.6250 ### This verifies that the variety group means are the same for Y as for fitted Y, ### as mentioned in class and in the problem statement. ## Our desired statistic is: > mean(oats$Y[oats$V=="Marvellous"]) - mean(oats$Y[oats$V!="Marvellous"]) [1] 8.729167 ## Using the fact that there are 24 indices for which oats$V=="Golden.rain", ## 24 for which $V=="Marvellous" and 24 for which $V=="Victory", ## we create a vector cvec which has entries 1/24 for the "Marvellous" cases ## and -1/48 for all others, so that the desired contrast of variety averages ## is sum(cvec*Ycol) for a given (initial or bootstrapped) Y-column Ycol > cvec = (c(-1,2,-1)/48)[as.numeric(oats$V)] sum(cvec*oats$Y) [1] 8.729167 ### This is the easiest way to calculate the statistic ### when data are generated as in PB with fixed B,V,N levels BootPB = array(rnorm(72*5000, mu.hat, sig.hat), c(72,5000)) bootinds = array(sample(1:72,72*5000,replace=T), c(72,5000)) BootNPB = array( oats$Y[bootinds], c(72,5000)) > summary(oats$Y) Min. 1st Qu. Median Mean 3rd Qu. Max. 53.0 86.0 102.5 104.0 121.2 174.0 > summary(c(BootPB)) Min. 1st Qu. Median Mean 3rd Qu. Max. 3.909 84.706 103.655 103.978 122.117 216.944 > summary(c(BootNPB)) Min. 1st Qu. Median Mean 3rd Qu. Max. 53.0 86.0 102.0 103.9 121.0 174.0 ### So the two bootstrapped arrays appear reasonable with respect to the ### overall behavior of the simulated Y's. ### The desired statistic for the parametric bootstrap is calculated ### as c(cvec %*% BootPB). ### But to calculate the group means for the nonparametric bootstrap ### requires a for-loop ! And the group means of the bootstrapped Y's ### and of the fitted Y's from the model fitted to bootstrapped Y's are ### no longer the same, because the group sizes are no longer the same !! NPBstat = NPBstatB = numeric(5000) for(i in 1:5000) { gpsiz = table(as.numeric(oats$V)[bootinds[,i]]) cvec2 = c(-0.5/gpsiz[1], 1/gpsiz[2], -0.5/gpsiz[3]) NPBstat[i] = sum(cvec2*BootNPB[,i]) NPBstatB[i] = sum(cvec2*lm(Y~., data=oats[bootinds[,i],])$fit) } qPB = quantile( c(cvec %*% BootPB), prob=c(.05,.95)) qNPB = quantile( NPBstat, prob=c(.05,.95)) CI.PB = 2*sum(cvec*oats$Y) - qPB[2:1] > CI.PB 95% 5% 2.644333 15.022155 ### very reasonable CI.NPB = 2*sum(cvec*oats$Y) - qNPB[2:1] > CI.NPB 95% 5% -33.47984 62.77589 CI.NPB2 = 2*sum(cvec*oats$Y) - quantile( NPBstatB, prob=c(.05,.95))[2:1] > CI.NPB2 95% 5% -33.03317 62.44689 ### These are completely UNreasonable, in ### the sense that they are badly centered and far too wide !!! #### The reason that this approach to Nonparametric Bootstrap gave absurd results #### is that we did not maintain the blocking structure of the Y data in the NPB case. ### How can we do it more reasonably ? The answer [WHICH I DID NOT EXPECT YOU TO FIND IN DOING THIS PROBLEM] is to do nonparametric bootstrap not on the original data but rather on the residuals from the linear-model fit !! ###------------- Everything past here is beyond the scope of the assigned problem. Yrsd = fit.oats$resid > sd(Yrsd) [1] 14.19373 > sig.hat [1] 15.31302 BootNPB2 = fit.oats$fit + array(Yrsd[bootinds], c(72,5000)) CI.NPB3 = 2*sum(cvec*oats$Y) - quantile( c(cvec* BootNPB2), prob=c(.95,.05)) 95% 5% 11.73785 20.37326 ### much more reasonable width, but still seems to indicate ### lack of fit of the original model Consider the dataset "oats", and suppose that the target of inference is the expected yield of the "Marvellous" variety minus the average of the "Victory" and "Golden.Rain" varieties. Let the statistic be the average of the fitted values under the model lm(Y ~ ., data=oats), and let the parametric model be the regression model with iid normal mean 0 errors. Find the 90% two-sided CI for the target using each of the methods (a) Nonparametric bootstrap, non-studentized (b) Parametric bootstrap, non-studentized Use a reasonably large number of bootstrap iterations (B >=1000). NOTE: IN THIS PROBLEM, YOU NEED TO FIT THE MODEL lm(Y ~ ., data=oats) IN ORDER TO ESTIMATE THE PARAMETERS NEEDED FOR THE PARAMETRIC BOOTSTRAP. HOWEVER, THE STATISTIC I HAVE ASKED YOU TO COMPUTE BOTH ON THE ORIGINAL oats DATASET AND THE LATER BOOTSTRAP SAMPLES, WHICH IS (avg of "Marvellous" fitted Y's) - 0.5*(sum of avg of "Victory" fitted Y's and of avg of "GoldenRain" fitted Y's) TURNS OUT IN EACH POSSIBLE SAMPLE TO BE EXACTLY THE SAME AS THE (avg of "Marvellous" Y's) - 0.5*(sum of avg of "Victory" Y's and of avg of "GoldenRain" Y's) DO THE CALCULATION ON THE ORIGINAL SAMPLE TO CONVINCE YOURSELF OF THIS FACT! NOTE THAT IN THE SECOND OF THESE STATISTIC EXPRESSIONS, NO MODEL FITTING IS INVOLVED.