Homework Problem 11, Due Monday March 3. ---------------------------------------- (This due-date has been extended because of the correction inserted below changing exp(-X) to exp(-|X|), on Feb.28.) (a) Generate 1000 independent batches of 20 i.i.d. random 2-vectors (X_i,Y_i), i=1,..,20 for which X has a standard logistic distribution, and conditionally given X Y - X ~ N( 0, 2*(1+exp(-|X|)) ) [CORRECTION HAS BEEN INSERTED HERE] Give some distributional checks or graphical evidence to show persuasively that you have simulated the joint (X,Y) distribution correctly. (b) The objective of this simulation is to evaluate E( (Y_(10) - X_(10))^2 ) (*) where the subscript _(10) refers to the 10th order statistic (ie the 10th smallest observation) within the batch. One method of evaluating the method (*) is by a direct simulation: perform (and time!) a simulation of 10000 batches and find (*) as a direct empirical average of 10000 quantities (Y_(10) - X_(10))^2 that you simulate. Explore whether you can speed up the simulation (while obtaining accuracy at least as good as with your first method) by using the method of control variates with control variate 2*(1+exp(-|X_(10)|)). SOLUTION: ## Methods will be done two ways, once for original ## form of problem assignment with variance 2*(1+exp(-x)) > xmat = array(rlogis(2e5),c(10000,20)) xvec = apply(xmat,1, function(xrow) sort(xrow)[10]) meanval = 2*cbind(empir = c(mean(1+exp(-xvec)), mean(1+exp(-abs(xvec)))), true = c(20/9,1 + (20/9)*(1-pbeta(.5,9,11)))) empir true [1,] 4.426639 4.444444 [2,] 3.441908 3.439124 ### Check these with numerical integrals: > 2*20*integrate(function(x) (1+exp(-x))*dlogis(x)* dbinom(9,19,plogis(x)), -20,20, abs.tol=1e-8)$val [1] 4.444444 > 2*20*integrate(function(x) (1+exp(-abs(x)))*dlogis(x)* dbinom(9,19,plogis(x)), -20,20, abs.tol=1e-8)$val [1] 3.439124 ### Note that the true values found in "meanval" were ### based on a transformation to the beta density using ### the change of variable z = plogis(x) ## As a further check on the simulation, we can overplot ## a very detailed histogram of the logistic matrix ## elements with the logistic density from -10 to 10: > hist(c(xmat), nclass=100, prob=T, xlim=c(-10,10)) lines(seq(-10,10,by=.2), dlogis(seq(-10,10,by=.2)) ## really perfect agreement. 2nd check is to ## do chi-square on X_(10) with 48 cells: > brkvec = c(-10,-1.3,-1.2, -1.1, seq(-1.0,1.0, by=.05), 1.1, 1.2, 1.3, 10) cvec = hist(xvec, breaks=brkvec, plot=F)$counts truvec = diff(pbeta(plogis(brkvec),10,11)*1e4) > chisq=sum((cvec-truvec)^2/truvec) c(chisq=chisq, pval=1-pchisq(chisq,47) chisq pval 58.9427587 0.1135001 ### So OK. ## Now complete the simulation ymat1 = xmat + rnorm(2e5)*sqrt(2*(1+exp(-xmat))) yvec1 = apply(ymat1, 1, function(yrow) sort(yrow)[10]) > hist(yvec1, nclass=60, prob=T) ## The ymat1 values are have a really extensive range: Min. 1st Qu. Median Mean 3rd Qu. Max. -252.80000 -1.55300 0.44500 0.04179 2.14900 2291.00000 ## But the yvec1 values are on a much more restricted range ## and have a smooth looking unimodal density > qqnorm(c((ymat1-xmat)/sqrt(2*(1+exp(-xmat))))) ### perfect normal fit. ### Now proceed to part (b), first using ymat1 and yvec1 as ### the problem was originally stated > c(mean = mean((yvec1-xvec)^2), se = sqrt(var((yvec1-xvec)^2))) mean se 0.654106 0.955463 ## This means the 95%CI has half-width > .Last.value[2]*qnorm(.975)/sqrt(1e4) ### = 0.01873 ## Now consider the speedup based on regression of ## (yvec1-xvec)^2) on (2*(1+exp(-xvec))-40/9) > CVfit = lm((yvec1-xvec)^2 ~ I(2*(1+exp(-xvec)))-40/9)) > summary(CVfit)$coef[1,1:2] Estimate Std. Error 0.657947449 0.009230393 ### versus .00955 ### So at the end of this we have very little speedup, ### because cor((yvec1-xvec)^2, exp(-xvec)) is 0.259 ### making the factor of improvement in CI width equal to > sqrt(1-.259^2) ### = 0.966 ### So my choice of control variate was not very good. ### Here is the same method applied to the second ## formulation of the problem. ymat2 = xmat + rnorm(2e5)*sqrt(2*(1+exp(-abs(xmat)))) yvec2 = apply(ymat2, 1, function(yrow) sort(yrow)[10]) > c(mean = mean((yvec2-xvec)^2), se = sqrt(var((yvec2-xvec)^2))) mean se 0.3690305 0.5255502 ## se for the average of yvec2 = .005256 > summary(lm((yvec2-xvec)^2 ~ I(2*(1+exp(-abs(xvec)))- meanval[2,2])))$coef[1,1:2] Estimate Std. Error 0.369056903 0.005255822 ### Now no speedup at all !! > cor((yvec2-xvec)^2, exp(-abs(xvec))) [1] -0.006356809 ### and this is why. -------------------------------------------------------------- ### One last try: to achieve larger correlations > 20*integrate(function(x) x^2*dlogis(x)* dbinom(9,19,plogis(x)), -20,20, abs.tol=1e-8)$val [1] 0.2103327 ymat = 5*xmat*(1 + runif(2e5,.5,1.5)) yvec = apply(ymat, 1, function(yrow) sort(yrow)[10]) > cor((yvec-xvec)^2, xvec^2) [1] 0.9490615 > c(mean = mean((yvec-xvec)^2), se = sqrt(var((yvec-xvec)^2)/1e4)) mean se 15.9733322 0.2354217 > summary(lm(I((yvec-xvec)^2) ~ I(xvec^2-0.2103327)))$coef[1,1:2] Estimate Std. Error 16.32143134 0.07419196 ## SO IN THIS FORMULATION THE PROBLEM WORKS WELL: IT IS STILL NOT PSSIBLE TO GET THE EXPECTATION ANALYTICALLY, BUT NOW THE CONTROL VARIATE IS SO WELL CORRELATED THAT WE GET A SPEEDUP BY A FACTOR .074/.235 = .315 which is essentially the same as > sqrt(1-0.9490615^2) ============================================================= Homework Problem 12, Due Wednesday March 5. ------------------------------------------ Consider the not uncommon situation where you may have 25 iid observations X_i , i=1,..,25, which you think are approximately normally distributed, but you are worried that the actual distribution may have heavier tails. A reasonable test statistic to test this is | max_i X_i - min_i X_i | / (*) (sqrt( sum_{i=1}^n (X_i-Xbar)^2 )/24) where Xbar is the sample mean of the 25 observations X_i, and the denominator is the usual sample standard deviation. This statistic is sometimes called the studentized range, and can easily be seen to be invariant under affine changes of variable X_i -> a X_i + b. FIND THE SIGNIFICANCE LEVEL OF THE HYPOTHESIS TEST WHICH REJECTS WHEN THE TEST STATISTIC (*) IS AT LEAST 5. Since this is a relatively rare event and involves a 25-dimensional integral, it is hard to evaluate exactly. So this exercise is to find the probability as accurately as you can by simulation. You can get only a crude evaluation of the probability by doing a direct simulation. So you should try to speed up the simulation to get acceptable accuracy within a feasible sample size. I suggest that you simulate from N(0,1) and EITHER try the method of control variates with control variable .04* sum_i |X_i|^3 (because the exact mean of this variable is easy to find for standard normal, and large values of the variable are associated with large values of the statistic(*)) OR try the method of importance sampling OR BOTH. (For the importance sampling distribution, we want to simulate from a distribution with tails something like the normal but heavier, but which is also easy to simulate. I suggest something like a symmetrized [ie multiplied by + or - 1 with equal probability] r.v. with the distribution of the square root of a chi-squared k degree of freedom variable where k is 2 or 5 or 10.) ## Direct simulation: > xmat = array(rnorm(2.5e5), c(1e4,25)) xvec = apply(xmat, 1, function(xrow) (max(xrow)-min(xrow))/sqrt(var(xrow))) > mean(xvec > 5) [1] 0.0137 ### OK, so this is the estimate ### with SE = sqrt(.0137*(1-.0137)/1e4) = 0.001162 ### Thus a simulation with 1e4 batches still has ### Coefficient of variation = SE/mean = 8.5% ### Now we try to get a couple of control variates. ### Consider : > zvec = .04*apply(xmat,1, function(xrow) sum(abs(xrow)^3)) - 4/sqrt(2*pi) > summary(lm(I(xvec>5) ~ zvec))$coef[1,1:2] Estimate Std. Error 0.013729607 0.001158474 ### almost no speedup because > cor(xvec >5, zvec) ## = 0.08362984 > zvec3 = apply(xmat,1, function(xrow) as.numeric(max(xrow)>2.5 & min(xrow)< -2.5)) - (1-2*pnorm(2.5)^25+(pnorm(2.5)-pnorm(-2.5))^25) > cor(zvec3, xvec>5) [1] 0.1533861 > summary(lm(I(xvec>5) ~ zvec3))$coef[1,1:2] Estimate Std. Error 0.013659746 0.001148787 > zvec6 = apply(xmat,1, function(xrow) .04*sum(xrow^4*(abs(xrow)>2)))- 3*(1-pgamma(2,2.5)) > summary(lm(I(xvec>5) ~ zvec+zvec3+zvec6))$coef ## r.sq=0.0787 Estimate Std. Error t value Pr(>|t|) (Intercept) 0.01375796 0.001115944 12.32854 1.135206e-34 zvec -0.06579192 0.003418092 -19.24814 4.248767e-81 zvec3 0.08048747 0.008465753 9.50742 2.405430e-21 zvec6 0.03095562 0.001279648 24.19073 1.090629e-125 ### Now the estimate and SE are: 0.01375796, 0.001115944 ### Still essentially no improvement. ### So continue by trying importance sampling instead. ### First try: skewed distributions made symmetric > xmat0 = array(sqrt(rgamma(2.5e5,.4))*sample(c(-1,1),2.5e5, rep=T), c(1e4,25)) ### Now try density ratio: > frat = function(x, shap=1) dnorm(x,0,sqrt(shap))/(abs(x)*dgamma(x^2,shap)) > summary(frat(xmat0[,1],.4)) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.1182 0.9215 1.0680 0.9992 1.1340 1.1550 ## NB mean = 1 as desired > xvec0 = apply(xmat0, 1, function(xrow) (max(xrow)-min(xrow))/sqrt(var(xrow))) > tmp = apply(xmat0[xvec0>5,],1,function(xrow) prod(frat(xrow,.4))) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.003734 0.108600 0.228900 0.348100 0.455300 3.122000 > sum(xvec0>5) [1] 370 ## NOTE Prob estimate = .348100*.0370 = 0.0128797 ## with SE = sqrt(var(c(tmp,rep(0,9630)))/1e4) = 0.0009807716 > xmat1 = array(sqrt(rgamma(2.5e5,.37))*sample(c(-1,1),2.5e5, rep=T), c(1e4,25)) xvec1 = apply(xmat1, 1, function(xrow) (max(xrow)-min(xrow))/sqrt(var(xrow))) tmp = apply(xmat1[xvec1>5,],1,function(xrow) prod(frat(xrow,.37))) > c(prob.est=sum(tmp)/1e4, SE=sqrt(var(c(tmp,rep(0,1e4-length(tmp))))/1e4)) prob.est SE 0.014742899 0.001242629 ### Try mixture of truncated normal with smaller variance in ### (-bpt,bpt) range and normal tails with larger variance in ### complement. New weights pwt, 1-pwt and new variance sig^2. > ImpSamp = function(Nrep, pwt, bpt, sig, nsiz=25) { x2mat = array(0, c(Nrep,nsiz)) emat = sample(c(F,T), Nrep*nsiz, rep=T, prob=c(1-pwt,pwt)) x2mat[emat] = qnorm(pnorm(-bpt)+(2*pnorm(bpt)-1)* runif(sum(emat))) x2mat[!emat] = sample(c(-1,1),sum(!emat), rep=T)*sig* qnorm(pnorm(bpt/sig)+pnorm(-bpt/sig)*runif(sum(!emat))) x2vec = apply(x2mat, 1, function(xrow) (max(xrow)-min(xrow))/sqrt(var(xrow))) r1 = (2*pnorm(bpt)-1)/pwt r2 = 2*sig*pnorm(-bpt/sig)/(1-pwt) fac0 = apply(x2mat,1, function(xrow) { ind = (abs(xrow)>bpt) m = sum(ind) r2^m * r1^(25-m) * exp(0.5*(1/sig^2-1)* sum(xrow[ind]^2)) }) list(Count=sum(x2vec>5), Intchk = mean(fac0), Prob.Est = sum(fac0[x2vec>5])/Nrep, facs = fac0, x2vec=x2vec, x2mat=x2mat) } > out1 = ImpSamp(1e4, .9, 1.2, 1.5) ## First check whether x2mat simulated correctly: > mean(abs(out1$x2mat)<1.2) [1] 0.900412 ## OK > hist(out1$x2mat[abs(out1$x2mat) < 1.2], prob=T) > lines(seq(-1.2,1.2,by=.02), dnorm(seq(-1.2,1.2, by=.02))/(2*pnorm(1.2)-1)) ### OK, perfect > hist(out1$x2mat[out1$x2mat > 1.2], xlim=c(1.2001,5.2), prob=T, nclass=20) > lines(seq(1.2,5.2,by=.05), dnorm(seq(1.2,5.2, by=.05), 0, sqrt(2))/pnorm(-1.2/sqrt(2))) ### OK ### Now check average of fac0 > mean(out1$facs) ### = 1.012370 OK !!! > sum(out1$Count) ## 1346 , much larger than for normal > out1$Prob [1] 0.01407891 ### might be right !? > sqrt(var(facs*(x2vec>5))/1e4) [1] 0.001027121 ### initial SE was = 0.001115, 10% improvement > out2 = ImpSamp(1e4, .94, 1.5, 1.5) > mean(out2$facs) [1] 0.9153158 > out2$Count [1] 657 > out2$Prob [1] 0.01146454 > sqrt(var(out2$facs*(out2$x2vec>5))/1e4) [1] 0.0008993807 > out1 = ImpSamp(1e4, .9, 1.3, 1.6) > mean(out1$facs) [1] 0.9937063 > out1$Count [1] 1462 > c(est=out1$Prob, SE=sqrt(var( out1$facs*(out1$x2vec>5))/1e4)) est SE 0.013193861 0.000814589 > c(est=out2$Prob, SE=sqrt(var( out2$facs*(out2$x2vec>5))/1e4)) est SE 0.013078840 0.001184197 > out1 = ImpSamp(1e4, .94, 1.5, 1.5) ## Repeats prev case > c(mean(out1$facs), sqrt(var(out1$facs)/1e4)) [1] 1.1519246 0.1503113 ### too variable !! > out1$Count ## 699 > c(est=out1$Prob, SE=sqrt(var( out1$facs*(out1$x2vec>5))/1e4)) est SE 0.015527429 0.001741408