Homework Problem 11, Due Monday March 3.
----------------------------------------
(This due-date has been extended because of the correction
inserted below changing exp(-X) to exp(-|X|), on Feb.28.)

(a) Generate 1000 independent batches of 
20 i.i.d. random 2-vectors (X_i,Y_i), i=1,..,20 
for which

X has a standard logistic distribution, and 
      conditionally given X

Y - X ~ N( 0, 2*(1+exp(-|X|)) )   [CORRECTION HAS BEEN INSERTED HERE]

Give some distributional checks or graphical 
evidence to show persuasively that you have 
simulated the joint (X,Y) distribution correctly.


(b) The objective of this simulation is to evaluate

E( (Y_(10) - X_(10))^2 )               (*)

where the subscript _(10) refers to the 10th order 
statistic (ie the 10th smallest observation) within 
the batch.

One method of evaluating the method (*) is by a 
direct simulation: perform (and time!) a simulation 
of 10000 batches and find (*) as a direct empirical 
average of 10000 quantities  (Y_(10) - X_(10))^2  that
you simulate.

Explore whether you can speed up the simulation (while 
obtaining accuracy at least as good as with your first
method) by using the method of control variates with 
control variate  2*(1+exp(-|X_(10)|)).

SOLUTION:

## Methods will be done two ways, once for original 
## form of problem assignment with variance 2*(1+exp(-x))
> xmat = array(rlogis(2e5),c(10000,20))
  xvec = apply(xmat,1, function(xrow) sort(xrow)[10])
  meanval = 2*cbind(empir = c(mean(1+exp(-xvec)),
     mean(1+exp(-abs(xvec)))), 
     true = c(20/9,1 + (20/9)*(1-pbeta(.5,9,11))))
        empir     true
[1,] 4.426639 4.444444
[2,] 3.441908 3.439124
### Check these with numerical integrals:
> 2*20*integrate(function(x) (1+exp(-x))*dlogis(x)*
    dbinom(9,19,plogis(x)), -20,20, abs.tol=1e-8)$val
[1] 4.444444
> 2*20*integrate(function(x) (1+exp(-abs(x)))*dlogis(x)*
    dbinom(9,19,plogis(x)), -20,20, abs.tol=1e-8)$val
[1] 3.439124
### Note that the true values found in "meanval" were 
###  based on a transformation to the beta density using 
###  the change of variable z = plogis(x)

## As a further check on the simulation, we can overplot
##  a very detailed histogram of the logistic matrix 
##  elements with the logistic density from -10 to 10:
> hist(c(xmat), nclass=100, prob=T, xlim=c(-10,10))
  lines(seq(-10,10,by=.2), dlogis(seq(-10,10,by=.2))
## really perfect agreement. 2nd check is to
## do chi-square on X_(10) with 48 cells:
> brkvec = c(-10,-1.3,-1.2, -1.1, 
        seq(-1.0,1.0, by=.05), 1.1, 1.2, 1.3, 10)
  cvec = hist(xvec, breaks=brkvec, plot=F)$counts
  truvec = diff(pbeta(plogis(brkvec),10,11)*1e4)
> chisq=sum((cvec-truvec)^2/truvec)
  c(chisq=chisq, pval=1-pchisq(chisq,47)
     chisq       pval 
58.9427587  0.1135001   ### So OK.

## Now complete the simulation
  ymat1 = xmat + rnorm(2e5)*sqrt(2*(1+exp(-xmat))) 
  yvec1 = apply(ymat1, 1, function(yrow) sort(yrow)[10])
> hist(yvec1, nclass=60, prob=T)
## The ymat1 values are have a really extensive range:
      Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
-252.80000   -1.55300    0.44500    0.04179    2.14900 2291.00000 
## But the yvec1 values are on a much more restricted range
##  and have a smooth looking unimodal density

> qqnorm(c((ymat1-xmat)/sqrt(2*(1+exp(-xmat)))))
### perfect normal fit.

### Now proceed to part (b), first using ymat1 and yvec1 as 
###  the problem was originally stated

> c(mean = mean((yvec1-xvec)^2), se = sqrt(var((yvec1-xvec)^2))) 
    mean       se 
0.654106 0.955463  ## This means the 95%CI has half-width
> .Last.value[2]*qnorm(.975)/sqrt(1e4)  ### = 0.01873

## Now consider the speedup based on regression of
## (yvec1-xvec)^2) on (2*(1+exp(-xvec))-40/9)

> CVfit = lm((yvec1-xvec)^2 ~ I(2*(1+exp(-xvec)))-40/9))
> summary(CVfit)$coef[1,1:2]
   Estimate  Std. Error 
0.657947449 0.009230393  ### versus .00955

### So at the end of this we have very little speedup,
###   because  cor((yvec1-xvec)^2, exp(-xvec)) is 0.259
###   making the factor of improvement in CI width equal to
> sqrt(1-.259^2)  ### =  0.966
### So my choice of control variate was not very good.

### Here is the same method applied to the second 
##    formulation of the problem.

 ymat2 = xmat + rnorm(2e5)*sqrt(2*(1+exp(-abs(xmat)))) 
 yvec2 = apply(ymat2, 1, function(yrow) sort(yrow)[10])

> c(mean = mean((yvec2-xvec)^2), se = sqrt(var((yvec2-xvec)^2))) 
     mean        se 
0.3690305 0.5255502 ##  se for the average of yvec2 = .005256

> summary(lm((yvec2-xvec)^2 ~ I(2*(1+exp(-abs(xvec)))-
                      meanval[2,2])))$coef[1,1:2]
   Estimate  Std. Error 
0.369056903 0.005255822 ### Now no speedup at all !!
> cor((yvec2-xvec)^2, exp(-abs(xvec))) 
[1] -0.006356809        ### and this is why.
--------------------------------------------------------------
### One last try: to achieve larger correlations

> 20*integrate(function(x) x^2*dlogis(x)*
    dbinom(9,19,plogis(x)), -20,20, abs.tol=1e-8)$val
[1]  0.2103327
  ymat = 5*xmat*(1 + runif(2e5,.5,1.5))
  yvec = apply(ymat, 1, function(yrow) sort(yrow)[10])
> cor((yvec-xvec)^2, xvec^2)
[1] 0.9490615
> c(mean = mean((yvec-xvec)^2), se = sqrt(var((yvec-xvec)^2)/1e4)) 
      mean         se 
15.9733322  0.2354217 
> summary(lm(I((yvec-xvec)^2) ~ I(xvec^2-0.2103327)))$coef[1,1:2]
   Estimate  Std. Error 
16.32143134  0.07419196 

## SO IN THIS FORMULATION THE PROBLEM WORKS WELL: IT IS STILL
NOT PSSIBLE TO GET THE EXPECTATION ANALYTICALLY, BUT NOW THE 
CONTROL VARIATE IS SO WELL CORRELATED  THAT WE GET A SPEEDUP 
BY A FACTOR  .074/.235 = .315  which is essentially the same as
> sqrt(1-0.9490615^2) 

=============================================================

Homework Problem 12, Due Wednesday March 5.
------------------------------------------

Consider the not uncommon situation where you may 
have 25 iid observations  X_i , i=1,..,25, which you 
think are approximately normally distributed, but 
you are worried that the actual distribution may have 
heavier tails. A reasonable test statistic to test 
this is 

 | max_i X_i - min_i X_i | /                   (*)
      (sqrt( sum_{i=1}^n (X_i-Xbar)^2 )/24)

where Xbar is the sample mean of the 25 observations X_i, 
and the denominator is the usual sample standard deviation.
This statistic is sometimes called the studentized range, 
and can easily be seen to be invariant under affine 
changes of variable  X_i -> a X_i + b.

FIND THE SIGNIFICANCE LEVEL OF THE HYPOTHESIS TEST WHICH
REJECTS WHEN THE TEST STATISTIC (*) IS AT LEAST 5.

Since this is a relatively rare event and involves a 
25-dimensional integral, it is hard to evaluate exactly.
So this exercise is to find the probability as accurately 
as you can by simulation. 

You can get only a crude evaluation of the probability by 
doing a direct simulation. 

So you should try to speed up the simulation to get 
acceptable accuracy within a feasible sample size. I suggest 
that you simulate from N(0,1) and EITHER try the method 
of control variates with control variable  .04* sum_i |X_i|^3
(because the exact mean of this variable is easy to find for
standard normal, and large values of the variable are 
associated with large values of the statistic(*)) 
OR try the method of importance sampling OR BOTH. 
(For the importance sampling distribution, we want 
to simulate from a distribution with tails something like 
the normal but heavier, but which is also easy to simulate. 
I suggest something like a symmetrized [ie multiplied by 
+ or - 1 with equal probability] r.v. with the distribution 
of the square root of a chi-squared k degree of freedom 
variable where k is 2 or 5 or 10.)

## Direct simulation:

> xmat = array(rnorm(2.5e5), c(1e4,25))
  xvec = apply(xmat, 1, function(xrow) 
        (max(xrow)-min(xrow))/sqrt(var(xrow)))
> mean(xvec > 5)
[1] 0.0137             ### OK, so this is the estimate
### with SE = sqrt(.0137*(1-.0137)/1e4) = 0.001162
### Thus a simulation with 1e4 batches still has 
### Coefficient of variation = SE/mean = 8.5%

### Now we try to get a couple of control variates. 
### Consider  :

> zvec = .04*apply(xmat,1, function(xrow) 
     sum(abs(xrow)^3)) - 4/sqrt(2*pi)
> summary(lm(I(xvec>5) ~ zvec))$coef[1,1:2]
   Estimate  Std. Error 
0.013729607 0.001158474  ### almost no speedup because
> cor(xvec >5, zvec) ## = 0.08362984

> zvec3 = apply(xmat,1, function(xrow) 
     as.numeric(max(xrow)>2.5 & min(xrow)< -2.5)) -
   (1-2*pnorm(2.5)^25+(pnorm(2.5)-pnorm(-2.5))^25)
> cor(zvec3, xvec>5)
[1] 0.1533861
> summary(lm(I(xvec>5) ~ zvec3))$coef[1,1:2]
   Estimate  Std. Error 
0.013659746 0.001148787 
> zvec6 = apply(xmat,1, function(xrow)
     .04*sum(xrow^4*(abs(xrow)>2)))- 3*(1-pgamma(2,2.5))

> summary(lm(I(xvec>5) ~ zvec+zvec3+zvec6))$coef ## r.sq=0.0787
                Estimate  Std. Error   t value      Pr(>|t|)
(Intercept)  0.01375796 0.001115944  12.32854  1.135206e-34
zvec        -0.06579192 0.003418092 -19.24814  4.248767e-81
zvec3        0.08048747 0.008465753   9.50742  2.405430e-21
zvec6        0.03095562 0.001279648  24.19073 1.090629e-125

### Now the estimate and SE are: 0.01375796, 0.001115944  
### Still essentially no improvement.

### So continue by trying importance sampling instead. 
### First try: skewed distributions made symmetric

> xmat0 = array(sqrt(rgamma(2.5e5,.4))*sample(c(-1,1),2.5e5,
            rep=T), c(1e4,25))
### Now try density ratio:
> frat = function(x, shap=1) 
      dnorm(x,0,sqrt(shap))/(abs(x)*dgamma(x^2,shap))
> summary(frat(xmat0[,1],.4))
    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.1182  0.9215  1.0680  0.9992  1.1340  1.1550 
## NB mean = 1 as desired
>  xvec0 = apply(xmat0, 1, function(xrow) 
        (max(xrow)-min(xrow))/sqrt(var(xrow)))
> tmp = apply(xmat0[xvec0>5,],1,function(xrow) prod(frat(xrow,.4)))
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.003734 0.108600 0.228900 0.348100 0.455300 3.122000 
> sum(xvec0>5)
[1] 370
## NOTE Prob estimate = .348100*.0370 = 0.0128797
## with SE = sqrt(var(c(tmp,rep(0,9630)))/1e4) = 0.0009807716

> xmat1 = array(sqrt(rgamma(2.5e5,.37))*sample(c(-1,1),2.5e5,
            rep=T), c(1e4,25))
  xvec1 = apply(xmat1, 1, function(xrow) 
        (max(xrow)-min(xrow))/sqrt(var(xrow)))
  tmp = apply(xmat1[xvec1>5,],1,function(xrow) prod(frat(xrow,.37)))
> c(prob.est=sum(tmp)/1e4, SE=sqrt(var(c(tmp,rep(0,1e4-length(tmp))))/1e4))
   prob.est          SE 
0.014742899 0.001242629 

###   Try mixture of truncated normal with smaller variance in 
###   (-bpt,bpt) range and normal tails with larger variance in
###   complement. New weights  pwt, 1-pwt and new variance sig^2.

> ImpSamp = function(Nrep, pwt, bpt, sig, nsiz=25) {
      x2mat = array(0, c(Nrep,nsiz))
      emat = sample(c(F,T), Nrep*nsiz, rep=T, prob=c(1-pwt,pwt))
      x2mat[emat] = qnorm(pnorm(-bpt)+(2*pnorm(bpt)-1)*
          runif(sum(emat)))
      x2mat[!emat] = sample(c(-1,1),sum(!emat), rep=T)*sig*
       qnorm(pnorm(bpt/sig)+pnorm(-bpt/sig)*runif(sum(!emat)))
      x2vec = apply(x2mat, 1, function(xrow) 
        (max(xrow)-min(xrow))/sqrt(var(xrow)))
      r1 = (2*pnorm(bpt)-1)/pwt
      r2 = 2*sig*pnorm(-bpt/sig)/(1-pwt)
      fac0 = apply(x2mat,1, function(xrow) {
          ind = (abs(xrow)>bpt)
          m = sum(ind)
          r2^m * r1^(25-m) * exp(0.5*(1/sig^2-1)*
             sum(xrow[ind]^2)) })
      list(Count=sum(x2vec>5), Intchk = mean(fac0),
          Prob.Est = sum(fac0[x2vec>5])/Nrep,
          facs = fac0, x2vec=x2vec, x2mat=x2mat) }

> out1 = ImpSamp(1e4, .9, 1.2, 1.5)
## First check whether x2mat simulated correctly:
> mean(abs(out1$x2mat)<1.2)
[1] 0.900412  ## OK
> hist(out1$x2mat[abs(out1$x2mat) < 1.2], prob=T)
> lines(seq(-1.2,1.2,by=.02), dnorm(seq(-1.2,1.2,
      by=.02))/(2*pnorm(1.2)-1))  ### OK, perfect
> hist(out1$x2mat[out1$x2mat > 1.2], xlim=c(1.2001,5.2),
   prob=T, nclass=20)
> lines(seq(1.2,5.2,by=.05), dnorm(seq(1.2,5.2,
   by=.05), 0, sqrt(2))/pnorm(-1.2/sqrt(2))) ### OK
### Now check average of fac0
> mean(out1$facs)  ### = 1.012370   OK !!!
> sum(out1$Count) ## 1346 , much larger than for normal
> out1$Prob
[1] 0.01407891   ### might be right !?
> sqrt(var(facs*(x2vec>5))/1e4)
[1] 0.001027121   ### initial SE was = 0.001115, 10% improvement

> out2 = ImpSamp(1e4, .94, 1.5, 1.5)
> mean(out2$facs)
[1] 0.9153158
> out2$Count
[1] 657
> out2$Prob
[1] 0.01146454
> sqrt(var(out2$facs*(out2$x2vec>5))/1e4)
[1] 0.0008993807

> out1 = ImpSamp(1e4, .9, 1.3, 1.6)
> mean(out1$facs)
[1] 0.9937063
> out1$Count
[1] 1462
> c(est=out1$Prob, SE=sqrt(var(
        out1$facs*(out1$x2vec>5))/1e4))
        est          SE 
0.013193861 0.000814589

> c(est=out2$Prob, SE=sqrt(var(
       out2$facs*(out2$x2vec>5))/1e4))
        est          SE 
0.013078840 0.001184197 

> out1 = ImpSamp(1e4, .94, 1.5, 1.5)  ## Repeats prev case
> c(mean(out1$facs), sqrt(var(out1$facs)/1e4))
[1] 1.1519246 0.1503113   ### too variable !!
> out1$Count  ## 699
> c(est=out1$Prob, SE=sqrt(var(
        out1$facs*(out1$x2vec>5))/1e4))
        est          SE 
0.015527429 0.001741408