Demo for Linear Model Selection in R --- MATH 420 ===================================== 4/4/11 (1) Generate data: 20 columns of 1000 entries all indep N(0,1) coefficients c(2,1.5,1,.8, .6, .5, .4, .2, rep(.1,6), rep(0.5,6)) > Xmat = array(rnorm(2e4), c(1000,20), dimnames=list( NULL, paste("X",1:20,sep=""))) btru = c(2,1.5,1,.8, .6, .5, .4, .2, rep(.1,6), rep(0.05,6)) Yvec = -2.5 + c(Xmat %*% btru) + sqrt(5)*rnorm(1000) > Dfr = cbind.data.frame(Y=Yvec, Xmat) > names(Dfr) [1] "Y" "X1" "X2" "X3" "X4" "X5" "X6" "X7" "X8" "X9" "X10" "X11" [13] "X12" "X13" "X14" "X15" "X16" "X17" "X18" "X19" "X20" (2) Step 1: preliminary screen > round(cor(Yvec, Xmat), 3) X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11 X12 [1,] 0.564 0.438 0.273 0.222 0.178 0.191 0.069 0.021 0.009 0.006 0.026 -0.007 X13 X14 X15 X16 X17 X18 X19 X20 [1,] -0.011 0.036 0.057 0.031 -0.044 0.043 0.026 -0.023 ### Compare with most significant "t-statistics" for combined fitted model > fullmod = lm(Y ~ . , data=Dfr) round(summary(fullmod)$coef[,3],2) (Intercept) X1 X2 X3 X4 X5 -31.96 27.51 21.15 14.97 10.78 8.88 X6 X7 X8 X9 X10 X11 8.50 5.41 2.84 0.84 1.03 2.27 X12 X13 X14 X15 X16 X17 0.41 0.63 1.14 1.87 0.32 -0.69 X18 X19 X20 0.49 0.51 -0.49 ### So let's say initially that we would choose all t-statistic ### values 10 or greater. NOTE that the correlations gowith a ### bottom-up "FORWARD" selection strategy, while the full-model ### looks at a "BACKWARD" selection strategy (we would remove ### columns with very small, insignificant coefficients). (3) Step 2: implement an AUTOMATIC (AIC based) selection. For these models, apart from constant, AIC is N*log(RSS_p) + k*p closely related to Cp (default has k=2) > extractAIC(fullmod) [1] 21.000 1660.105 > nullmod = lm(Y~1, data=Dfr) > extractAIC(nullmod) [1] 1.000 2703.885 > 1000*log(sum(fullmod$resid^2)/sum(nullmod$resid^2))+20*2 [1] -1043.780 ### same as 1660.105-2703.885 > forwd = step(nullmod, scope=list(lower=nullmod$terms, upper= fullmod$terms), direction = "forward", trace = 1) Start: AIC=2703.89 Y ~ 1 ... Step: AIC=2323.04 Y ~ X1 ... Step: AIC=2066.38 Y ~ X1 + X2 Df Sum of Sq RSS AIC + X3 1 1060.04 6788.9 1923.3 + X4 1 639.69 7209.3 1983.4 + X6 1 411.90 7437.1 2014.5 + X5 1 403.20 7445.8 2015.7 + X7 1 144.73 7704.2 2049.8 + X15 1 47.75 7801.2 2062.3 + X8 1 27.92 7821.1 2064.8 7849.0 2066.4 + X11 1 14.98 7834.0 2066.5 + X14 1 14.95 7834.0 2066.5 + X9 1 8.72 7840.3 2067.3 + X18 1 6.81 7842.2 2067.5 + X10 1 5.67 7843.3 2067.7 + X12 1 4.25 7844.7 2067.8 + X16 1 3.17 7845.8 2068.0 + X20 1 0.13 7848.8 2068.4 + X13 1 0.12 7848.8 2068.4 + X17 1 0.05 7848.9 2068.4 + X19 1 0.00 7849.0 2068.4 ... Step: AIC=1650.21 Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 Df Sum of Sq RSS AIC + X11 1 27.5767 5087.6 1646.8 + X15 1 18.1589 5097.0 1648.7 5115.1 1650.2 + X14 1 8.3171 5106.8 1650.6 + X10 1 5.9509 5109.2 1651.0 + X9 1 3.9657 5111.2 1651.4 + X13 1 2.7572 5112.4 1651.7 + X17 1 1.6526 5113.5 1651.9 + X12 1 1.0241 5114.1 1652.0 + X19 1 1.0171 5114.1 1652.0 + X20 1 0.9779 5114.2 1652.0 + X18 1 0.8205 5114.3 1652.0 + X16 1 0.6388 5114.5 1652.1 Step: AIC=1646.8 Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X11 ... Step: AIC=1645.44 Y ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + X11 + X15 ### so final AIC=1645.44, with p=11, RSS = 5070.5, r.sq=.660 ### full model with p=21 had AIC = 1660.1, RSS=5043.5, r.sq=.662 (4) Similarly could do automatic search "backward" from full model > backwd = step(fullmod, scope=list(lower=nullmod$terms, upper= fullmod$terms), direction = "backward") #### SAME final model !! (5) Try partial correlation by method mentioned in class starting from matrix with column of 1's X1 .. X4 Xtmp = cbind(rep(1,1000), Xmat[,1:4]) projMat = diag(1000) - Xtmp %*% solve(t(Xtmp) %*% Xtmp) %*% t(Xtmp) ### almost instantaneous > newYX = projMat %*% cbind(Y=Yvec, Xmat[,5:20]) > dimnames(newYX)[[2]] [1] "Y" "X5" "X6" "X7" "X8" "X9" "X10" "X11" "X12" "X13" "X14" "X15" [13] "X16" "X17" "X18" "X19" "X20" > round(cor(newYX[,1],newYX[,2:16]),3) X5 X6 X7 X8 X9 X10 X11 X12 X13 X14 X15 X16 [1,] 0.256 0.253 0.15 0.065 0.03 0.036 0.064 -0.001 0.018 0.042 0.077 0.004 X17 X18 X19 [1,] 0.003 0.029 0.019 ### Note that the larger terms here are X5..X8, X11, and X15