Solution to HW 20. ================= > dim(mtcars) [1] 32 11 > names(mtcars) [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" [11] "carb" > maxfit = lm(mpg ~ ., data=mtcars) > extractAIC(maxfit) [1] 11.00000 70.89774 > tmpfit = step(frstfit, list(upper= ~ qsec+vs+gear+carb + cyl+ disp+ hp+ drat+ wt+ am, lower = ~1), dire="for", k=4) Step: AIC= 69.2 mpg ~ wt + cyl Df Sum of Sq RSS AIC 191.172 69.198 + hp 1 14.551 176.621 70.665 > summary(tmpfit)$coef Estimate Std. Error t value Pr(>|t|) (Intercept) 39.686261 1.7149840 23.140893 3.043182e-20 wt -3.190972 0.7569065 -4.215808 2.220200e-04 cyl -1.507795 0.4146883 -3.635972 1.064282e-03 ### Running same "step" function with upper = ~qsec+vs+gear+carb + (cyl+ disp+ hp+ drat+ wt+ am)^2 ### leads to "best" model Step: AIC= 66.89 mpg ~ wt + cyl + wt:cyl ### but improper termination !!? > fitfinal = update(tmpfit, formula=.~. + wt*cyl) > extractAIC(fitfinal, k=4) [1] 4.00000 66.89147 > summary(fitfinal)$coef Estimate Std. Error t value Pr(>|t|) (Intercept) 54.3068062 6.127535 8.862749 1.288912e-09 wt -8.6555590 2.320122 -3.730648 8.609666e-04 cyl -3.8032187 1.005028 -3.784193 7.471791e-04 wt:cyl 0.8083947 0.327322 2.469723 1.988242e-02 ### STILL A VERY STRONG AND SIGNIFICANT MODEL > backfit = step(lm(mpg ~ qsec+vs+gear+carb + cyl+ disp+ hp+ drat+ wt+ am + wt:cyl, data=mtcars), k=4) Step: AIC= 65.75 mpg ~ qsec + cyl + wt + cyl:wt Df Sum of Sq RSS AIC 133.680 65.751 - qsec 1 23.296 156.976 66.891 - cyl:wt 1 46.925 180.605 71.378 > summary(backfit)$coef Estimate Std. Error t value Pr(>|t|) (Intercept) 41.642888 8.2001797 5.078290 2.467874e-05 qsec 0.757161 0.3490572 2.169160 3.904427e-02 cyl -3.392655 0.9632556 -3.522072 1.543295e-03 wt -10.819534 2.3977297 -4.512408 1.128707e-04 cyl:wt 0.976600 0.3172252 3.078570 4.734899e-03 ### STILL VERY STRONG and all coefficients sensible > plot(backfit) ### Residuals plot generall sensible, but a few dramatic outliers! (Toyota Corolla, Corona, and Fiat 128): without these three (17,20,21) get > backf2 = step(lm(mpg ~ qsec+vs+gear+carb + cyl+ disp+ hp+ drat+ wt+ am + wt:cyl, data=mtcars[c(1:17,19,22:32),]), k=4) Step: AIC= 44.67 mpg ~ cyl + wt + am + cyl:wt ### model very similar Df Sum of Sq RSS AIC 67.899 44.671 - am 1 12.966 80.865 45.739 - cyl:wt 1 22.229 90.128 48.884 > summary(backf2)$coef Estimate Std. Error t value Pr(>|t|) (Intercept) 54.382372 5.6124728 9.689556 9.067462e-10 cyl -3.511505 0.7878346 -4.457160 1.651637e-04 wt -8.457864 1.9597174 -4.315859 2.365715e-04 am -2.044559 0.9550441 -2.140801 4.265448e-02 cyl:wt 0.718421 0.2563001 2.803046 9.859160e-03 ## previous bad residuals were + - 4, now (Firebird, Imperial, Camaro) + - 3 ### Sequential selection seems dramatically dependent on the order ### chosen, partly because of outliers. ### NOW SAS data home.mtcars; infile "mtcars.dat"; input carname $ mpg cyl disp hp drat wt qsec vs am gear carb ; V23 = cyl*disp; V24 = cyl*hp; V25 = cyl*drat; V26 = cyl*wt; V29 = cyl*am; V34 = disp*hp; V35 = disp*drat; V36 = disp*wt; V39 = disp*am; V45 = hp*drat; V46 = hp*wt; V49 = hp*am; V56 = drat*wt; V59 = drat*am; V69 = wt*am; run; ### 32 x 27 !!! proc reg data=home.mtcars; model mpg = cyl disp hp drat wt qsec vs am gear carb V23 V24 V25 V26 V29 V34 V35 V36 V39 V45 V46 V49 V56 V59 V69 / selection = stepwise slentry = .1 slstay = .05 ; run; Summary of Stepwise Selection Variable Variable Number Partial Model Step Entered Removed Vars In R-Square R-Square C(p) F Value Pr > F 1 V26 1 0.7821 0.7821 73.8147 107.69 <.0001 ###ONLY variable entered is wt*cyl which for us means that we should ### enter also wt*cyl. But as we saw in R, that is not quite the ### best model !!