PROBS 9.3,9.4, p. 316., 5th ED ================================ Problem 9.3, p. 316, 5th Ed ----------------------------------------------- Regress number of books in college libraries on ENROLLMENT, HIGHEST DEGREE offered, and AREA of college. Use FORWARD selection. BOOKS in milions, ENROLL(ment) in thousands HIGHEST DEGREE: 1=BA,2=MA,3=PHD AREA in acres. Note: Arbitrary assignment of degree. If change assignment get different interpretation. Need to use dummy variables, as do in Prob. 9.4 below. OPTION PS=45 LS=70; DATA LIBRARY; INPUT BOOKS ENROLL DEGREE AREA; DATALINES; 4 5 3 20 5 8 3 40 10 40 3 100 1 4 2 50 5 2 1 300 2 8 1 400 7 30 3 40 4 20 2 200 1 10 2 5 1 12 1 100 ; PROC REG DATA=LIBRARY; MODEL BOOKS = ENROLL DEGREE AREA / SELECTION=FORWARD; RUN; The SAS System The REG Procedure Model: MODEL1 Dependent Variable: BOOKS Number of Observations Read 10 Number of Observations Used 10 Forward Selection: Step 1 Variable ENROLL Entered: R-Square = 0.5602 and C(p) = 6.0588 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 43.69702 43.69702 10.19 0.0128 Error 8 34.30298 4.28787 Corrected Total 9 78.00000 Parameter Standard Variable Estimate Error Type II SS F Value Pr > F Intercept 1.53094 1.01341 9.78565 2.28 0.1693 ENROLL 0.17763 0.05564 43.69702 10.19 0.0128 Forward Selection: Step 2 Variable DEGREE Entered: R-Square = 0.6370 and C(p) = 5.9529 Forward Selection: Step 2 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 2 49.68736 24.84368 6.14 0.0288 Error 7 28.31264 4.04466 Corrected Total 9 78.00000 Parameter Standard Variable Estimate Error Type II SS F Value Pr > F Intercept -0.19936 1.72923 0.05376 0.01 0.9115 ENROLL 0.14147 0.06167 21.28156 5.26 0.0555 DEGREE 1.06332 0.87373 5.99034 1.48 0.2630 Bounds on condition number: 1.3023, 5.2094 ---------------------------------------------------------------------- Forward Selection: Step 3 Variable AREA Entered: R-Square = 0.7812 and C(p) = 4.0000 Forward Selection: Step 3 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 3 60.93207 20.31069 7.14 0.0209 Error 6 17.06793 2.84465 Corrected Total 9 78.00000 Parameter Standard Variable Estimate Error Type II SS F Value Pr > F Intercept -4.39541 2.56070 8.38127 2.95 0.1369 ENROLL 0.11445 0.05348 13.02949 4.58 0.0761 DEGREE 2.51818 1.03555 16.82124 5.91 0.0510 AREA 0.01208 0.00608 11.24472 3.95 0.0940 Bounds on condition number: 2.6011, 18.158 ---------------------------------------------------------------------- All variables have been entered into the model. Summary of Forward Selection Variable Number Partial Model Step Entered Vars In R-Square R-Square C(p) F Value Pr > F 1 ENROLL 1 0.5602 0.5602 6.0588 10.19 0.0128 2 DEGREE 2 0.0768 0.6370 5.9529 1.48 0.2630 3 AREA 3 0.1442 0.7812 4.0000 3.95 0.0940 ===================================================================== Problem 9.4, p. 316, 5th Ed ----------------------------------------------- Now use dummy variables MASTERS and DOCTORATE. The reference is DEGREE=1. OPTION PS=45 LS=70; DATA LIBRARY; INPUT BOOKS ENROLL DEGREE AREA; MASTERS = (DEGREE EQ 2); ***New variable MASTERS with values of 1 when DEGREE=2 and 0 ow; DOCTORATE = (DEGREE EQ 3); ***New variable DOCTORATE with values of 1 when DEGREE=3 and 0 ow; LOG_AREA = LOG(AREA); DATALINES; 4 5 3 20 5 8 3 40 10 40 3 100 1 4 2 50 5 2 1 300 2 8 1 400 7 30 3 40 4 20 2 200 1 10 2 5 1 12 1 100 ; PROC REG DATA = LIBRARY; MODEL BOOKS = ENROLL MASTERS DOCTORATE LOG_AREA / SELECTION = FORWARD; RUN; QUIT; The SAS System The REG Procedure Model: MODEL1 Dependent Variable: BOOKS Number of Observations Read 10 Number of Observations Used 10 Forward Selection: Step 1 Variable ENROLL Entered: R-Square = 0.5602 and C(p) = 7.7437 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 43.69702 43.69702 10.19 0.0128 Error 8 34.30298 4.28787 Corrected Total 9 78.00000 Parameter Standard Variable Estimate Error Type II SS F Value Pr > F Intercept 1.53094 1.01341 9.78565 2.28 0.1693 ENROLL 0.17763 0.05564 43.69702 10.19 0.0128 Bounds on condition number: 1, 1 ---------------------------------------------------------------------- Forward Selection: Step 2 Variable DOCTORATE Entered: R-Square = 0.7420 and C(p) = 4.0619 Forward Selection: Step 2 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 2 57.87812 28.93906 10.07 0.0087 Error 7 20.12188 2.87455 Corrected Total 9 78.00000 Parameter Standard Variable Estimate Error Type II SS F Value Pr > F Intercept 1.18562 0.84419 5.66994 1.97 0.2030 ENROLL 0.12297 0.05178 16.21145 5.64 0.0493 DOCTORATE 2.76277 1.24387 14.18110 4.93 0.0618 Bounds on condition number: 1.2918, 5.1671 ---------------------------------------------------------------------- Forward Selection: Step 3 Variable LOG_AREA Entered: R-Square = 0.8375 and C(p) = 3.0771 NOTE: Before R-Square = 0.7812 Forward Selection: Step 3 Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 3 65.32805 21.77602 10.31 0.0088 Error 6 12.67195 2.11199 Corrected Total 9 78.00000 Parameter Standard Variable Estimate Error Type II SS F Value Pr > F Intercept -1.97983 1.83418 2.46073 1.17 0.3219 ENROLL 0.10382 0.04554 10.97715 5.20 0.0628 DOCTORATE 3.55501 1.14660 20.30241 9.61 0.0211 LOG_AREA 0.73985 0.39393 7.44993 3.53 0.1094 Bounds on condition number: 1.494, 12.038 ---------------------------------------------------------------------- No other variable met the 0.5000 significance level for entry into the model. So, the final model is: BOOKS = -1.97983 + 0.10382*ENROLL + 3.55501*DOCTORATE + 0.73985*LOG_AREA Thus, when the highest degree is DOCTORATE then DOCTORATE=1 and: BOOKS = (-1.97983+ 3.55501) + 0.10382*ENROLL + 0.73985*LOG_AREA Otherwise: DOCTORATE=0 and BOOKS = -1.97983 + 0.10382*ENROLL + 0.73985*LOG_AREA Summary of Forward Selection Variable Number Partial Model Step Entered Vars In R-Square R-Square C(p) F Value Pr > F 1 ENROLL 1 0.5602 0.5602 7.7437 10.19 0.0128 2 DOCTORATE 2 0.1818 0.7420 4.0619 4.93 0.0618 3 LOG_AREA 3 0.0955 0.8375 3.0771 3.53 0.1094 ====================================================================== Check by the direct way by inputing the 0-1 values: PHD=3 ----> 1 0 MA =2 ----> 0 1 BA =1 ----> 0 0 (REFERENCE) OPTION PS=45 LS=70; DATA LIBRARY; INPUT BOOKS ENROLL PHD MA AREA; LOG_AREA = LOG(AREA); DATALINES; 4 5 1 0 20 5 8 1 0 40 10 40 1 0 100 1 4 0 1 50 5 2 0 0 300 2 8 0 0 400 7 30 1 0 40 4 20 0 1 200 1 10 0 1 5 1 12 0 0 100 ; PROC REG DATA = LIBRARY; MODEL BOOKS = ENROLL PHD MA LOG_AREA / SELECTION = FORWARD; RUN; QUIT; We get identical results: Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 3 65.32805 21.77602 10.31 0.0088 Error 6 12.67195 2.11199 Corrected Total 9 78.00000 Parameter Standard Variable Estimate Error Type II SS F Value Pr > F Intercept -1.97983 1.83418 2.46073 1.17 0.3219 ENROLL 0.10382 0.04554 10.97715 5.20 0.0628 PHD 3.55501 1.14660 20.30241 9.61 0.0211 LOG_AREA 0.73985 0.39393 7.44993 3.53 0.1094 Bounds on condition number: 1.494, 12.038 ---------------------------------------------------------------------- No other variable met the 0.5000 significance level for entry into the model. =================================================================== Check AIC: With both PHD and MA in the model PROC mixed IC DATA = LIBRARY method=ml; MODEL BOOKS = ENROLL PHD MA LOG_AREA; RUN; QUIT; Information Criteria Neg2LogLike Parms AIC AICC HQIC BIC CAIC 30.6 6 42.6 70.6 40.6 44.4 50.4 Type 3 Tests of Fixed Effects Num Den Effect DF DF F Value Pr > F ENROLL 1 5 6.90 0.0467 PHD 1 5 9.11 0.0295 MA 1 5 0.15 0.7108 LOG_AREA 1 5 4.47 0.0881 Now with PHD only in the model get a better AIC: PROC mixed IC DATA = LIBRARY method=ml; MODEL BOOKS = ENROLL PHD LOG_AREA; RUN; QUIT; Information Criteria Neg2LogLike Parms AIC AICC HQIC BIC CAIC 30.7 5 40.7 55.7 39.1 42.3 47.3 Type 3 Tests of Fixed Effects Num Den Effect DF DF F Value Pr > F ENROLL 1 6 8.66 0.0258 PHD 1 6 16.02 0.0071 LOG_AREA 1 6 5.88 0.0515