Example of Regression in SAS (here) & Splus (in HW22) ===================================================== Stat 798C 4/7/04 ### Get the data for this example from http://www.math.umd.edu/~evs/s430/Data/bass.gz [Spring 2008 HW 18: the updated link is: http://www.math.umd.edu/~evs/s430.old/Data/bass.gz ] data bass (drop=AUX aux2); infile "Bass.Dat"; input aux2 $ 1-2 AUX $ 3-4 lake $ 7-14 @; IF aux = "" and aux2 NE "" and lake ne "Lake" THEN do; input @1 ID Lake $ Alk pH Calc Chlor AvMrc nsamp min max Mrc3yr agedat ; /* In this version, the dataset must have had blanks in columns 1-2 but NOT in columns 3-4 in those 53 records which contained the desired numeric data. */ output; end; run; /* Note the use of the trailing @ to read the line more than once. Also note that SAS truncates the lake-names to their first 8 characters. */ libname home "." ; data home.bass (drop = tmpchar); infile "Bass.Dat" ; length Lake $ 16 ; input tmpchar $1 @ ; if tmpchar > "0" and tmpchar <= "9" then do; input @1 ID Lake $ Alk pH Calc Chlor AvMrc nsamp min max Mrc3yr agedata ; end; if ID ne . ; run; data bass; set home.bass; run; proc gplot ; plot AvMrc * Alk ; run; data bass2; set bass; logAvM = log(AvMrc) ; proc reg data=bass2; model logAvM = Alk ; output out=bass3 p=logMprd r=logMrsd COOKD= Cookd Student= Studnt LCLM = LowPrd UCLM = HiPrd ; run; Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 18.71377 18.71377 53.22 <.0001 Error 51 17.93193 0.35161 Corrected Total 52 36.64571 /* NOTE as covered in class: if H = X (X^{tr}X)^{-1} X^{tr} denotes the nxn "Hat" matrix, and R the matrix I-H, then the Cook's Distance vector has n entries given by D_i = (i 'th studentized Residual)^2 * (H_{ii}/(1-H_{ii})) *(1/p) where p = number of columns of design matrix. */ Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 -0.32110 0.11471 -2.80 0.0072 Alk 1 -0.01570 0.00215 -7.30 <.0001 (b) goptions colors = (black) ; symbol1 v=star; symbol2 i=J L=3 ; proc gplot data=bass3 ; plot logAvM * Alk = 1 logMprd * Alk = 2 / overlay ; run; /* PRODUCES SCATTERPLOT WITH PREDICTED LINE */ PROC GPLOT data=bass3 ; plot logMrsd * Alk = 1 / vref=0; run; /* vref option produces horizontal `reference line' */ (c) proc means data=bass3 min max qrange std p95 p5 maxdec=3; var cookd logMrsd Studnt ; run; Quartile Variable Min Max Range Std Dev 95th Pctl 5th Pctl -------------------------------------------------------------------- Cookd 0.000 0.264 0.016 0.049 0.140 0.000 logMrsd -2.066 1.792 0.572 0.587 0.788 -0.995 Studnt -3.522 3.104 0.980 1.011 1.341 -1.695 ------------------------------------------------------------------- proc print data=bass3 ; var alk logavM logMrsd logMprd cookd studnt ; where cookd > 0.140 or abs(logmrsd)>1 or ID in (36,52) or abs(Studnt) > 1.5; run; Obs Alk logAvM logMrsd logMprd Cookd Studnt 3 116.0 -3.21888 -1.07626 -2.14262 0.20336 -1.91323 18 61.3 -0.26136 1.02231 -1.28368 0.04125 1.74721 36 25.4 -1.71480 -0.99485 -0.71995 0.03054 -1.69549 38 53.0 -3.21888 -2.06553 -1.15334 0.13969 -3.52240 40 87.6 0.09531 1.79197 -1.69666 0.26366 3.10367 52 17.3 -1.38629 -0.79354 -0.59276 0.02282 -1.35479 /* So the 36 and 52 cases don't seem very outlying ! (But they seem more so with STUDENTIZED residuals than with raw residuals.) In any case, the outliers 3, 38 and 40 have largest Cook's D, but only 38 and 40 have really large residuals. */ (d) In original model: R-Square 0.5107 Adj R-Sq 0.5011 a^ = -0.32110 , b^ = -0.01570 data bass4 bass5; set bass3; if not(cookd > 0.140 or abs(logmrsd)>1 or ID in (36,52)) then output bass4; else output bass5; run; NOTE: The data set WORK.BASS4 has 47 observations and 16 variables. NOTE: The data set WORK.BASS5 has 6 observations and 16 variables. proc reg data=bass4; model logavM = alk; output out=bass6 p=newP r = newrsd ; run; R-Square 0.7197 Adj R-Sq 0.7135 a^ = -0.26569 , b^ = -0.01600 Change in intercept is .055 = about .75 `standard errors', while change in slope is a small fraction (about 0.2) of a `standard error'. But the (adjusted) R-square or correlation is now much greater (NOTE: correlation rho = -sqrt(R^2) , negative because slope is) (e) Prediction intervals are the linear predictors plus or minus 1.96 standard deviations FOR THE PREDICTOR at the given x-point. We didn't cover this yet in class, but Chapter 5 of the Cody & Smith book treats them in Section 5.I. We deviate from the book by NOT constructing plots within PROC REG. Instead, we use the OUPUT statement of PROC REG to construct a file with all of the relevant columns, and then plot those columns within PROC GPLOT. [THIS GIVES MORE FLEXIBILITY.] goptions colors = (black) ; symbol1 v=star; symbol2 v=circle; symbol3 i=J L=3 ; symbol4 i=J L=4 ; proc reg data=bass4; model logavM = alk; output out=basscols LCL=lowP UCL=hiP ; run; proc sort data=basscols; by alk; proc gplot data=basscols; title "Prediction Interval Plot"; plot LowP*alk =3 HiP*alk =4 /overlay; run; THIS PLOT GIVES TWO curves ON THE SAME GRAPH, SHOWING PREDICTION INTERVALS from the model without outliers. Can obtain prediction intervals `near' outliers either from graph OR from doing PROC PRINT on "basscols" dataset just created. NOTE that LCL and UCL in the output statement create prediction intervals, while LCLM and UCLM create confidence-interval endpoints for the expectations of the individual observations ! FROM MODEL W/O OUTLIERS Obs Alk logAvM logMprd LowP HiP 3 116.0 -3.21888 -2.14262 -2.95 -1.33 18 61.3 -0.26136 -1.28368 -2.01 -0.46 36 25.4 -1.71480 -0.71995 -1.45 +0.10 38 53.0 -3.21888 -1.15334 -1.71 -0.16 40 87.6 0.09531 -1.69666 -2.56 -0.97 52 17.3 -1.38629 -0.59276 -1.32 +0.23 NOTE THAT THESE OUTLIERS MOSTLY LIE BELOW THE INTERVAL PREDICTED IN THEIR ABSENCE !!! (only obs 18 and 40 lie above). SO OBSERVE: BY THIS CRITERION, ALL 6 OF THESE OBSERVATIONS ARE CLEARLY OUTLIERS IN THE SENSE OF BEING BADLY FITTED !!! (f) data bass7; set bass2; logalk = log(alk); proc gplot ; plot AvMrc * logalk =1 /overlay; run; THE KEY POINT ABOUT THIS PLOT IS THAT THE VARIABILITY SEEMS GENERALLY LARGE AND NOT AT ALL CONSTANT. However, the largest alkalinity values seem to produce Average Mercury values in a very tight (narrow) range except for one wild outlier. (g) Based on the very extreme values of t-statistics found in the regression, both with and without the outliers present, it certainly looks as though there is a negative slope relating the log of Avergae Mercury with alkalinity. The assumptions of normality and homoscedasticity of the data might be very questionable, and it is not at all clear how to regard the outliers, but the highly significant negative CORRELATION borne out by the scatter plot says that there is definitely negative assocation. BUT causality is a different story !! (Acid rain probably kills lots of fish and thereby affects the food chain through which larger fish accumulate mercury in their bodies.) ----------------------------------------------------------- Further illustration related to Stepwise model-selection ! proc reg data=bass2; model logAvM = Alk pH Calc Chlor / selection = forward best=2 ; run; Summary of Forward Selection Variable Number Partial Model Step Entered Vars In R-Square R-Square C(p) F-Val Pr>F 1 Alk 1 0.5107 0.5107 29.7014 53.22 <.0001 2 Chlor 2 0.1734 0.6841 3.8085 27.45 <.0001 3 Calc 3 0.0154 0.6995 3.3266 2.52 0.1191