Example of Regression in SAS (here) & Splus (in HW22)
=====================================================
Stat 798C                                     4/7/04

### Get the data for this example from 

http://www.math.umd.edu/~evs/s430/Data/bass.gz

[Spring 2008 HW 18: the updated link is:

http://www.math.umd.edu/~evs/s430.old/Data/bass.gz ]


      data bass (drop=AUX aux2);
     infile "Bass.Dat";
     input aux2 $ 1-2 AUX $ 3-4 lake $ 7-14 @;
     IF aux = "" and aux2 NE "" and lake ne "Lake" THEN do;
       input @1 ID Lake $ Alk pH Calc
       Chlor AvMrc nsamp min max Mrc3yr agedat ;
 /* In this version, the dataset must have had blanks 
      in columns 1-2 but NOT in columns 3-4 in those 53
      records which contained the desired numeric data. */
        output; end;
    run;
	/* Note the use of the trailing @ to read the line more 
	  than once. Also note that SAS truncates the lake-names
	  to their first 8 characters. */

   libname home "." ;   
     data home.bass (drop = tmpchar);
   infile "Bass.Dat" ;
   length Lake $ 16 ;
  input tmpchar $1 @ ;
  if tmpchar > "0" and tmpchar <= "9"  then do;
     input @1 ID Lake $ Alk pH  Calc Chlor
         AvMrc nsamp  min  max Mrc3yr agedata  ;
   end;
   if ID ne . ;
    run;

     data bass;
     set home.bass;
     run;

     proc gplot ;
       plot AvMrc * Alk ;
     run;


     data bass2;
       set bass;
       logAvM = log(AvMrc) ;
     proc reg data=bass2;
       model logAvM = Alk ;
       output out=bass3 p=logMprd r=logMrsd COOKD= Cookd 
		Student= Studnt LCLM = LowPrd UCLM = HiPrd ;
     run;

                 Analysis of Variance

                          Sum of           Mean
Source          DF        Squares         Square    F Value    Pr > F

Model            1       18.71377       18.71377      53.22    <.0001
Error           51       17.93193        0.35161
Corrected Total 52       36.64571


/* NOTE as covered in class: if H = X (X^{tr}X)^{-1} X^{tr} 
     denotes the nxn "Hat" matrix, and R the matrix I-H, 
     then the Cook's Distance vector has n entries given by

  D_i = (i 'th studentized Residual)^2 * (H_{ii}/(1-H_{ii})) *(1/p)
     where p = number of columns of design matrix.
*/

                  Parameter     Standard
Variable   DF     Estimate      Error    t Value    Pr > |t|

Intercept  1      -0.32110       0.11471   -2.80     0.0072
Alk        1      -0.01570       0.00215   -7.30     <.0001

(b)   goptions colors = (black) ;
  symbol1 v=star;  
  symbol2 i=J L=3 ;
     proc gplot data=bass3 ;
        plot logAvM * Alk = 1 logMprd * Alk = 2 / overlay ;
     run;	/* PRODUCES SCATTERPLOT WITH PREDICTED LINE */


  PROC GPLOT data=bass3 ;
     plot logMrsd * Alk = 1 / vref=0;
   run;    
     /* vref option produces horizontal `reference line' */

(c)  proc means data=bass3 min max qrange std p95 p5 maxdec=3;
        var   cookd logMrsd  Studnt ;
                  run;

                         Quartile
Variable  Min     Max    Range     Std Dev   95th Pctl    5th Pctl
--------------------------------------------------------------------
Cookd     0.000   0.264  0.016      0.049     0.140        0.000
logMrsd  -2.066   1.792  0.572      0.587     0.788       -0.995
Studnt   -3.522   3.104  0.980      1.011     1.341       -1.695
-------------------------------------------------------------------

proc print data=bass3 ;
  var alk logavM logMrsd logMprd cookd studnt ;
  where  cookd > 0.140 or abs(logmrsd)>1 or ID in (36,52) or 
	abs(Studnt) > 1.5;
 run;

 Obs     Alk      logAvM      logMrsd     logMprd     Cookd    Studnt

   3    116.0    -3.21888    -1.07626    -2.14262    0.20336  -1.91323
  18     61.3    -0.26136     1.02231    -1.28368    0.04125   1.74721
  36     25.4    -1.71480    -0.99485    -0.71995    0.03054  -1.69549
  38     53.0    -3.21888    -2.06553    -1.15334    0.13969  -3.52240
  40     87.6     0.09531     1.79197    -1.69666    0.26366   3.10367
  52     17.3    -1.38629    -0.79354    -0.59276    0.02282  -1.35479

/* So the 36 and 52 cases don't seem very outlying ! (But they seem 
   more so with STUDENTIZED residuals than with raw residuals.) In any 
   case, the outliers 3, 38 and 40 have largest Cook's D, but only 
   38 and 40 have really large residuals. */

(d) In original model:    R-Square     0.5107     Adj R-Sq     0.5011
                          a^ =      -0.32110  ,    b^  =     -0.01570

data bass4 bass5;
   set bass3;
   if not(cookd > 0.140 or abs(logmrsd)>1 or ID in (36,52)) 
       then output bass4;  else output bass5;
run; 
NOTE: The data set WORK.BASS4 has 47 observations and 16 variables.
NOTE: The data set WORK.BASS5 has 6 observations and 16 variables.

proc reg data=bass4;
   model logavM = alk;
   output out=bass6
      p=newP r = newrsd ;
run;

         R-Square     0.7197           Adj R-Sq     0.7135
              a^ =   -0.26569   ,      b^  =    -0.01600
                  
Change in intercept is  .055 = about .75  `standard errors', while
change in slope is a small fraction (about 0.2) of a `standard error'.

But the (adjusted) R-square or correlation is now much greater
(NOTE: correlation   rho =  -sqrt(R^2) ,  negative because slope is)

(e) Prediction intervals are the linear predictors plus or minus 1.96
standard deviations FOR THE PREDICTOR at the given x-point. We didn't
cover this yet in class, but Chapter 5 of the Cody & Smith book treats
them in Section 5.I. We deviate from the book by NOT constructing
plots within PROC REG.  Instead, we use the OUPUT statement of PROC
REG to construct a file with all of the relevant columns, and then
plot those columns within PROC GPLOT. [THIS GIVES MORE FLEXIBILITY.]

    goptions  colors = (black) ;
  symbol1 v=star;  
  symbol2 v=circle;
  symbol3 i=J L=3 ;
  symbol4 i=J L=4 ;
    proc reg data=bass4;
              model logavM = alk;
            output out=basscols LCL=lowP UCL=hiP ;
           run;
    proc sort data=basscols;
       by alk;
    proc gplot data=basscols;
      title "Prediction Interval Plot";
      plot LowP*alk =3  HiP*alk =4 /overlay;
        run;

   THIS PLOT GIVES TWO curves ON THE SAME GRAPH, SHOWING PREDICTION 
	INTERVALS from the model without outliers. Can obtain 
        prediction intervals `near' outliers either from graph OR
        from doing PROC PRINT on "basscols" dataset just created.


NOTE that LCL and UCL in the output statement create prediction
    intervals, while  LCLM  and  UCLM  create confidence-interval
    endpoints for the expectations of the individual observations !

					FROM MODEL W/O OUTLIERS
 Obs     Alk      logAvM     logMprd       LowP       HiP

   3    116.0    -3.21888    -2.14262     -2.95      -1.33
  18     61.3    -0.26136    -1.28368     -2.01      -0.46
  36     25.4    -1.71480    -0.71995     -1.45      +0.10
  38     53.0    -3.21888    -1.15334     -1.71      -0.16
  40     87.6     0.09531    -1.69666     -2.56      -0.97
  52     17.3    -1.38629    -0.59276     -1.32      +0.23

	NOTE THAT THESE OUTLIERS MOSTLY LIE BELOW THE INTERVAL
    PREDICTED IN THEIR ABSENCE !!! (only obs 18 and 40 lie above).

    SO OBSERVE: BY THIS CRITERION, ALL 6 OF THESE OBSERVATIONS
    ARE CLEARLY OUTLIERS IN THE SENSE OF BEING BADLY FITTED !!!

(f)        data bass7;
               set bass2;
              logalk = log(alk);
        proc gplot ;
         plot AvMrc * logalk =1 /overlay;
             run;

      THE KEY POINT ABOUT THIS PLOT IS THAT THE VARIABILITY SEEMS
       GENERALLY LARGE AND NOT AT ALL CONSTANT. However, the largest
       alkalinity values seem to produce Average Mercury values in 
       a very tight (narrow) range except for one wild outlier.

(g) Based on the very extreme values of t-statistics found in the
regression, both with and without the outliers present, it certainly
looks as though there is a negative slope relating the log of Avergae
Mercury with alkalinity. The assumptions of normality and
homoscedasticity of the data might be very questionable, and it is not
at all clear how to regard the outliers, but the highly significant
negative CORRELATION borne out by the scatter plot says that there is
definitely negative assocation.  BUT causality is a different story !!
(Acid rain probably kills lots of fish and thereby affects the food
chain through which larger fish accumulate mercury in their bodies.)

-----------------------------------------------------------
Further illustration related to Stepwise model-selection !

  proc reg data=bass2;
       model logAvM = Alk pH Calc Chlor /
             selection = forward best=2 ;
     run;
                Summary of Forward Selection

       Variable   Number     Partial    Model
Step   Entered    Vars In    R-Square   R-Square   C(p)  F-Val  Pr>F

1       Alk        1       0.5107     0.5107     29.7014  53.22  <.0001
2     Chlor        2       0.1734     0.6841      3.8085  27.45  <.0001
3     Calc         3       0.0154     0.6995      3.3266   2.52  0.1191