Further Examples in SAS for Stat 770,  3/27/03
==============================================

Consider "kyphosis" dataset available on the web-page
for Stat 798C under "kyphosis.gz"  at : 

http://www.math.umd.edu/~evs/s798c/Data/

(Saved as  ASCII file "kyphosis" including header, on 
cluster machine)

libname SASstf "SASproj";
data SASstf.kyph (drop=labels1-labels5);
  infile "kyphosis";
 if _N_=1 then input labels1-labels5 $ ;
 else input Id Kyphos $ Age NVert Nstart ;
   if Id ge 1 ;
run;

## Now try to work with logistic regression model ...


NOTE: The data set SASSTF.KYPH has 81 observations and 5 variables.
(which is correct)

### Now for Splus version:

maclaurin% cp /home1/evspublic_html/s798c/Data/kyphosis.gz .

> kyph <- read.table("../kyphosis",skip=9)[,2:5]
> names(kyph) <- c("Kyphos","Age","NVert","Nstart") 
> kyph[1:6,]
   Kyphos Age NVert Nstart 
1  absent  71     3      5
2  absent 158     3     14
3 present 128     4      5
4  absent   2     5      1
5  absent   1     4     15
6  absent   1     2     16

### Back to SAS  
  libname SASstf "SASproj";
      data kyph;
        set SASstf.kyph ;
      run;
   proc logistic data=kyph;
   title 'Kyphosis Logistic Regression';
     model kyphos = Age Nvert Nstart / Risklimits;
  run;

	Kyphosis Logistic Regression
                   
             The LOGISTIC Procedure

               Response Profile

        Ordered                      Total
      Value     Kyphos       Frequency

          1     absent              64
          2     present             17

    Probability modeled is Kyphos='absent'.


         Model Fit Statistics

                                 Intercept
                   Intercept         and
    Criterion        Only        Covariates

  AIC               85.234         69.380
  SC                87.629         78.958
  -2 Log L          83.234         61.380

    Testing Global Null Hypothesis: BETA=0 ...

 Test                 Chi-Sq   DF    Pr > ChiSq

Likelihood Ratio     21.8545    3      <.0001
Score                20.8548    3      0.0001
Wald                 14.1123    3      0.0028

Analysis of Maximum Likelihood Estimates
                              
		Wald     Standard
Parameter  DF  Estimate    Error    Chi-Sq   Pr > ChiSq
Intercept  1    2.0369    1.4496    1.9744    0.1600
Age        1   -0.0109    0.00645   2.8748    0.0900
NVert      1   -0.4106    0.2249    3.3340    0.0679
Nstart     1    0.2065    0.0677    9.3045    0.0023

Wald Confidence Interval for Adjusted Odds Ratios

Effect      Unit    Estimate   95% Confidence Limits

Age        1.0000     0.989      0.977     1.002
NVert      1.0000     0.663      0.427     1.031
Nstart     1.0000     1.229      1.077     1.404

### Back to Splus

> unlist(lapply(kyph,is.factor))
 Kyphos Age NVert Nstart 
      T   F     F      F
> tmpglm <- glm(Kyphos ~ . , family=binomial, data=kyph)
> tmpglm
Call:
glm(formula = Kyphos ~ ., family = binomial, data = kyph)

Coefficients:
 (Intercept)        Age    NVert   Nstart 
   -2.036932 0.01093048 0.410601 -0.20651

Degrees of Freedom: 81 Total; 77 Residual
Residual Deviance: 61.37993 
> summary(tmpglm)$coef
                  Value Std. Error   t value 
(Intercept) -2.03693225 1.44918287 -1.405573
        Age  0.01093048 0.00644419  1.696175
      NVert  0.41060098 0.22478659  1.826626
     Nstart -0.20651000 0.06768504 -3.051043
> model.matrix(tmpglm)[1:5,]
  (Intercept) Age NVert Nstart 
1           1  71     3      5
2           1 158     3     14
3           1 128     4      5
4           1   2     5      1
5           1   1     4     15
> tmpglm$y[1:5]
[1] 0 0 1 0 0
> anova(tmpglm)
Analysis of Deviance Table

Binomial model

Response: Kyphos

Terms added sequentially (first to last)
       Df Deviance Resid. Df Resid. Dev 
  NULL                    80   83.23447
   Age  1  1.30198        79   81.93249
 NVert  1 10.30593        78   71.62656
Nstart  1 10.24663        77   61.37993

### Let's try to get confidence limits into the picture:
### Recall:
> summary(tmpglm)$coef
                  Value Std. Error   t value 
(Intercept) -2.03693225 1.44918287 -1.405573
        Age  0.01093048 0.00644419  1.696175
      NVert  0.41060098 0.22478659  1.826626
     Nstart -0.20651000 0.06768504 -3.051043

### and note that the Std.Error column can be obtained
### separately as

> sqrt(diag(summary(tmpglm)$cov.unscaled))
[1] 1.44918287 0.00644419 0.22478659 0.06768504

### So the confidence intervals for the coefficient 
###    estimates are

> round(outer(c(1,1),tmpglm$coef,"*") + 1.96*outer(c(-1,1),
    .Last.value,"*"),3)
     (Intercept)    Age  NVert Nstart 
[1,]      -4.877 -0.002 -0.030 -0.339
[2,]       0.803  0.024  0.851 -0.074

### Finally, recalling that (a) SAS was modelling probability 
### that kyph is absent while in Splus we modelled the 
### probability that it is  present, we find:

odds ratio for P(absent | Age=a+1, other vars)
           vs  P(absent | Age=a, other vars)

which is exp(-b_{Age}), falls between: 
> exp( -c(.024,-.002))
[1] 0.9762857 1.0020020

Similarly, odds ratio for P(absent | Nvert=n+1, other vars)
           vs  P(absent | Nvert=n, other vars)
falls between:
> exp(-c(.851,-0.030))
[1] 0.4269877 1.0304545

and odds ratio for P(absent | Nstart=n+1, other vars)
           vs  P(absent | Nstart=n, other vars)
falls between:
> exp(-c(-.074,-0.339))
[1] 1.076807 1.403543

## Thus we have recovered the SAS `adjusted odds ratio'
##   confidence intervals.

------------------------------------------------------

### Now back to Splus to consider logistic 
###    regression residuals.		   4/11/03

> plot(tmpglm$fitted, tmpglm$resid)
### Not very useful, in logistic-regression setting

> summary(abs(tmpglm$resid)/sqrt(tmpglm$fitted*(1-tmpglm$fitted)))
  Min. 1st Qu. Median  Mean 3rd Qu.  Max. 
 3.079   3.461  4.138 6.333   5.554 57.06
> summary(tmpglm$fitted)
     Min. 1st Qu. Median   Mean 3rd Qu.  Max. 
 0.008973 0.05628 0.1215 0.2099  0.3196 0.931
> summary(sqrt(tmpglm$fitted*(1-tmpglm$fitted)))
   Min. 1st Qu. Median   Mean 3rd Qu. Max. 
 0.0943  0.2305 0.3092 0.3218  0.4276  0.5
> summary(summary(tmpglm)$deviance.resid)
   Min. 1st Qu.  Median    Mean 3rd Qu.  Max. 
 -2.312 -0.5484 -0.3632 -0.1328 -0.1659 2.161
> sum(summary(tmpglm)$deviance.resid^2)
[1] 61.37993
> tmpglm$dev
[1] 61.37993
> summary(tmpglm$resid)   ### What are these ????!!!
   Min. 1st Qu. Median    Mean 3rd Qu.  Max. 
 -14.46  -1.162 -1.067 -0.3611  -1.011 10.33
> summary(tmpglm$y-tmpglm$fitted)
   Min. 1st Qu.   Median       Mean  3rd Qu.   Max. 
 -0.931 -0.1396 -0.06382 -1.903e-08 -0.01366 0.9033
> summary((tmpglm$y-tmpglm$fitted)/sqrt(tmpglm$fitted*(1-tmpglm$fitted)))
   Min. 1st Qu.  Median     Mean 3rd Qu.  Max. 
 -3.673 -0.4028 -0.2611 -0.04619 -0.1177 3.056
> sum((tmpglm$y-tmpglm$fitted)^2/(tmpglm$fitted*(1-tmpglm$fitted)))
[1] 70.31516			### OK, reasonable fit !!!

> plot(tmpglm$fitted,(tmpglm$y-tmpglm$fitted)/
      sqrt(tmpglm$fitted*(1-tmpglm$fitted)), xlab="Fitted values",
      ylab="Pearson Residuals", main=
    "Plot of Pearson Residuals vs Fitted Values for Kyphosis Data")
  abline(h=0, lty=5)
  printgraph(file="KyphResd.ps")

> cbind(1:81,kyph, tmpglm$fitted, (tmpglm$y-tmpglm$fitted)/
    sqrt(tmpglm$fitted*(1-tmpglm$fitted)))
         [abs(tmpglm$y-tmpglm$fitted)/sqrt(tmpglm$fitted*
    (1-tmpglm$fitted)) > 3,]
   X1  Kyphos Age NVert Nstart         X3        X4 
23 23 present  96     3     12 0.09674523  3.055557
43 43  absent 143     9      3 0.93099048 -3.672976

### The first of these two standardized residuals is not really
    terribly far out for the largest (in absolute value of 81, 
    but the second one is !!!
> 2*(1-pnorm(c(3.06, 3.67))^81)
[1] 0.17157295 0.01955159

> plot(1:81, (tmpglm$y-tmpglm$fitted)/
      sqrt(tmpglm$fitted*(1-tmpglm$fitted)), xlab="Sequence Number",
      ylab="Pearson Residuals", main=
    "Plot of Pearson Residuals vs SeqNo for Kyphosis Data")
<<< This plot confirms that only the second of those two extreme
      residuals is the real outlier !! The age of 143 months
      is the 13th largest, but far less than the largest (206), and
      in the latter observation also kyphosis is absent.

#### Now let's see what the residuals pictures can tell us about
####     the model without the NVert  covariate:

> tmpkyph <- glm(Kyphos ~ . -NVert, family=binomial, data=kyph)
Coefficients:
 (Intercept)         Age    Nstart 
   0.2250435 0.009507095 -0.237923

Degrees of Freedom: 81 Total; 78 Residual
Residual Deviance: 65.29907 

*** Now try a cumulative-deviance-residual-squared plot, in 
    sequence-number order !

> plot(1:81, cumsum(summary(tmpkyph)$deviance.resid[order(kyph$NVert)]^2),
   xlab="NVert Sorted Sequence Number", ylab="Cum Deviance Residual",
   main="Plot of Cum Kyphosis Deviance Resid's in NVert Order")

*** Compare this with:

> plot(1:81, cumsum(summary(tmpglm)$deviance.resid[order(kyph$NVert)]^2),
   xlab="NVert Sorted Sequence Number", ylab="Cum Deviance Residual",
   main="Plot of Cum Kyphosis Deviance ResidSq Versus SeqNo")

*** Surprisingly both of these show more or less the same convex
   rather than linear shape !!


*** Next try with Pearson Residuals:

> PearsRsd1 <- (tmpglm$y-tmpglm$fitted)/
    sqrt(tmpglm$fitted*(1-tmpglm$fitted))
  PearsRsd2 <- (tmpkyph$y-tmpglm$fitted)/
    sqrt(tmpglm$fitted*(1-tmpglm$fitted))

> plot(1:81, cumsum(PearsRsd1[order(kyph$NVert)]),
   xlab="NVert Sorted Sequence Number", ylab="Cum Deviance Residual",
   main="Plot of Cum Kyphosis Pearson Resid Versus SeqNo")

> plot(1:81, cumsum(PearsRsd2[order(kyph$NVert)]),
   xlab="NVert Sorted Sequence Number", ylab="Cum Deviance Residual",
   main="Plot of Cum Kyphosis Pearson Resid Versus SeqNo")

### Again, more or less the same pattern exists for the model both
    with and without the NVert predictor !!

### This at least suggests that there is room for additional
    model-fitting, perhaps interaction terms involving NVert !!

> summary(glm(Kyphos ~ . +NVert*Age, family=binomial, data=kyph))$coef   
                Value  Std. Error   t value 
(Intercept) -5.025626202 2.778674087 -1.808642
        Age  0.038532951 0.022034418  1.748762
      NVert  1.035135683 0.534699547  1.935920
     Nstart -0.215785325 0.070369522 -3.066460
  NVert:Age -0.005904498 0.004327807 -1.364316
Coefficients:
 (Intercept)        Age    NVert     Nstart    NVert:Age 
   -5.025626 0.03853295 1.035136 -0.2157853 -0.005904498

Degrees of Freedom: 81 Total; 76 Residual
Residual Deviance: 59.45309 
	 ### Compare with final Residual Deviance 61.38 before.
  ##Not too persuasive.