Further Examples in SAS for Stat 770, 3/27/03 ============================================== Consider "kyphosis" dataset available on the web-page for Stat 798C under "kyphosis.gz" at : http://www.math.umd.edu/~evs/s798c/Data/ (Saved as ASCII file "kyphosis" including header, on cluster machine) libname SASstf "SASproj"; data SASstf.kyph (drop=labels1-labels5); infile "kyphosis"; if _N_=1 then input labels1-labels5 $ ; else input Id Kyphos $ Age NVert Nstart ; if Id ge 1 ; run; ## Now try to work with logistic regression model ... NOTE: The data set SASSTF.KYPH has 81 observations and 5 variables. (which is correct) ### Now for Splus version: maclaurin% cp /home1/evspublic_html/s798c/Data/kyphosis.gz . > kyph <- read.table("../kyphosis",skip=9)[,2:5] > names(kyph) <- c("Kyphos","Age","NVert","Nstart") > kyph[1:6,] Kyphos Age NVert Nstart 1 absent 71 3 5 2 absent 158 3 14 3 present 128 4 5 4 absent 2 5 1 5 absent 1 4 15 6 absent 1 2 16 ### Back to SAS libname SASstf "SASproj"; data kyph; set SASstf.kyph ; run; proc logistic data=kyph; title 'Kyphosis Logistic Regression'; model kyphos = Age Nvert Nstart / Risklimits; run; Kyphosis Logistic Regression The LOGISTIC Procedure Response Profile Ordered Total Value Kyphos Frequency 1 absent 64 2 present 17 Probability modeled is Kyphos='absent'. Model Fit Statistics Intercept Intercept and Criterion Only Covariates AIC 85.234 69.380 SC 87.629 78.958 -2 Log L 83.234 61.380 Testing Global Null Hypothesis: BETA=0 ... Test Chi-Sq DF Pr > ChiSq Likelihood Ratio 21.8545 3 <.0001 Score 20.8548 3 0.0001 Wald 14.1123 3 0.0028 Analysis of Maximum Likelihood Estimates Wald Standard Parameter DF Estimate Error Chi-Sq Pr > ChiSq Intercept 1 2.0369 1.4496 1.9744 0.1600 Age 1 -0.0109 0.00645 2.8748 0.0900 NVert 1 -0.4106 0.2249 3.3340 0.0679 Nstart 1 0.2065 0.0677 9.3045 0.0023 Wald Confidence Interval for Adjusted Odds Ratios Effect Unit Estimate 95% Confidence Limits Age 1.0000 0.989 0.977 1.002 NVert 1.0000 0.663 0.427 1.031 Nstart 1.0000 1.229 1.077 1.404 ### Back to Splus > unlist(lapply(kyph,is.factor)) Kyphos Age NVert Nstart T F F F > tmpglm <- glm(Kyphos ~ . , family=binomial, data=kyph) > tmpglm Call: glm(formula = Kyphos ~ ., family = binomial, data = kyph) Coefficients: (Intercept) Age NVert Nstart -2.036932 0.01093048 0.410601 -0.20651 Degrees of Freedom: 81 Total; 77 Residual Residual Deviance: 61.37993 > summary(tmpglm)$coef Value Std. Error t value (Intercept) -2.03693225 1.44918287 -1.405573 Age 0.01093048 0.00644419 1.696175 NVert 0.41060098 0.22478659 1.826626 Nstart -0.20651000 0.06768504 -3.051043 > model.matrix(tmpglm)[1:5,] (Intercept) Age NVert Nstart 1 1 71 3 5 2 1 158 3 14 3 1 128 4 5 4 1 2 5 1 5 1 1 4 15 > tmpglm$y[1:5] [1] 0 0 1 0 0 > anova(tmpglm) Analysis of Deviance Table Binomial model Response: Kyphos Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev NULL 80 83.23447 Age 1 1.30198 79 81.93249 NVert 1 10.30593 78 71.62656 Nstart 1 10.24663 77 61.37993 ### Let's try to get confidence limits into the picture: ### Recall: > summary(tmpglm)$coef Value Std. Error t value (Intercept) -2.03693225 1.44918287 -1.405573 Age 0.01093048 0.00644419 1.696175 NVert 0.41060098 0.22478659 1.826626 Nstart -0.20651000 0.06768504 -3.051043 ### and note that the Std.Error column can be obtained ### separately as > sqrt(diag(summary(tmpglm)$cov.unscaled)) [1] 1.44918287 0.00644419 0.22478659 0.06768504 ### So the confidence intervals for the coefficient ### estimates are > round(outer(c(1,1),tmpglm$coef,"*") + 1.96*outer(c(-1,1), .Last.value,"*"),3) (Intercept) Age NVert Nstart [1,] -4.877 -0.002 -0.030 -0.339 [2,] 0.803 0.024 0.851 -0.074 ### Finally, recalling that (a) SAS was modelling probability ### that kyph is absent while in Splus we modelled the ### probability that it is present, we find: odds ratio for P(absent | Age=a+1, other vars) vs P(absent | Age=a, other vars) which is exp(-b_{Age}), falls between: > exp( -c(.024,-.002)) [1] 0.9762857 1.0020020 Similarly, odds ratio for P(absent | Nvert=n+1, other vars) vs P(absent | Nvert=n, other vars) falls between: > exp(-c(.851,-0.030)) [1] 0.4269877 1.0304545 and odds ratio for P(absent | Nstart=n+1, other vars) vs P(absent | Nstart=n, other vars) falls between: > exp(-c(-.074,-0.339)) [1] 1.076807 1.403543 ## Thus we have recovered the SAS `adjusted odds ratio' ## confidence intervals. ------------------------------------------------------ ### Now back to Splus to consider logistic ### regression residuals. 4/11/03 > plot(tmpglm$fitted, tmpglm$resid) ### Not very useful, in logistic-regression setting > summary(abs(tmpglm$resid)/sqrt(tmpglm$fitted*(1-tmpglm$fitted))) Min. 1st Qu. Median Mean 3rd Qu. Max. 3.079 3.461 4.138 6.333 5.554 57.06 > summary(tmpglm$fitted) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.008973 0.05628 0.1215 0.2099 0.3196 0.931 > summary(sqrt(tmpglm$fitted*(1-tmpglm$fitted))) Min. 1st Qu. Median Mean 3rd Qu. Max. 0.0943 0.2305 0.3092 0.3218 0.4276 0.5 > summary(summary(tmpglm)$deviance.resid) Min. 1st Qu. Median Mean 3rd Qu. Max. -2.312 -0.5484 -0.3632 -0.1328 -0.1659 2.161 > sum(summary(tmpglm)$deviance.resid^2) [1] 61.37993 > tmpglm$dev [1] 61.37993 > summary(tmpglm$resid) ### What are these ????!!! Min. 1st Qu. Median Mean 3rd Qu. Max. -14.46 -1.162 -1.067 -0.3611 -1.011 10.33 > summary(tmpglm$y-tmpglm$fitted) Min. 1st Qu. Median Mean 3rd Qu. Max. -0.931 -0.1396 -0.06382 -1.903e-08 -0.01366 0.9033 > summary((tmpglm$y-tmpglm$fitted)/sqrt(tmpglm$fitted*(1-tmpglm$fitted))) Min. 1st Qu. Median Mean 3rd Qu. Max. -3.673 -0.4028 -0.2611 -0.04619 -0.1177 3.056 > sum((tmpglm$y-tmpglm$fitted)^2/(tmpglm$fitted*(1-tmpglm$fitted))) [1] 70.31516 ### OK, reasonable fit !!! > plot(tmpglm$fitted,(tmpglm$y-tmpglm$fitted)/ sqrt(tmpglm$fitted*(1-tmpglm$fitted)), xlab="Fitted values", ylab="Pearson Residuals", main= "Plot of Pearson Residuals vs Fitted Values for Kyphosis Data") abline(h=0, lty=5) printgraph(file="KyphResd.ps") > cbind(1:81,kyph, tmpglm$fitted, (tmpglm$y-tmpglm$fitted)/ sqrt(tmpglm$fitted*(1-tmpglm$fitted))) [abs(tmpglm$y-tmpglm$fitted)/sqrt(tmpglm$fitted* (1-tmpglm$fitted)) > 3,] X1 Kyphos Age NVert Nstart X3 X4 23 23 present 96 3 12 0.09674523 3.055557 43 43 absent 143 9 3 0.93099048 -3.672976 ### The first of these two standardized residuals is not really terribly far out for the largest (in absolute value of 81, but the second one is !!! > 2*(1-pnorm(c(3.06, 3.67))^81) [1] 0.17157295 0.01955159 > plot(1:81, (tmpglm$y-tmpglm$fitted)/ sqrt(tmpglm$fitted*(1-tmpglm$fitted)), xlab="Sequence Number", ylab="Pearson Residuals", main= "Plot of Pearson Residuals vs SeqNo for Kyphosis Data") <<< This plot confirms that only the second of those two extreme residuals is the real outlier !! The age of 143 months is the 13th largest, but far less than the largest (206), and in the latter observation also kyphosis is absent. #### Now let's see what the residuals pictures can tell us about #### the model without the NVert covariate: > tmpkyph <- glm(Kyphos ~ . -NVert, family=binomial, data=kyph) Coefficients: (Intercept) Age Nstart 0.2250435 0.009507095 -0.237923 Degrees of Freedom: 81 Total; 78 Residual Residual Deviance: 65.29907 *** Now try a cumulative-deviance-residual-squared plot, in sequence-number order ! > plot(1:81, cumsum(summary(tmpkyph)$deviance.resid[order(kyph$NVert)]^2), xlab="NVert Sorted Sequence Number", ylab="Cum Deviance Residual", main="Plot of Cum Kyphosis Deviance Resid's in NVert Order") *** Compare this with: > plot(1:81, cumsum(summary(tmpglm)$deviance.resid[order(kyph$NVert)]^2), xlab="NVert Sorted Sequence Number", ylab="Cum Deviance Residual", main="Plot of Cum Kyphosis Deviance ResidSq Versus SeqNo") *** Surprisingly both of these show more or less the same convex rather than linear shape !! *** Next try with Pearson Residuals: > PearsRsd1 <- (tmpglm$y-tmpglm$fitted)/ sqrt(tmpglm$fitted*(1-tmpglm$fitted)) PearsRsd2 <- (tmpkyph$y-tmpglm$fitted)/ sqrt(tmpglm$fitted*(1-tmpglm$fitted)) > plot(1:81, cumsum(PearsRsd1[order(kyph$NVert)]), xlab="NVert Sorted Sequence Number", ylab="Cum Deviance Residual", main="Plot of Cum Kyphosis Pearson Resid Versus SeqNo") > plot(1:81, cumsum(PearsRsd2[order(kyph$NVert)]), xlab="NVert Sorted Sequence Number", ylab="Cum Deviance Residual", main="Plot of Cum Kyphosis Pearson Resid Versus SeqNo") ### Again, more or less the same pattern exists for the model both with and without the NVert predictor !! ### This at least suggests that there is room for additional model-fitting, perhaps interaction terms involving NVert !! > summary(glm(Kyphos ~ . +NVert*Age, family=binomial, data=kyph))$coef Value Std. Error t value (Intercept) -5.025626202 2.778674087 -1.808642 Age 0.038532951 0.022034418 1.748762 NVert 1.035135683 0.534699547 1.935920 Nstart -0.215785325 0.070369522 -3.066460 NVert:Age -0.005904498 0.004327807 -1.364316 Coefficients: (Intercept) Age NVert Nstart NVert:Age -5.025626 0.03853295 1.035136 -0.2157853 -0.005904498 Degrees of Freedom: 81 Total; 76 Residual Residual Deviance: 59.45309 ### Compare with final Residual Deviance 61.38 before. ##Not too persuasive.