SOLUTION TO HOMEWORK 5 FROM STAT 770 RELATING TO GLM DATA ANALYSES. ================================================== (Assignment can be found on Stat 770 course web-page, http://www.math.umd.edu/~evs/s770 , cf. also http://www.math.umd.edu/~evs/s770/SASlogs/SASlog3.txt First, the standardized coefficients and analysis of deviance table found in Splus for the kyphosis logistic regression (cf SASlog3.txt file indicated above) were: > summary(glm(Kyphosis ~ . , family=binomial, data=kyphosis))$coef Value Std. Error t value (Intercept) -2.03693225 1.44918287 -1.405573 Age 0.01093048 0.00644419 1.696175 Number 0.41060098 0.22478659 1.826626 Start -0.20651000 0.06768504 -3.051043 > anova(glm(Kyphosis ~ . , family=binomial, data=kyphosis)) Analysis of Deviance Table, Binomial model, Response: Kyphosis Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev NULL 80 83.23447 Age 1 1.30198 79 81.93249 Number 1 10.30593 78 71.62656 Start 1 10.24663 77 61.37993 Now for the analogous output in probit regression: > summary(glm(Kyphosis ~ . , family=binomial(link=probit), data=kyphosis))$coef Value Std. Error t value (Intercept) -1.063353291 0.809886949 -1.312965 Age 0.005984768 0.003507093 1.706475 Number 0.215179016 0.121687912 1.768286 Start -0.120214682 0.038512786 -3.121423 > anova(glm(Kyphosis ~ . , family=binomial(link=probit), data=kyphosis)) Analysis of Deviance Table, Binomial model, Response: Kyphosis Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev NULL 80 83.23447 Age 1 1.41079 79 81.82369 Number 1 10.68480 78 71.13889 Start 1 10.05939 77 61.07950 ### Now let's re-cap it using PROC GENMOD in SAS. proc logistic data=kyph; title 'Kyphosis Logistic Regression'; model kyphos(EVENT='PRESENT') = Age Nvert Nstart / selection=Forward SLentry=.9; /* Default is link=LOGIT */ run; ... Response Profile Ordered Total Value Kyphos Frequency 1 absent 64 2 present 17 Probability modeled is Kyphos='present'. ... Forward Selection Procedure Step 1. Effect Nstart entered: ... Model Fit Statistics Criterion Int. Only Int. & Covs. -2 Log L 83.234 68.072 ... Step 2. Effect NVert entered: ... Criterion Int. Only Int. & Covs. -2 Log L 83.234 64.536 Step 3. Effect Age entered: ... Criterion Int. Only Int.& Covs. -2 Log L 83.234 61.380 ... NOTE: All effects have been entered into the model. ... Analysis of Maximum Likelihood Estimates Parameter DF Estimate St.Err. Wald X-Square P > ChiSq Intercept 1 -2.0369 1.4496 1.974 0.1600 Age 1 0.0109 0.00645 2.875 0.0900 NVert 1 0.4106 0.2249 3.334 0.0679 Nstart 1 -0.2065 0.0677 9.304 0.0023 #### To do the analogous thing with PROBIT regression in SAS, the only change is to insert the model-option LINK=PROBIT after / in the MODEL statement. All of the (edited) results are exactly as in the Splus glm function with `method' anova applied. *** NOTE: as the students in the class found, there is a more direct way in SAS to do the Analysis of Deviance Tables, namely: to apply PROC GENMOD with dist = whatever (binomial in this example) and then in the MODEL statement use the option / TYPE1 ; Recall that in linear models, the Type 1 sum of squares is the set of decrements to sum of squared residuals when one successively enters the terms specified as predictors in the MODEL statement; here, the analogue is to create the decrements to deviance when the predectors are sequentially entered ... (B) Problem 4.11 The SAS code is given at the Agresti website, under the `Primary' data link. After inputting data under data-step and calling it crab and modifying Agresti's SAS code slightly, we get: proc genmod; model satell = width / dist=poi link=log ; proc genmod; model satell = width / dist=negbin link=log ; run; (Poisson/Log) Model Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance 171 567.8786 3.3209 Log Likelihood 68.4463 Analysis Of Parameter Estimates Parameter DF Estimate Std.Err Limits Chi-Sq P>ChiSq Intercept 1 -3.3048 0.5422 -4.368,-2.242 37.14 <.0001 width 1 0.1640 0.0200 0.125, 0.203 67.51 <.0001 Scale 0 1.0000 0.0000 (fixed) (NegBin/Log) Model Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance 171 195.8112 1.1451 Log Likelihood 154.3889 Analysis Of Parameter Estimates Parameter DF Estimate Std.Err Limits Chi-Sq P>ChiSq Intercept 1 -4.0525 1.2642 -6.530,-1.575 10.28 0.0013 width 1 0.1921 0.0476 0.099, 0.285 16.27 <.0001 Dispersion 1 1.1055 0.1971 0.780, 1.568 NOTE that the models are very different, so that the Intercept and width coefficients need NOT correspond exactly, but as the book points out in problem 4.11, the std.err's of the width- coefficient are VERY different! The main point is this: NegBin has a higher-dimensional parameter, one component of which (1+mu/k) as the same interpretation as SCALE in Poisson log- linear GLM case. We saw in Problem 4.7 (with a different predictor, Weight instead of Width which is what we consider here), that the Poisson log-linear model for Satell is highly over-dispersed. Let us repeat that analysis here. An R model-fit to these data give > Poisfit$call glm(formula = Satell ~ Width, family = poisson, data = Crabs) > sum((Crabs$Satell-Poisfit$fit)^2/Poisfit$fit)/171 [1] 3.182205 ### Thus we have just the same degree of overdispersion as before! The phi-hat estimate is 3.182, which means that all coeff SE's must be scaled up by sqrt(3.1822) = 1.784, and since 1.784 * 0.02 = 0.0357, we find that most of the difference in SE's is due to the fact that the negbin model corrects for over-dispersion while the log-linear Poisson raw output does not. Perhaps we ought not expect a closer correspondence than this (.0357 vs .048) between the two models' SE estimates for beta, especially because the GLM for Poisson treats Var(Y_i)/E(Y_i) as having the constant value phi for all i, while the negative binomial model says Var(Y_i)/E(Y_i) = (mu_i + mu_i^2/k)/mu_i = 1 + mu_i/k varies with i. (C) Problem 5.1. (a) > plogis(-3.7771+ 8*.1449) [1] 0.06799525 (b) > plogis(-3.7771+ 26*.1449) [1] 0.497575 (c) Rate of change (I assume, per unit LI) means > plogis(-3.7771+ 9*.1449)-plogis(-3.7771+ 8*.1449) [1] 0.009777536 ### for a full unit change, but ### using dlogis as the derivative of plogis > dlogis(-3.7771+ 8*.1449)*.1449 [1] 0.009182588 ### for an `infinitesimal' LI change. ## Similarly, the changes from LI=26 are: > c(plogis(-3.7771+ 27*.1449)-plogis(-3.7771+ 26*.1449), .1449*dlogis(-3.7771+ 26*.1449)) [1] 0.03617359 0.03622415 (d) plogis(-3.7771+ c(14,28)*.1449) [1] 0.1482365 0.5695707 (e) Log-Odds add .1449, so Odds multiply by exp(.1449)= 1.1559 (f) Odds ratio is exp(beta), which along with the CI enpts are given (transformation of ordinary Wald CI) by: > exp(.1449+ 1.96*c(0,-1,1)*.0593) [1] 1.155924 1.029087 1.298394 (g) Wald test for (LI-) effect could be done at alpha=.05 by seeing whether 1 falls into the confidence interval. More precisely, the p-value is obtained by: > 2*(1-pnorm(abs(.1449)/.0593)) ### = 0.0145 (h) For LRT of same effect: 34.372-26.073 = 8.2988, which has p-val .004 vs 1 df chi-sq (i) CI for logit(pi) is > -.37771 + 8*.1449 + 1.96*c(-1,1)*(1.900616 - 2*8*.07653+ 8^2*.003521) ## = (-4.479,-0.757) [1](-1.079,2.642) ## So taking plogis gives reported CI (0.0112,0.3193)