Log of Practice SAS Scripts for Categorical & GLM Analyses ================================================ 3/18/03 ff From Unix prompt: sas & (after telnet and xhost sequence on cluster) /* (I) Begin with PROC FREQ on simple contingency tables */ libname SASstf "SASproj" ; data SASstf.temptabl; /* Creates SASproj/temptabl.sas7bdat */ input row col count; cards; 1 1 260 2 1 107 1 2 145 2 2 98 ; run; proc sort data=sasstf.temptabl; by row col; /* usually good practice whenever providing outputs "by" variables. */ proc FREQ data=sasstf.temptabl; title "Row by Column Data Counts"; tables row*col / nocum nopercent ; /* Without these options, get same info in table along with junk re-coding into percents and cum percents */ weight count; /* organizes into table with entries equal to "count" variable: this construction applies to general IxJ and also Multi-way tables */ run; OUTPUT-------------------------------------------------OUTPUT Row by Column Data Counts 10:33 Tuesday, March 18, 2003 The FREQ Procedure Table of row by col row col Frequency| Row Pct | Col Pct | 1| 2| Total ---------+--------+--------+ 1 | 260 | 145 | 405 | 64.20 | 35.80 | | 70.84 | 59.67 | ---------+--------+--------+ 2 | 107 | 98 | 205 | 52.20 | 47.80 | | 29.16 | 40.33 | ---------+--------+--------+ Total 367 243 610 /* Now include analysis options */ proc FREQ data=sasstf.temptabl; title "Chi-Squared Analysis"; weight count; tables row*col / nocum nopercent chisq ; run; OUTPUT-------------------------------------------------OUTPUT Chi-Squared Analysis The FREQ Procedure Statistics for Table of row by col Statistic DF Value Prob ------------------------------------------------------ Chi-Square 1 8.1809 0.0042 Likelihood Ratio Chi-Square 1 8.1201 0.0044 Continuity Adj. Chi-Square 1 7.6878 0.0056 ... Fisher's Exact Test ---------------------------------- Cell (1,1) Frequency (F) 260 Left-sided Pr <= F 0.9983 Right-sided Pr >= F 0.0029 Table Probability (P) 0.0012 Two-sided Pr <= P 0.0051 Sample Size = 610 /* So far these analyses are reproduced in Splus as follows: > ttbl <- matrix(c(260,107,145,98), ncol=2) > etbl <- outer(apply(ttbl,1,sum),apply(ttbl,2,sum),"*")/sum(ttbl) > chsq <- sum((ttbl-etbl)^2/etbl) > CAchsq <- sum((abs(ttbl-etbl)-0.5)^2/etbl) ### cont. corrected > LRchsq <- 2*sum(ttbl*log(ttbl/etbl)) ### = Gsq stat > c(chsq, CAchsq, LRchsq) [1] 8.180940 7.687814 8.120084 ### pval's wrt chisq 1 df ### Now for the Fisher exact test probabilities: > round(c(phyper(260, 405,205,367), 1-phyper(259, 405,205,367), dhyper(260, 405,205,367), 1-phyper(259, 405,205,367)+phyper(227, 405,205,367)),5) [1] 0.99835 0.00286 0.00120 0.00505 This ends our Splus interlude */ /* Now for the options giving odds ratio, risk ratio, CI etc. */ proc FREQ data=sasstf.temptabl; title "Other Analyses"; weight count; tables row*col / nocum nopercent cmh1 fisher Riskdiff; run; OUTPUT-------------------------------------------------OUTPUT Other Analyses The FREQ Procedure Summary Statistics for row by col Estimates of the Common Relative Risk (Row1/Row2) Type of Study Method Value 95% Conf. Limits ------------------------------------------------------------------------- Case-Control Mantel-Haenszel 1.6423 1.1676 2.3099 (Odds Ratio) Logit 1.6423 1.1676 2.3099 Cohort Mantel-Haenszel 1.2300 1.0588 1.4288 (Col1 Risk) Logit 1.2300 1.0588 1.4288 Cohort Mantel-Haenszel 0.7489 0.6171 0.9089 (Col2 Risk) Logit 0.7489 0.6171 0.9089 ... Column 1 Risk Estimates (Asympt) 95% (Exact) 95% Risk ASE Conf Limits Conf Limits ------------------------------------------------------------- Row 1 0.6420 0.0238 0.5953 0.6887 0.5931 0.6887 Row 2 0.5220 0.0349 0.4536 0.5903 0.4512 0.5920 Total 0.6016 0.0198 0.5628 0.6405 0.5616 0.6407 Diff 0.1200 0.0422 0.0372 0.2028 Difference is (Row 1 - Row 2) /* Splus notes: > ttbl[1,1]*ttbl[2,2]/(ttbl[1,2]*ttbl[2,1]) [1] 1.642282 > ttbl[1,1]*sum(ttbl[2,])/(ttbl[2,1]*sum(ttbl[1,])) [1] 1.229953 ### Conf Limits are Wald CI's for log-odds-ratio and log RR, e.g.: > exp(log(1.642282)+1.96*sqrt(sum(1/ttbl))*c(-1,1)) [1] 1.167604 2.309937 > exp(log(1.229953)+1.96*sqrt((1-ttbl[1,1]/sum(ttbl[1,]))/ttbl[1,1] + (1-ttbl[2,1]/sum(ttbl[2,]))/ttbl[2,1])*c(-1,1)) [1] 1.058795 1.428779 ### End Splus */ /* Also available in SAS: EXACT distributional calculation or Monte Carlo approximations for various statistics. */ proc FREQ data=sasstf.temptabl; title "Other Analyses"; weight count; tables row*col / nocum nopercent; EXACT CHISQ OR / MC ; /* default 10000 Monte Carlo iterations!! which takes a little time to compute !! */ run; OUTPUT-------------------------------------------------OUTPUT Pearson Chi-Square Test ---------------------------------- Chi-Square 8.1809 DF 1 Asymptotic Pr > ChiSq 0.0042 Monte Carlo Estimate for the Exact Test Pr >= ChiSq 0.0044 99% Lower Conf Limit 0.0027 99% Upper Conf Limit 0.0061 Number of Samples 10000 Initial Seed 47700 ... Odds Ratio (Case-Control Study) ----------------------------------- Odds Ratio 1.6423 Asymptotic Conf Limits 95% Lower Conf Limit 1.1676 95% Upper Conf Limit 2.3099 Exact Conf Limits 95% Lower Conf Limit 1.1500 95% Upper Conf Limit 2.3427 >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> /* (II) Now proceed with logistic and probit and Poisson regressions ... The relevant SAS procedures are: LOGISTIC, PROBIT, GENMOD . We explore only the GENMOD syntax now: later we will go to PROC LOGISTIC for stepwise model selection. */ %== DATA at http://www.math.umd.edu/~evs/s430/Data/pbcdata.asc ============= OBS IDNUM DTH EVTTIME TREATGP LOGBILI AGEVAR CIRRH CCHOL ALBUMIN 1 180 0 0.00 1 1.00000 36.598 0 0 37.0 2 188 0 0.00 0 1.36173 29.964 0 0 35.0 ... /* Using saved ASCII file "pbcdata.asc" in log-in directory */ data SASstf.PBC (drop = labels1-labels10); infile "pbcdata.asc" ; if _N_ = 1 then input labels1-labels10 $ ; /* Deals with header-line */ else INPUT obs idnum dth evttime treatgp logbili agevar cirrh cchol albumin ; if obs ne . ; /* Skips past blank line */ run; /* Let's start by fitting a logistic regression model with EVTTIME > 41 as response, restricted to individuals with DTH=1, in terms of predictors TREATGP and LOGBILI and AGEVAR */ data SASstf.DthPBC (drop = Dth EVTTIME); set SASstf.PBC (keep= Dth EVTTIME LOGBILI AGEVAR TREATGP rename = (AGEVAR=AGEYRS)); if Dth=1 ; if EVTTIME GE 41 then Surv41=1; else Surv41=0; run; Proc Genmod data=SASstf.DthPBC; class Treatgp; /* used for categorical regressors, not nec here */ model Surv41 = logbili ageyrs treatgp / dist = binomial link = logit; run; OUTPUT-------------------------------------------------OUTPUT Logistic Regression Using Genmod The GENMOD Procedure Model Information ... Class Level Information Class Levels Values treatgp 2 0 1 Response Profile Ordered Total Value Surv41 Frequency 1 0 61 2 1 42 PROC GENMOD is modeling the probability that Surv41='0'. One way to change this to model the probability that Surv41='1' is to specify the DESCENDING option in the PROC statement. Criteria For Assessing Goodness Of Fit Criterion DF Value Value/DF Deviance 99 117.9883 1.1918 Scaled Deviance 99 117.9883 1.1918 Pearson Chi-Square 99 96.9904 0.9797 Scaled Pearson X2 99 96.9904 0.9797 Log Likelihood -58.9941 Algorithm converged. ... Analysis Of Parameter Estimates Standard Wald 95% Conf Chi-sq Parameter DF Estimate Error Limits P > ChiSq Intercept 1 -4.5095 1.2459 -6.9515 -2.0676 13.10 0.0003 logbili 1 2.7456 0.6842 1.4046 4.0865 16.10 <.0001 AGEYRS 1 0.0071 0.0039 -0.0006 0.0148 3.31 0.0690 treatgp 0 1 -0.6894 0.4745 -1.6195 0.2406 2.11 0.1463 treatgp 1 0 0.0000 0.0000 0.0000 0.0000 . . Scale 0 1.0000 0.0000 1.0000 1.0000 NOTE: The scale parameter was held fixed. ## Splus comparison ### > pbcdat <- read.table(file="/home/slud0001/SASproj/pbcdata.asc", header=T) > pbcdth <- pbcdat[pbcdat[,"DTH"]==1,4:7] > pbcdth[,1] <- as.numeric(pbcdt[,1] <= 41) > names(pbcdth)[1] <- "Surv41" > tmpglm <- glm(cbind(Surv41,1-Surv41) ~ ., data=pbcdth, family=binomial) > summary(tmpglm) Call: glm(formula = cbind(Surv41, 1 - Surv41) ~ TREATGP + LOGBILI + AGEVAR, family = binomial, data = pbcdth) Deviance Residuals: Min 1Q Median 3Q Max -1.796847 -1.101684 0.5154959 0.8971313 1.733077 Coefficients: Value Std. Error t value (Intercept) -5.198060233 1.387271601 -3.746967 TREATGP 0.689316661 0.472556817 1.458696 LOGBILI 2.745092979 0.678464250 4.046039 AGEVAR 0.007141237 0.003904377 1.829034 (Dispersion Parameter for Binomial family taken to be 1 ) Null Deviance: 139.2633 on 102 degrees of freedom Residual Deviance: 117.9883 on 99 degrees of freedom Number of Fisher Scoring Iterations: 3 Correlation of Coefficients: (Intercept) TREATGP LOGBILI TREATGP -0.4718359 LOGBILI -0.9613930 0.3374626 AGEVAR -0.4943981 0.1203173 0.3593797 > summary(glm(cbind(Surv41,1-Surv41) ~ ., data=pbcdth, family= binomial(link=probit)))[c(3,9)] $coefficients: Value Std. Error t value (Intercept) -3.221704436 0.807233342 -3.991045 TREATGP 0.425427090 0.283750497 1.499300 LOGBILI 1.702723936 0.392264360 4.340756 AGEVAR 0.004388439 0.002307833 1.901541 $deviance: [1] 117.5646 ... Null deviance same as logistic-regression case /* Back to SAS */ Proc Genmod data=SASstf.DthPBC DESCENDING; title "Probit Regression using GenMod"; class Treatgp; /* used for categorical regressors, not nec here */ model Surv41 = logbili ageyrs treatgp / dist = binomial link = probit; run; Log Likelihood -58.7823 (vs -58.9941 before) Probit Regression Using GenMod Analysis Of Parameter Estimates Standard Wald 95% Conf Chi- Parameter DF Estimate Error Limits Square P > ChiSq Intercept 1 3.2220 0.8235 1.6080 4.8359 15.31 <.0001 logbili 1 -1.7029 0.4014 -2.4896 -0.9162 18.00 <.0001 AGEYRS 1 -0.0044 0.0023 -0.0088 0.0000 3.76 0.0526 treatgp 1 -0.4255 0.2839 -0.9819 0.1310 2.25 0.1340