DATA ANALYSIS LOG IN R USING GLM =================================== (I) (10/28/09) Begin by getting syntax for different "family" and "link" choices. ### Start with basic syntax for "binomial" data, i.e. ### data which comes in batches of "successes", and ### "failures" under each of a number of experimental ### settings summarized through predictor "covariates" ### here written as X1 and X2: > X1 = runif(120); X2 = rpois(120,3) ntrials = rep(c(5,6,3), 40) success = rbinom( 120, ntrials, plogis(-1 + 0.6*X1 + 0.1*X2) ) failure = ntrials - success simlogis = data.frame( X1, X2, success, failure ) ### So the data looks like X1 X2 success failure 1 0.2120602 3 3 2 2 0.7260127 1 2 4 3 0.1604437 3 1 2 4 0.4431811 4 0 5 5 0.8397253 5 5 1 ... ### "Logistic regression" should give intercept -1, and ### coefs for X1 and X2 respectively 0.6, 0.1 > fitB0 = glm ( cbind(success, failure) ~ X1 + X2, family=binomial, data=simlogis ) ### To display coef's, you get much more info from ### summary(fitB0)$coef than from fitB0$coef > round( summary(fitB0)$coef, 4) Estimate Std. Error z value Pr(>|z|) (Intercept) -1.1342 0.2400 -4.7259 0.0000 X1 0.8906 0.3188 2.7940 0.0052 X2 0.0998 0.0566 1.7627 0.0780 #### NOT BAD, but not precise as you can see from Std.Error ### Here the model was the correct one; but for "probit" ### regression which puts success probability equal to ### pnorm( a + b1*X1 + b2*X2 ) , an incorrect model, ### get a, b1 and b2 nearly the same after rescaling, ### since the scale of the logistic is different ### from the scale of the normal by the factor (ratio of sd's) > sqrt( 1 / integrate(function(x) x^2*dlogis(x), -Inf, Inf)$val ) [1] 0.5513289 ### Now fit "probit" regression: > fitB0.probit = glm ( cbind(success, failure) ~ X1 + X2, family=binomial(link="probit"), data=simlogis ) > round( rbind(probit= fitB0.probit$coef, logit.rescal= fitB0$coef*0.5513), 4) (Intercept) X1 X2 probit -0.7045 0.5487 0.0628 logit.rescal -0.6253 0.4910 0.0550 ### Two sets of coef's close after rescaling ------------------------------------------------------------- (II) Binomial data often comes in binary form, with ntrials = 1. ### Use data "rats" from "survival" package in R. ### Data on 150 rats contain identifying "litter", ### "rx" (indicator of injection of drug after initial ### administration of carcinogen), time in days on ### study (ignored initally, and "status" which is ### indicator of tumor, our binary response variable. > library(survival) > fitB1 = glm(cbind(status,1-status) ~ rx, family=binomial, data = rats) ### NOTE you can use the column headers as variable names ### if you specify the data-frame using "data=" > fitB1 Call: glm(formula = cbind(status, 1 - status) ~ rx, family = binomial, data = rats) Coefficients: (Intercept) rx -1.450 1.127 Degrees of Freedom: 149 Total (i.e. Null); 148 Residual Null Deviance: 174 Residual Deviance: 165.3 AIC: 169.3 > summary(fitB1)$coef Estimate Std. Error z value Pr(>|z|) (Intercept) -1.450010 0.2549063 -5.688404 1.282324e-08 rx 1.127237 0.3835089 2.939272 3.289845e-03 ### Coefficients fitted by MLE, link="logit" is default for ### binomial-family data: this is logistic regression ## Std.error found as sqrt of diagonal in variance: > sqrt(diag(summary(fitB1)$cov.scaled)) (Intercept) rx 0.2549063 0.3835089 ### "Deviance" = "Residual Deviance" = -2*logLik > c(fitB1$deviance, -2*logLik(fitB1)) [1] 165.2738 165.2738 ### "Null.deviance" is -2 times the logLik ### for the same model with only a const term > c(fitB1$null.dev, -2*logLik(update(fitB1, formula = .~ 1)) [1] 173.9746 173.9746 ### NOTE the use of the "update" function to refit a model ### of the same type as the one previously done: in the ### model formula, the left-hand side term is a . ### to indicate that it is the same as before. We have ### changed the right-hand side from rx (which automatically ### included intercept along with rx predictor) ### to one with "1" or intercept alone. ### Next use "update" to change the fitB1 model not in its ### model formula but in its specified link within ### the binomial "family". > fitB2 = update(fitB1, family=binomial(link="probit")) rbind(logit= fitB1$coef, probit= fitB2$coef, rescal.probit = fitB2$coef/0.5513) (Intercept) rx logit -1.4500102 1.1272368 probit -0.8778963 0.6760028 rescal.probit -1.5924112 1.2261977 ### RECALL THAT WE USE DEVIANCES AND DIFFERENCES BETWEEN ### THEM BECAUSE WILKS' THEOREM says that 2 times the ### difference between logLik for a model with p extra ### parameter dimensions versus logLik for a base model ### is equal to approximately a chi-square p df variate ### when the base model is the true one ! ### LET's use this idea to examine the quality of the model ### with predictor log(time) in the model along with rx ### NOTE that we must put the expression log(time) ### within I( ) to have it evaluated and constructed as ### a new predictor within the glm fitting function > fitB3 = update(fitB1, formula= . ~ . + I(log(time))) ### the "data" is the same now as in fitB1, so "time" ### is the column in the data-frame, and a log(time) ### predictor is created under "glm" ### It was reasonable to use "time" as a predictor, partly ### because the range of times is not too different for rats who ### generate tumors, so "time" is not really a response variable. > summary(rats$time[rats$status==1]) Min. 1st Qu. Median Mean 3rd Qu. Max. 34.00 66.75 80.00 77.28 92.50 104.00 > summary(rats$time[rats$status==0]) Min. 1st Qu. Median Mean 3rd Qu. Max. 45.00 83.50 104.00 93.85 104.00 104.00 > cbind(rats[1:10,], model.matrix(fitB3)[1:10,]) litter rx time status (Intercept) rx I(log(time)) 1 1 1 101 0 1 1 4.615121 2 1 0 49 1 1 0 3.891820 3 1 0 104 0 1 0 4.644391 4 2 1 104 0 1 1 4.644391 5 2 0 102 0 1 0 4.624973 6 2 0 104 0 1 0 4.644391 7 3 1 104 0 1 1 4.644391 8 3 0 104 0 1 0 4.644391 9 3 0 104 0 1 0 4.644391 10 4 1 77 0 1 1 4.343805 ### the first 4 columns are the original "rats" data ### the last 3 are the design matrix columns created by glm ### So we can look at the new model fit for significance of coefs > summary(fitB3)$coef Estimate Std. Error z value Pr(>|z|) (Intercept) 17.868661 4.4924813 3.977459 6.965559e-05 rx 1.194019 0.4284626 2.786751 5.323935e-03 I(log(time)) -4.355407 1.0180848 -4.278040 1.885462e-05 ### or alternatvely compare deviances or logLik's with fitB1 > c(2*(logLik(fitB3)-logLik(fitB1)), fitB1$dev-fitB3$dev) [1] 24.37307 24.37307 ### this is LRT stat to be compared with chisq 1 df > 1-pchisq(24.373,1) [1] 7.937339e-07 ## still highly significant but ### somewhat different p-value from the I(log(time)) coef ### probably because the model is still far from the ### right one. ### We could try to enter additional terms like log(time)^2 ### or rx * log(time) > fitB4 = update(fitB3, .~. + I(rx*log(time)) + I(log(time)^2)) Estimate Std. Error z value Pr(>|z|) (Intercept) -9.784968 52.690740 -0.1857056 0.85267560 rx -13.501206 8.732837 -1.5460274 0.12209794 I(log(time)) 10.179730 24.364938 0.4178024 0.67609159 I(rx * log(time)) 3.324780 1.973389 1.6848069 0.09202582 I(log(time)^2) -1.871215 2.819823 -0.6635932 0.50695069 ### This time, the new variables do NOT look significant ### which we can check also through deviances: > fitB3$dev - fitB4$dev [1] 3.901065 ### to be compared with chisq 2df, so is ### not at all significant ### We can also do these deviance comparisons all at once ### by looking at an "analysis of deviance" table > anova(fitB4) Analysis of Deviance Table Model: binomial, link: logit Response: cbind(status, 1 - status) Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev NULL 149 173.975 rx 1 8.701 148 165.274 I(log(time)) 1 24.373 147 140.901 I(rx * log(time)) 1 3.478 146 137.422 I(log(time)^2) 1 0.423 145 137.000 ### As in an anova table, in which RSS replaced Deviance ### these "Deviance values" are the amounts by which the ### current model-fitting line decreases the deviance: ### (recall that fitB1 uses rx only, fitB3 augments by log(time), ### and fitB4 by log(time) plus the 2 additional rx*log(tim) ### and log(time)^2 terms > Devs = c(fitB1$null.dev, fitB1$dev, fitB3$dev, update(fitB3, .~.+I(rx*log(time)))$dev, fitB4$dev) > Devs [1] 173.9746 165.2738 140.9007 137.4223 136.9997 > round ( -diff(Devs) , 3 ) ### successive differences of llks [1] 8.701 24.373 3.478 0.423 ### give Deviance col in "anova" ------------------------------------------------ (III) Now look at a descriptive check of model fit within the last series of models. We can plot for each litter the observed versus predicted mean "status) (ie mean tumor proportion): ## First we can group the fitted and observed data into 50x3 matrices ## so that we can create sums over the 3 members of each litter > fitmn = c( matrix(fitB3$fitted, ncol=3) %*% c(1,1,1)/3 ) obsmn = c( matrix(rats$status, ncol=3) %*% c(1,1,1)/3 ) plot(fitmn, obsmn) ### the picture shows generally increasing relationship, but there ### is still only a weak relationship between fitted and obs. > table(obsmn*3) ### there are lots of litter with 0 tumors, 0 1 2 3 ### and 1 litter with 3 19 23 7 1 ### Clearly there is still a lot difference across litters ! ### Here is one more idea for descriptively plotting fit vs observation: ### we can assemble litters into bins according to their predictors ### and then plot predictor versus observed tumor prop for the group. ### hisotgram of logit values qlogis(fitB3$fitted) suggests grouping ### into intervals with breaks c(-2.5,-2.0, -1.5, -1.0, -0.5, 0, 4) > hist(qlogis(fitB3$fitted), breaks=c(-2.5,-2.0, -1.5, -1.0, -0.5, 0, 4), plot=F)$counts [1] 54 10 35 16 16 19 ### Define groups mathematically: > grp = ifelse(qlogis(fitB3$fitted) > 0, 6, 1+trunc(2*qlogis(fitB3$fitted)+5)) grp 1 2 3 4 5 6 54 10 35 16 16 19 ### The following command produces a table of fitted and ### status values summed by group. > aggregate.data.frame(cbind.data.frame(fit=round(fitB3$fitted,2), obs = rats$status), list(grp=grp), sum) grp fit obs ### so the fit is actually not bad !! 1 1 4.95 3 2 2 1.39 0 3 3 8.32 12 ### although we can see in grp 3 that 4 4 5.24 5 ### the model is not very precise. 5 5 6.79 8 6 6 13.53 12 ----------------------------------------------------------------- ### One way to see "litter" differences is enter "litter" as a ### fixed or random effect; another would be to fit a model ### using "conditional logistic regression" (clogit in R) ### which allows the nuisance parameter litter-effect to cancel ### away by conditioning on the sufficient statistic = total number ### of tumors for each litter. > summary(glm(cbind(status,1-status) ~ rx + time + factor(litter), family=binomial, data=rats))$coef[1:5,] Estimate Std. Error z value Pr(>|z|) (Intercept) 9.6333910 3.626144e+00 2.656649127 0.0078921537 rx 3.4784372 9.711567e-01 3.581746504 0.0003413049 time -0.1517933 3.951368e-02 -3.841537689 0.0001222660 factor(litter)2 -16.6354770 9.136262e+03 -0.001820819 0.9985471977 factor(litter)3 -16.6106806 9.070080e+03 -0.001831371 0.9985387783 ... ### estimates wildly different from those before: unreliable because ### the number of parameters is now increased by 49 (= # litters -1) ### Here is the "conditional logistic" fit (conditioned on litter totals) > clogit(status ~ rx + I(log(time)) + strata(litter), data=rats, method="exact") Call: clogit(status ~ rx + I(log(time)) + strata(litter), data = rats, method = "exact") coef exp(coef) se(coef) z p rx 1.97 7.176830 0.68 2.90 0.0038 I(log(time)) -6.91 0.000996 2.50 -2.77 0.0056 Likelihood ratio test=28.6 on 2 df, p=6.2e-07 n= 150 ### Tells that the coef's are significant and roughly like those ### fitted in glm, but still goodness of fit is hard to assess. > fitB3$coef (Intercept) rx I(log(time)) 17.868661 1.194019 -4.355407