DATA ANALYSIS LOG IN  R  USING  GLM 
===================================

(I) (10/28/09)  Begin by getting syntax for different
"family" and "link" choices.

### Start with basic syntax for "binomial" data, i.e. 
###   data which comes in batches of "successes", and 
###   "failures" under each of a number of experimental 
###   settings summarized through predictor "covariates"
###   here written as X1 and X2:

> X1 = runif(120); X2 = rpois(120,3)
  ntrials = rep(c(5,6,3), 40)  
  success = rbinom( 120, ntrials, plogis(-1 + 0.6*X1 + 0.1*X2) )
  failure = ntrials - success
  simlogis = data.frame( X1, X2, success, failure )

### So the data looks like

         X1 X2 success failure
1 0.2120602  3       3       2
2 0.7260127  1       2       4
3 0.1604437  3       1       2
4 0.4431811  4       0       5
5 0.8397253  5       5       1
...

### "Logistic regression" should give intercept -1, and 
###     coefs for X1 and X2 respectively   0.6, 0.1

> fitB0 = glm ( cbind(success, failure) ~ X1 + X2,
                family=binomial, data=simlogis )

### To display coef's, you get much more info from
###  summary(fitB0)$coef  than from     fitB0$coef

> round( summary(fitB0)$coef,  4)
            Estimate Std. Error z value Pr(>|z|)
(Intercept)  -1.1342     0.2400 -4.7259   0.0000
X1            0.8906     0.3188  2.7940   0.0052
X2            0.0998     0.0566  1.7627   0.0780

 #### NOT BAD, but not precise as you can see from Std.Error

### Here the model was the correct one; but for "probit"
###    regression which puts success probability equal to
###    pnorm( a + b1*X1 + b2*X2 ) ,  an incorrect model,
###    get a,  b1  and  b2  nearly the same after rescaling, 
###    since the scale of the logistic  is different 
###    from the scale of the normal by the factor (ratio of sd's)

> sqrt( 1 / integrate(function(x) x^2*dlogis(x), -Inf, Inf)$val )
[1] 0.5513289

### Now fit "probit" regression:
> fitB0.probit = glm ( cbind(success, failure) ~ X1 + X2,
          family=binomial(link="probit"), data=simlogis )

> round( rbind(probit= fitB0.probit$coef, 
             logit.rescal= fitB0$coef*0.5513), 4)
             (Intercept)     X1     X2
probit           -0.7045 0.5487 0.0628
logit.rescal     -0.6253 0.4910 0.0550
         ### Two sets of coef's close after rescaling

-------------------------------------------------------------

(II) Binomial data often comes in binary form, with ntrials = 1.

### Use data "rats" from "survival" package in R.
### Data on 150 rats contain identifying "litter", 
###   "rx" (indicator of injection of drug after initial 
###   administration of carcinogen), time in days on 
###   study (ignored initally, and "status" which is 
###   indicator of tumor, our binary response variable.

> library(survival)

> fitB1 = glm(cbind(status,1-status) ~ rx, family=binomial,
              data = rats)

### NOTE you can use the column headers as variable names
###   if you specify  the data-frame using "data="

> fitB1

Call:  glm(formula = cbind(status, 1 - status) ~ rx, family = binomial, data = rats) 

Coefficients:
(Intercept)           rx  
     -1.450        1.127  

Degrees of Freedom: 149 Total (i.e. Null);  148 Residual
Null Deviance:      174 
Residual Deviance: 165.3        AIC: 169.3 

> summary(fitB1)$coef
             Estimate Std. Error   z value     Pr(>|z|)
(Intercept) -1.450010  0.2549063 -5.688404 1.282324e-08
rx           1.127237  0.3835089  2.939272 3.289845e-03

### Coefficients fitted by MLE, link="logit" is default for
###   binomial-family data: this is logistic regression

## Std.error found as sqrt of diagonal in variance:
> sqrt(diag(summary(fitB1)$cov.scaled))
(Intercept)          rx 
  0.2549063   0.3835089 

### "Deviance" = "Residual Deviance" = -2*logLik
> c(fitB1$deviance, -2*logLik(fitB1))
[1] 165.2738 165.2738

### "Null.deviance" is -2 times the logLik
###    for the same model with only a const term 

> c(fitB1$null.dev, 
    -2*logLik(update(fitB1, formula = .~ 1))
[1] 173.9746 173.9746

### NOTE the use of the "update" function to refit a model
###   of the same type as the one previously done: in the 
###   model formula, the left-hand side term is a   .
###   to indicate that it is the same as before. We have 
###   changed the right-hand side from  rx  (which automatically
###   included intercept along with  rx  predictor)
###   to one with "1"  or intercept alone.

### Next use "update" to change the fitB1 model not in its 
###   model formula but in its specified link within 
###   the binomial "family".

> fitB2 = update(fitB1, family=binomial(link="probit"))
  rbind(logit= fitB1$coef, probit= fitB2$coef, 
      rescal.probit = fitB2$coef/0.5513)
              (Intercept)        rx
logit          -1.4500102 1.1272368
probit         -0.8778963 0.6760028
rescal.probit  -1.5924112 1.2261977


### RECALL THAT WE USE DEVIANCES AND DIFFERENCES BETWEEN 
###   THEM BECAUSE  WILKS' THEOREM says that 2 times the
###   difference between logLik for a model with  p  extra 
###   parameter dimensions versus logLik for a base model
###   is equal to approximately a chi-square  p  df variate
###   when the base model is the true one !

### LET's use this idea to examine the quality of the model
###   with predictor  log(time)  in the model along with rx
###  NOTE that we must put the expression  log(time)  
###   within  I( ) to have it evaluated and constructed as
###   a new predictor within the   glm  fitting function

> fitB3 = update(fitB1, formula= . ~ . + I(log(time)))
       ### the "data" is the same now as in fitB1, so "time"
    ### is the column in the data-frame, and a  log(time) 
    ### predictor is created under "glm"

### It was reasonable to use "time" as a predictor, partly
###  because the range of times is not too different for rats who 
###  generate tumors, so "time" is not really a response variable.

> summary(rats$time[rats$status==1])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  34.00   66.75   80.00   77.28   92.50  104.00 
> summary(rats$time[rats$status==0])
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  45.00   83.50  104.00   93.85  104.00  104.00 

> cbind(rats[1:10,], model.matrix(fitB3)[1:10,])
   litter rx time status (Intercept) rx I(log(time))
1       1  1  101      0           1  1     4.615121
2       1  0   49      1           1  0     3.891820
3       1  0  104      0           1  0     4.644391
4       2  1  104      0           1  1     4.644391
5       2  0  102      0           1  0     4.624973
6       2  0  104      0           1  0     4.644391
7       3  1  104      0           1  1     4.644391
8       3  0  104      0           1  0     4.644391
9       3  0  104      0           1  0     4.644391
10      4  1   77      0           1  1     4.343805

### the first 4 columns are the original "rats" data
### the last 3 are the design matrix columns created by glm

### So we can look at the new model fit for significance of coefs
> summary(fitB3)$coef
              Estimate Std. Error   z value     Pr(>|z|)
(Intercept)  17.868661  4.4924813  3.977459 6.965559e-05
rx            1.194019  0.4284626  2.786751 5.323935e-03
I(log(time)) -4.355407  1.0180848 -4.278040 1.885462e-05

### or alternatvely compare deviances or logLik's with fitB1
> c(2*(logLik(fitB3)-logLik(fitB1)), fitB1$dev-fitB3$dev)
[1] 24.37307 24.37307    
  ### this is LRT stat to be compared with chisq 1 df

> 1-pchisq(24.373,1)
[1] 7.937339e-07       ## still highly significant but
     ### somewhat different p-value from the I(log(time)) coef 
     ### probably because the model is still far from the 
     ### right one.

### We could try to enter additional terms like log(time)^2
### or rx * log(time)

> fitB4 = update(fitB3, .~. + I(rx*log(time)) + I(log(time)^2))
                    Estimate Std. Error    z value   Pr(>|z|)
(Intercept)        -9.784968  52.690740 -0.1857056 0.85267560
rx                -13.501206   8.732837 -1.5460274 0.12209794
I(log(time))       10.179730  24.364938  0.4178024 0.67609159
I(rx * log(time))   3.324780   1.973389  1.6848069 0.09202582
I(log(time)^2)     -1.871215   2.819823 -0.6635932 0.50695069

### This time, the new variables do NOT look significant
###   which we can check also through deviances:

> fitB3$dev - fitB4$dev
[1] 3.901065      ### to be compared with chisq 2df, so is
             ### not at all significant

### We can also do these deviance comparisons all at once
###   by looking at an "analysis of deviance" table

> anova(fitB4)
Analysis of Deviance Table
Model: binomial, link: logit

Response: cbind(status, 1 - status)
Terms added sequentially (first to last)

                   Df Deviance Resid. Df Resid. Dev
NULL                                 149    173.975
rx                  1    8.701       148    165.274
I(log(time))        1   24.373       147    140.901
I(rx * log(time))   1    3.478       146    137.422
I(log(time)^2)      1    0.423       145    137.000

### As in an anova table, in which RSS replaced Deviance
###  these "Deviance values" are the amounts by which the
###  current model-fitting line decreases the deviance:
### (recall that fitB1 uses rx only, fitB3 augments by log(time),
###    and fitB4 by  log(time) plus the 2 additional rx*log(tim)
###       and log(time)^2  terms

> Devs = c(fitB1$null.dev, fitB1$dev, fitB3$dev,
             update(fitB3, .~.+I(rx*log(time)))$dev, fitB4$dev)
> Devs
[1] 173.9746 165.2738 140.9007 137.4223 136.9997

> round ( -diff(Devs) , 3 )      ### successive differences of llks
[1]  8.701 24.373  3.478  0.423      ### give Deviance col in "anova"

------------------------------------------------

(III) Now look at a descriptive check of model fit within the last
series of models. We can plot for each litter the observed versus
predicted mean "status) (ie mean tumor proportion):

## First we can group the fitted and observed data into 50x3 matrices
##  so that we can create sums over the 3 members of each litter
> fitmn = c( matrix(fitB3$fitted, ncol=3) %*% c(1,1,1)/3 )
  obsmn = c( matrix(rats$status, ncol=3) %*% c(1,1,1)/3 )
  plot(fitmn, obsmn)

### the picture shows generally increasing relationship, but there 
###   is still only a weak relationship between fitted and obs.

> table(obsmn*3)       ### there are lots of litter with 0 tumors, 
 0  1  2  3            ###    and 1 litter with 3
19 23  7  1 

### Clearly there is still a lot difference across litters !

### Here is one more idea for descriptively plotting fit vs observation:
###  we can assemble litters into bins according to their predictors
###  and then plot predictor versus observed tumor prop for the group.

### hisotgram of logit values  qlogis(fitB3$fitted) suggests grouping 
###   into intervals with breaks c(-2.5,-2.0, -1.5, -1.0, -0.5, 0, 4)

> hist(qlogis(fitB3$fitted), breaks=c(-2.5,-2.0, -1.5, -1.0, -0.5, 0, 4),
         plot=F)$counts
[1] 54 10 35 16 16 19
### Define groups mathematically:
> grp = ifelse(qlogis(fitB3$fitted) > 0, 6, 
                1+trunc(2*qlogis(fitB3$fitted)+5))
grp
 1  2  3  4  5  6 
54 10 35 16 16 19 

### The following command produces a table of fitted and 
###        status values summed by group.
> aggregate.data.frame(cbind.data.frame(fit=round(fitB3$fitted,2), 
       obs = rats$status), list(grp=grp), sum)

  grp   fit obs           ### so the fit is actually not bad !!
1   1  4.95   3
2   2  1.39   0
3   3  8.32  12      ### although we can see in grp 3 that 
4   4  5.24   5        ### the model is not very precise.
5   5  6.79   8
6   6 13.53  12

-----------------------------------------------------------------
###  One way to see "litter" differences is enter "litter" as a 
###  fixed or random effect; another would be to fit a model 
###  using "conditional logistic regression" (clogit in R)
###  which allows the nuisance parameter litter-effect to cancel 
###  away by conditioning on the sufficient statistic = total number
###  of tumors for each litter.

> summary(glm(cbind(status,1-status) ~ rx + time + factor(litter),
    family=binomial, data=rats))$coef[1:5,]
                   Estimate   Std. Error      z value     Pr(>|z|)
(Intercept)       9.6333910 3.626144e+00  2.656649127 0.0078921537
rx                3.4784372 9.711567e-01  3.581746504 0.0003413049
time             -0.1517933 3.951368e-02 -3.841537689 0.0001222660
factor(litter)2 -16.6354770 9.136262e+03 -0.001820819 0.9985471977
factor(litter)3 -16.6106806 9.070080e+03 -0.001831371 0.9985387783
...

### estimates wildly different from those before: unreliable because 
###   the number of parameters is now increased by 49 (= # litters -1)

### Here is the "conditional logistic" fit (conditioned on litter totals)

> clogit(status ~ rx + I(log(time)) + strata(litter), data=rats, method="exact")
Call:
clogit(status ~ rx + I(log(time)) + strata(litter), data = rats, 
    method = "exact")

              coef exp(coef) se(coef)     z      p
rx            1.97  7.176830     0.68  2.90 0.0038
I(log(time)) -6.91  0.000996     2.50 -2.77 0.0056

Likelihood ratio test=28.6  on 2 df, p=6.2e-07  n= 150 

### Tells that the coef's are significant and roughly like those 
###   fitted in glm, but still goodness of fit is hard to assess.

> fitB3$coef
 (Intercept)           rx I(log(time)) 
   17.868661     1.194019    -4.355407