Handout on R functions for analyzing survival data as in Chapter 4 ==================================================== 9/29/21 ===== In this illustrative file, I use the "kidney" dataset within the R "survival" package. Here is the annotated R session log for the Nelson-Aalen and Kaplan-Meier analysis. > library(survival) > names(kidney) #### 76 records, 7 variables ### to find the column names in the dataset [1] "id" "time" "status" "age" "sex" "disease" "frail" > table(kidney$disease, kidney$status) 0 1 ### 0 = censored, 1 = failure Other 6 20 ### There are four disease types: we ignore the GN 4 14 ### distinctions for now. AN 6 18 PKD 2 6 > KidFit = survfit(Surv(time,status)~1, data=kidney) ## Typical command to analyze using failure counts and at-risk, along with Kaplan-Meier curve. ## But note that the "survfit" syntax now requires Surv(..,..) ~ 1. > names(KidFit) ### larger now than 10 years ago ... [1] "n" "time" "n.risk" "n.event" "n.censor" "surv" "std.err" "cumhaz" [9] "std.chaz" "type" "logse" "conf.int" "conf.type" "lower" "upper" "call" > length(KidFit$time) [1] 60 ### only 60 distinct event-times, ordered in this vector ### NOTE: the n.risk component gives the number Y(t) at risk at the times t , and the n.event # gives the number of observed deaths at t, which can be > 1 in case of tied death-times. > table(KidFit$n.event) 0 1 2 4 10 44 5 1 > round(KidFit$surv,3) [1] 0.987 0.987 0.987 0.987 0.959 0.931 0.917 0.888 0.874 0.845 0.830 0.815 [13] 0.801 0.785 0.770 0.755 0.739 0.724 0.708 0.645 0.629 0.613 0.598 0.582 [25] 0.566 0.566 0.550 0.550 0.533 0.517 0.500 0.500 0.483 0.466 0.466 0.466 [37] 0.447 0.428 0.410 0.391 0.372 0.372 0.333 0.314 0.294 0.294 0.273 0.252 [49] 0.231 0.210 0.189 0.168 0.147 0.126 0.105 0.084 0.063 0.042 0.021 0.000 ### Interpret these as the KM survival estimator just AFTER each of ### the ordered event times in "time". Note that many of these are the ### same as the previous entry: these correspond to "censoring" event-times. ### To confirm that the Kaplan-Meier product formula we had in class is ### exactly the one used in R: > sum(abs(cumprod(1-KidFit$n.event/KidFit$n.risk)-KidFit$surv)) [1] 3.243933e-15 ### OK up to machine accuracy. ### As far as I can tell from online help, we cannot get "survfit" to ### calculate the Nelson-Aalen cumulative hazard estimator directly. ### The implementation of the formula from class is: > round(cumsum(KidFit$n.ev/KidFit$n.risk),4) [1] 0.0132 0.0132 0.0132 0.0132 0.0413 0.0703 0.0857 0.1169 0.1331 0.1664 [11] 0.1837 0.2015 0.2197 0.2386 0.2578 0.2778 0.2986 0.3199 0.3416 0.4305 [21] 0.4549 0.4799 0.5056 0.5319 0.5589 0.5589 0.5875 0.5875 0.6178 0.6490 [31] 0.6813 0.6813 0.7158 0.7515 0.7515 0.7515 0.7915 0.8331 0.8766 0.9221 [41] 0.9697 0.9697 1.0750 1.1338 1.1963 1.1963 1.2677 1.3446 1.4280 1.5189 [51] 1.6189 1.7300 1.8550 1.9978 2.1645 2.3645 2.6145 2.9478 3.4478 4.4478 ### We can compare this with the "asymptotically equivalent" version ### obtained as negative log of the KM estimator > round(-log(KidFit$surv), 4) [1] 0.0132 0.0132 0.0132 0.0132 0.0418 0.0712 0.0867 0.1185 0.1347 0.1686 [11] 0.1860 0.2041 0.2224 0.2415 0.2609 0.2811 0.3021 0.3236 0.3456 0.4387 [21] 0.4634 0.4887 0.5147 0.5414 0.5688 0.5688 0.5977 0.5977 0.6285 0.6603 [31] 0.6931 0.6931 0.7281 0.7645 0.7645 0.7645 0.8053 0.8479 0.8923 0.9389 [41] 0.9877 0.9877 1.0989 1.1595 1.2240 1.2240 1.2982 1.3782 1.4652 1.5605 [51] 1.6659 1.7837 1.9172 2.0713 2.2537 2.4768 2.7645 3.1700 3.8631 Inf ### Final value is Inf because the final KM estimated probability is 0. ### Next estimate the median survival time: this is by def'n where KM first dips below the ## level 0.5, at the 31 or 33 observation depending on whether by "below" we mean <= or < . > KidFit$time[min((1:60)[KidFit$surv < 0.5])] [1] 78 ### These are recurrence times to infection of a catheter: ## I suppose the time units are weeks rather than days. -------------------------------------------------------------------- Now we consider standard errors and confidence intervals. ## KidFit$std.err is supposed to be the square root of variance estimator for the cumulative # hazard of -log(surv). Its default is exactly equal to the calculated Greenwood formula: > sum(abs(KidFit$std.err - sqrt(cumsum(KidFit$n.ev/(KidFit$ n.risk*(KidFit$n.risk-KidFit$n.ev))))), na.rm=T) [1] 3.95517e-16 ##----------------- Added 9/29/21 ## In older releases of "survfit", there was an "error=..." argument that allowed the choice of ## calculating variances using d_j/l_j^2 rather than d_j/(l_j*(l_j-d_j)) in the Greenwood formula ## That option is no longer available. If we wanted to do it, we would use the command: Tsiatis.Haz.Var = cumsum(KidFit$n.ev/(KidFit$n.risk^2) Tsiatis.SKM.Var = KidFit$surv^2 * Tsiatis.Haz.Var ### Next come the confidence intervals: these are intervals for the survival curve, ## specified by the argument "conf.type" which takes default value "log", and can otherwise ## be chosen as "plain", "log-log", "logit" or "arcsin" ## The "log" default-type of CI isobtained by exponentiating the CI for cumulative hazard. > round(KidFit$upper[KidFit$n.ev > 0],5)[1:20] [1] 1.00000 1.00000 0.99129 0.98289 0.96441 0.95452 0.93360 0.92270 0.91145 [10] 0.89996 0.88813 0.87608 0.86369 0.85095 0.83800 0.82487 0.77068 0.75675 [19] 0.74268 0.72848 > round((KidFit$surv*exp(1.96*KidFit$std.err))[KidFit$n.ev > 0],5)[1:20] [1] 1.01280 1.00555 0.99129 0.98289 0.96441 0.95452 0.93360 0.92270 0.91145 [10] 0.89996 0.88813 0.87608 0.86369 0.85095 0.83800 0.82487 0.77068 0.75676 [19] 0.74269 0.72848 ### The agreement is perfect if we max out the interval at 1 and use the ### exact normal .975 quantile: > sum(abs(KidFit$upper- pmin(1,KidFit$surv*exp(qnorm(.975)*KidFit$std.err)))[1:59]) [1] 1.915135 ### Similarly for lower confidence band for the survival curve: > sum(abs(KidFit$lower- KidFit$surv*exp(-qnorm(.975)*KidFit$std.err))[1:59]) [1] 1.355253e-15 ### BUT observe that these confidence interval endpoints are not simply the exponentials ### of the Greenwood-formula Confidence Interval for cumulative hazard, although ## that is the way we defined "transformed interval" in the book. ============================================================== There are other options with these functions: (A) suppose we wanted a direct CI for the survival curve using the formula without exponentiating the interval for cum-hazard: then we would use the "upper" and "lower" components after the command > KidFit2 = survfit(Surv(time,status)~1, data=kidney, conf.type="plain") ### all life tables and survival probs are same as before, ### only the CI is different > sum(abs(KidFit2$upper - pmin(1,KidFit2$surv*(1 + qnorm(.975)*KidFit2$std.err)))[1:59]) [1] 1.665335e-15 > sum(abs(KidFit2$lower - pmax(0,KidFit2$surv*(1 - qnorm(.975)*KidFit2$std.err)))[1:59]) [1] 1.168336e-15 (A') Let's do one other "transformed" interval, say the "logit" which is log(x/(1-x)) = qlogis(x) KidFit2B = survfit(Surv(time,status)~1, data=kidney, conf.type="logit") ## undefined at 60th time ## Let's try to reconstruct this from the logit transformed-definition, noting that the inverse # of logit is plogis . > LgtS = qlogis(KidFit2B$surv) ## delta method gives the following as the std error for LgtS: LgtS.se = KidFit2B$std.err/(KidFit2B$surv*(1-KidFit2B$surv)) NewCI = cbind( plogis(LgtS - qnorm(.975)*LgtS.se), plogis(LgtS + qnorm(.975)*LgtS.se) ) > summary(NewCI[1:59,1] - KidFit2B$lower[1:59]) Min. 1st Qu. Median Mean 3rd Qu. Max. -0.135883 -0.115285 -0.081198 -0.072638 -0.030552 -0.002123 ### so here also survfit is doing something slightly different from straight delta-method (B) Suppose we wanted to create separate survival functions and other (std.err, life-table, CI) quantities by some grouping variable like disease-type or sex. Then the command syntax is: > KidFit3 = survfit(Surv(time,status) ~ sex, data=kidney) ## or > KidFit3 = survfit(Surv(time,status) ~ disease, data=kidney) ============================================================== We complete this handout with a calculation of confidence intervals for the median survival time using the Brookmeyer-Crowley (1982) test-based interval and also the "transformed reflected" interval discussed in Slud, Byar and Green (1984, Biometrics): ## for Brookmeyer-Crowley as given in book, we first find ## the indices of those times t for which the quantity ## |S_KM(t)-0.5| <= 1.96* S_KM(t) * sigma_S(t) > BCinds = (abs(KidFit$surv - 0.5) <= qnorm(.975)*KidFit$surv* KidFit$std.err) > (1:59)[BCinds[1:59]] [1] 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 ## So the times contained in the test-based interval range from ## the 22nd time to just before the 41st (left-closed right-open) > KidFit$time[c(22,41)] [1] 38 141 ## this is the CI ## For the analogous "transformed reflected" CI we look at the estimated cumulative-hazard # function (either Nelson-Aalen or -log(S_KM), both given above, and find the times where # the estimated function first crosses the level log(2) + or - 1.96*Greenwood variance at the estimated median (estimated median was 78, at the 33rd event-time) > log(2) + c(-1,1)*1.96*KidFit$std.err[33] [1] 0.4394169 0.9468774 ### These are cum-hazard values, and ## these levels are first crossed by the cum-hazard function > round(-log(KidFit$surv),3) [1] 0.013 0.013 0.013 0.013 0.042 0.071 0.087 0.118 0.135 0.169 0.186 0.204 [13] 0.222 0.241 0.261 0.281 0.302 0.324 0.346 0.439 0.463 0.489 0.515 0.541 [25] 0.569 0.569 0.598 0.598 0.629 0.660 0.693 0.693 0.728 0.765 0.765 0.765 [37] 0.805 0.848 0.892 0.939 0.988 0.988 1.099 1.160 1.224 1.224 1.298 1.378 [49] 1.465 1.561 1.666 1.784 1.917 2.071 2.254 2.477 2.764 3.170 3.863 Inf ## at event times 21 and 41 respectively, giving the CI > KidFit$time[c(21,41)] [1] 34 141 ## differs only a little, by one event-time at ## lower endpoint, from Brookmeyer-Crowley confidence interval