Self-Consistency for Kaplan Meier ================================= 4/30/08, UPDATED 11/12/2021 Suppose that right-censored survival data are given in a Table, consisting of columns: time, Dth, Cens time (column of ordered distinct even-times t_j) Dth (number d_j of observed failures at time t_j) Cens (number c_j of observed right-censored at time t_j) In addition, (as long as time is already ordered increasing) the at-risk count column Risk is defined to have value at t_j equal to sum_k I(t_k >= t_j) * (d_k+c_k) which can be easily coded in R from the other columns as Risk = rev(cumsum(rev(Dth + Cens))) Now suppose that Shat0 is a survival function estimator (given as a column, evaluated at the times t_j) which is piecewise constant, with jumps only at event times t_j. Then the self-consistency idea is to update Shat0 to give a survival function Shat1 by using Shat0 to attribute to each censored individual at t_j a conditional survival function Shat0(t)/Shat0(t_j) at all times t_j. The definition of Shat1 from Shat0 is as follows: nevt = length(time) Shat1 = 1 - (1/Risk[1])*(cumsum(Dth) + Shat0*cumsum(Cens/c(1,Shat0[1:(nevt-1)]))) # We code this into a function: SlfConsUp = function(Shat0, LTable) { ### LTable should have columns time, Dth, Cens nevt = nrow(LTable) npop = sum(LTable[,2] + LTable[,3]) Shat1 = 1 - (1/npop)*(cumsum(LTable[,2]+LTable[,3]) - Shat0* cumsum(LTable[,3]/Shat0)) Shat1} ### Now consider life table from former "gehan" dataset, now called "drug6mp" in KMsurv library(KMsurv) data(drug6mp) tmp = drug6mp[,c("t2","relapse")] ### Note that there are exactly two times, t=10 and t=6, at which simultaneous ### censoring and failures occur. So make the censoring time 0.5 later > tmp[21,1] = 10.5 tmp[20,1] = 6.5 Gehan6MP = cbind(time=as.numeric(levels(factor(tmp$t2))), Dth=aggregate(tmp$relapse, by=list(tmp$t2), sum)$x, Cens=aggregate(1-tmp$relapse, by=list(tmp$t2), sum)$x) > Gehan6MP ### 21 at risk initially in 6-MP group time Dth Cens [1,] 6.0 3 0 [2,] 6.5 0 1 [3,] 7.0 1 0 [4,] 9.0 0 1 [5,] 10.0 1 0 [6,] 10.5 0 1 [7,] 11.0 0 1 [8,] 13.0 1 0 [9,] 16.0 1 0 [10,] 17.0 0 1 [11,] 19.0 0 1 [12,] 20.0 0 1 [13,] 22.0 1 0 [14,] 23.0 1 0 [15,] 25.0 0 1 [16,] 32.0 0 2 [17,] 34.0 0 1 [18,] 35.0 0 1 > Shat = 1 - cumsum(Gehan6MP[,2])/21 round(Shat,4) [1] 0.8571 0.8571 0.8095 0.8095 0.7619 0.7619 0.7619 0.7143 0.6667 0.6667 [11] 0.6667 0.6667 0.6190 0.5714 0.5714 0.5714 0.5714 0.5714 > for (i in 1:10) { Shat = SlfConsUp(Shat,Gehan6MP) cat(round(Shat,4),"\n\n") } 0.8571 0.8571 0.8067 0.8067 0.7529 0.7529 0.7529 0.6902 0.6275 0.6275 0.6275 0.6275 0.5378 0.4482 0.4482 0.4482 0.4482 0.4482 ... 0.8571 0.8571 0.8067 0.8067 0.7529 0.7529 0.7529 0.6902 0.6275 0.6275 0.6275 0.6275 0.5378 0.4482 0.4482 0.4482 0.4482 0.4482 ### Converged at 1 iteration to 4 decimal places !!! > round(survfit(Surv(tmp$time,tmp$cens))$surv,4) [1] 0.8571 0.8571 0.8067 0.8067 0.7529 0.7529 0.7529 0.6902 0.6275 0.6275 [11] 0.6275 0.6275 0.5378 0.4482 0.4482 0.4482 0.4482 0.4482 ### Need 20 iterations to get agreement to 7 decimal places. Could check this # with other datasets too !! We do it next with a randomly generated one: set.seed(2112) Xvec = rexp(25,.5) Uvec = runif(25,1,5) Tvec = pmin(Xvec,Uvec) Dth = as.numeric(Xvec <= Uvec) Tabl = cbind(time=Tvec, Dth= Dth, Cens=1-Dth)[order(Tvec),] > Shat = 1-.04*cumsum(Dth) for(i in 1:20) Shat = SlfConsUp(Shat,Tabl) round(Shat,4) [1] 0.9600 0.9200 0.8800 0.8400 0.8000 0.7600 0.7200 0.6800 0.6400 0.6000 [11] 0.5600 0.5600 0.5169 0.4738 0.4738 0.4738 0.4212 0.4212 0.4212 0.4212 [21] 0.3370 0.2527 0.1685 0.1685 0.1685 > round(survfit(Surv(Tabl[,1], Tabl[,2])~1)$surv,4) [1] 0.9600 0.9200 0.8800 0.8400 0.8000 0.7600 0.7200 0.6800 0.6400 0.6000 [11] 0.5600 0.5600 0.5169 0.4738 0.4738 0.4738 0.4212 0.4212 0.4212 0.4212 [21] 0.3370 0.2527 0.1685 0.1685 0.1685 # NOTE number of iterations is of the order of number of death-times! ================================================================= ### Coding for Redistribute-to-the-Right Algorithm for Kaplan-Meier ## Recall that the idea is to work in a single pass, from left to right, # through all of the censoring times, successively dividing the mass # (initially 1/npop at all observations) at each such point equally # among all observations with greater event times. > Redistr = function(LTable) { nevt = nrow(LTable) Dth = LTable[,2] Cens = LTable[,3] nRisk = rev(cumsum(rev(Dth+Cens))) npop = nRisk[1] inds = (1:nevt)[Cens > 0 & rev(cumsum(rev(Dth))) > 0] for(i in inds) Dth[(i+1):nevt] = Dth[(i+1):nevt]*( 1+Cens[i]/nRisk[i+1]) 1 - cumsum(Dth)/npop } > sum(abs(Redistr(Gehan6MP) - survfit(Surv(tmp$t2, tmp$relapse) ~ 1)$surv)) [1] 1.831868e-15 ### so the Redistribute algorithm gives KM! > sum(abs(Redistr(Tabl) - survfit(Surv(Tabl[,"time"],Tabl[,"Dth"])~1)$surv)) [1] 1.665335e-15 ### OK.