Summary of Steps and R Log in Computing Self-Consistent Estimator of Survival D.f. for Double-Censored Data ======================================================== 10/17/21 Ordinarily begin with data in the form of a column of n individuals' (unsorted) event-times, Times, and a column of corresponding indicators, Dths, equal by definition to 1 if the individual subject was observed to die, 0 if the subject was left-censored, and 2 if right-censored. STEP 1. Sort the DISTINCT event-times to get a column tauvec, along with columns nDvec, nLvec, nRvec of numbers respectively dead, left- and right-censored at the corresponding times in tauvec. > tauvec = sort(unique(Times)) nDvec = tapply(Dths==1, Times, sum) nLvec = tapply(Dths==0, Times, sum) nRvec = tapply(Dths==2, Times, sum) Dmat = cbind(nD=nDvec, nL=nLvec, nR=nRvec) STEP 2. Now we want an initial survival-distribution estimator. The simplest way might be (as suggested in the book's example) to ignore the left-censored observations and get the Kaplan-Meier df estimator: > fstart = 1-survfit(Surv(Times[Dths>0],2-Dths[Dths>0])~1)$surv STEP 3. Finally, we want to implement the inductive step x_{k+1} = G(x_k) of the Turnbull self-consistent estimator as a function G. Note that this must be done for computational reasons, by re-allocating the left-censored observations fractionally to all smaller event-times, and then doing a new KM Product-Limit estimator on the result with fractional numbers of observed deaths. We create a matrix DL_{ij} (with nonzero entries only for j <= i) corresponding to number of `estimated' deaths at tau_j due to individuals left-censored at tau_i: the formula is nLvec[i]*(F.old[j]-F.old[j-1])/F.old[i] > Gfunc = function(Fvec, Dmat) { nr = nrow(Dmat) DL = array(0, dim=rep(nr,2)) for(i in 2:nr) if(Dmat[i,"nL"]>0) DL[i,1:i] = Dmat[i,"nL"]*diff(c(0,Fvec[1:i]))/Fvec[i] newdth = Dmat[,"nD"] + t(DL) %*% rep(1,nr) newY = rev(cumsum(rev(newdth+Dmat[,"nR"]))) 1-cumprod(1-newdth/newY) } ------------------------------------------------------------ Now we illustrate/test these steps using the example from the text, on California high school marijuana use. Since we are not given data in individual records, we omit step 1 and enter the data directly from Table 5.1 of Klein & Moeschberger. We use age 19 in place of the ">18" category from that Table. > tauvec = c(10:19) nLvec = c(0,0,0,1,2,3,2,3,1,0) nDvec = c(4,12,19,24,20,13,3,1,0,4) nRvec = c(0,0,2,15,24,18,14,6,0,0) Yvec = rev(cumsum(rev(nDvec+nRvec))) fstart = 1-cumprod(1-nDvec/Yvec) Dmat = cbind(nD=nDvec, nL=nLvec, nR=nRvec) Sfinal = c(.977,.906,.794,.651,.516,.392,.345,.308,.308,0) Ffinal = 1-Sfinal DL = array(0, dim=c(10,10)) for(i in 2:10) if(nLvec[i]>0) DL[i,1:i] = nLvec[i]*diff(c(0,fstart[1:i]))/fstart[i] ### Then the newly allocated number of deaths is > c(round(nDvec+ t(DL) %*% rep(1,10),3)) [1] 4.487 13.461 21.313 26.963 22.437 14.714 3.417 1.207 0.000 4.000 ### This agrees precisely with 2nd column of Table 5.3 in book. Next we show ## explicitly the first 10 iterations starting with fstart. > DblCns = array(0, dim=rep(10,2)) newf = fstart for(i in 1:10) DblCns[,i] = newf = Gfunc(newf,Dmat) round(DblCns,6) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [1,] 0.023492 0.023497 0.023497 0.023497 0.023497 0.023497 0.023497 0.023497 [2,] 0.093969 0.093988 0.093989 0.093989 0.093989 0.093989 0.093989 0.093989 [3,] 0.205557 0.205598 0.205601 0.205602 0.205602 0.205602 0.205602 0.205602 [4,] 0.348611 0.348694 0.348701 0.348702 0.348702 0.348702 0.348702 0.348702 [5,] 0.484221 0.484247 0.484248 0.484248 0.484248 0.484248 0.484248 0.484248 [6,] 0.607949 0.607893 0.607885 0.607884 0.607884 0.607884 0.607884 0.607884 [7,] 0.654752 0.654646 0.654631 0.654629 0.654629 0.654629 0.654629 0.654629 [8,] 0.691932 0.692059 0.692082 0.692085 0.692086 0.692086 0.692086 0.692086 [9,] 0.691932 0.692059 0.692082 0.692085 0.692086 0.692086 0.692086 0.692086 [10,] 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 [,9] [,10] [1,] 0.023497 0.023497 [2,] 0.093989 0.093989 [3,] 0.205602 0.205602 [4,] 0.348702 0.348702 [5,] 0.484248 0.484248 [6,] 0.607884 0.607884 [7,] 0.654629 0.654629 [8,] 0.692086 0.692086 [9,] 0.692086 0.692086 [10,] 1.000000 1.000000 CONVERGENCE IS INCREDIBLY FAST !!! Suppose we start with a very poor initial guess, namely > fstart2 = (1:10)/10 newf = fstart for(i in 1:10) DblCns[,i] = newf = Gfunc(newf,Dmat) ## Again the convergence is incredibly rapid: > round(DblCns[,3],4) [1] 0.0235 0.0940 0.2056 0.3487 0.4842 0.6079 0.6546 0.6921 0.6921 1.0000 ### Exactly the same as before !!! ### To assess convergence at the final (10th) iteration, consider > sum(abs(DblCns[,9]-DblCns[,10])) [1] 1.948173e-10 ------------------------------------------------------------------- At this point, we try to relate the Self-Consistency Equation covered in class on Monday 10/18 to the convergent result of the algorithm just described and coded. Recall that the self-consistency equation said that F(t) = n^(-1) [ sum_{i: tau_i <= t} (I[Dth_i <2] + I[Dth_i=2]*(F(t)-F(tau_i))/(1-F(tau_i)) + sum_{i: tau_i > t} I[Dth_i=0]*F(t)/F(tau_i) ] So we code a function to take a vector of F value and calculate the right-hand side, as follows. > Hfunc = function(Fvec, Dmat) { nr = nrow(Dmat) DLR = array(0, dim=rep(nr,2)) for(i in 1:nr) if(Dmat[i,"nL"]>0) DLR[i,1:i] = Dmat[i,"nL"]*diff(c(0,Fvec[1:i]))/Fvec[i] for(i in 1:(nr-1)) if(Dmat[i,"nR"]>0) DLR[i,(i+1):nr] <- Dmat[i,"nR"]*diff(Fvec[i:nr])/(1-Fvec[i]) cumsum(t(DLR) %*% rep(1,nr) + Dmat[,"nD"])/sum(Dmat) } To check whether the self-consistency equation holds for the Turnbull estimated d.f. in the double-censoring example above, we check whether the DblCns[,10] vector is preserved by the function Hfunc(., Dmat): > sum(abs(DblCns[,10]-Hfunc(DblCns[,10],Dmat))) [1] 7.436309e-12 ### So it IS preserved, and we DO get self-consistency this way!! As a further check on the # self-consistency idea, note that we could also get our estimator by iterating Hfunc > fnew = fstart2 [1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 > for(i in 1:3) { fnew = Hfunc(fnew,Dmat) cat(round(fnew,5),"\n") } 0.0235 0.09399 0.2056 0.3487 0.48425 0.60788 0.65463 0.69208 0.69208 1 0.0235 0.09399 0.2056 0.3487 0.48425 0.60788 0.65463 0.69208 0.69208 1 0.0235 0.09399 0.2056 0.3487 0.48425 0.60788 0.65463 0.69208 0.69208 1 ### Again, very rapid convergence !!!