Summary of Steps and R Log in Computing Self-Consistent 
Estimator of Survival D.f. for Double-Censored Data 
======================================================== 10/17/21

Ordinarily begin with data in the form of a column of n individuals' (unsorted) event-times, Times, and a column of corresponding indicators, Dths, equal by definition to 1 if the individual subject was observed to die, 0 if the subject was left-censored, and 2 if right-censored.

STEP 1. Sort the DISTINCT event-times to get a column tauvec, along with columns  nDvec, nLvec, nRvec  of numbers respectively dead, left- and right-censored at the corresponding
times in  tauvec.

> tauvec = sort(unique(Times))
  nDvec = tapply(Dths==1, Times, sum)
  nLvec  = tapply(Dths==0, Times, sum)
  nRvec = tapply(Dths==2, Times, sum)
  Dmat = cbind(nD=nDvec, nL=nLvec, nR=nRvec)

STEP 2. Now we want an initial survival-distribution estimator. The simplest way might be (as suggested in the book's example) to ignore the left-censored observations and get the Kaplan-Meier df estimator:

> fstart = 1-survfit(Surv(Times[Dths>0],2-Dths[Dths>0])~1)$surv

STEP 3. Finally, we want to implement the inductive step  x_{k+1} = G(x_k) of the Turnbull self-consistent estimator as a function G. Note that this must be done for computational reasons, by
re-allocating the left-censored observations fractionally to all smaller event-times, and then doing a new KM Product-Limit estimator on the result with fractional numbers of observed deaths. 

We create a matrix DL_{ij} (with nonzero entries only for j <= i) corresponding to number of `estimated' deaths at tau_j due to individuals left-censored at tau_i:  the formula is

             nLvec[i]*(F.old[j]-F.old[j-1])/F.old[i]

> Gfunc =  function(Fvec, Dmat) {
    nr = nrow(Dmat)
    DL = array(0, dim=rep(nr,2))
    for(i in 2:nr) if(Dmat[i,"nL"]>0) 
      DL[i,1:i] = Dmat[i,"nL"]*diff(c(0,Fvec[1:i]))/Fvec[i]
    newdth = Dmat[,"nD"] + t(DL) %*% rep(1,nr)
    newY =  rev(cumsum(rev(newdth+Dmat[,"nR"])))
    1-cumprod(1-newdth/newY)
  }

------------------------------------------------------------
Now we illustrate/test these steps using the example from the text, on California high school marijuana use. Since we are not given data in individual records, we omit step 1 and enter the data directly from Table 5.1 of Klein & Moeschberger. We use age 19 in place of the 
">18" category from that Table.

> tauvec = c(10:19)
  nLvec = c(0,0,0,1,2,3,2,3,1,0)
  nDvec = c(4,12,19,24,20,13,3,1,0,4)
  nRvec = c(0,0,2,15,24,18,14,6,0,0)
  Yvec = rev(cumsum(rev(nDvec+nRvec)))
  fstart = 1-cumprod(1-nDvec/Yvec)
  Dmat = cbind(nD=nDvec, nL=nLvec, nR=nRvec)
  Sfinal = c(.977,.906,.794,.651,.516,.392,.345,.308,.308,0)
  Ffinal = 1-Sfinal
  DL = array(0, dim=c(10,10))
  for(i in 2:10) if(nLvec[i]>0) 
      DL[i,1:i] = nLvec[i]*diff(c(0,fstart[1:i]))/fstart[i]

### Then the newly allocated number of deaths is
> c(round(nDvec+ t(DL) %*% rep(1,10),3))
 [1]  4.487 13.461 21.313 26.963 22.437 14.714  3.417  1.207  0.000  4.000
### This agrees precisely with 2nd column of Table 5.3 in book. Next we show 
##   explicitly the first 10 iterations starting with   fstart.

> DblCns = array(0, dim=rep(10,2))
  newf = fstart
  for(i in 1:10) 
     DblCns[,i] = newf = Gfunc(newf,Dmat)
  round(DblCns,6)    
          [,1]     [,2]     [,3]     [,4]     [,5]     [,6]     [,7]     [,8] 
 [1,] 0.023492 0.023497 0.023497 0.023497 0.023497 0.023497 0.023497 0.023497
 [2,] 0.093969 0.093988 0.093989 0.093989 0.093989 0.093989 0.093989 0.093989
 [3,] 0.205557 0.205598 0.205601 0.205602 0.205602 0.205602 0.205602 0.205602
 [4,] 0.348611 0.348694 0.348701 0.348702 0.348702 0.348702 0.348702 0.348702
 [5,] 0.484221 0.484247 0.484248 0.484248 0.484248 0.484248 0.484248 0.484248
 [6,] 0.607949 0.607893 0.607885 0.607884 0.607884 0.607884 0.607884 0.607884
 [7,] 0.654752 0.654646 0.654631 0.654629 0.654629 0.654629 0.654629 0.654629
 [8,] 0.691932 0.692059 0.692082 0.692085 0.692086 0.692086 0.692086 0.692086
 [9,] 0.691932 0.692059 0.692082 0.692085 0.692086 0.692086 0.692086 0.692086
[10,] 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
          [,9]    [,10] 
 [1,] 0.023497 0.023497
 [2,] 0.093989 0.093989
 [3,] 0.205602 0.205602
 [4,] 0.348702 0.348702
 [5,] 0.484248 0.484248
 [6,] 0.607884 0.607884
 [7,] 0.654629 0.654629
 [8,] 0.692086 0.692086
 [9,] 0.692086 0.692086
[10,] 1.000000 1.000000

CONVERGENCE IS INCREDIBLY FAST !!! Suppose we start with a very poor initial guess, namely 

> fstart2 = (1:10)/10
  newf = fstart
  for(i in 1:10) DblCns[,i] = newf = Gfunc(newf,Dmat)

## Again the convergence is incredibly rapid:
> round(DblCns[,3],4)
[1] 0.0235 0.0940 0.2056 0.3487 0.4842 0.6079 0.6546 0.6921 0.6921 1.0000
    ### Exactly the same as before !!!
### To assess convergence at the final (10th) iteration, consider

> sum(abs(DblCns[,9]-DblCns[,10]))
[1] 1.948173e-10

-------------------------------------------------------------------

At this point, we try to relate the Self-Consistency Equation covered in class on Monday 10/18 to the convergent result of the algorithm just described and coded.

Recall that the self-consistency equation said that

F(t) = n^(-1) [ sum_{i: tau_i <= t} (I[Dth_i <2] + 
                 I[Dth_i=2]*(F(t)-F(tau_i))/(1-F(tau_i))
       + sum_{i: tau_i > t} I[Dth_i=0]*F(t)/F(tau_i) ]

So we code a function to take a vector of F value and calculate the right-hand side, as follows.

> Hfunc = function(Fvec, Dmat) {
    nr = nrow(Dmat)
    DLR = array(0, dim=rep(nr,2))
    for(i in 1:nr) if(Dmat[i,"nL"]>0) 
      DLR[i,1:i] = Dmat[i,"nL"]*diff(c(0,Fvec[1:i]))/Fvec[i]
    for(i in 1:(nr-1)) if(Dmat[i,"nR"]>0)
      DLR[i,(i+1):nr] <- Dmat[i,"nR"]*diff(Fvec[i:nr])/(1-Fvec[i])
    cumsum(t(DLR) %*% rep(1,nr) + Dmat[,"nD"])/sum(Dmat)
  }

To check whether the self-consistency equation holds for the Turnbull estimated d.f. in the double-censoring example above, we check whether the DblCns[,10] vector is preserved by the function Hfunc(., Dmat):

> sum(abs(DblCns[,10]-Hfunc(DblCns[,10],Dmat)))
[1] 7.436309e-12

### So it IS preserved, and we DO get self-consistency this way!! As a further check on the 
#    self-consistency idea, note that we could also get our estimator by iterating Hfunc
> fnew = fstart2
 [1] 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
> for(i in 1:3) {
     fnew = Hfunc(fnew,Dmat)
     cat(round(fnew,5),"\n") }
0.0235 0.09399 0.2056 0.3487 0.48425 0.60788 0.65463 0.69208 0.69208 1 
0.0235 0.09399 0.2056 0.3487 0.48425 0.60788 0.65463 0.69208 0.69208 1 
0.0235 0.09399 0.2056 0.3487 0.48425 0.60788 0.65463 0.69208 0.69208 1 
       ### Again, very rapid convergence !!!