BASIC COMMANDS IN R FOR DATA ENTRY AND COMPUTATIONS OF LIFE TABLES AND ESTIMATORS, WITH ILLUSTRATIONS ================================================== 9/7/05 PRELIMINARY Start R by simply typing "R" at a unix command-prompt, within the directory where you would like to save your R workspace, or by double-clicking an R icon in Windows. When you exit R, by typing > q() if you answer "yes" or "y" to the save request, the workspace will be saved in ".RData" and will automatically be re-loaded when you start R again within the same directory. (In Windows you can save your R workspace by going to "save" under the File Menu, and choosing a directory and naming a file with suffix ".RData" to save into, which can then re-open in future by double-clicking on it.) Note that R commands are given following command prompt > and that text within a line following "#" is ignored, and that continuation lines begin with "+". > .First <- function() { + options(editor="emacs") + library(survival) + help.start() } ## Putting this function with this name into your saved ## workspace means that it is called immediately whenever ## you re-open the workspace in a new R session. ### To call the initializing function ".First" directly: > .First() Making links in ~/.R ... If netscape is already running, it is *not* restarted, and you must switch to its window. Otherwise, be patient ... --------------------------------------------------- (1) R objects are named and assigned with the assignment operator "<-" . Some examples of R objects are VECTORS > x <- 1:10 > y <- c(3, 7, -4, 2, 4.555, 20) > z <- runif(100) > L <- c("a","B","Wr") Here x is a 10-dimensional vector with entries 1,2,3,...,10; y is a 6-dimensional vector with indicated entries; z is a vector of 100 randomly generated uniform variates; and L is a 3-dimensional vector whose entries are character-strings. Refer to vector entries in square-brackets, eg "x[3]". Define subvectors by referring to a vector of integer entries, e.g. > y[c(1,3,6)] [1] 3 -4 20 MATRICES > M <- matrix( 1:10, ncol=2) ### defines a 5x2 matrix M > M [,1] [,2] [1,] 1 6 [2,] 2 7 [3,] 3 8 [4,] 4 9 [5,] 5 10 > M[3,2] > M[3,2] [1] 8 > M[2:4,] [,1] [,2] [1,] 2 7 [2,] 3 8 [3,] 4 9 LISTS R objects can conveniently be strung together into lists, which means that a number of related vectors or matrices (or other R objects, including lists!) can be given a single name and worked with as a unit. This is useful because the "object-oriented" style of R typically makes the output from any statistical function in R a list which may then serve as input to another statistical function (either a pre-coded one, or one which you may code yourself). Thus: > LL <- list(x=x, y=y, Ran = round(z[1:7],5)) > LL $x [1] 1 2 3 4 5 6 7 8 9 10 $y [1] 3.000 7.000 -4.000 2.000 4.555 20.000 $Ran [1] 0.76233 0.08557 0.35093 0.27004 0.11411 0.50839 0.25646 The "x", "y", and "Ran" names are used to refer to the three list components; thus LL$x is the vector 1:10, LL$y is the 6-vector with same entries as y defined above, and LL$Ran is the vector of the first 10 entries of z rounded to 5 decimal places. DATA FRAMES A data frame is an object like a matrix (with entries that can be referred to in the same way, eg: Dfram[3,8]) but also with the structure of a list. So if the 3rd column of the data-frame Dfram has header or name "ndth", then the whole column can be referred to either as Dfram[,3] or as Dfram$ndth. One useful aspect of data-frames is that, unlike matrices which must be either all numeric or all boolean or all character, you can have a data-fram with different colmuns of different types, e.g. > Dfram <- cbind.data.frame(newx=x[4:9], Ran2= round(z[11:16],4), Good=c(T,T,F,T,F,T), label=c(L,L)) > Dfram newx Ran2 Good label 1 4 0.7302 TRUE a 2 5 0.0851 TRUE B 3 6 0.8456 FALSE Wr 4 7 0.4622 TRUE a 5 8 0.6959 FALSE B 6 9 0.8326 TRUE Wr > Dfram$Ran2 [1] 0.7302 0.0851 0.8456 0.4622 0.6959 0.8326 --------------------------------------------------- (2) DATA ENTRY. See the script at http://www.math.umd.edu/~evs/Data/ImportDat.txt for some illustrations. The R function scan("indata.txt") reads all of the entries (assumed separated by blanks) as a long numeric vector if possible, or as a vector of character-strings if there is any character data in the input file (as there generally will be if you do not skip initial header lines). If the ASCII input file is in a tabular form, with column headers in the first line, and with exactly the same number of entries in each line (and with character data always appearing in the same columns if at all), then the command > newfram <- read.table("indata.txt", header=T) will enter it into a data-frame named "newfram". --------------------------------------------------- (3) LIFE TABLE AND SURVIVAL-RELATED FUNCTIONS Let's make a sample survival dataset using random numbers, beginning with a specified seed so that you can reproduce the commands. > .Random.seed [1] 1 695938663 850815041 > NewSeed <- .Random.seed In order to use the same seed, you should issue the command > .Random.seed <- c(1, 695938663 850815041) Now we create a vector T consisting of 20 unordered exponentially distributed variates, and a vector Dth consisting of Binom(1,.7) variates. > T <- rexp(20) Dth <- rbinom(20,1,.7) > Dth [1] 1 0 0 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 0 1 > round(T,6) [1] 1.285517 1.217859 0.710261 1.941706 1.493600 0.261126 1.149667 1.223809 [9] 0.519391 1.752205 1.728407 0.671232 0.851777 1.269756 0.394057 1.331550 [17] 0.175630 1.498749 0.001993 0.113999 Two of the very useful R commands related to survival data are Surv and survfit, both within the "survival" package to which you have access after the command > library(survival) which you must either type directly or put into your initializing function .First() as above. "Surv" creates a "survival object" by marking with a "+" those times that are censored, ie those that correspond to death-indicator 0: > Surv(T,Dth) [1] 1.285517101 1.217858786+ 0.710260889+ 1.941706190 1.493600235 [6] 0.261125644 1.149666859+ 1.223808518 0.519391177 1.752204616 [11] 1.728407180 0.671232499+ 0.851776768 1.269756441 0.394056911 [16] 1.331549701 0.175629556 1.498749230 0.001993308+ 0.113998710 Then survfit creates a list with various useful life-table elements and estimates, using the survival object as input: > Slist <- survfit(Surv(T,Dth)) > names(Slist) [1] "n" "time" "n.risk" "n.event" "surv" "type" [7] "std.err" "upper" "lower" "conf.type" "conf.int" "call" At this point, you should use the fact that the $time, $n.event, and $n.risk list components are respectively the ordered distinct survival times, the number of observed deaths (with Dth=1) at those times, and the number-at-risk: > round(Slist$time,5) [1] 0.00199 0.11400 0.17563 0.26113 0.39406 0.51939 0.67123 0.71026 0.85178 [10] 1.14967 1.21786 1.22381 1.26976 1.28552 1.33155 1.49360 1.49875 1.72841 [19] 1.75220 1.94171 > Slist$n.ev [1] 0 1 1 1 1 1 0 0 1 0 0 1 1 1 1 1 1 1 1 1 > Slist$n.risk [1] 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 If we want to restrict these vectors to the times at which observed deaths occur, we make an index-vector > ind <- (1:20)[Slist$n.ev==1] > ind [1] 2 3 4 5 6 9 12 13 14 15 16 17 18 19 20 > round(Slist$time[ind],4) [1] 0.1140 0.1756 0.2611 0.3941 0.5194 0.8518 1.2238 1.2698 1.2855 1.3315 [11] 1.4936 1.4987 1.7284 1.7522 1.9417 > Slist$n.ev[ind] [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 > Slist$n.risk[ind] [1] 19 18 17 16 15 12 9 8 7 6 5 4 3 2 1 ## You should check these commands and results to make ## sure that you understand them. NOTE: Slist$surv contains the Kaplan--Meier estimated (right-continuous) survival function estimator at the times in Slist$time. ------------------------------------------------------------- (4) ILLUSTRATION USING SIMULATED SURVIVAL DATA, INCLUDING PLOTTING ============================================== Here are some steps in R to reproduce the Table 1.4 and Figure 1.2 exhibits relating to the data in Table 1.1, Chapter 1 of Kalbfleisch and Prentice. > Tbl1.1fr <- cbind.data.frame( Group=c(rep(1,19), rep(2,21)), Dth=c(rep(1,17),0,0,rep(1,19),0,0), Times= c(143, 164, 188, 188, 190, 192, 206, 209, 213, 216, 220, 227, 230, 234, 246, 265, 304, 216, 244, 142, 156, 163, 198, 205, 232, 232, rep(233, 4), 239, 240, 261, 280, 280, 296, 296, 323, 204, 344)) > Ex1.1Gp1 <- survfit(Surv(Tbl1.1fr$Times[1:19], Tbl1.1fr$Dth[1:19])) Ex1.1Gp2 <- survfit(Surv(Tbl1.1fr$Times[20:40], Tbl1.1fr$Dth[20:40])) ### To get Table 1.4 outputs: first for Group 1 > round(cbind(ti=Ex1.1Gp1$time, di=Ex1.1Gp1$n.ev, ni=Ex1.1Gp1$n.ri, SKMti=Ex1.1Gp1$surv, SKMvar=(Ex1.1Gp1$surv*Ex1.1Gp1$std.err)^2),5) ### and a similar command works for Group 2. ### Now for the over-plotted pictures In general, plot(x,y) sets up axes and plots points (x[i],y[i]) for two vectors of equal lengths. Additional character-string arguments xlab, ylab, and main to the "plot" function respectively control the x-axis label, y-axis label and main title of the graph. If you want to plot lines, use the argument type="l"; if stairsteps type="s" (which is what we want in a survival-function context). Plotting-characters for plotted points are controlled by the pch argument, for plotted lines by lty (=1 for solid line, the default, or =2 or higher for dashed and dotted lines). Finally, to overplot you should issue subsequent commands with "points(x,y)" to plot new points on the same axes, or "lines(x,y)" to plot new lines. Here are the commands used to generate a graph like Figure 1.2, p.17, in R. > plot(Ex1.1Gp1$time, Ex1.1Gp1$surv, xlab="Time", ylab="Surv Prob", type="s", lty=1, xlim=c(0,325), main="Carcinogenesis Data KM Plots") lines(Ex1.1Gp2$time, Ex1.1Gp2$surv, type="s", lty=2) ## To add a legend, you type the following in R, ## then click the middle mouse-button over the graph ## at the point where you want ther upper-left of ## the printed legend box to appear. [On PC, first ## click left mouse-button and then right. > legend(locator(), legend=c("Group 1","Group 2"), lty=1:2) ## After producing a graph, to get rid of the graph window type > dev.off() Finally, to do a log-rank test comparing survival in the two groups: > survdiff(formula = Surv(Times, Dth) ~ Group, data = Tbl1.1fr) Call: survdiff(formula = Surv(Times, Dth) ~ Group, data = Tbl1.1fr) N Observed Expected (O-E)^2/E (O-E)^2/V Group=1 19 17 12.2 1.853 3.12 Group=2 21 19 23.8 0.954 3.12 Chisq= 3.1 on 1 degrees of freedom, p= 0.0772 ## Here the Log-Rank "numerator" O-E for Gp2 vs Gp1 survival ## (with positive numbers indicating "excess deaths in Gp1 ## which implies Gp2 superiority with respect to survival) ## is 17-12.2 = 4.8, and the variance can be found as ## 3.12/(4.8^2) = 0.1354. The signed logrank statistic is ## (O-E)/sqrt(V) = sqrt(3.12) = 1.766352, which is treated as ## a standard normal deviate. We can recover the p-value, ## except for errors due to rounding in the table above, as > 2*(1-pnorm(1.76635)) [1] 0.0773