# Statistics 702: Survival Analysis

### Fall 2021 MWF 9-9:50am,    Mth B0423 In-person class with assignments and additional materials on ELMS

Instructor: Professor Eric Slud,  Statistics Program,  Math Dept.,   Rm 2314, x5-5469,  slud@umd.edu

Office hours: M 1-2, W 11-12 (initially), or email me to make an appointment (can be on Zoom).

Please fill out the on-line Evaluation form on this Course and instructor at http://CourseEvalUM.umd.edu. Thank you.

SAMPLE  PROBLEMS  FOR  IN-CLASS  TEST: from 2005 and 2008 and Practice Problems 2021.

Course Overview: The topic of the course is the statistical analysis of data on lifetimes or durations. Such data often have the feature of being (right-) censored, where subjects may leave the study at random times (and in some cases return) and those who are in the study and have not died at the ending time of the study are simply recorded as being still alive. Another possible data pattern is , or truncated, where subjects enter the study (at a recorded time) only if they meet some criterion which may involve an age-variable or time since diagnosis or other preliminary event.  Such data arise frequently in clinical trials, epidemiologic studies, reliabilitytests, and insurance. We first present parameterizations of survival distributions, in terms of hazard intensities, which lend themselves to the formulation of parametric models, including regression-type models which relate failure-time distributions to auxiliary biomedical predictors. The special features of truncation or censoring present unique challenges in the formulation of likelihoods and efficient estimation and testing in settings where the distributions of arrival-times and withdrawal-times are unknown and not parametrically modelled. This statistical topic has achieved great prominence in the theoretical statistical literature because it is a particularly good arena for the introduction of techniques of estimating and testing finite-dimensional parameter values --- such as a treatment- effectiveness parameter in clinical studies --- in the presence of infinite-dimensional unknown parameters. Such problems are called Semiparametric.

### Required Text:

Klein, J. and Moeschberger, M. (2003) Survival Analysis: Techniques for Censored and Truncated Data, 2nd ed. Springer

Prerequisites: Stat 410 and either Stat 420 or Stat 700. The presentation will be geared to second-year Stat grad students.

Grading:  The course grade will be based 50% on 6 or 7 homework problem sets, 25% on an in-class test (tentatively scheduled for October 29), and 25% on a course project or paper at the end. The homework problems will be a mixture of theoretical problems at Stat 410/Stat 700 level, and of computational or data-analysis problems. Grading of these problems will also be based on the quality of verbal descriptions and interpretations of results: submitted analyses presenting only undigested numerical output will be graded down. The in-class test will be designed to test (i) definitions (of models and distributions and statistics), (ii) ability to use model definitions to construct likelihoods (and partial likelihoods) and derive statistics from them, and (iii) basic properties of estimators and test-statistics studied in class.
The course project will be either a paper on a topic not fully covered in class, with illustrative data analysis, or an extended and coherent data analysis and writeup (of about 10-12 pages, not including computer output). Note: homework problem assignments will be due approximately every 2 weeks. The problem sets and due dates will be posted to this web-page and on ELMS and announced in class, and will be submitted as pdf's on ELMS. The problems will be due on the dates announced and will be graded down for lateness unless you have a VERY good excuse.

Homework Guidelines: Homework papers are to be worked on individually, except that you may share verbal hints (or get such hints from me) about how to approach a problem. Working together or sharing computed results or written work is a violation of the Code of Academic Integrity and will be reported. You are to hand in HW papers as electronically as pdfs posted to ELMS. If you create the homework paper by using text files containing R scripts, graphical outputs or scanned files, then I recommend that you import these into MS Word and save the document as a single pdf before submitting it. Multiple-document submissions will not be acceptable.

President Pines provided clear expectations to the University about the wearing of masks for students, faculty, and staff. Face coverings over the nose and mouth are required while you are indoors at all times. The only allowed exception when it comes to classrooms and laboratories is for course instructors while they are teaching and adequately distanced from the class.) Students not wearing a mask will be given a warning and asked to wear one, or will be asked to leave the room immediately. Students who have additional issues with the mask expectation after a first warning will be referred to the Office of Student Conduct for failure to comply with a directive of University officials.

### Recommended Text(s):

(1)     Kalbfleisch, J. and Prentice, R. (2002) The Statistical Analysis of Failure Time Data, 2nd ed. Wiley

This book was used once for the course. Its explanations are harder, less straightforward and often more intuitive.

(2)   Another very useful and readable recommended text (reissued in 1998 as a paperback and currently as an e-book)

R. Miller, Jr. (1980) Survival Analysis.Wiley-Interscience 1998

(3)   An easier book that can be used for self-study and review, free to students as an e-book through the UMD library:

Kleinbaum, D. and Klein, M. (2005), Survival Analysis - A Self-Learning Text, 2nd ed., Springer.

(4)   For the more mathematically inclined, a primarily theoretical text by two former Maryland students:

Fleming, T. and Harrington, D. (1991) Counting Processes and Survival Analysis. Wiley.

Coverage of the Klein & Moeschberger book will be Chapters 1-9, plus a few miscellaneous topics. The main topics are:

• Survival distributions, hazard functions, and the life table method; regression-type models for survival in terms of predictor variables, including the famous Cox model and random-effect or frailty' model extensions;
• Formulation of Likelihoods for censored and truncated data;
• Parameter estimation and hypothesis testing in parametric and semiparametric settings, including the Kaplan-Meier survival function estimator and Nelson-Aalen cumulative-hazard function estimator;
• Goodness of fit diagnostics and testing for estimated models.
• EM algorithm and missing-data aproaches to censored data; and
• Methods for estimating survival distributions involving smoothing and density-estimation techniques.
• Klein & Moeschberger is a very methods-oriented book, and will be covered along with R software implementation with real-data examples. The Miller book explains things well and gives good background and literature references. For additional mathematical justifications, including the connection with counting processes and martingales, I will draw additional material from Fleming and Harrington, my own notes, and the research literature. Other data examples, and more sophisticated data analyses, can be found in the Kalbfleisch and Prentice book (get the data from the R package KMsurv).

Computing in the course can be done with R, SAS, or any other package you are familiar with that also has preprogrammed Survival Analysis modules. However, R is by far the best choice if you want guidance and/or help from me, and if you want access to the newest methods from the research literature. Various datasets can be explored and accessed within existing R packages and libraries, e.g. by issuing the command    > data()    after    > library(survival)   or   [for all datasets from the Klein and Moeschberger book including its exercises]   after   library(KMsurv). Whatever package you choose, you can get computing help, datasets, and further links here.  In particular you can get lots of survival datasets, including some that were in the Kalbfleisch and Prentice book, by clicking here and searching for the keyword "survival".

See the Handouts section below for a link to the "Rbasics" file connected with the data analysis tasks needed for Homeworks. For the systematic Introduction to R and R reference manual distributed with the R software, either download from the R website or simply invoke the command    > help.start()    from within R. For a slightly less extensive introductory tutorial in R, click here. A very handy reference card containing R commands can be found here.

## Lecture Note and Slides Handouts:

• Slides on Survival Data Structure and Hazard Functions
• Slides on Parametric Censored-Data Likelihoods and Weibull MLE's
• Handout on Nelson-Aalen Estimator as a limit of MLE's of piecewise constant hazards when the positive time-line is partitioned into smaller and smaller intervals.
• Interpretation of Kaplan-Meier as Nonparametric MLE .
• One page handout on Partial Likelihood via Marginal Likelihood in the Cox Model.
• Slides of my Stat Seminar 5/3/2012 with Jiraphan Suntornchost, on parametric survival densities from so-called Phase-Type Models.

• You can look at some sample problems for the Fall 05 in-class test here. This year's in-class test will have at most 3 problems and will be oriented as closely to definitions and basic equivalence of models as I can make it. Other sample problems can be found here and also Practice Problems 2021. If you would like to bring a sheet of formulas, that is OK, but the test will otherwise be closed-book.

• You can view the slides in a Survival Analysis short course that I gave in Spring 2005. See especially pages 34-36 and 47-50 and 58-72 for ideas relevant to the data-analyses in your final Projects, and click here for the R scripts I used in analysing the data for that short course.

• I gave three Mini-Course' talks on Survival Analysis a couple of years ago which are very relevant to the material of this course. The slides are available in pdf format. They are respectively about Competing Risks,   Martingales and Populations,   and Semiparametric Models.

• ## Statistical Computing Handouts:

The following material and handouts were produced 10+ years ago, when the Klein & Moeschberger text had a web-site from which data could be downloaded, and in the intervening years that website was taken down and some of the important R functions in the survival package (such as survfit have changed in important ways. The datasets can now be found in the R package KMsurv, and the R functions are mostly still usable with minor changes. But these handout have to be refreshed and updated to be completely current. I am gradually doing that, and will indicated with a label UPDATED next to the R Scripts indicated below those for which I have mad these modifications in Fall 2021.

NOTE: to get started using survival-related functions in R, you need to "load" the R survival package, which is accomplished by the command:    > library(survival)

Handouts can be found at linked pages for each of the following topics:

(0) Basics on R commands for data entry and Life Tables and Life Table construction. In addition, various useful files on Statistical Computing can be found at my course web-page for Stat 705, along with additional relevant links.

(1) UPDATED    Nelson-Aalen & Kaplan Meier calculation

(3) UPDATED   Nelson-Aalen calculation for left-truncated right-censored data.

(4) UPDATED    Script and Illustrative Picture on model fitting of VA Lung-Cancer data in R. This Script and picture also contain material about fitting and plotting the Cox Model for the same dataset and comparing the results to the previous accelerated failure time parametric regression model.

(5) UPDATED    R calculations for weighted logrank (2-sample) test. Also available is a New Script-file on Stratified and K-sample Logrank statistics using "survdiff" UPDATED for F21.

(6) New illustration of Stratified versus interaction-term tests of difference between coefficients in subgroups of a survival dataset. This R script and picture explain in the example of a Mayo lung-cancer study that there are differences between the coefficient for a baseline health index ("Karnofsky score") for the two sexes in the study, but that these differences are obscured if an assumption of common baseline hazard for both sexes is made.

(7) Handout containing UPDATED    R Log on Self-Consistency Property of Kaplan-Meier Estimator and Redistribute-to-the-Right Algorithm and UPDATED Coding for Turnbull (1974) self-consistent estimator of survival-distribution in double-censored survival data.

(8) R Script for Time-dependent Cox-Model fitting, illustrated with data analysis of Mayo-Clinic Lung Cancer Data. An UPDATED version is now available in the RScripts directory, here.

(9) R script for calculating Partial Likelihoods in (non-time-dependent) Cox-model. This includes calculations with risk-groups. The script will later be augmented to include the calculation of score statistics for individual coefficients.

## Syllabus.

Chapter 1. Introduction: Terminology, data structures & examples.    1 class,   8/30

• Definition of terminology: event-times, censoring (left and right), biomedical covariates, life tables. Inferential problems and objectives.
• Chapter 2. Failure Time models. 4   classes,   9/1 - 9/10

• Survival and hazard functions. Parametric distributional models, continuous and discrete regression models. Latent failure time model. Competing risks.
• Chapter 3. Censored-Data Parametric Inference & Likelihoods.    4   classes,    9/13 - 9/20

• Parametric likelihoods and parametric inference; truncation and interval censoring, large-sample MLE theory.
• Counting processes and cumulative hazards. Statistics as integrals with respect to compensated counting processes.
• Chapter 4. Nonparametric survival-curve estimation. 4   classes,   9/22 - 9/29

• Cumulative-hazard (Nelson-Aalen) estimators for right-censored left-truncated data.
• Kaplan-Meier survival function estimators. Confidence bands.
• Quantile (median) estimates and confidence intervals.
• More on competing risks and attempt to restore semiparametric identifiability in Competing Risk settings.

Chapter 5. Estimates for other censoring schemes. 3   classes,   10/1 - 10/6

• Left, double, and interval censoring. Self consistency property of Kaplan-Meier (right-censored case) and extension to estimation algorithms with other kinds of censoring.
• Chapter 7. Rank statistics for 1- and 2-sample Tests. 5 classes,     10/8 - 10/18

• Tests based on scores. Relation to contingency table ideas. Stratified tests. Tests for trend.
• Logrank and weighted-logrank tests. Sample size and power. Relation to simple survival regression models.
• Chapter 8. Relative Risk Regression Models 5 classes,     10/20 - 11/3

• Estimation via maximized Partial Likelihood: regression coefficients and estimation of baseline survival function
• Related likelihoods (marginal and rank). Large sample theory of estimators.
• Associated hypothesis tests. Wald, score and LR test analogues.

Chapter 6. Other estimation techniques.     3   classes,    11/5 - 11/10

• Kernel-based estimation of the hazard intensity. Application to excess mortality
• Bayesian nonparametric methods.
• (*) Methods based on (multiple) imputation of censored lifetimes.
• Chapter 9. Stratified & Time-Dependent Covariate Cox models.      3 classes,    11/12 - 11/17

• Time dependent covariate version of proportional hazard model. Application to model checking. Tests of fit related to residuals (material taken also from Chapter 11.)
• R functions for stratified time-dependent Cox regression.

Chapter 10. Extended Survival Regression Models.       As time permits

• Material taken from Chapters 10 and 12 on Additive Hazards regression models, and frailty models.
• Material taken from Kalbfleisch-Prentice and other sources on Accelerated failure and Proportional Odds models.
• Introduction of models with time-varying mechanism from journal literature.

• ## Homework Problem Sets:

Problem Set 1, Due Monday Sept.13, 2021.   ( 6 Problems in all, worth 10 points each.)
Do problems # 2.3, 2.9 (the times to substitute are 12, 24, and 60 months), 2.10, 2.16, and 2.20 from the end of Chapter 2 in the Klein and Moeschberger text. Also to be handed in: using the data in Table 1.2 of the book, create a life table, with rows corresponding to ordered increasing infection times within each of the two ("Surgically Placed Catheter" and "Percutaneous Placed Catheter") groups, showing the number of "failures" (=infections) occurring at that time, and the number at risk (ie individuals within the group who are neither infected nor censored before that time).

See the R script at LifeTab.RLog for an indication of how to build Life Tables using R commands.

Problem Set 2, Six Problems in all, Due Tuesday Sept.28, 2021 (11:59pm).
Do #3.6, 3.8, 4.1(a)-(e) and (i), 4.2(a)-(c) and (e)-(f). In addition

( 5th problem to hand in ): read Theoretical Note 1 on pp. 56-57 and show as much as you can of the following statement given there:
if, in a bivariate setting with dependent (X,C) having a joint density, the function ρ(t) defined on p.56 is known to be identically equal to a constant ρ, and also the sub-distribution function F1(t) and event-time survival function ST(t) are known, then the marginal survival function SX(t) is uniquely determined, and this survival function depends in a monotonically decreasing way on ρ.

( 6th problem to hand in ): (a) Suppose that T = λ Xα for some constants   α > 1   and   λ > 0. Find a formula for   hT(t)   in terms of the function   hX(.)   and known functions of t (which depend on λ, α). (b) Use (a) to show that if   X ~ f(x,β, θ)   has   hX(t,β,θ) = eθ h0(t, β),   then the hazard function of T also factors into a function of θ times a function of (t,β).

Problem Set 3, Due Sunday October 17, 2021, 11:59pm.
(I) Klein & Moeschberger problems:    #4.5, 4.7, 4.9, 10 points each.
(II) (15 points) Kalbfleisch & Prentice Problem (#3.11): Use the famous Freireich et al. (1963) data which can be found as "gehan" within the MASS library,
(a) to test the hypothesis of equality of remission times in the two groups, using Weibull, log-normal, and log-logistic models, and to decide which model fits the data best, and
(b) to test for adequacy of an exponential model relative to a Weibull model.
In the dataset, you should ignore the "pair" information. The last column is "treat" , a factor (categorical) variable.
(III) (15 points) Consider the setting where you have right-censored survival data on a large number n of iid patients, where the underlying and censoring distributions are both Exponential, with respective parameters λ and ρ. Find simplified formulas for the asymptotic variances (proportional to 1/n) for the estimated marginal survival function S(t) at time t=1 based on the parametric Exponential estimator of λ and also based on the Kaplan-Meier estimator, and compare the formulas. (How much larger is the KM variance ?)

Problem Set 4, due Monday October 25, 2021, 11:59pm. (3 Problems, 30 points) plus an optional 5-pt Extra-credit problem.
(I) (10 points) (a) Show that to compute   supt |\hat{S}KM,1(t) - \hat{S}KM,2(t)|   based on two independent samples (or two different groups of subjects) in a survival study, it suffices to create survival function estimates at the (sorted) union of observed death times for the two samples and take the maximum of absolute differences of the two survival function estimators at those points.
(b) Use this idea to show explicitly how you compute this quantity for the Kaplan-Meier survival functions for the two "types" in the "kidney" dataset (in the KMsurv package).
(II) (10 points) # 5.2 in Klein and Moeschberger. This is a problem about double censoring. Refer to the R handout on this topic on the web-page, in the RScripts directory.
(III) (10 points) # 7.2 in Klein and Moeschberger. This is about valid hypothesis tests for a fixed known hazard rate versus a nonparametric alternative. Note that we do not have a valid test (although we can discuss constructing one, and I will give a journal-paper reference) in case the null hypothesis involves an unknown parameter that must be estimated.

(Extra-credit)(5 points) Recall the discussion in class of how to construct confidence bands for
n1/2   sup0 ≤ t ≤ b |\hat{S}KM(t) - SX(t)|/(SX(t) · (1+V(t))
for appropriately chosen V(t) = Vn(t) = n · ∑j: tj < t   dj/(lj(lj-dj)).   Now suppose as in (I) that there are two independent samples or groups with separate KM curves   \hat{S}KM,g(t),    g=1,2. Denote by   tjk, djk, ljk   the respective ordered distinct death-times, number of deaths, and numbers at risk for these two samples or groups, and assume (as a null hypothesis H0) that both have the same SX(t) function. Show how to construct a hypothesis test of H0 at significance level 0.05 based on
supt ∈ [0,b] (\hat{S}KM,1(t) - \hat{S}KM,2(t))/(S(t) · (1+V*(t))
where S(t) is a pooled estimator of SX(t) obtained e.g. as an average of \hat{S}KM,g(t) for g=1,2, and V*(t) a suitable variance estimator.

Problem Set 5, due Thursday November 18, 2021, 11:59pm. (6 Problems, 60 points).
Problems 7.4, 7.8, 7.10 Klein & Moeschberger, 10 points each.
(A) (10 points) Prove that the one- or two-sample weighted logrank test statistics with weight functions W(t) that are nonincreasing are consistent against stochastically ordered alternatives (have power converging to 1) as the sample size n grows to ∞. Recall the definition of stochastically ordered. In the one-sample case, this means that SX(t) ≥ S0(t) for all t or SX(t) ≤ S0(t) for all t, with strict inequality for at least one t. In the two-sample case, either SX(1)(t) ≥ SX(2)(t) for all t, or ≤ for all t, with strict inequality for at least one value of t.
(B) (10 points) Problem 8.1: method is illustrated in CoxMod.txt Log-page, including the part about fit of exponential model.
(C) (10 points) (a) Approximately how large a sample would you need to achieve power 0.90 against the alternative with hazard ratio 1.5 using a logrank test, if the sample of size n were randomly allocated with a fair coin-toss to control or treatment group, and control-group survival is Expon(2) while censoring is approximately Expon(1) in the control group and Expon(1.25) in the treatment group ? (b) Do the same sample-size calculation if the hypothesis test to use is the Gehan-modified Wilcoxon. (c) Do the same sample size calculation if the hypothesis test to be used is the Peto-Prentice Wilcoxon (Gρ test with ρ=1).

Problem Set 6, due Saturday December 11, 2021, 11:59pm. (6 Problems, 60 points).
(I) From Chapter 8 of the course text, do problems 8.4, 8.5, 8.8(a) and (c).
(II) Using the data and results of problem 8.8, find and plot estimators for:   (a) the baseline cumulative hazard function Λ0(t),
and    (b) the population summary survival functions for the ALL, AML Low-Risk, and AML Hi-Risk groups.
(III) Do problems 6.3, 9.3 .

## Final Project

I have created a data-file Lymphom.dat which you can use in your project. It is large, with 31689 records of 13 columns each, subsetted and re-coded from the National Cancer Institute's SEER database of Lymphoma cancer cases from 1973-2001. The file can be inputted with read.table and you will get the proper column-headers if you use the option header=T. You may certainly subset it further in any analyses you do and write up. Details concerning the records retained, the variables chosen, their meanings and the way I re-coded them, can be found here .

Guidelines for the Final Project. As will be discussed in class, the culminating work for the course, beyond HW and the in-class Test, is a take-home course project which is to consist of a 10-12 page paper based on an original data analysis using the ideas covered in the course, to be handed in before 11:59pm, Saturday December 18, 2021. You may find data anywhere. I suggest that you find a survival dataset with enough structure (eg, regression variables, clear hypothesis of interest like treatment effectiveness in a two-group clinical trial) and sufficient sample-size so that it would make sense to try a few different survival analyses and compare the results. You will be graded on appropriateness and interest of the analyses and especially on the clarity and reasonableness of the conclusions (and/or comparisons among conclusions from different methods) that you reach. Your 10-12 pages (excluding data and plots) should explain clearly the models and assumptions and conclusions in a readable narrative. You may hand in (but preferably give URL for) data, intermediate statistical results, and summary displays such as plots and/or histograms, but I do not want to be given any undigested outputs. That is, any such computed outputs should be presented as exhibits, with specific references to such material and suitable interpretations given in the text of your paper.

If you want to do anything other than a data analysis and narrative for your paper (eg, simulation study or exploration of theoretical and illustrative material on additional methods not covered in the course), such an alternative  may   be OK, but you must see me about it to get it approved first !!

Important Dates:

• August 30, Monday:   First day of class.
• Sept. 6, Monday:   Labor Day: No Class.
• October 25, Monday:     Review for in-class test.
• October 27, Wednesday:     In-class test.
• November 24-26, Wed.-Fri.:     Thanksgiving Break: No Class
• December 13, Monday:   Last Day of Classes
• December 18, Saturday:   Term Projects Due