Statistics 702: Survival Analysis

Fall 2021 MWF 9-9:50am,    Mth B0423

In-person class with assignments and additional materials on ELMS

Instructor: Professor Eric Slud,  Statistics Program,  Math Dept.,   Rm 2314, x5-5469,  slud@umd.edu

Office hours: M 1-2, W 11-12 (initially), or email me to make an appointment (can be on Zoom).

SAMPLE  PROBLEMS  FOR  IN-CLASS  TEST Old and New.

Syllabus Lecture Handouts Statistical Computing (R) Handouts Homework

Course Overview: The topic of the course is the statistical analysis of data on lifetimes or durations. Such data often have the feature of being (right-) censored, where subjects may leave the study at random times (and in some cases return) and those who are in the study and have not died at the ending time of the study are simply recorded as being still alive. Another possible data pattern is , or truncated, where subjects enter the study (at a recorded time) only if they meet some criterion which may involve an age-variable or time since diagnosis or other preliminary event.  Such data arise frequently in clinical trials, epidemiologic studies, reliabilitytests, and insurance. We first present parameterizations of survival distributions, in terms of hazard intensities, which lend themselves to the formulation of parametric models, including regression-type models which relate failure-time distributions to auxiliary biomedical predictors. The special features of truncation or censoring present unique challenges in the formulation of likelihoods and efficient estimation and testing in settings where the distributions of arrival-times and withdrawal-times are unknown and not parametrically modelled. This statistical topic has achieved great prominence in the theoretical statistical literature because it is a particularly good arena for the introduction of techniques of estimating and testing finite-dimensional parameter values --- such as a treatment- effectiveness parameter in clinical studies --- in the presence of infinite-dimensional unknown parameters. Such problems are called Semiparametric.

Required Text:

Klein, J. and Moeschberger, M. (2003) Survival Analysis: Techniques for Censored and Truncated Data, 2nd ed. Springer



Prerequisites: Stat 410 and either Stat 420 or Stat 700. The presentation will be geared to second-year Stat grad students.

Grading:  The course grade will be based 50% on 6 or 7 homework problem sets, 25% on an in-class test (tentatively scheduled for October 29), and 25% on a course project or paper at the end. The homework problems will be a mixture of theoretical problems at Stat 410/Stat 700 level, and of computational or data-analysis problems. Grading of these problems will also be based on the quality of verbal descriptions and interpretations of results: submitted analyses presenting only undigested numerical output will be graded down. The in-class test will be designed to test (i) definitions (of models and distributions and statistics), (ii) ability to use model definitions to construct likelihoods (and partial likelihoods) and derive statistics from them, and (iii) basic properties of estimators and test-statistics studied in class.
The course project will be either a paper on a topic not fully covered in class, with illustrative data analysis, or an extended and coherent data analysis and writeup (of about 10-12 pages, not including computer output). Note: homework problem assignments will be due approximately every 2 weeks. The problem sets and due dates will be posted to this web-page and on ELMS and announced in class, and will be submitted as pdf's on ELMS. The problems will be due on the dates announced and will be graded down for lateness unless you have a VERY good excuse.


Homework Guidelines: Homework papers are to be worked on individually, except that you may share verbal hints (or get such hints from me) about how to approach a problem. Working together or sharing computed results or written work is a violation of the Code of Academic Integrity and will be reported. You are to hand in HW papers as electronically as pdfs posted to ELMS. If you create the homework paper by using text files containing R scripts, graphical outputs or scanned files, then I recommend that you import these into MS Word and save the document as a single pdf before submitting it. Multiple-document submissions will not be acceptable.


About in-class Masking for Fall 2021

President Pines provided clear expectations to the University about the wearing of masks for students, faculty, and staff. Face coverings over the nose and mouth are required while you are indoors at all times. The only allowed exception when it comes to classrooms and laboratories is for course instructors while they are teaching and adequately distanced from the class.) Students not wearing a mask will be given a warning and asked to wear one, or will be asked to leave the room immediately. Students who have additional issues with the mask expectation after a first warning will be referred to the Office of Student Conduct for failure to comply with a directive of University officials.


Academic Integrity and HONOR CODE

The University of Maryland, College Park has a nationally recognized Code of Academic Integrity, administered by the Student Honor Council. This Code sets standards for academic integrity at Maryland for all undergraduate and graduate students. As a student you are responsible for upholding these standards for this course. It is very important for you to be aware of the consequences of cheating, fabrication, facilitation, and plagiarism. For more information on the Code of Academic Integrity or the Student Honor Council, please visit http://www.shc.umd.edu. To further exhibit your commitment to academic integrity, remember to sign the Honor Pledge on all examinations: "I pledge on my honor that I have not given or received any unauthorized assistance on this examination (assignment)."



Recommended Text(s):

(1)     Kalbfleisch, J. and Prentice, R. (2002) The Statistical Analysis of Failure Time Data, 2nd ed. Wiley

This book was used once for the course. Its explanations are harder, less straightforward and often more intuitive.

(2)   Another very useful and readable recommended text (reissued in 1998 as a paperback and currently as an e-book)

R. Miller, Jr. (1980) Survival Analysis.Wiley-Interscience 1998

(3)   An easier book that can be used for self-study and review, free to students as an e-book through the UMD library:

Kleinbaum, D. and Klein, M. (2005), Survival Analysis - A Self-Learning Text, 2nd ed., Springer.

(4)   For the more mathematically inclined, a primarily theoretical text by two former Maryland students:

Fleming, T. and Harrington, D. (1991) Counting Processes and Survival Analysis. Wiley.


Coverage of the Klein & Moeschberger book will be Chapters 1-9, plus a few miscellaneous topics. The main topics are:

  • Survival distributions, hazard functions, and the life table method; regression-type models for survival in terms of predictor variables, including the famous Cox model and random-effect or `frailty' model extensions;
  • Formulation of Likelihoods for censored and truncated data;
  • Parameter estimation and hypothesis testing in parametric and semiparametric settings, including the Kaplan-Meier survival function estimator and Nelson-Aalen cumulative-hazard function estimator;
  • Goodness of fit diagnostics and testing for estimated models.
  • EM algorithm and missing-data aproaches to censored data; and
  • Methods for estimating survival distributions involving smoothing and density-estimation techniques.
  • Klein & Moeschberger is a very methods-oriented book, and will be covered along with R software implementation with real-data examples. The Miller book explains things well and gives good background and literature references. For additional mathematical justifications, including the connection with counting processes and martingales, I will draw additional material from Fleming and Harrington, my own notes, and the research literature. Other data examples, and more sophisticated data analyses, can be found in the Kalbfleisch and Prentice book (get the data from the R package KMsurv).


    Computing in the course can be done with R, SAS, or any other package you are familiar with that also has preprogrammed Survival Analysis modules. However, R is by far the best choice if you want guidance and/or help from me, and if you want access to the newest methods from the research literature. Various datasets can be explored and accessed within existing R packages and libraries, e.g. by issuing the command    > data()    after    > library(survival)   or   [for all datasets from the Klein and Moeschberger book including its exercises]   after   library(KMsurv). Whatever package you choose, you can get computing help, datasets, and further links here.  In particular you can get lots of survival datasets, including some that were in the Kalbfleisch and Prentice book, by clicking here and searching for the keyword "survival".

    See the Handouts section below for a link to the "Rbasics" file connected with the data analysis tasks needed for Homeworks. For the systematic Introduction to R and R reference manual distributed with the R software, either download from the R website or simply invoke the command    > help.start()    from within R. For a slightly less extensive introductory tutorial in R, click here. A very handy reference card containing R commands can be found here.


    Lecture Note and Slides Handouts:

  • Slides on Survival Data Structure and Hazard Functions
  • Slides on Parametric Censored-Data Likelihoods and Weibull MLE's
  • Handout on Nelson-Aalen Estimator as a limit of MLE's of piecewise constant hazards when the positive time-line is partitioned into smaller and smaller intervals.
  • Interpretation of Kaplan-Meier as Nonparametric MLE .
  • One page handout on Partial Likelihood via Marginal Likelihood in the Cox Model.
  • Slides of my Stat Seminar 5/3/2012 with Jiraphan Suntornchost, on parametric survival densities from so-called Phase-Type Models.

  • You can look at some sample problems for the Fall 05 in-class test here. This year's in-class test will have at most 3 problems and will be oriented as closely to definitions and basic equivalence of models as I can make it. Other sample problems can be found here. If you would like to bring a sheet of formulas, e.g. stochastic-integral and variance formulas related to our score and Partial Likelihood ML statistics, that is OK, but the test will otherwise be closed-book.

  • You can view the slides in a Survival Analysis short course that I gave in Spring 2005. See especially pages 34-36 and 47-50 and 58-72 for ideas relevant to the data-analyses in your final Projects, and click here for the R scripts I used in analysing the data for that short course.

  • I gave three `Mini-Course' talks on Survival Analysis a couple of years ago which are very relevant to the material of this course. The slides are available in pdf format. They are respectively about Competing Risks,   Martingales and Populations,   and Semiparametric Models.


  • Statistical Computing Handouts:

    For a directory of R Scripts relevant this course, click here.

    NOTE: to get started using survival-related functions in R, you need to "load" the R survival package, which is accomplished by the command:    > library(survival)

    Handouts can be found at linked pages for each of the following topics:

    (0) Basics on R commands for data entry and Life Tables and Life Table construction. In addition, various useful files on Statistical Computing can be found at my course web-page for Stat 705, along with additional relevant links.

    (1) Nelson-Aalen & Kaplan Meier calculation

    (2) Illustrative R Script for Survival Curves, Hazards, Medians, and SE's.

    (3) Nelson-Aalen calculation for left-truncated right-censored data.

    (4) Script and Illustrative Picture on model fitting of VA Lung-Cancer data in R. This Script and picture also contain material about fitting and plotting the Cox Model for the same dataset and comparing the results to the previous accelerated failure time parametric regression model.

    (5) R calculations for weighted logrank (2-sample) test statistics.

    (6) New illustration of Stratified versus interaction-term tests of difference between coefficients in subgroups of a survival dataset. This R script and picture explain in the example of a Mayo lung-cancer study that there are differences between the coefficient for a baseline health index ("Karnofsky score") for the two sexes in the study, but that these differences are obscured if an assumption of common baseline hazard for both sexes is made.

    (7) Handout containing R Log on Self-Consistency Property of Kaplan-Meier Estimator and Redistribute-to-the-Right Algorithm and Coding for Turnbull (1974) self-consistent estimator of survival-distribution in double-censored survival data.

    (8) R Script for Time-dependent Cox-Model fitting, illustrated with data analysis of Mayo-Clinic Lung Cancer Data.

    (9) R script for calculating Partial Likelihoods in (non-time-dependent) Cox-model. This includes calculations with risk-groups. The script will later be augmented to include the calculation of score statistics for individual coefficients.


    Syllabus.

    Chapter 1. Introduction: Terminology, data structures & examples.    1 class,   8/30

  • Definition of terminology: event-times, censoring (left and right), biomedical covariates, life tables. Inferential problems and objectives.
  • Chapter 2. Failure Time models. 4   classes,   9/1 - 9/10

  • Survival and hazard functions. Parametric distributional models, continuous and discrete regression models. Latent failure time model. Competing risks.
  • Chapter 3. Censored-Data Parametric Inference & Likelihoods.    4   classes,    9/13 - 9/20

  • Parametric likelihoods and parametric inference; truncation and interval censoring, large-sample MLE theory.
  • Counting processes and cumulative hazards. Statistics as integrals with respect to compensated counting processes.
  • Chapter 4. Nonparametric survival-curve estimation. 4   classes,   9/22 - 9/29

  • Cumulative-hazard (Nelson-Aalen) estimators for right-censored left-truncated data.
  • Kaplan-Meier survival function estimators. Confidence bands.
  • Quantile (median) estimates and confidence intervals.
  • More on competing risks and attempt to restore semiparametric identifiability in Competing Risk settings.

    Chapter 5. Estimates for other censoring schemes. 3   classes,   10/1 - 10/6

  • Left, double, and interval censoring. Self consistency property of Kaplan-Meier (right-censored case) and extension to estimation algorithms with other kinds of censoring.
  • Chapter 7. Rank statistics for 1- and 2-sample Tests. 5 classes,     10/8 - 10/18

  • Tests based on scores. Relation to contingency table ideas. Stratified tests. Tests for trend.
  • Logrank and weighted-logrank tests. Sample size and power. Relation to simple survival regression models.
  • Chapter 8. Relative Risk Regression Models 5 classes,     10/20 - 11/3

  • Estimation via maximized Partial Likelihood: regression coefficients and estimation of baseline survival function
  • Related likelihoods (marginal and rank). Large sample theory of estimators.
  • Associated hypothesis tests. Wald, score and LR test analogues.

    Chapter 6. Other estimation techniques.     3   classes,    11/5 - 11/10

  • Kernel-based estimation of the hazard intensity. Application to excess mortality
  • Bayesian nonparametric methods.
  • (*) Methods based on (multiple) imputation of censored lifetimes.
  • Chapter 9. Stratified & Time-Dependent Covariate Cox models.      3 classes,    11/12 - 11/17

  • Time dependent covariate version of proportional hazard model. Application to model checking. Tests of fit related to residuals (material taken also from Chapter 11.)
  • R functions for stratified time-dependent Cox regression.

    Chapter 10. Extended Survival Regression Models.       As time permits

  • Material taken from Chapters 10 and 12 on Additive Hazards regression models, and frailty models.
  • Material taken from Kalbfleisch-Prentice and other sources on Accelerated failure and Proportional Odds models.
  • Introduction of models with time-varying mechanism from journal literature.



  • Homework Problem Sets:

    Problem Set 1, Due Monday Sept.13, 2021.   ( 6 Problems in all, worth 10 points each.)
    Do problems # 2.3, 2.9 (the times to substitute are 12, 24, and 60 months), 2.10, 2.16, and 2.20 from the end of Chapter 2 in the Klein and Moeschberger text. Also to be handed in: using the data in Table 1.2 of the book, create a life table, with rows corresponding to ordered increasing infection times within each of the two ("Surgically Placed Catheter" and "Percutaneous Placed Catheter") groups, showing the number of "failures" (=infections) occurring at that time, and the number at risk (ie individuals within the group who are neither infected nor censored before that time).

    See the R script at LifeTab.RLog for an indication of how to build Life Tables using R commands.


    Problem Set 2, Six Problems in all, Due Tuesday Sept.28, 2021 (11:59pm).
    Do #3.6, 3.8, 4.1(a)-(e) and (i), 4.2(a)-(c) and (e)-(f). In addition

    ( 5th problem to hand in ): read Theoretical Note 1 on pp. 56-57 and show as much as you can of the following statement given there:
    if, in a bivariate setting with dependent (X,C) having a joint density, the function ρ(t) defined on p.56 is known to be identically equal to a constant ρ, and also the sub-distribution function F1(t) and event-time survival function ST(t) are known, then the marginal survival function SX(t) is uniquely determined, and this survival function depends in a monotonically decreasing way on ρ.

    ( 6th problem to hand in ): (a) Suppose that T = λ Xα for some constants   α > 1   and   λ > 0. Find a formula for   hT(t)   in terms of the function   hX(.)   and known functions of t (which depend on λ, α). (b) Use (a) to show that if   X ~ f(x,β, θ)   has   hX(t,β,θ) = eθ h0(t, β),   then the hazard function of T also factors into a function of θ times a function of (t,β).



    Final Project

    I have created a data-file Lymphom.dat (zipped) which you can use in your project. It is large, with 31689 records of 13 columns each, subsetted and re-coded from the National Cancer Institute's SEER database of Lymphoma cancer cases from 1973-2001. You may certainly subset it further in any analyses you do and write up. Details concerning the records retained, the variables chosen, their meanings and the way I re-coded them, can be found here .

    Guidelines for the Final Project. As will be discussed in class, the culminating work for the course, beyond HW and the in-class Test, is a take-home course project which is to consist of a 10-12 page paper based on an original data analysis using the ideas covered in the course, to be handed in before 11:59pm, Friday December 17, 2021. You may find data anywhere. I suggest that you find a survival dataset with enough structure (eg, regression variables, clear hypothesis of interest like treatment effectiveness in a two-group clinical trial) and sufficient sample-size so that it would make sense to try a few different survival analyses and compare the results. You will be graded on appropriateness and interest of the analyses and especially on the clarity and reasonableness of the conclusions (and/or comparisons among conclusions from different methods) that you reach. Your 10-12 pages (excluding data and plots) should explain clearly the models and assumptions and conclusions in a readable narrative. You may hand in (but preferably give URL for) data, intermediate statistical results, and summary displays such as plots and/or histograms, but I do not want to be given any undigested outputs. That is, any such computed outputs should be presented as exhibits, with specific references to such material and suitable interpretations given in the text of your paper.

    If you want to do anything other than a data analysis and narrative for your paper (eg, simulation study or exploration of theoretical and illustrative material on additional methods not covered in the course), such an alternative  may   be OK, but you must see me about it to get it approved first !!


    Important Dates:



    My home page.

    The UMCP Math Department home page.

    The University of Maryland home page.

    © Eric V Slud,  September 26, 2021.