**Instructor:
Professor Eric Slud,**
Statistics Program, Math Dept.,
Rm 2314, x5-5469, slud@umd.edu

**Office hours: **M 1-2, W 11-12 (initially), or email me to make an appointment (can be on Zoom).

Please fill out the on-line Evaluation form on this Course and instructor at http://CourseEvalUM.umd.edu.
**Thank you.**

**SAMPLE
PROBLEMS FOR IN-CLASS TEST: from
2005 and 2008 and Practice Problems 2021**.

Syllabus

**Course Overview:** The topic of the course is the statistical analysis of data on
lifetimes or durations. Such data often have the feature of being (right-)** censored**,
where subjects may leave the study at random times (and in some cases return) and those who
are in the study and have not died at the ending time of the study are simply recorded as
being still alive. Another possible data pattern is , or truncated, where subjects enter
the study (at a recorded time) only if they meet some criterion which may involve an
age-variable or time since diagnosis or other preliminary event. Such data arise
frequently in clinical trials, epidemiologic studies, reliabilitytests, and insurance. We
first present parameterizations of survival distributions, in terms of hazard intensities,
which lend themselves to the formulation of parametric models, including regression-type
models which relate failure-time distributions to auxiliary biomedical predictors. The
special features of truncation or censoring present unique challenges in the formulation
of likelihoods and efficient estimation and testing in settings where the distributions
of arrival-times and withdrawal-times are unknown and not parametrically modelled.
This statistical topic has achieved great prominence in the theoretical statistical
literature because it is a particularly good arena for the introduction of techniques
of estimating and testing finite-dimensional parameter values --- such as a treatment-
effectiveness parameter in clinical studies --- in the presence of infinite-dimensional
unknown parameters. Such problems are called **Semiparametric**.

Klein, J. and Moeschberger, M. (2003) *Survival Analysis: Techniques for*
*Censored and Truncated Data, 2nd ed. *Springer

**Prerequisites:** Stat 410 and either Stat 420 or Stat 700. The
presentation will be geared to second-year Stat grad students.

**Grading:** The course grade will be based 50% on 6 or 7 homework problem sets, 25% on an in-class test (tentatively scheduled for October 29), and 25% on a course project or paper at the end. The homework problems will be a mixture of theoretical problems at Stat 410/Stat 700 level, and of computational or data-analysis problems. *Grading of these problems will also be based on the quality of verbal descriptions and interpretations of results: submitted analyses presenting only undigested numerical output will be graded down.* The in-class test will be designed to test (i) definitions (of models and distributions and statistics), (ii) ability to use model definitions to construct likelihoods (and partial likelihoods) and derive statistics from them, and (iii) basic properties of estimators and test-statistics studied in class.
**either** a paper on a topic not fully covered in class, with illustrative data analysis, **or** an extended and coherent data analysis and writeup (of about
10-12 pages, **not including** computer output). **Note: homework problem assignments will be due approximately every 2 weeks. The problem sets and due dates will be posted
to this web-page and on ELMS and announced in class, and will be submitted as pdf's on ELMS. The
problems will be due on the dates announced and will be graded down for lateness unless you have a VERY good excuse.**

** President Pines provided clear expectations to the University about the wearing of masks for students, faculty, and staff. Face coverings over the nose and mouth are required while you are indoors at all times. The only allowed exception when it comes to classrooms and laboratories is for course instructors while they are teaching and adequately distanced from the class.) Students not wearing a mask will be given a warning and asked to wear one, or will be asked to leave the room immediately. Students who have additional issues with the mask expectation after a first warning will be referred to the Office of Student Conduct for failure to comply with a directive of University officials. **

**(1)** Kalbfleisch, J. and Prentice, R. (2002) *The Statistical Analysis of Failure Time Data*, 2nd ed. Wiley

This book was used once for the course. Its explanations are harder, less straightforward and often more intuitive.

**(2)** Another very useful and readable recommended text (reissued in 1998
as a paperback and currently as an e-book)

*Survival Analysis.*Wiley-Interscience 1998

**(3)** An easier book that can be used for self-study and review, free to students as an e-book through the UMD library:

*Survival Analysis - A Self-Learning Text*, 2nd ed., Springer.

**(4)** For the more mathematically inclined, a primarily
theoretical text by two former Maryland students:

*Counting Processes and Survival Analysis.* Wiley.

**Coverage** of the Klein & Moeschberger book will be Chapters 1-9,
plus a few miscellaneous topics. The main topics are:

Klein & Moeschberger is a very methods-oriented book, and will be covered along with **R** software implementation with real-data examples. The Miller book explains things well and gives good background and literature references. For additional mathematical justifications, including the connection with counting processes and martingales, I will draw additional material from Fleming and Harrington, my own notes, and the research literature. Other data examples, and more sophisticated data analyses, can be found in the Kalbfleisch and Prentice book (get the data from the **R** package **KMsurv**).

**Computing** in the course can be done with **R**, **SAS**, or any other package you are familiar with that also has preprogrammed Survival Analysis modules. However,
**R** is by far the best choice if you want guidance and/or help from me, and if you want access to the newest methods from the research literature. Various datasets can be explored and accessed within existing **R** packages and libraries, e.g. by issuing the command
** > data()** after **> library(survival)** or [for all datasets from the Klein and Moeschberger book including its exercises] after **library(KMsurv)**. Whatever package you choose, you can get computing help, datasets, and further links here. In particular you can get lots of survival datasets, including some that were in the Kalbfleisch and Prentice book, by clicking here and searching for the keyword "survival".

See the **Handouts ** section below for a link to the "Rbasics" file connected with the data analysis tasks needed for Homeworks. For the systematic **Introduction to R** and **R reference manual** distributed with the R software, either download from the R website **or** simply invoke
the command ** > help.start() ** from within R. For a slightly
less extensive introductory tutorial in R, click here. A very handy reference card containing **R** commands can be found here.

For a directory of R Scripts relevant this course, click here.

NOTE: to get started using survival-related functions in R, you need to "load" the R **survival** package, which is accomplished by the command: > library(survival)

Handouts can be found at linked pages for each of the following topics:

(0)
Basics on R commands for data entry and Life Tables and Life Table construction. In addition, various useful files on Statistical Computing can be found at my course web-page for **Stat 705**, along with additional relevant links.

(1) **UPDATED** Nelson-Aalen & Kaplan Meier calculation

(2) **UPDATED** Illustrative R Script for Survival Curves, Hazards, Medians, and SE's.

(3) **UPDATED** Nelson-Aalen calculation for left-truncated right-censored data.

(4) **UPDATED** Script and
Illustrative Picture on model fitting of VA Lung-Cancer data in R. This Script and picture also contain material about fitting and plotting the Cox Model for the same dataset and comparing the
results to the previous accelerated failure time parametric regression model.

(5) **UPDATED** R
calculations for weighted logrank (2-sample) test. Also available is a New Script-file on Stratified and K-sample Logrank statistics using "survdiff" **UPDATED for F21**.

(6) New illustration of Stratified versus interaction-term tests of difference between coefficients in subgroups of a survival dataset. This R script and picture explain in the example of a Mayo lung-cancer study that there are differences between the coefficient for a baseline health index ("Karnofsky score") for the two sexes in the study, but that these differences are obscured if an assumption of common baseline hazard for both sexes is made.

(7) Handout containing **UPDATED** R Log on Self-Consistency Property of Kaplan-Meier Estimator and Redistribute-to-the-Right Algorithm and **UPDATED** Coding for Turnbull (1974) self-consistent estimator of survival-distribution in double-censored survival data.

(8) **R** Script for Time-dependent Cox-Model fitting, illustrated with data analysis of Mayo-Clinic Lung Cancer Data. An **UPDATED** version is now available in the RScripts directory,
here.

(9) **R** script for calculating Partial Likelihoods in (non-time-dependent) Cox-model. This includes calculations with risk-groups. The script will later be
augmented to include the calculation of score statistics for individual coefficients.

**Chapter 1. Introduction: Terminology, data structures & examples. 1 class, 8/30**

**Chapter 2. Failure Time models. 4 classes, 9/1 - 9/10**

**Chapter 3. Censored-Data Parametric Inference & Likelihoods.
4 classes, 9/13 - 9/20**

**Chapter 4. Nonparametric survival-curve estimation. 4 classes, 9/22 - 9/29**

**Chapter 5. Estimates for other censoring schemes. 3 classes, 10/1 - 10/6**

**Chapter 7. Rank statistics for 1- and 2-sample Tests.
5 classes, 10/8 - 10/18**

**Chapter 8. Relative Risk Regression Models
5 classes, 10/20 - 11/3**

**Chapter 6. Other estimation techniques. 3 classes, 11/5 - 11/10**

**Chapter 9. Stratified & Time-Dependent Covariate Cox models.
3 classes, 11/12 - 11/17**

**Chapter 10. Extended Survival Regression Models.
**

**Problem Set 1, Due Monday Sept.13, 2021.** (* 6 Problems in all, worth 10 points each.*)

Do problems # 2.3, 2.9 (the times to substitute are 12, 24, and 60 months), 2.10, 2.16, and 2.20 from the end of Chapter 2 in the Klein and Moeschberger text.
**Also to be handed in**: using the data in Table 1.2 of the book, create a life table, with rows
corresponding to ordered increasing infection times within each of the two ("Surgically Placed Catheter" and "Percutaneous Placed Catheter") groups, showing the number of "failures" (=infections) occurring
at that time, and the number at risk (ie individuals within the group who are neither infected nor censored before that time).

See the R script at LifeTab.RLog for an indication of how to build Life Tables using R commands.

**Problem Set 2, Six Problems in all, Due Tuesday Sept.28, 2021 (11:59pm).**

Do #3.6, 3.8, 4.1(a)-(e) and (i), 4.2(a)-(c) and (e)-(f). In addition

(* 5th problem to hand in *): read Theoretical Note 1 on pp. 56-57 and
* show as much as you can of the following statement given there: *

if, in a bivariate setting with dependent (X,C) having a joint density, the
function ρ(t) defined on p.56 is known to be identically equal to a constant ρ,
and also the sub-distribution function F_{1}(t) and event-time survival function S_{T}(t) are known, then the marginal survival function S_{X}(t) is uniquely determined, and this survival function depends in a monotonically decreasing way
on ρ.

(* 6th problem to hand in *): (a) Suppose that T = λ X^{α} for some constants α > 1 and λ > 0. Find a formula for h_{T}(t) in terms of the function h_{X}(.) and known functions of t (which depend on λ, α). (b) Use (a) to show that if X ~ f(x,β, θ) has h_{X}(t,β,θ) = e^{θ} h_{0}(t, β), then the hazard function of T also factors into a function of θ times a function of (t,β).

** Problem Set 3, Due Sunday October 17, 2021, 11:59pm.**

** (15 points)** Kalbfleisch & Prentice Problem (#3.11):
Use the famous Freireich et al. (1963) data which can be found as "gehan" within the MASS library,

** Problem Set 4, due Monday October 25, 2021, 11:59pm.** *(3 Problems, 30 points)* plus an optional 5-pt Extra-credit problem.

*(10 points)* (a) Show that to compute sup_{t} |\hat{S}^{KM,1}(t) - \hat{S}^{KM,2}(t)| based on two independent samples (or two different groups of subjects) in a survival study, it suffices to create survival function estimates at the (sorted) union of observed death times for the two samples and take the maximum of absolute differences of the two survival function estimators at those points.

*(10 points)* # 5.2 in Klein and Moeschberger. This is a problem about double censoring. Refer to the R handout on this topic on the web-page, in the RScripts directory.

*(10 points)* # 7.2 in Klein and Moeschberger. This is about valid hypothesis tests for a fixed known hazard rate versus a nonparametric alternative. Note that we do not have a valid test (although we can discuss constructing one, and I will give a journal-paper reference) in case the null hypothesis involves an unknown parameter that must be estimated.

**(Extra-credit)***(5 points)* Recall the discussion in class of how to construct confidence bands for

^{1/2} sup_{0 ≤ t ≤ b} |\hat{S}^{KM}(t) - S_{X}(t)|/(S_{X}(t) · (1+V(t))

for appropriately chosen V(t) = V_{n}(t) = n · ∑_{j: tj < t} d_{j}/(l_{j}(l_{j}-d_{j})). Now suppose as in (I) that there are two independent samples or groups with separate KM curves \hat{S}^{KM,g}(t), g=1,2. Denote by t_{j}^{k}, d_{j}^{k}, l_{j}^{k} the respective ordered distinct death-times, number of deaths, and numbers at risk for these two samples or groups, and assume (as a null hypothesis **H _{0}**) that both have the same S

where S(t) is a pooled estimator of S

** Problem Set 5, due Thursday November 18, 2021, 11:59pm.** *(6 Problems, 60 points)*.

**(A)** (10 points) Prove that the one- or two-sample weighted logrank test statistics with weight functions W(t) that are nonincreasing are consistent against stochastically ordered alternatives (have power converging to 1) as the sample size n grows to ∞. Recall the definition of *stochastically ordered*. In the one-sample case, this means that S_{X}(t) ≥ S_{0}(t) for all t or S_{X}(t) ≤ S_{0}(t) for all t, with strict inequality for at least one t. In the two-sample case, either S_{X}^{(1)}(t) ≥ S_{X}^{(2)}(t) for all t, or ≤ for all t, with strict inequality for at least one value of t.

**(B)** (10 points) Problem 8.1: method is illustrated in CoxMod.txt Log-page, including the part about fit of exponential model.

**(C)** (10 points) (a) Approximately how large a sample would you need to achieve power 0.90 against the alternative with hazard ratio 1.5 using a logrank test, if the sample of size n were randomly allocated with a fair coin-toss to control or treatment group, and control-group survival is Expon(2) while censoring is approximately Expon(1) in the control group and Expon(1.25) in the treatment group ? (b) Do the same sample-size calculation if the hypothesis test to use is the Gehan-modified Wilcoxon. (c) Do the same sample size calculation if the hypothesis test to be used is the Peto-Prentice Wilcoxon (G_{ρ} test with ρ=1).

** Problem Set 6, due Saturday December 11, 2021, 11:59pm.** *(6 Problems, 60 points)*.

(I) From Chapter 8 of the course text, do problems ** 8.4, 8.5, 8.8(a) and (c).**

(II) Using the data and results of problem 8.8, find and plot estimators for:
(a) the baseline cumulative hazard function Λ_{0}(t),

and (b) the population summary survival functions for the ALL, AML Low-Risk, and AML Hi-Risk groups.

(III) Do problems ** 6.3, 9.3 **.

I have created a data-file Lymphom.dat which you can use in your project. It is large, with 31689 records of 13 columns each, subsetted and re-coded from the
National Cancer Institute's **SEER** database of Lymphoma cancer cases from 1973-2001. The file can be inputted with read.table and you will get the proper column-headers if you use the option header=T. You may certainly subset it further in any analyses you do and write up. Details concerning the records retained, the variables chosen, their meanings and the way I re-coded them, can be found here .

**Guidelines for the Final Project.**
As will be discussed in class, the culminating work for the course, beyond HW and the in-class Test, is a take-home course project which is to consist of a 10-12 page paper based on an original data analysis using the ideas covered in the course, to be handed in **before** 11:59pm, Saturday December 18, 2021. You may find data anywhere. I suggest that you find a survival dataset with enough structure (eg, regression variables, clear hypothesis of interest like treatment effectiveness in a two-group clinical trial) and sufficient sample-size so that it would make sense to try a few different survival analyses and compare the results. **You will be graded on appropriateness and interest of the analyses and especially on the clarity and reasonableness of the conclusions (and/or comparisons among conclusions from different methods) that you reach. Your 10-12 pages (excluding data and plots) should explain clearly the models and assumptions and conclusions in a readable narrative. You may hand in (but preferably give URL for) data, intermediate statistical results, and summary displays such as plots and/or histograms, but I do not want to be given any undigested outputs. That is, any such computed outputs should be presented as exhibits, with specific references to such material and suitable interpretations given in the text of your paper.**

*If you want to do anything other than a data analysis and narrative for your paper (eg, simulation study or exploration of theoretical and illustrative material on additional methods not covered in the course), such an alternative may be OK, but you must see me about it to get it approved first !!*

**Important Dates:**

- August 30, Monday: First day of class.
- Sept. 6, Monday: Labor Day: No Class.
- October 25, Monday:
**Review for in-class test.** - October 27, Wednesday:
**In-class test.** - November 24-26, Wed.-Fri.:
**Thanksgiving Break: No Class** - December 13, Monday: Last Day of Classes
- December 18, Saturday: Term Projects Due

My home page.

The UMCP Math Department home page.

The University of Maryland home page.

© Eric V Slud, December 9, 2021.