Instructor:
Professor Eric Slud,
Statistics Program, Math Dept.
Rm 2314, x5-5469, evs@math.umd.edu
Office hours: M 1-2, W 11-12, Th 1-2 (initially), or by appointment.
SAMPLE
PROBLEMS FOR IN-CLASS TEST
Old and New.
NOTE changed date for
test, which is now Friday April 11.
For Syllabus click here
for lecture Handouts click here, and
here
for Statistical Computing handouts.
For Homework, click here.
The topic of the course is the statistical analysis
of data on lifetimes or durations.
Such data often have the feature of being right-censored,
where subjects may
leave the study at random times and those who
have not died at the ending time of
the study are simply recorded as being still
alive, or truncated, where subjects enter
the study only if they meet some criterion which
may involve an age-variable or
time since diagnosis or other preliminary event.
Such data arise frequently in
clinical trials, epidemiologic studies, reliability
tests, and insurance. We first
present parameterizations of survival distributions,
in terms of hazard intensities,
which lend themselves to the formulation of parametric
models, including
regression-type models which relate failure-time
distributions to auxiliary
biomedical predictors. The special features of
truncation or censoring present
unique challenges in the formulation of likelihoods
and efficient estimation and
testing in settings where the distributions of
arrival-times and withdrawal-times
are unknown and not parametrically modelled.
This statistical topic has achieved
great prominence in the theoretical statistical
literature because it is a particularly
good arena for the introduction of techniques
of estimating and testing finite-
dimensional parameter values --- such as a treatment-
effectiveness parameter
in clinical studies --- in the presence of
infinite-dimensional unknown
parameters. Such problems are referred to as
Semiparametric.
Prerequisites: The
presentation will be geared to second-year Stat grad students.
Minimum background is Stat 410 and Stat 700.
Required Text: The required text is:
Klein, J. and Moeschberger, M. (2003) Survival
Analysis: Techniques for
Censored and Truncated Data, 2nd ed. Springer-Verlag
ISBN: 038795399X
Data:
Datasets contained in
Appendix A of the Kalbfleisch & Prentice book, except
for Dataset V, can be downloaded in Excel format
from the public ftp site,
linked here .
Two of the datasets (Datasets I and V)
are available in ASCII format,
as rectangular tables,
here.
Recommended Text(s):
The book which we used last time this course was taught will
serve as a useful reference,
but the explanations given
there are harder, less straightforward and often more intuitive.
Kalbfleisch, J. and Prentice, R. (2002) The Statistical Analysis
of Failure Time Data,
2nd ed. John Wiley ISBN: 0-471-36357-X
Another very useful recommended text (a 1980 book reissued as a paperback)
R. Miller, Jr. (1980) Survival Analysis. Wiley-Interscience 1998, ISBN: 0471255483
and, for the more mathematically inclined, a primarily
theoretical text by two
former Maryland students:
Fleming, T. and Harrington, D. (1991) Counting
Processes and Survival Analysis.
ISBN: 047152218X
Coverage of the Klein &
Moeschberger book will be Chapters 1-9, plus a few
miscellaneous topics. The main topics are:
Klein & Moeschberger is a very methods-oriented book,
and will be covered along
with R software
implementation with real-data examples. The Miller book
explains things
well and gives good background and
literature references. For additional mathematical
justifications, including the connection with counting
processes and martingales, I will draw
additional material
from Fleming and Harrington and the research literature. Other data
examples, and more sophisticated data analyses, can be found in the
Kalbfleisch and
Prentice book.
Grading: The
course grade will be based 50% on 7 homework problem sets,
25% on an in-class test, and 25% on a course
project or paper at the end. The homework
problems will be a mixture of theoretical problems
at Stat 410/Stat 700 level, and of
computational or data-analysis problems. The
in-class test will be designed to test
(i) definitions (of models and distributions and statistics),
(ii) ability to use model definitions to construct
likelihoods (and partial likelihoods) and
derive statistics from them, and
(iii) basic properties of estimators and test-statistics
studied in class.
The course project will be either a paper
on a topic not fully covered in class, with
illustrative data analysis, or an extended
and coherent data analysis and writeup (of about
10 pages, not including computer output).
Note: homework problem assignments will be
due approximately every 2 weeks.
The problem sets and due dates will be posted
to this web-page and announced in
class. The problems will be due on the dates
announced and will be graded down for
lateness unless you have a VERY good
excuse.
Computing in the course can be done
with Splus, R, SAS, or any other package
you
are familiar with which also has preprogrammed
Survival Analysis modules. However,
Splus and R are the best
choices if you want guidance and/or help from me. R is also
the best choice in accommodating the newest
methods from the research literature. Various
datasets can be explored
and accessed within existing R packages and libraries, e.g. by
issuing the command
> data()
after
> library(survival)
Whatever package you choose, you can get computing
help, datasets, and further links at
StatLib.
All of the datasets for the Klein and Moeschberger book, including its
exercises,
can be found here
. As mentioned above, all of the datasets in (the Appendix of)
the
Kalbfleisch and Prentice text are freely available for
download, and Datasets I and V can
be found in ASCII format
here .
As an indication of how the datasets were imported using
R into the format given, see the following linked script.
See Handouts
section below for link to "Rbasics" file connected with the
data analysis tasks
needed for Homeworks. For the systematic
Introduction to R and R reference manual
distributed
with the R software, either download from the
R website or simply invoke
the command
> help.start()
from within R. For a slightly
less extensive introductory tutorial in R, click
here .
Lecture Note and Slides Handouts:
Statistical Computing Handouts:
The following material and
handouts were produced in previous terms
when the
course was given using Splus as the computing platform. They
are mostly still usable, since R and Splus are syntactically
the same and
share many older functions, but these handouts will be
updated and
converted to R as the term progresses.
NOTE: to get started using survival-related functions in R, you
need to "load" the
R survival package, which is accomplished by the
command:
> library(survival)
Handouts can
be found at linked pages for each of the following topics:
(0)
Basics on R commands for data entry and life table estimates
(2) Illustrative R Script for
Survival Curves, Hazards, Medians, and SE's.
(3) Nelson-Aalen calculation
for left-truncated right-censored data
(4)
Script and
Illustrative Picture on model fitting of VA Lung-Cancer
data in R.
This Script and picture also contain material about
fitting and plotting the Cox Model
for the same dataset and
comparing the results to the previous accelerated failure time
parametric regression model.
(5) R calculations for weighted logrank (2-sample) test statistics.
(6)
Splus
log on Stratified Analyses (Survival Curves & Weighted
Logrank & Cox Models).
(7) New illustration of Stratified versus interaction-term
tests of difference between
coefficients in subgroups of a survival
dataset. This R script and picture explain in the
example of a Mayo
lung-cancer study that there are actual differences between the
coefficient for a baseline health index ("Karnofsky score") for the
two sexes in the study,
but that these differences are obscured if an
assumption of common baseline hazard for
both sexes is made.
(8) Log of Splus analyses related to Cox
model and comparison
with exponential regression model, along with some pictures.
(9) Handout containing R Log on
Self-Consistency Property of Kaplan-Meier Estimator
and
Redistribute-to-the-Right Algorithm and Coding for
Turnbull (1974) self-consistent estimator
of survival-distribution in double-censored survival data.
(10) Splus algorithm of Turnbull for interval-censored data.
(10) Splus computations for kernel-smoothed Nelson-Aalen estimator.
(11) R Script
for Time-dependent Cox-Model fitting, illustrated with
data analysis
of
Mayo-Clinic Lung Cancer Data.
(12) R script for calculating
Partial Likelihoods in (non-time-dependent) Cox-model.
This includes calculations with risk-groups. The script will later be
augmented
to include
the calculation of score statistics for individual coefficients.
Syllabus.
Chapter 1. Introduction: Terminology, data structures & examples. 1 class, 1/28
Chapter 2. Failure Time models. 4 classes, 1/30 - 2/6
Chapter 3. Censored-Data Parametric Inference & Likelihoods. 3 classes, 2/8 - 2/13
Chapter 4. Nonparametric survival-curve estimation. 3 classes, 2/15 - 2/20
Chapter 5. Estimates for other censoring schemes. 2 classes, 2/22 - 2/25
Chapter 6. Other estimation techniques. 3 classes, 2/27 - 3/3
Chapter 7. Rank statistics for 1- and 2-sample Tests. 5 classes, 3/5 - 3/14
Chapter 8. Relative Risk Regression Models 5 classes, 3/24 - 4/2
Chapter 9. Stratified & Time-Dependent Covariate Cox models. 3 classes, 4/9 - 4/14
Chapter 10. Extended Survival Regression Models. As time permits
Homework Problem Sets: For Solutions, click here.
Problem Set 1, Due Wednesday February 6, 2008.
Do # 2.3, 2.9 (the times to substitute are 12, 24,
and 60 months), 2.10, 2.16, and 2.20.
Also to be handed in: using the data in
Table 1.2 of the book, create a life table, with rows
corresponding to ordered increasing infection times within
each of the two ("Surgically
Placed Catheter" and
"Percutaneous Placed Catheter") groups,
showing the number of
"failures" (=infections) occurring
at that time, and the number at risk (ie individuals within
the group who are neither infected nor censored before that
time.
Problem Set 2, Due Wednesday February 27, 2008. Do #3.6,
3.8, 4.1(a)-(f), 4.2(a)-(c) and (e)-(g).
In addition
( 5th problem to hand in ):
read Theoretical Note 1 on pp. 56-57 and
show as much as you can of the following statement given there:
if, in a bivariate setting with dependent (T,C) having a
joint density, the function &rho(t)
defined on p.56 is known along
with the sub-distribution function F1(t) and
event-time
survival function ST(t), then the marginal
survival function SX(t) is uniquely determined,
and
this survival function depends in a monotonically decreasing way
on &rho(t).
Problem Set 3, Due Friday March 14, 2008.
(I) Klein & Moeschberger problems:
#4.5, 4.7, 4.9.
(II) Kalbfleisch & Prentice Problem (#3.11):
Use the famous Freireich et al. (1963) data
which can be found as
"gehan" within the MASS library,
(a) to test the hypothesis of
equality of remission times in the two groups, using Weibull,
log-normal, and log-logistic models, and to decide which model fits
the data best, and
(b) to test for adequacy of an
exponential model relative to a Weibull model.
In the dataset, you should ignore the "pair" information. The last
column is "treat" , a factor
(categorical) variable.
(III) Consider the setting where you have
right-censored survival data on a large number n
of iid
patients, where the underlying and censoring distributions are both
Exponential, with
respective parameters λ and ρ . Find
simplified formulas for the
asymptotic variances (proportional
to 1/n) for the estimated
marginal survival function S(t) at time t=1 based on the parametric
Exponential estimator of λ and also based on the Kaplan-Meier
estimator, and compare the
formulas. (How much larger is the KM
variance ?)
Problem Set 4, Due Friday April 4, 2008.
(I) Problems 7.1, 7.3,
7.9. See the Statistical Computing script (5) above on
weighted logrank statistics.
(II) Problem 8.1: method is
illustrated in CoxMod.txt Log-page, including the part about
fit of exponential model.
(III) (a) Approximately how large
a sample would you need to achieve power 0.90 against
the
alternative with hazard ratio 1.5 using a logrank test, if the sample
of size n were randomly
allocated with a fair coin-toss to control or
treatment group, and control-group survival is
Expon(2) while
censoring is approximately Expon(1) in the control group and
Expon(1.25) in
the treatment group ? (b) Do the same sample-size
calculation if the hypothesis test to use is the
Gehan-modified
Wilcoxon. (c) Do the same sample size calculation if the hypothesis
test to be
used is the Peto-Prentice Wilcoxon (G&rho
test with &rho=1).
Problem Set 5, Due Wednesday April 30, 2008.
(I) Consider the setting
discussed in class, of a large-sample dataset consisting of 2n
observations min(Xi, Ci), where all
Xi ~ Expon(&lambda) and Ci = infinity
for i=1,..,n and
Ci = tmax
for i=n+1,...,2n. Show explicitly that the right-censored MLE
for &lambda based on
these data is consistent for &lambda .
Also find the large-sample limit for the MLE of &lambda
under an
Expon(&lambda) model for only those data-values Xi
which are `observed' in the sense that they are
strictly less than their
corresponding Ci censoring-times. (This includes all of the first n
observations, but approximately n (1-exp(-tmax
&lambda)) observations among the observations
i=n+1,..,2n.
(As discussed in class, this second analysis, based only on the "complete
cases"
which are uncensored is misspecified and should not be expected
to be consistent for &lambda .)
(II) From Chapter 8 of
the course text, do problems 8.4, 8.5, 8.8.
(III) Using the data and results of problem 8.8, find and plot
estimators for: (a) the baseline
cumulative hazard
function &Lambda0(t), and (b) the population
summary survival functions for the
ALL, AML Low-Risk, and
AML Hi-Risk groups.
Problem Set 6, Due Monday May 12, 2008.
Klein & Moeschberger # 5.2, 6.3, 6.6((a) and (c) only, and 9.3.
I have created a data-file Lymphom.dat
(zipped) which you can use in your project.
It is very large, with
31689 records of 13 columns each, subsetted and re-coded from
the
National Cancer Institute's SEER database of Lymphoma cancer
cases from
1973-2001. You may certainly subset it (much) further in
any analyses you do and write
up. Details concerning the
records retained, the variables chosen, their meanings and the
way I re-coded them, can be found here .
Guidelines for the Final Project.
As will be discussed in class, the culminating work for
the course, beyond HW and the in-class Test, is a take-home course
project which is to consist
of a 10-page paper based on an original
data analysis using the ideas covered in the course,
to be handed
in before 5pm , Monday December 19, 2005. You
may find data anywhere, but the
StatLib
web-site would be a particularly good place to start. My suggestions were
to find a
survival dataset with enough structure (eg, regression
variables, clear hypothesis of interest like
treatment
effectiveness in a two-group clinical trial) and sufficient
sample-size so that it would
make sense to try a few different
survival analyses and compare the results. You will be graded
on
appropriateness and interest of the analyses and especially on the
clarity and
reasonableness of the conclusions (and/or comparisons
among conclusions from different
methods) which you reach. Your
10 pages (beyond data and plots) should explain clearly
the
models and assumptions and conclusions in areadable narrative. You may
hand in
(but preferably give URL for) data, intermediate
statistical results, and summary displays
such as plots and/or
histograms, but I do not want to be given any undigested
outputs.
That is, any such computed outputs should be presented
as exhibits, with specific references
to such material
and suitable interpretations given in the text of your paper.
Specific URL's at which to look
for data are:
http://lib.stat.cmu.edu/datasets/,
http://lib.stat.cmu.edu/disease/
http://lib.stat.cmu.edu/jasadata/,
http://lib.stat.cmu.edu/DASL/
Also, Splus and R supply several good survival datasets.
If you want to do anything other than a data analysis and
narrative for your paper (eg,
simulation study or exploration
of theoretical and illustrative material on additional
methods
not covered in the course), such an alternative
may be permissible, but you must
to see me
about it to get it approved first !!
Important Dates:
Other links:
(A) I gave three `Mini-Course' talks on
Survival Analysis a couple of years ago which are
very relevant
to the material of this course. The slides are available
in pdf format. They are
respectively about
Competing
Risks,
Martingales
& Populations, and
Semiparametric
Models.
(B) Various useful files on Statistical Computing, in
particular using Splus and R but also
with some material on SAS,
can be found at the Spring '04 course web-page for
Stat 798C, now
renumbered as Stat
705, along with additional relevant links.
(C) StatLib.
Useful repository of downloadable software and datasets, and much more.
(D) R
General source for freely downloadable R packages and related manuals and
datasets.
The UMCP Math Department home page.
The University of Maryland home page.
© Eric V Slud, May 19, 2008.