Statistics 818D  Bootstrap Methods

Spring 2021 MW 9-9:50am,    Online class delivered through ELMS

Instructor: Eric Slud, Statistics program, Math. Dept.

Office:  Mth 2314, email: slud@umd.edu
Office Hours: W 1:30-3pm or by appointment

Primary Course References:

(i) Lecture Notes (adapted by me from input of colleagues including Snigdansu Chatterjeee, U. of Minnesota)

(ii) Freely downloadable (from UMD Library) e-book chapters and e-book texts including:

Efron, B. (1982), The Jackknife, the Bootstrap and Other Resampling Plans. SIAM.
Gentle, J. (2009), Computational Statistics, Springer.
Wassermann, L. (2006), Chapter 3 on The Bootstrap and the Jackknife in: All of Nonparametric Statistics, Springer.
Hall. P. (1994), The Bootstrap and Edgewoth Expansion, Springer
Das Gupta, A. (2008), Chapter 29 on The Bootstrap in: Asymptotic Theory of Statistics and Probability, Springer.
Giné, E. (1997), Lectures on Some Aspects of the Bootstrap, pp.37-151 in: Lec. Notes in Math., vol 1665, Springer.
Good, P. (2005) Permutation, Parametric and Bootstrap Tests of Hypothesis, 3rd ed., Springer.

(iii) Journal articles containing various bootstrap applications.



Current Homework Assignment          Course Handouts          R Scripts

Overview: The topic of the course is statistical Resampling Methods, with emphasis on Bootstrap Methods. Resampling means statistical procedures based on re-use (often repeated) of datasets subsets of which are randomly selected. These methods are supported by large-sample probability limit theorems, which sometimes apply in surprisingly small-to-moderate size datasets. The goal of such procedures is any or all of: bias reduction, variance and Confidence Interval construction from statistical estimators, and calculation of null reference distributions of statistics used for hypothesis testing. In many Data Science applications, these techniques provide sensible approaches to the estimation of reference distributions under minimal modeling assumptions, with mathematically justified properties under broad qualitative assumptions on the data generating mechanism. The course is suitable either for STAT students or for students in Applied Mathematics or Mathematical Data Science.

The course will be taught at STAT MA level, more or less, with a mixture of theory, software-oriented (primarily R) applications, and data analyses and simulation case-studies.

Special Features of the Course this Term. 50-minute Lectures will be offered live (ie, synchronously) over Zoom through ELMS and recorded. Each lecture will consist of one segment containing primarily theoretical meterial, and one providing computational data illustration in R. The theory pieces will be pdf slides with voiceover, and sometimes with handwritten pieces via a document camera. The R pieces will either be slides with R codes and pictures, or live demonstration using R or RStudio windows.

THEORETICAL MATERIAL.  Both in homeworks and lectures, there will be theoretical material at the level of probability theory (STAT 410 or sometimes 600-601) related to laws of large numbers and central limit theorem (sometimes "functional central limit theorems", which I will explain) , along with the `delta method' (Taylor linearization). There will be some proofs, mostly at advanced-calculus level but some involving measure-theoretic ideas.

Prerequisite: Stat 410 or 600-601, Stat 420 or 700-701, plus some computing familiarity, preferably including some R.

Course requirements and Grading: there will be 6 graded homework sets (one every 2 - 2.5 weeks), plus a project/paper at the end. Homeworks will be split between theory problems and statistical computations and interpretations with data. The homework will be worth 65% of the grade, the term paper 35%.


Course Coverage & Outline:

(1.) Monte Carlo Simulation versus Resampling/Bootstrap (1/25-27/2021)

Introduction to Monte Carlo Simulation; definitions and objectives; statistical uses; simulations from
real-data (empirical) distributions); resampling and `pseudo-data'; bootstrap.

(2.) Statistical Functionals: Bias Reduction via Jackknife & Bootstrap (1/27-29/2021)

Definition of Statistical Functions; higher-order Delta Method and asymptotic expansion for bias;
Jackknife for Bias reduction; Bootstrap for Bias Reduction

(3.) Reference Distributions: Bootstrap Hypothesis Tests and Confidence Intervals (2/1-3/2021)

(4.) Bootstrap with Complicated Statistics (sometimes non-smooth) (2/3-5/2021)

(5.) Proof of Consistency of Bootstrap Distributional Estimate for the Mean

(6.) More on Statistical Functionals, "Influence Functions", and Bootstrap

(7.) Enhanced accuracy for the Bootstrap vs. asymptotic normal approximations
the `Singh Property' (at least 2 lectures)

(8.) Double and Iterated Bootstrap for Higher-order Accuracy

(9.) Bootstrap in Regression Problems -- Bootstrapping Residuals

(10.) Sme Settings where Bootstrap does not Work
Extensions with m-out-of-n Bootstrap and (non-bootstrap) "Subsampling"

(11.) Relation between Functional Central Limit (empirical-process) Theory and Bootstrap Limit Theory

(12.) Weighted and Multiplier Bootstraps

(13.) Parametric Bootstrap -- Theory and application in Mixed-Model and Empirical-Bayes Problems

(14.) Bootstrap for Sample-Survey Inference

(15.) Other Applications of Bootstrap (Survival Analysis, possibly others)

(16.) Bootstrap in Problems with Dependent Data (Time Series, Spatial) Idea of "Block Bootstrap"

(17.) Bootstrap Variants in Machine Learning -- Boosting and Bagging



COMPUTING.  In Lectures, the homework-sets and possibly also the course project, you will be doing computations on real and simulated datasets using a statistical computation platform or library. Any of several statistical-computing platforms are suitable for this: R, Matlab, Python, or others. If you are learning one of these packages for the first time, or investing some effort toward deepening your statistical computing skills, I recommend R,which is free and open-source and is the most flexible and useful for research statisticians. I will provide links to free online R tutorials and will provide examples and scripts and will offer some R help, as needed.


Getting Started in R. Lots of R introductory materials can be found on my STAT 705 website from several years ago, in particular in these Notes. Another free and interactive site I recently came across for introducing R to social scientists is: https://campus.sagepub.com/blog/beginners-guide-to-r.


R Scripts

A set of R Scripts on many topics related to the course are available in this directory. .



Homework

Notes and Guidelines. Homeworks should be handed in as pdf's through ELMS "Assignments". Solutions will usually be posted, and a percentage deduction of the overall HW score will generally be made for late papers.

Homework solutions will be posted to the ELMS Assignments tab for the use of students registered in the course.


Assignment 1. (First 2 weeks of course, HW due Wed., Feb. 10 ). Reading: Gentle (2009) chapters on simulation and bootstrap, Efron (1982) Ch.2 & 5, Wassermann (2006) Ch.3, plus R scripts from Lectures. Then solve and hand in all of the following 6 problems:

(1). In Lecture 1 slides 3 and 4, two different algorithms are suggested for simulating n=10,000 or 100,000   Poisson(5) random variables  Xi:   one by finding a large number m wuch that P(X>m) < 1e-7, and using sample to sample independently using the Poisson(5) probabilities P(X=k | X ≤ m), k=0,...,m; and the other by letting X be the largest k such that V1+...+Vk ≤ 5. Run a simulation study in R, using system.time to calculate running times, to demonstrate which of the two algorithms is faster.

(2). (Exercise adapted from Dekking (2005) book chapter 18 on Bootstrap or Wassermann's (2006) Chap.3)
(a.) Suppose that a bootstrap sample  (x*j, 1 ≤ j ≤ 10) is drawn from the (empirical distribution function of) the dataset
0.49, 0.51, 0.48, 0.54, 0.50, 0.46, 0.44, 0.56, 0.45, 0.47 .   Compute the probability    P(max(x*j,   1 ≤ j ≤ 10) < 0.56).

(b.) Calculate    P(max(x*j,   1 ≤ j ≤ n) < X(n))    for a bootstrap sample    (x*j, 1 ≤ j ≤ n)    drawn from (the e.d.f. of) a sample   (Xj,   1 ≤ j ≤ n)   with distinct elements, where   X(n)   denotes the n'th order-statistic (the largest element) from the original sample.
(c.) Conclude from this that the large-n limiting distribution for   Xi ∼ Uniform(0,θ)   of   n·(X(n) - max(x*j,   1 ≤ j ≤ n))   is very different from that of   n·(θ - X(n)).   (See Ex.11 at the end of the Wassermann bootstrap chapter.)

(3). Show that the limiting bootstrap estimate (when B → ∞) of the bias of the sample second central moment   Tn = ((n-1)/n) S2   is   ((n-1)/n2) S2,   where   S2  denotes the ordinary sample variance.

(4)-(5). Gentle (2009) Problems 13.1, 13.8.

(6). (Bias Corrections in Misspecified-Model Setting)   Suppose that the model (for a fixed dataset of size   n ) is assumed to be   f(x, θ)   and the Method of Moments estimator   Tn = g(x̄)   is used, where the smooth function   g   satisfies   g(μ) = θ   and where   μ = Eθ(X1). But now suppose that the correct model for the data is actually   h(x,θ, b/n1/2) within a larger smoothly parameterized two-parameter family of densities   h(x,θ,β)   satisfying   h(x,θ,0) ≡ f(x,θ). For large n (and sufficiently large number B of bootstrap replications, what effect do the Jackknife and Bootstrap bias-correction estimators have on bias ?


Homework Assignment 2. (Second 2 weeks of course, HW due Sat., Feb. 27, 11:59pm ). Reading: Wassermann (2006) Ch.3, DasGupta(2008) Chapter 29, plus R scripts from Lectures. Then solve and hand in all of the following problems:

(1). (counts as 2 problems) Perform an adequate number of Monte Carlo iterations to distinguish the coverage performance of the sample median estimator with bootstrap confidence interval for the following three data settings, all with sample size n=50 and number of bootstrap replicates B=300.
(a) f(x) = Gamma(2,1), (b) discrete Uniform on the set {1,...,35}, (c) f(x) = 3(1-2x)2 on [0,1].
You may make your own choice among Bootstrap Confidence Interval type -- basic-pivotal, percentile, or one of the improved percentile intervals -- but use the same type of bootstrap CI throughout your simulation. Also, use R=1000 or more Monte Carlo iterations, but determine this number with the simulation error in mind, to make the simulation adequate to distinguish CI performance clearly.

(2). If X and Y are independent Gamma(n1, b) and Gamma(n2, b) random variables, use the Multivariate Central Limit Theorem (and multivariate delta method) to prove that n1/2 (X/(X+Y) - λ) converges in distribution to N(0, λ(1-λ)) as   n1, n2 → ∞ in such a way that n1/(n1+n2) - λ = o((n1+n2)-1/2), where λ ∈ (0,1). The main hint is that Gamma(n1,b) is the sum of n1 iid Gamma(1,b) random variables, and Gamma(n2,b) the sum of n2 iid Gamma(1,b) r.v.'s.

(3). (Complicated Statistic arising from estimation after testing)   Suppose that we observe a data sample Xn and want to estimate the standard deviation σ of X1. Assume these data are either N(0,σ2) or Logistic (with density (1/b) ex/b/(1+ex/b)2, which has variance (π b)2/3). Suppose that we estimate σ with the statistic Tn defined by the following steps:
(i) calculate the MLE   sigma.MLE   and the maximimized log-likelihood   L1   for the normal-data case,
(ii) calculate the MLE for b and the maximimized log-likelihood   L2   for the logistic-data case; this can be done with two lines of R code,
b.MLE = exp(nlm(function(logb,dat) -dlogis(dat,0,exp(b),T),0,data=xvec)$est
llk.logis = mean(dlogis(xvec,0,b.MLE,T))

(iii) Tn = sigma.MLE if L1 > L2; else = b.MLE * π/sqrt(3)

Find bootstrap estimates for the variance of Tn, and perform a Monte Carlio simulation with sample sizes n = 40, 80 to see how accurate the bootstrap estimates are.

(4). Prove that if Fn and F are strictly increasing and continuous distributions such that Fn=0 and Fn=1 and Fn(x) → F(x) pointwise for all x, as n → ∞, then the Mallows-Wasserstein metric d2(Fn,F) → 0. Do this by defining random variables   Y ~ F,   U = F(Y),   Yn = Fn-1(U).

(5). Define a nonparametric bootstrap estimator Tn of the variance of the sample median for iid samples X1,...,Xn. Also define a parametric-bootstrap estimator Vn of the same quantity for Expon(λ) data-samples. Do the means of these estimates agree for large n when the data-sample they are based on is actually Expon(λ) ? What about their variances ? What does theory say about the answers to these questions ? Also give a computational answer to the question based on R=400 Monte Carlo iterations with B=300 bootstrap samples.



FINAL PROJECT ASSIGNMENT, due Monday, May 17, 2021, 11:59pm (uploaded to ELMS as pdf or MS Word document). As a final course project, you are to write a paper including at least 5-10 pages of narrative, plus relevant code and graphical or tabular exhibits, on a statistical journal article related to the course or else a data analysis or case-study [or simulation study] based on a dataset or data structure of your choosing.
The guideline is that the paper should be closely related to the Bootstrap course material, and should be 10 pages (1.5 spaced in a reasonable font 10pt to 12pt) if it is primarily expository based on an article, but could have somewhat fewer pages of narrative if based on a data-analytic case study. However, for the latter kind of paper, all numerical outputs should be accompanied by code used to generate them, plus discussion and interpretation of software outputs and graphical exhibits. For a data-analysis or case study, the paper should present a coherent and reasoned data analysis with discussion of research questions you want to address, theoretical support where possible for the computational analyses you do and the interpretations and a clear descripting of the findings and conclusions.
Good sources of data for the paper are the StatLib web-site mentioned below, the
UCI Machine Learning Data Repository, or any other public data source.
Possible topics for the paper include: TBA.


Handouts for Reference

(1)Pre-history of the Bootstrap, a 2003 Statistical Science paper by Peter Hall.

(2). A set of R Scripts on many topics related to the course are available in this directory.

(3). Several R packages replated to Bootstrap are: list with reference documentation to be added throughout the term.


Additional Computing Resources.  There are many publicly available datasets for practice data-analyses. Many of them are taken from journal articles and/or textbooks and documented or interpreted. A good place to start is Statlib. Here is another good source. Datasets needed in the course will be either be posted to the course web-page, or indicated by links which will be provided here.
A good set of links to data sources from various organizations including Federal and international statistical agencies is at Washington Statistical Society links.


Important Dates


The UMCP Math Department home page.
The University of Maryland home page.
My home page.
Eric V Slud, Feb. 22, 2021.