**Instructor: **Eric Slud, Statistics program, Math. Dept.

**Office: ** Mth 2314, email: slud@umd.edu**Office Hours:**
W 1:30-3pm or by appointment

**Primary Course References:**

**Overview:** The topic of the course is statistical Resampling Methods, with emphasis on Bootstrap Methods. Resampling means statistical procedures based on re-use (often repeated) of datasets subsets of which are randomly selected. These methods are supported by large-sample probability limit theorems, which sometimes apply in surprisingly small-to-moderate size datasets. The goal of such procedures is any or all of: bias reduction, variance and Confidence Interval construction from statistical estimators, and calculation of null reference distributions of statistics used for hypothesis testing. In many Data Science applications, these techniques provide sensible approaches to the estimation of reference distributions under minimal modeling assumptions, with mathematically justified properties under broad qualitative assumptions on the data generating mechanism. The course is suitable either for STAT students or for students in Applied Mathematics or Mathematical Data Science.

The course will be taught at STAT MA level, more or less, with a mixture of theory, software-oriented (primarily **R**) applications, and data analyses and simulation case-studies.

**Special Features of the Course this Term. 50-minute Lectures will be offered live (ie, synchronously) over Zoom through ELMS and recorded. Each lecture will consist of one segment containing primarily theoretical meterial, and one providing computational data illustration in R. The theory pieces will be pdf slides with voiceover, and sometimes with handwritten pieces via a document camera. The R pieces will either be slides with R codes and pictures, or live demonstration using R or RStudio windows. **

**THEORETICAL MATERIAL. **Both in homeworks and lectures, there will be theoretical material at the level of probability theory (STAT 410 or sometimes 600-601) related to laws of large numbers and central limit theorem (sometimes "functional central limit theorems", which I will explain) , along with the `delta method' (Taylor linearization). There will be some proofs, mostly at advanced-calculus level but some involving measure-theoretic ideas.

**Prerequisite: **Stat 410 or 600-601, Stat 420 or 700-701, plus some computing familiarity, preferably including some **R**.

**Course requirements and Grading:** there will be 6 graded homework sets (one every 2 - 2.5 weeks), plus a project/paper at the end. Homeworks will be split between theory problems and statistical computations and interpretations with data. The homework will be worth 65% of the grade, the term paper 35%.

**(1.)** Monte Carlo Simulation versus Resampling/Bootstrap (1/25-27/2021)

**(2.)** Statistical Functionals: Bias Reduction via Jackknife & Bootstrap (1/27-29/2021)

**(3.)** Reference Distributions: Bootstrap Hypothesis Tests and Confidence
Intervals (2/1-3/2021)

**(4.)** Bootstrap with Complicated Statistics (sometimes non-smooth) (2/3-5/2021)

**(5.)** Proof of Consistency of Bootstrap Distributional Estimate for the Mean

**(6.)** More on Statistical Functionals, "Influence Functions", and Bootstrap

**(7.)** Enhanced accuracy for the Bootstrap vs. asymptotic normal
approximations

**(8.)** Double and Iterated Bootstrap for Higher-order Accuracy

**(9.)** Bootstrap in Regression Problems -- Bootstrapping Residuals

**(10.)** Sme Settings where Bootstrap does not Work

**(11.)** Relation between Functional Central Limit (empirical-process) Theory
and Bootstrap Limit Theory

**(12.)** Weighted and Multiplier Bootstraps

**(13.)** Parametric Bootstrap -- Theory and application in
Mixed-Model and Empirical-Bayes Problems

**(14.)** Bootstrap for Sample-Survey Inference

**(15.)** Other Applications of Bootstrap (Survival Analysis, possibly others)

**(16.)** Bootstrap in Problems with Dependent Data (Time Series, Spatial)
Idea of "Block Bootstrap"

**(17.)** Bootstrap Variants in Machine Learning -- Boosting and Bagging

**COMPUTING. **In Lectures, the homework-sets and possibly also the course project, you will be doing computations on real and simulated datasets using a statistical computation platform or library. Any of several statistical-computing platforms are suitable for this: **R**, Matlab, Python, or others. If you are learning one of these packages for the first time, or investing some effort toward deepening your statistical computing skills, I recommend **R**,which is free and open-source and is the most flexible and useful for research statisticians. I will provide links to free online **R** tutorials and will provide examples and scripts and will offer some **R** help, as needed.

**Getting Started in R.** Lots of R introductory materials can be found on my
STAT 705 website from several years ago, in particular in these Notes.
Another free and interactive site I recently came across for introducing R to social scientists is: https://campus.sagepub.com/blog/beginners-guide-to-r.

A set of R Scripts on many topics related to the course are available in this directory. .

**Notes and Guidelines.** Homeworks should be handed in as pdf's through ELMS "Assignments". Solutions will usually be posted, and a percentage deduction of the overall HW score will generally be made for late papers.

Assignment 1. (First 2 weeks of course, HW due Wed., Feb. 10 ). **Reading:** Gentle (2009) chapters on simulation and bootstrap, Efron (1982) Ch.2 & 5, Wassermann (2006) Ch.3, plus **R** scripts from Lectures.
Then solve and hand in all of the following 6 problems:

**(1).** In Lecture 1 slides 3 and 4, two different algorithms are suggested for simulating n=10,000 or 100,000 Poisson(5) random variables X_{i}: one by finding a large number m wuch that P(X>m) < 1e-7, and using **sample** to sample independently using the Poisson(5) probabilities P(X=k | X ≤ m), k=0,...,m; and the other by letting X be the largest k such that V_{1}+...+V_{k} ≤ 5. Run a simulation study in **R**, using **system.time** to calculate running times, to demonstrate which of the two algorithms is faster.

**(2).** *(Exercise adapted from Dekking (2005) book chapter 18 on Bootstrap or Wassermann's (2006) Chap.3)*

^{*}_{j}, 1 ≤ j ≤ 10) is drawn from the (empirical distribution function of) the dataset

0.49, 0.51, 0.48, 0.54, 0.50, 0.46, 0.44, 0.56, 0.45, 0.47 .
Compute the probability P(max(x^{*}_{j}, 1 ≤ j ≤ 10) < 0.56).

^{*}_{j}, 1 ≤ j ≤ n) < X_{(n)}) for a bootstrap sample (x^{*}_{j}, 1 ≤ j ≤ n) drawn from (the e.d.f. of) a sample (X_{j}, 1 ≤ j ≤ n) with distinct elements, where X_{(n)} denotes the n'th order-statistic (the largest element) from the original sample.

_{i} ∼ Uniform(0,θ) of n·(X_{(n)} - max(x^{*}_{j}, 1 ≤ j ≤ n)) is very different from that of n·(θ - X_{(n)}). *(See Ex.11 at the end of the Wassermann bootstrap chapter.)*

**(3).** Show that the limiting bootstrap estimate (when B → ∞) of the bias of the sample second central moment T_{n} = ((n-1)/n) S^{2} is ((n-1)/n^{2}) S^{2}, where S^{2} denotes the ordinary sample variance.

**(4)-(5).** Gentle (2009) Problems 13.1, 13.8.

**(6).** *(Bias Corrections in Misspecified-Model Setting)*
Suppose that the model (for a fixed dataset of size n ) is assumed to be f(x, θ) and the Method of Moments estimator T_{n} = g(x̄) is used, where the smooth function g satisfies g(μ) = θ and where μ = E_{θ}(X_{1}). But now suppose that the correct model for the data is actually h(x,θ, b/n^{1/2}) within a larger smoothly parameterized two-parameter family of densities h(x,θ,β) satisfying h(x,θ,0) ≡ f(x,θ). For large n (and sufficiently large number B of bootstrap replications, what effect do the Jackknife and Bootstrap bias-correction estimators have on bias ?

Homework Assignment 2. (Second 2 weeks of course, HW due Sat., Feb. 27, 11:59pm ). **Reading:** Wassermann (2006) Ch.3, DasGupta(2008) Chapter 29, plus **R** scripts from Lectures.
Then solve and hand in all of the following problems:

**(1).** *(counts as 2 problems)* Perform an adequate number of Monte Carlo iterations to distinguish the coverage performance of the sample median estimator with bootstrap confidence interval for the following three data settings, all with sample size n=50 and number of bootstrap replicates B=300.

(a) f(x) = Gamma(2,1), (b) discrete Uniform on the set {1,...,35}, (c) f(x) = 3(1-2x)^{2} on [0,1].

You may make your own choice among Bootstrap Confidence Interval type -- basic-pivotal, percentile, or one of the improved percentile intervals -- but use the same type of bootstrap CI throughout your simulation. Also, use
R=1000 or more Monte Carlo iterations, but determine this number with the simulation error in mind, to make the simulation adequate to distinguish CI performance clearly.

**(2).** If X and Y are independent Gamma(n_{1}, b) and Gamma(n_{2}, b) random variables, use the Multivariate Central Limit Theorem (and multivariate delta method) to prove that n^{1/2} (X/(X+Y) - λ) converges in distribution to *N*(0, λ(1-λ)) as n_{1}, n_{2} → ∞ in such a way that n_{1}/(n_{1}+n_{2}) - λ = o((n_{1}+n_{2})^{-1/2}), where λ ∈ (0,1). The main hint is that Gamma(n_{1},b) is the sum of n_{1} *iid* Gamma(1,b) random variables, and Gamma(n_{2},b) the sum of n_{2} *iid* Gamma(1,b) r.v.'s.

**(3).** *(Complicated Statistic arising from estimation after testing)* Suppose that we observe a data sample __X___{n} and want to estimate the standard deviation σ of X_{1}. Assume these data are either *N*(0,σ^{2}) or Logistic
(with density (1/b) e^{x/b}/(1+e^{x/b})^{2}, which has variance (π b)^{2}/3). Suppose that we estimate σ with the statistic T_{n} defined by the following steps:

_{1} for the normal-data case,

_{2} for the logistic-data case; this can be done with two lines of **R** code,

_{n} = sigma.MLE if L_{1} > L_{2}; else = b.MLE * π/sqrt(3)

Find bootstrap estimates for the variance of T_{n}, and perform a Monte Carlio simulation with sample sizes n = 40, 80 to see how accurate the bootstrap estimates are.

**(4).** Prove that if F_{n} and F are strictly increasing and continuous distributions such that F_{n}=0 and F_{n}=1 and F_{n}(x) → F(x) pointwise for all x, as n → ∞, then the Mallows-Wasserstein metric d_{2}(F_{n},F) → 0. Do this by defining random variables Y ~ F, U = F(Y), Y_{n} = F_{n}^{-1}(U).

**(5).*** Define a nonparametric bootstrap estimator T _{n} of the variance of the sample median for iid samples X_{1},...,X_{n}. Also define a parametric-bootstrap estimator V_{n} of the same quantity for Expon(λ) data-samples. Do the means of these estimates agree for large n when the data-sample they are based on is actually Expon(λ) ? What about their variances ? What does theory say about the answers to these questions ? Also give a computational answer to the question based on R=400 Monte Carlo iterations with B=300 bootstrap samples.*

FINAL PROJECT ASSIGNMENT, due Monday, May 17, 2021, 11:59pm
(uploaded to ELMS as pdf or MS Word document). As a final course project, you
are to write a paper including at least 5-10 pages of narrative, plus relevant code and graphical or tabular exhibits, on a statistical journal article related to the course or else a data analysis or case-study [or simulation study] based on a dataset or data structure of your choosing.

**StatLib** web-site mentioned below, the

** TBA**.

**(1)**Pre-history of the Bootstrap, a 2003 *Statistical Science* paper by Peter Hall.

**(2).** A set of R Scripts on many topics related to the course are available in this directory.

**(3).** Several **R** packages replated to Bootstrap are: *list with reference documentation to be added throughout the term*.

Additional Computing Resources. There
are many publicly available datasets for practice data-analyses. Many of them are taken from journal articles and/or textbooks and documented or interpreted. A good place to start is Statlib. Here is another good source. Datasets needed in the course will be either be posted to the course web-page, or indicated by links which will be provided here.

A good set of links to data sources from various organizations including Federal and international statistical agencies is at Washington Statistical Society links.

**First Class: Mon., Jan. 25, 2021****Spring Break: week of March 15-19, NO CLASS****Last day of classes: Mon. May 10, 2020**

**The UMCP Math Department home page.
The University of Maryland home page.
My home page.
© Eric V Slud, Feb. 22, 2021.**