Statistics 818M Analysis of Missing Data

Spring 2019 MW 5-6:15pm, Mth 0102

Instructor: Eric Slud, Statistics program, Math. Dept.

Office: Mth 2314, x5-5469, email evs@math.umd.edu, Office Hours: M11, W1, or by appointment

Course Text: J.-K. Kim and J. Shao, Statistical Methods for Handling Incomplete Data, CRC 2013.

Recommended Texts:
R. Little and D. Rubin, Statistical Analysis with Missing Data (2002), 2nd edition, Wiley.
Handbook of Missing Data Methodology (2014), Chapman and Hall.

Please fill out the on-line Evaluation form on this Course and instructor at http://CourseEvalUM.umd.edu. Thank you.

Overview: This course covers the statistical analysis of data in which important components are unobservable or missing. Such data arise frequently in large databases, in sample surveys, and even in carefully designed experiments. By their nature, such data must be handled through the use of modeling assumptions, generally of the form that unseen data values or their relationships with observable data must in some way be similar to corresponding observed data values. So one of the first tasks in studying the topic of missing data is to understand various statistical models and concepts for mechanisms of missingness. This is where the well-known terminology of `ignorable' missingness or mechanisms of `missing at random' come in, but also where modeling concepts of `patterns of missingness' and `propensities' to be observed are also directly relevant.
The course will roughly follow Chapters 1-5 and 8 of the text, beginning by discussing modeling aspects of missingness (likelihoods, qualitative concepts like missing at random and propensity); next computational techniques related to EM Algorithm, data augmentation and Monte Carlo; next various techniques and approaches to Imputation; next methods of analysis based on propensity matching and scoring and reweighting of estimating equations. Applications to the analysis of survey sample data will also be covered. As time permits, we will talk about research topics and alternative methods of analysis in contexts with non-ignorable missingness and not-missing-at-random data. The text does not say much about Multiple Imputation as a Missing-Data strategy, but other books such as the recommended texts do, and we will introduce the topic.
I am hoping that people coming to the course will feel comfortable with a bit of theory, to understand why EM works and what sort of other likelihood-based modeling and estimation can be used in understanding and compensating for missingness mechanisms.

NOTE ON USE OF THEORETICAL MATERIAL. Both in homeworks and the in-class test, there will be theoretical material at the level of probability theory needed to apply the law of large numbers and central limit theorem, along with the `delta method' (Taylor linearization) and other manipulations at advanced-calculus level.

Prerequisite: Stat 420 or Stat 700, plus some computing familiarity.

Course requirements and Grading: there will be 5 graded homework sets (one every 2--2.5 weeks) which together will count 2/3 of the course grade, and a final project or presentation (10-12 page paper) that will count 1/3 of the grade.

NOTE ON COMPUTING. Both in the homework-sets and the course project, you will be required to do computations on real datasets well beyond the scope of hand calculation or spreadsheet programs. Any of several statistical-computing platforms can be used to accomplish these: R, SAS, Minitab, Matlab, or SPSS, or others. If you are learning one of these packages for the first time, I recommend R which is free and open-source and is the most flexible and useful for research statisticians. I will provide links to free online R tutorials and will provide examples and scripts and will offer some R help.

Homework

Notes and Guidelines. Homeworks should be handed in as hard-copy in-class, except for occasional due-dates on Fridays when you may submit them electronically, via email, in pdf format. Solutions will usually be posted, and a percentage deduction of the overall HW score will generally be made for late papers.

Homework solutions will will be emailed to all course registrants shortly after they are due. (For course attendees who are not registered, I will email HW solutions if they are requested.)

Assignment 1. (First 2 weeks of course, HW due Mon., Feb. 11). Read about missing-data likelihoods and the definition of Missing at Random and Missing Completely at Random (material in Chapter 2 of Kim & Shao). Then solve and hand in the following problems (counting as 8 problem parts, worth 10 points for each part):

(1) Simulate X_i ~ N(1,1) independent identically distributed (iid) , i=1,...,200, and Y_i = 2 + 0.6 X_i + ε_i , where ε_i are iid N(0,1) and independent of {X_i}, and retain as observations only those data-pairs (X_i , Y_i) for which X_i > 0 .
(a). What is the density of the retained observations Y_i ?
(b). Do the retained pairs (X_i, Y_i) satisfy a linear-regression relationship ? (To justify your answer, find the joint density of the retained observations.)
(c). If the equation defining Y_i in terms of X_i has 2 replaced by a and 0.6 by b and you estimate (a, b) by least-squares, then what is the large-sample target (ie limit) of your estimator b̂ ?

(2) Suppose that random variables X_i ∈ (0,1] are iid with unknown distribution F, and that observation X_i is observed with probability w(X_i) = X_i².
(a). Show how to estimate the unknown F consistently from the observed X_i.
(b). Also prove that your estimator for F is efficient if X_i is discrete with values { 1/m, 2/m, ..., m/m }.
Hint: because X_i is discrete with m distinct values, you can view the vector of its probability masses at the first m-1 of them as an unknown finite-dimensional parameter, and calculate Fisher-information matrices, etc. Note that to prove efficiency, you must either calculate (large-sample asymptotic) variance of your estimator as being the same as the inverse of Fisher Information, or else show that your estimator is equivalent to (i.e., differs o(1/sqrt(n)) in probability from) the MLE.

(3) Exercise 7 in Kim and Shao Chapter 2, p.22: Consider a bivariate variable (Y1,Y2) where (Y1,Y2) takes on possible values (1,1), (1,0), (0,1) and (0,0) with respective probabilities π₁₁, π₁₀, π₀₁ and π₀₀, where π₁₁+π₁₀+π₀₁+π₀₀=1. To answer the following questions, it may be helpful to define marginal and conditional probabilities in a 2x2 table by π₁₊=P(Y1 =1), π_1|1=P(Y2 =1|Y1 =1) and π_1|0=P(Y2 =1|Y1 =0). Note that there is a one-to-one correspondence between the two alternate parameterizations θ₁ = (π₀₀, π₀₁, π₁₀) and θ₂ = (π₁₊, π_1|1, π_1|0). The realized sample observations are counts n_ij,H and n_i+,K of configurations (Y1=i,Y2=j) for the combination of two independent data samples H and K of sizes 300 for H and 100 for K:
n_11,H=100 , n_10,H = 50 , n_01,H=75 , n_00,H = 75 , n_1+,K = 40 , n_0+,K=60

(a) Compute the observed information and score functions in terms of θ₂.
(b) Obtain the maximum likelihood estimates for θ₁.
(c) Obtain the observed information matrix for θ₁.

Assignment 2. (2nd 2 weeks, HW due Fri., March 1). Finish Chapter 2, Sec.2.4, and Chapter 3 through Sec.3.5 on EM Algorithm and Monte Carlo variants in the Kim and Shao book. The problems to solve and hand in are the following (7 parts total) :
(1) #9 on p.23 of Ch.2 (counts as 2),
(2) the problem of identifiability of θ = (a,b,λ₁, λ₂) from observed data (X, R, R T), where X ~ Expon(1),
P(T ≥ t | X) = I_[t≤x] exp(-λ₁ t) + I_[t>x] exp(-λ₁ x - λ₂(t-x)) , P(R=1 | T,X) = a I_{[t≤ x]} + b I_[t>x], and λ₁, λ₂ > 0, a, b ∈ (0,1)
(3) #3 on p.54, Ch.3, and
(4) Exercise I (3 parts) in the handout NewEMhandoutS19.pdf.

See R code in this handout for coding examples related to the EM algorithms in the pdf handout.

Assignment 3. (HW3 due Fri., March 15). Finish Chapter 3 (Sec.3.6), plus Ch.4 on Imputation through Sec.4.5 in the Kim and Shao book. The problems to solve and hand in are the following (8 problem-parts or 80 points total): #4, 5, 7, 12 from Chapter 3, pp.54-58.
There is a new R script related to numerical integration that might be used as an alternative to Monte Carlo in Problem #12: although that is not the way the problem is assigned, you might use the numerical-integration idea given there to check your work.

Assignment 4. (HW4 due Fri., April 12). Ch.4 on Imputation through Sec.4.5 in the Kim and Shao book. The problems to solve and hand in are the following (8 problem-parts or 80 points total): #6, 7, 10 from Chapter 4, pp. 95-97. See coursemail message for formula hints on problems 6, 10.

Assignment 5. (HW5 due Fri., May 10 in class). Ch.5 on Propensity Scoring through Sec.5.5 in the Kim and Shao book. The problems to solve and hand in are the following (8 problem-parts or 80 points total):
(I) Reconstruct the 300 x 6 data arrays Ymat3, Rmon3, Xmat from ImputScript.RLog (or another similar simulated dataset with different parameters or seed) with monotone missing pattern.
(a) (15 points) Impute the missing values multiple times using a randomized hot-deck (within 8 groups defined by cross-classifying observations using the two binary values of X₁ and quartiles of X₂). Find estimates and (using Rubin's rules) standard errors.
(b)(10 points) Impute the missing values column-wise, multiple times using estimated univariate (for each Ymat3[,j]) regression models on Xmat, and show that the correlations among complete-data Y columns are very badly estimated by the multiply imputed data.
(c) (15 points) Show theoretically that the model for Y[,2] on Y[,1] in the data subset consisting of the first 2 columns disregarding Xmat is not MAR. (That is, show that the conditional distribution of Y[i,2] given Y[i,1] is different for the i's with R_i = 1 and those for R_i=0.)
(II) Do problems #6(a),(b) [20 points]
(III) [20 points] Get datasets apipop and apiclus2 within R package survey. Here the apipop dataset gives the whole-population information (a poulation of California schools with > 100 students) and apiclus2 a survey-sample.
(A). Find the whole-population average proportions for the variables meals, ell, col.grad, full (which are given as percentages) after first noting that one of these variables has a few missing values and imputing those missing values using a deterministic regression-imputation in terms of the others.
(B). Using the weights-column wtini = apiclus2$pw for the selected respondents (which are all the records in apiclus2), find a calibrated or propensity-model fit π(X_i,φ) to the weights, of the form
π(X_i,φ) = exp(φ₀+X_i %*% (φ₁,...,φ₁)/pw[i] solving the weighted-propensity estimating equation (ie calibration equation) in terms of the same 4 non-constant predictor variables.
(C). From the apiclus2 outcome data (api99, api00) and whole-population totals found in (A) for the variables meals, ell, col.grad, full, estimate the population-wide api99, api00 averages using the propensity weights you found in (B) together with a regression model with up to 4 additional covariates that you find most useful in the apiclus2 dataset. For your preferred model, give the resulting api99, api00 errors and corresponding standard errors.

Getting Started in R and SAS. Lots of R introductory materials can be found on my last-year's STAT 705 website.

Various pieces of information to help you get started in using SAS can be found under an old (F09) course website Stat430. In particular you can find:

--- an overview of the minimum necessary steps to use SAS from Mathnet.

--- a series of SAS logs with edited outputs for illustrative examples.

FINAL PROJECT ASSIGNMENT, due Friday, May 17, 2019, 5pm. As a final course project, you are to write a paper including some 5-10 pages of narrative, plus relevant code and graphical or tabular exhibits, on a statistical journal article related to the course or else a data analysis or case-study based on a dataset of your choosing. The guideline is that the paper should be 10--12 pages if it is primarily expository based on an article, but could have somewhat fewer pages of narrative if based on a data-analytic case study. However, for the latter kind of paper, all numerical outputs should be accompanied by code used to generate them, plus discussion and interpretation of software outputs and graphical exhibits. For a data-analysis or case study, the paper should present a coherent and reasoned data analysis with supporting evidence for the model you choose to fit, the method and approach to handling missing data, and an assessment of the results.

Possible topics for the paper include: implementation and analysis/interpretation of one or more imputation methods on a real dataset (e.g., survey public-use data from American Community Survey) using methods and software discussed in the course; exposition of a journal paper on missing data methods in a subject-matter application, such as educational statistics; exposition of some other missing-data topic, such as double- or interval-censored data, from a paper or book-chapter; or some other topic you propose.

Good topic choices for the paper include: (1) Parts of the documentation or related papers on the mice software by van Buuren, as linked under Handouts (5) below; (2) A highly cited paper by Rebecca Andridge, A Review of Hot Deck Imputation for Survey Nonresponse; or (3) any subject-matter-related paper on Propensity Weighting (e.g. a famous seminal paper by Rosenbaum, 1983 Biometrika, or later papers by various authors) or Causal Inference (many possible sources including a famous 1976 paper by Rubin) or Inverse Probability Weighting methodology.

Handouts

(1) A handout from Stat 705 on ML estimation using the EM (Expectation-Maximization) algorithm along with another on MCMC (Markov Chain Monte Carlo) techniques.
Concerning the EM algorithm and associated computations, two pdf's and and R script show theoretical expressions and computations in R regarding small contingency-table examples related to HW2 Problem (I). The pdf pages are NewEMhandoutS19.pdf and EMinfo-examp.pdf, and the R script is RscriptsEM1.RLog.

(2) R scripts related to various topics in the course can be found in the new web-page directory RScripts.

(3) A journal paper I wrote related to combining estimators from different samples is related also to the "Generalized Least Squares" method cited in Ch.3 of Kim and Shao in missing-data contexts.

(4) A talk I gave in the UMD Statistics Seminar in March 2019 about Bayesian computing in a generalized logistic mixed-model setting may be of interest here in the context of MCEM Metropolis-Hastings algorithms.

(5) The journal paper that started the idea of "Chained Imputation". Additional readings on this topic can be found associated with the MICE R-package, and in the "Fully Conditional Specification" Chapter (Ch.13) in the CRC Missing Data Handbook under the authorship of Stef van Buuren. See an online version of it, especially Chapter 4 containing th essence pf the "fully conditional specification idea", expanding on the SRMI idea of Raghunathan et al. You can look at a hands-on introduction to the MICE software in the pdf of a Journal of Statistical Software article of van Buuren at https://www.jstatsoft.org/article/view/v045i03.

(6) A hands-on, purely applied introduction to propensity weighting and matching using a few R packages can be found here.

Additional Computing Resources. There are many publicly available datasets for practice data-analyses. Many of them are taken from journal articles and/or textbooks and documented or interpreted. A good place to start is Statlib. Datasets needed in the course will be either be posted to the course web-page, or indicated by links which will be provided here.

A good set of links to data sources from various organizations including Federal and international statistical agencies is at Washington Statistical Society links.

Important Dates

First Class: Mon., January 28, 2019
Spring Break March 17--24, 2019
Change credit level or drop without W, February 8, 2019
Last schedule-adjustment Date (for Drop/Withdrawal): April 12, 2019
Last day of classes: Mon. May 13, 2019