Statistics 818M  Analysis of Missing Data

Spring 2019 MW 5-6:15pm,    Mth 0102

Instructor: Eric Slud, Statistics program, Math. Dept.

Office:  Mth 2314, x5-5469, email, Office Hours: M11, W1, or by appointment

Course Text: J.-K. Kim and J. Shao, Statistical Methods, CRC 2013.

Recommended Texts:
R. Little and D. Rubin, Statistical Analysis with Missing Data (2002), 2nd edition, Wiley.
Handbook of Missing Data Methodology (2014), Chapman and Hall.

Overview: This course covers the statistical analysis of data in which important components are unobservable or missing. Such data arise frequently in large databases, in sample surveys, and even in carefully designed experiments. By their nature, such data must be handled through the use of modeling assumptions, generally of the form that unseen data values or their relationships with observable data must in some way be similar to corresponding observed data values. So one of the first tasks in studying the topic of missing data is to understand various statistical models and concepts for mechanisms of missingness. This is where the well-known terminology of `ignorable' missingness or mechanisms of `missing at random' come in, but also where modeling concepts of `patterns of missingness' and `propensities' to be observed are also directly relevant.
The course will roughly follow Chapters 1-5 and 8 of the text, beginning by discussing modeling aspects of missingness (likelihoods, qualitative concepts like missing at random and propensity); next computational techniques related to EM Algorithm, data augmentation and Monte Carlo; next various techniques and approaches to Imputation; next methods of analysis based on propensity matching and scoring and reweighting of estimating equations. Applications to the analysis of survey sample data will also be covered. As time permits, we will talk about research topics and alternative methods of analysis in contexts with non-ignorable missingness and not-missing-at-random data. The text does not say much about Multiple Imputation as a Missing-Data strategy, but other books such as the recommended texts do, and we will introduce the topic.
I am hoping that people coming to the course will feel comfortable with a bit of theory, to understand why EM works and what sort of other likelihood-based modeling and estimation can be used in understanding and compensating for missingness mechanisms.

NOTE ON USE OF THEORETICAL MATERIAL.  Both in homeworks and the in-class test, there will be theoretical material at the level of probability theory needed to apply the law of large numbers and central limit theorem, along with the `delta method' (Taylor linearization) and other manipulations at advanced-calculus level.

Prerequisite: Stat 420 or Stat 700, plus some computing familiarity.

Course requirements and Grading: there will be 5 graded homework sets (one every 2--2.5 weeks) which together will count 2/3 of the course grade, and a final project or presentation (10-12 page paper) that will count 1/3 of the grade.

NOTE ON COMPUTING.  Both in the homework-sets and the course project, you will be required to do computations on real datasets well beyond the scope of hand calculation or spreadsheet programs. Any of several statistical-computing platforms can be used to accomplish these: R, SAS, Minitab, Matlab, or SPSS, or others. If you are learning one of these packages for the first time, I recommend   R which is free and open-source and is the most flexible and useful for research statisticians. I will provide links to free online R tutorials and will provide examples and scripts and will offer some R help.

Getting Started in R and SAS. Lots of R introductory materials can be found on my last-year's STAT 705 website. Another free and interactive site I recently came across for introducing R to social scientists is:

Various pieces of information to help you get started in using SAS can be found under an old (F09) course website  Stat430.  In particular you can find:

--- an overview of the minimum necessary steps to use SAS from Mathnet.

---  a series of SAS logs with edited outputs for illustrative examples.

FINAL PROJECT ASSIGNMENT, due Friday, May 17, 2019, 5pm. As a final course project, you are to write a paper including some 5-10 pages of narrative, plus relevant code and graphical or tabular exhibits, on a statistical journal article related to the course or else a data analysis or case-study based on a dataset of your choosing. The guideline is that the paper should be 10--12 pages if it is primarily expository based on an article, but could have somewhat fewer pages of narrative if based on a data-analytic case study. However, for the latter kind of paper, all numerical outputs should be accompanied by code used to generate them, plus discussion and interpretation of software outputs and graphical exhibits. For a data-analysis or case study, the paper should present a coherent and reasoned data analysis with supporting evidence for the model you choose to fit, the method and approach to handling missing data, and an assessment of the results.

Possible topics for the paper include: under construction.


(1)  A handout from Stat 705 on ML estimation using the EM (Expectation-Maximization) algorithm along with another on MCMC (Markov Chain Monte Carlo) techniques.

(2)  more under construction, including pointers to an upcoming R Scripts directory.

Additional Computing Resources.  There are many publicly available datasets for practice data-analyses. Many of them are taken from journal articles and/or textbooks and documented or interpreted. A good place to start is Statlib. Datasets needed in the course will be either be posted to the course web-page, or indicated by links which will be provided here.
A good set of links to data sources from various organizations including Federal and international statistical agencies is at Washington Statistical Society links.

Important Dates

The UMCP Math Department home page.
The University of Maryland home page.
My home page.
Eric V Slud, Dec. 14, 2018.