Statistics 770  Analysis of Categorical Data

Fall 2020 MW 5-6:30pm,    Online class delivered through ELMS

Instructor: Eric Slud, Statistics program, Math. Dept.

Office:  Mth 2314, x5-5469, email evs@math.umd.edu
Office Hours: W 1:30-3pm or F 2-3pm (one-on-one, using "waiting-room"), or by appointment

Course Text: A. Agresti, Categorical Data Analysis, 3rd ed. 2013.      Find Errata   here.

We will NOT have an in-class test this semester. Current Homework Assignment


Overview: This course covers the statistical analysis of discrete data, cross-classified by and modeled in terms of auxiliary covariate measurements which may be continuous or discrete. Such data structures arise in a wide variety of fields of application, especially in the social and biological sciences. The basic underlying model is the multinomial distribution, with cell-probabilities parametrically restricted according to their array structure, with conditional probability masses for a distinguished response variable often expressed linearly in terms of covariates. Important models of this type (some of which generalize to the case of continuous covariates) include logistic regression, other `generalized linear models', and loglinear models. The modern approach to these topics involves estimation via likelihood-based methods or generalizations to so-called quasilikelihood estimating equations, with emphasis on statistical computing and model diagnostics. In addition, computational advances have made categorical data models with random effects tractable to estimate and interpret, and Bayesian and empirical-Bayes methods are an important part of the material included in the new edition of the Agresti text. Methods covered in the course will be presented in terms of theoretical properties, computational implementation (primarily in R), and real-data application.

Special Features of the Course this Term. Because the course will be offered online this Fall, I plan to offer lectures in 30-minute pieces, live (ie, synchronously) and recorded. Each lecture will consist of one piece that is primarily a theory lecture, and one piece that is a computational data illustration in R. The theory pieces will be pdf slides with voiceover, and I will experiment with a "whiteboard" format including real-time writing of equations or presentation of graphical pictures, with a document camera. The R pieces will either be slides with R codes and pictures, or live demonstration using R or RStudio windows. Sometimes there will also be a 3rd lecture piece, which may be theory, data illustration, or discussion of models and methods.

NOTE ON USE OF THEORETICAL MATERIAL.  Both in homeworks and the in-class test, there will be theoretical material at the level of probability theory needed to apply the law of large numbers and central limit theorem, along with the `delta method' (Taylor linearization) and other manipulations at advanced-calculus level.

Prerequisite: Stat 420 or Stat 700, plus some computing familiarity, pereferably including some R.

Course requirements and Grading: there will be 7 graded homework sets (one every 2 weeks), plus a project/paper at the end. Homeworks will be split between theory problems and statistical computations and interpretations with data. The homework will be worth 65% of the grade, the term paper 35%.

Course Coverage: in the Agresti book:
Chapters 1-6 for basic material (including ROC and prediction/classification topic in Sec.6.3)
Chapter 7 for semiparametric and high-dimensional data-- this is also where SVM and CART belong,
so we may jump a bit into Chapter 15 of Agresti where these topics are presented,
Sections 8.1 (for multicategory response regression models) and 8.6 (for Bayes),
Chapter 9,
Chapter 12 (longitudinal and GEE -- with my 1994 Statistica Sinica paper in Handouts),
Chapter 13 Secs. 13.1-13.4 for small-area-type GLMMs, with 13.7 Bayesian version

NOTE ON COMPUTING.  Both in the homework-sets and the course project, you will be required to do computations on real datasets well beyond the scope of hand calculation or spreadsheet programs. Any of several statistical-computing platforms can be used to accomplish these: R, SAS, Minitab, Matlab, or SPSS, or others. If you are learning one of these packages for the first time, or investing some effort toward deepening your statistical computing skills, I recommend   R   which is free and open-source and is the most flexible and useful for research statisticians. I will provide links to free online R tutorials and will provide examples and scripts and will offer some R help. The Agresti book gives scripts and illustrations in SAS.


Getting Started in R and SAS. Lots of R introductory materials can be found on my STAT 705 website from several years ago, in particular in these Notes. Another free and interactive site I recently came across for introducing R to social scientists is: https://campus.sagepub.com/blog/beginners-guide-to-r.

Various pieces of information to help you get started in using SAS can be found under an old (F09) course website  Stat430.  In particular you can find:

--- an overview of the minimum necessary steps to use SAS from Mathnet.

---  a series of SAS logs with edited outputs for illustrative examples.


SAS and R Illustrations and Logs Related to the Agresti book.

The Agresti text has an Appendix A describing software, including SAS scripts, which  can be used to perform categorical data analyses. In addition, datasets can be downloaded from Agresti's website. Several logs in SAS (with some comparative Splus analyses) doing illustrative data analyses and including standard SAS scripts can be found here. There is also a lengthy manual for performing R analyses of examples in (the 2nd edition of) the Agresti book.


R Scripts

A set of R Scripts on many topics related to the course are available in this directory. Those that are specifically cited in Fall 2020 Lectures will be separately located in this sub-directory.



Homework

Notes and Guidelines. Homeworks should be handed in as pdf's through ELMS "Assignments". Solutions will usually be posted, and a percentage deduction of the overall HW score will generally be made for later papers.

Homework solutions will will be emailed to all course registrants shortly after they are due. (For course attendees who are not registered, I will email HW solutions if they are requested.)


Assignment 1. (First 2 weeks of course, HW due Wed., Sep. 16). Read all of Chapter 1, plus the first of the sections from the historical notes in Chap. 17. Then solve and hand in all of the following problems:

(A). Calculate the coverage probabilities of Wald, Clopper-Pearson, and Score-statistic (= Wilson) 95% confidence intervals for a binomial parameter π in a sample of size n=40, for at least three different values of the true parameter   π. If you do this with a small program, then try to produce a plot exactly analogous to this one. If you have not yet mastered the capabilities of your software, then you may instead just tabulate several values of the coverage probabilities for the three types of confidence interval. Whichever of these displays you choose for your results: explain in detail why the coverage probabilities are discontinuous functions of  π, piecewise smooth but with discontinuity at finitely many points; and what are the π values of the discontinuity points ?

(B). Suppose that observations Xi are independent and identically distributed (iid) discrete random variables with values {1,...,K} and probability mass function P(Xi=k) = pk. Find the joint probability distribution of   (Nk,   k=1,...,K)   where  Nk = summation over i=1 to N of  I[Xi=k]   where   N~Poisson(n) is independent of   {Xi:   i=1,2,...}.

(C). (a) Suppose that N bins are equally likely to be occupied (i.e. are sampled equiprobably) by m black balls and also, independently, are equally likely to be occupied by n white balls. A bin may be occupied by both a black and a white ball, but each bin may contain at most one ball of a given color. Let X be the (random) number of bins holding both a black and a white ball. Find its probability distribution (explicitly, as a function of   m,n,N).
         (b) Let the m black balls position themselves among N bins as in (a). But now suppose that, given the positions of black balls, the positions for the n white balls are chosen in such a way that the odds for each white ball to fall in a bin occupied by a black ball is multiplied by a factor eθ as compared with the odds of falling in a bin not occupied by a black ball. (A more precise way to say or model this question is as follows: (i) suppose that for some fixed probability   p  , for each bin j=1,...,N independently of the others, a black ball is placed in bin j; but that (ii) we condition on the total number of bins containing black balls being equal to m; and (iii) suppose that each bin not containing a black ball independently of all other bins has probability   q   of receiving a white ball, while each bin containing a black ball independently of all other bins has probability q* [defined to satisfy:   q*/(1-q*) = eθ q/(1-q)  ],  of receiving a white ball; but that (iv) we condition on the total number of bins containing white balls being equal to n.) Now what is the probability distribution of the number X of bins occupied by both a black and a white ball ? Your answer should turn out not to depend on q* and q, i.e. to depend only on m, n, N and θ.

(D). Agresti, Chapter 1: Do # 1.7 but instead of treating the number Y of successes (new drug better) as Y=20 out of n=20 trials as requested in the book, do the exercise based on Y=29 in n=30 trials; also do the exercises numbered 1.8, 1.13, 1.38.

A listing of R functions and commands that can be used to solve problem (A) above can be found here. You can also look at the resulting picture either for sample size n=40 as requested, or for n=100.

For an interesting comparison between an `Agresti-Coull (1998) confidence interval ' advocated by the author of our text (see problem 1.25), versus the other standard  intervals we are studying, and also versus a transformed Wald-interval (with 0.5 added  to number of successes and failures) on logit scale, see this picture.


Assignment 2.(Second 2 weeks of course, HW due Thurs., Oct.1 11:59pm). Read all of Chapter 2, plus Chap.3 Sections 3.1-3.3 and 3.5.1-3.5.2. Then solve and hand in the following problems:

(A). Consider the data in Table 2.8 of the Agresti book, on page 57, and assume that the data sampled in each Age group are independent identically distributed among a well-defined population of employed people. Find a 95% confidence interval for the fraction of each age-defined subpopulation that is "fairly satisfied" (i.e., falls in Job-satisfaction category (2), according to each of the following methods: (i) Wald, (ii) Inverted Score-test, (iii) Clopper-Pearson, (iv) Likelihood Ratio Test, and (v) Bayesian credible interval based on a Beta(1,1) = Uniform prior distribution on the unknown fraction p. Now assume you know that the proportion falling in Job-satisfaction category (2) is the same in all three age-groups, and (vi) find a 95% confidence interval for that proportion.

(B). Establish (formulas for) a large-sample CI for log relative risk log(p1/p2) using the Delta method, in the setting of independent observations Xj ~ Binom(nj, pj) for j=1,2, and apply it to the data in Table 2.1 on Aspirin and Heart Attack Study data on p.38, in two ways: (i) to find a 90% CI for log RR of fatal heart attack for those on placebo relative to those taking aspirin, and (ii) to find a 95% CI for log RR for incidence of Heart Attack (whether fatal or not) for those on placebo relative to those taking aspirin. Assume that the placebo patients and those taking aspirin were sampled independently and equiprobably from large general populations.

(C). The following data from a paper by Helmes and Fekken (1986 Jour. Clin. Psych. 42, 569-576) classifies a sample of psychiatric patients by their diagnoses and whether their treatment prescribed drugs:
Schizophrenia Affective Disorder Neurosis Personality Disorder Other
Drugs    105 12 18 47 0
No Drugs 8    2 19 52 13
Using these data and assuming the rows were sampled independently and iid, (a) conduct a test of (row vs column) independence and interpret the P-value; (b) Obtain standarized residuals and interpret, and (c) partition the LRT (and the approximating chi-squares, if you like) into three components to describe differences and similarities among the diagnoses, by comparing (i) the first two columns,the 3rd and 4th column, and the last column to the combination of the first two colums and the combination of the 3rd and 4th columns.
In this problem each sampled individual (out of the total of 276 in the table) is viewed as an iid random draw from a large population of psychiatric patients, with the facts recorded about the psychiatric disorderaffecting each patient and then also the information about whetheror not they were prescribed drugs. So this is an overall multinomial table, if you regard thenumber n=276 as fixed in advance. If you condition further on the number of patients receiving and not receiving drugs, each of the rows of the Table becomes multinomial, but that is not the way the experiment was done. The interpretation ofthe question is that we are testing whether the type of disorder and the fact of receiving drugs as treatment are dependent as categoricalrandom variables.

(D). Of the 14 candidates for 4 managerial positions, 7 are female and 7 male. Denote the females F1,..,F7 and the males M1,..,M7. The actual result of choosing the managers is (F3,M1,M4,M6). (i) How many possible without-replacement samples were there ? Construct the contingency table for the without-replacement sample actually obtained. (ii) Find and interpret the P-value for the Fisher's Exact Test of the hypothesis that all candidates were equally likely to be selected. (iii) Re-do the same problem if there were 60 candidates (30 male and 30 female) for the four managerial positions.

(E). Find the likelihood ratio test and chi-squared test test-statistics in a 3x3 table for the hypothesis  H_0:   p_{jk} ∝ exp(aj + bk + cjk)  versus the general alternative. Find and interpret the P-values for these statistics in the 3x3 table with first row (3,7,18), second row (5,18,17), and 3rd row (9,35,42).   Note: you should find the likelihood-maximizing values   â, b̂ and ĉ   under H0, either using a numerical-maximizing function like nlm or optim in R, or else set up a multiparameter Newton-Raphson iteration to find an iterative numerical solution of the likelihood equations. Then substitute these MLEs to create the likelihood-ratio or chi-square statistics.

(F). Thirty measurements W1,..., W30 of body weight of male students are collected by randomly sampling men from a large population. These are thought to be normally distributed. But the only data we have access to are the numbers of the 30 weights respectively falling into the intervals (0,142],(142,165], (165,180], (180,200] and (200,Infty], and those counts are respectively 4,7,9,8,2. Use these data to find a likelihood ratio test of the null hypothesis that the 30 iid observations were normally distributed.


Assignment 3. (Third 2 weeks of course, HW due Fri., Oct.16 11:59pm). Read Bayes Sec. 3.6, plus Chapter 4 and the first few sections of Chapter 5. Then solve and hand in the following problems (6 in all):    # 4.10, 4.12, on pages 156-158 plus the following

(I). Do problem # 3.21 in Agresti Chapter 3. But after doing part (b), do two more parts assigned here:
(c) re-do part (b) to find a posterior mean estimate and equal-tail posterior probability interval for the ratio of proportions π12; and
(d) Find the posterior mean estimate in part (b) if the prior distribution is a mixture distribution assigning probability 1/2 to independent and Uniform[0,1] distributed π1 and π2 and probability 1/2 to Uniformly distributed π = π1 = π2.

(II). Fit a logistic regression model to the Crabs mating data with outcome variable: (Crabs$sat > 1) Use the predictors: spine (as factor), weight (rounded to nearest 0.2kg), and width (rounded to nearest cm). You may use interactions if they help. Fit the best model you can, and assess the quality of fit of your best model using the techniques in Sec. 5.2.

(III). For the "best" model you fit in problem (II): (a) fit the coefficients directly (by a likelihood calculation that you code in R using "optim" with method="BFGS") and also by coding the Fisher-scoring algorithm and using 5 or 10 iterations (starting from all coef's = 0), and (b) check that the SEs for coefficients found by "glm" are close to those found from your observed information matrix estimates calculated in (a).

(IV). (Compare problem 4.33 on p.162) Use the formulas in the book or class to show how the observed (whole-dataset, not per-observation) information matrix with the probit link depends on the data and differs from the expected (whole-dataset) or Fisher information.


Assignment 4. HW submissions due by upload to ELMS by Tuesday November 3, 11:59 pm.
Reading: the rest of Chapters 5 and 6 plus Sec.7.1-7.2. Then solve and hand in the following problems (7 in all): 4.20, 5.6, 5.9, 5.30, 5.39, 6.8, 6.14 (ROC and AUC only).



FINAL PROJECT ASSIGNMENT, due Wednesday, December 16, 2020, 11:59pm. As a final course project, you are to write a paper including 5-10 pages of narrative, plus relevant code and graphical or tabular exhibits, on a statistical journal article related to the course or else a data analysis or case-study based on a dataset of your choosing. The guideline is that the paper should be 10 pages if it is primarily expository based on an article, but could have somewhat fewer pages of narrative if based on a data-analytic case study. However, for the latter kind of paper, all numerical outputs should be accompanied by code used to generate them, plus discussion and interpretation of software outputs and graphical exhibits. For a data-analysis or case study, the paper should present a coherent and reasoned data analysis with supporting evidence for the model you choose to fit, the variables you choose to include and exclude, whatever indications you can provide for the adequacy of fit of the models, and a summary of what the model says about the generating mechanism of the data.
Two good sources of data for the paper are the StatLib web-site mentioned below, or
Agresti's book web-site.

Possible topics for the paper include: (a) Zero-inflated Poisson regression models, based on an original paper of Lambert but discussed in connection with the Horseshoe Crabs dataset in a web-page posted by Agresti (indexed also under heading 2. in his book web-page.
(b) The relationship between individual heterogeneity and overdispersion and prediction, in a very nice article connected with a court case in the Netherlands mentioned as a Handout (number (7) under Handouts section below).
(c) Discussion of `raking' in connection with survey-weighted contingency tables, and extension of the Iterative Proportional Fitting Algorithm covered in the loglinear chapter in the course.
(d) I mentioned in class that those of you with interests in Educational Statistics might consider covering some article or book-chapter in categorical data analysis related to Item Response Theory modeling, such as the article Semiparametric Estimation in the Rasch Model and Related Exponential Response Models, Including a Simple Latent Class Model for Item Analysis, by Bruce Lindsay, Clifford C. Clogg, John Grego, in the Journal of the American Statistical Association, Vol. 86, No. 413 (Mar., 1991), pp. 96-107, http://www.jstor.org/stable/2289719.
(e) You might also base your paper on discussion of an R-package, with data illustration to do analyses related to some categorical-data data structure not covered in detail in the class, such as analysis of multicategory generalizations of (fixed-effect) logistic or probit regression, or ordinal-outcome categorical modeling, or `social choice' modeling.


Handouts for Reference

(1)Two pdf-handout files contain readings related to (Maximum Likelihood) estimation of parameters in Generalized Linear Mixed Models (GLMM's), specifically in random-intercept logistic regression models:

(i)  A handout from Stat 705 on ML estimation using the EM (Expectation-Maximization) algorithm along with another on MCMC (Markov Chain Monte Carlo) techniques.

(ii) A technical report (written by me for the Small Area Income and Poverty Estimates program at the Census Bureau) on numerical maximization of the random-intercept logistic regression model using the Adaptive Gaussian Quadratures method developed by Pinheiro and Bates (the authors of related nonlinear-model mixed-effect software in R later adapted to NLMIXED in SAS).

(2). A link to a lecture by Agresti in Italy on History of Categorical Data Analysis. Further historical material can be found in an interesting historical article by Stephen Stigler (2002) showing just how recent is the display of data by cross-classification into contingency tables.

(3). Handouts produced for other classes cover Aymptotics relating Wald, Score and LRT Tests, and another Proof of Wilks' Theorem and equivalence of corresponding chi-square statistic with Wald & Rao-Score statistics.

(4). You can get an idea of test topics and course coverage from previous semesters in an old In-Class Test from April 14, 2003. Also see a directory of SASlogs and a Sample Test. A small writeup of computational details related to the first problem of the sample test can be found here.

(5). Proof of limiting distribution of multinomial Chi-square goodness of fit test statistic.

(6). See the directory Survey Confidence Intervals for two papers and a supplement on the extension of binomial confidence intervals to unknown proportions estimated in complex surveys. The JSM 2014 paper, published in the American Statistical Association Proceedings of the Survey Research and Methods Section from the 2014 annual statistical meetings, contains the results of a simulation study showing the relative merits of various binomial-proportion Confidence Intervals adapted to complex survey data. The other paper and supplement, which extends the simulation study and improves the way the intervals are adapted to survey data, has appeared recently in the Journal of Survey Statistics and Methodology.

(7). A very interesting case-study on a criminal case in the Netherlands and the importance of accounting for overdispersion in doing justice to a criminal-defendant. The case study is authored by eminent Dutch statisticians, Gill, Groeneboom and de Jong. The math is very accessible and the point very clear.

(8). A set of R Scripts on many topics related to the course are available in this directory. Those that are specifically cited in Fall 2020 Lectures will be separately located in this sub-directory.

(9). Several R packages for fitting Generalized Linear Mixed Models (particularly, binomial and Poisson family random-intercept were mentioned in a script-file covered in class. Some only approximate the GLMM log-likelihood using Monte Carlo techniques, such as glmm or glmmBUGS, while others (which are most useful for the relatively simple random-intercept models arising in the applications in Agresti) calculate the log-likelihoods as accurately as desired using Adaptive Gaussian Quadrature (AGQ): these include lme4, glmmML, or GLMMadaptive, and these can also be checked against my own code in the workspace Rscripts. Also see exposition of AGQ that I wrote in a random-intercept logistic-regression context, which should be accessible and useful to students in this course.

(10). Another package that can be used to fit multiple-outcome ("generalized-logistic" or "multinomial") logistic regression is mlogit. That package, which may be the only R package currently capable of fitting random-intercept models of generalized logistic type, was written not for that purpose but to analyze `social choice' datasets of interest to econometricians.


August 27, 2020

SYLLABUS for Stat 770, based on Agresti 3rd edition, Fall 2020

1. Introduction --- binomial and multinomial probabilities, statistical tests, estimators and confidence intervals. Law of large numbers, central limit theorem, delta method, asymptotic normal distribution of maximum likelihood estimators, Wilks' Theorem. Ch.1 & Ch.16 Review -- 2-3 classes.

2. Computational Techniques -- Numerical Maximization, Fisher Scoring, Iterative Proportional Fitting, EM Algorithm, Bayesian (MCMC) Computation.Partially covered in Asymptotics review, with more material throughout semester, amounting to approximately 2 classes

3. Describing Contingency Tables --- models and measures of independence vs. association of categorical variables in multiway contingency tables, including tests of independence. Hypotheses equating proportions for different variables. Conditional and marginal odds ratios and relative risks. Confidence intervals for parameters, including Bayesian formulations. Historical notes on contingency table analysis. Ch.2 & Ch.3 -- 4 classes.

4. Generalized linear models. Formulation of conditional response probabilities as linear expressions in terms of covariables. Likelihood and inference. Quasilikelihood and estimating equations.

5. Logistic regression.  Interpretation and inference on model parameters. Model fitting, prediction, and comparison.

6. Model-building including variables selection, diagnostics and inference about variable associations in logistic regression models.

7. Logistic regression extensions: multiple-category responses, weighted observations, and missing data.

8. Loglinear models and their likelihood-based parametric statistical inference.

9. Generalized linear models with random effects. Likelihood and penalized likelihood based inference. Missing-data formulation.

10. Comparison of prediction and classification using various model-fitting strategies. Likelihood, quasilikelihood, penalized likelihood, Bayes. Models & trategies include logistic regression and multicategory extensions, loglinear models, GLMMs, Support Vector Machines, recursive partitioning & decision trees


Additional Computing Resources.  There are many publicly available datasets for practice data-analyses. Many of them are taken from journal articles and/or textbooks and documented or interpreted. A good place to start is Statlib. Here is another good source. Datasets needed in the course will be either be posted to the course web-page, or indicated by links which will be provided here.
A good set of links to data sources from various organizations including Federal and international statistical agencies is at Washington Statistical Society links.


Important Dates


The UMCP Math Department home page.
The University of Maryland home page.
My home page.
Eric V Slud, Oct. 20, 2020.