Statistics 770  Analysis of Categorical Data

Fall 2018 MW 5-6:15pm,    Mth 0303

Instructor: Eric Slud, Statistics program, Math. Dept.

Office:  Mth 2314, x5-5469, email evs@math.umd.edu, Office Hours: M11, W1, or by appointment

Course Text: A. Agresti, Categorical Data Analysis, 3rd ed. 2013.

We will NOT have an in-class test this semester. Current Homework Assignment


Please fill out the on-line Evaluation form on this Course and instructor at http://CourseEvalUM.umd.edu. Thank you.


Overview: This course covers the statistical analysis of discrete data, cross-classified by and modeled in terms of auxiliary covariate measurements which may be continuous or discrete. Such data structures arise in a wide variety of fields of application, especially in the social and biological sciences. The basic underlying model is the multinomial distribution, with cell-probabilities parametrically restricted according to their array structure, with conditional probability masses for a distinguished response variable often expressed linearly in terms of covariates. Important models of this type (some of which generalize to the case of continuous covariates) include logistic regression, other `generalized linear models', and loglinear models. The modern approach to these topics involves estimation via likelihood-based methods or generalizations to so-called quasilikelihood estimating equations, with emphasis on statistical computing and model diagnostics. In addition, computational advances have made categorical data models with random effects tractable to estimate and interpret, and Bayesian and empirical-Bayes methods are an important part of the material included in the new edition of the Agresti text. Methods covered in the course will be presented in terms of theoretical properties, computational implementation (primarily in R), and real-data application.

NOTE ON USE OF THEORETICAL MATERIAL.  Both in homeworks and the in-class test, there will be theoretical material at the level of probability theory needed to apply the law of large numbers and central limit theorem, along with the `delta method' (Taylor linearization) and other manipulations at advanced-calculus level.

Prerequisite: Stat 420 or Stat 700, plus some computing familiarity.

Course requirements and Grading: there will be 7 graded homework sets (one every 2 weeks) which together will count 50% of the course grade. There will also be an in-class test (in early November) and a final data analysis/simulation course project (which you will summarize in a 5-10 page paper plus graphical/numerical exhibits), each of which will count as 25% of the course grade.
NOTE: since the in-class test will not be given this semester, the course grade will be based 2/3 on Homework and 1/3 on the Final Paper.

Course Coverage: Chapters 1-6, 8-9 and 13 of the Agresti book, plus bits and pieces from Chapters 7, 16, and 17.

NOTE ON COMPUTING.  Both in the homework-sets and the course project, you will be required to do computations on real datasets well beyond the scope of hand calculation or spreadsheet programs. Any of several statistical-computing platforms can be used to accomplish these: R, SAS, Minitab, Matlab, or SPSS, or others. If you are learning one of these packages for the first time, I recommend   R which is free and open-source and is the most flexible and useful for research statisticians. I will provide links to free online R tutorials and will provide examples and scripts and will offer some R help. The Agresti book gives scripts and illustrations in SAS.


Getting Started in R and SAS. Lots of R introductory materials can be found on my last-year's STAT 705 website. Another free and interactive site I recently came across forintroducing R to social scientists is: https://campus.sagepub.com/blog/beginners-guide-to-r.

Various pieces of information to help you get started in using SAS can be found under an old (F09) course website  Stat430.  In particular you can find:

--- an overview of the minimum necessary steps to use SAS from Mathnet.

---  a series of SAS logs with edited outputs for illustrative examples.


SAS Illustrations and Logs Related to the Agresti book.

The Agresti text has an Appendix A describing software, including SAS scripts, which  can be used to perform categorical data analyses. In addition, datasets can be downloaded from Agresti's website. Several logs in SAS (with some comparative Splus analyses) doing illustrative data analyses and including standard SAS scripts can be found here.


Test Topics and Coverage: you can get an idea of these from In-Class Test from April 14, 2003. Also see a directory of SASlogs and a Sample Test. A small writeup of computational details related to the first problem of the sample test can be found here.


Homework

Notes and Guidelines. Homeworks should be handed in as hard-copy in-class, except for occasional due-dates on Fridays when you may submit them electronically, via email, in pdf format. Solutions will usually be posted, and a percentage deduction of the overall HW score will generally be made for later papers.

Homework solutions will will be emailed to all course registrants shortly after they are due. (For course attendees who are not registered, I will email HW solutions if they are requested.)

Assignment 1. (First 2 weeks of course, HW due Wed., Sep. 12). Read all of Chapter 1, plus the first of the sections from the historical notes in Chap. 17. Then solve and hand in all of the following problems:

(A). Calculate the coverage probabilities of Wald, Clopper-Pearson, and Score-statistic (= Wilson) 95% confidence intervals for a binomial parameter π in a sample of size n=40, for at least three different values of the true parameter   π. If you do this with a small program, then try to produce a plot exactly analogous to this one. If you have not yet mastered the capabilities of your software, then you may instead just tabulate several values of the coverage probabilities for the three types of confidence interval. Whichever of these displays you choose for your results: explain in detail why the coverage probabilities are discontinuous functions of  π, piecewise smooth but with discontinuity at finitely many points; and what are the π values of the discontinuity points ?

(B). Suppose that observations Xi are independent and identically distributed (iid) discrete random variables with values {1,...,K} and probability mass function P(Xi=k) = pk. Find the joint probability distribution of   (Nk,   k=1,...,K)   where  Nk = summation over i=1 to N of  I[Xi=k]   where   N~Poisson(n) is independent of   {Xi:   i=1,2,...}.

(C). (a) Suppose that N bins are equally likely to be occupied (i.e. are sampled equiprobably) by m black balls and also, independently, are equally likely to be occupied by n white balls. A bin may be occupied by both a black and a white ball. Let X be the (random) number of bins holding both a black and a white ball. Find its probability distribution (explicitly, as a function of   m,n,N).
         (b) Let the m black balls position themselves among N bins as in (a). But now suppose that, given the positions of black balls, the positions for the n white balls are chosen in such a way that the odds for each white ball to fall in a bin occupied by a black ball is multiplied by a factor eθ as compared with the odds of falling in a bin not occupied by a black ball. (A more precise way to say or model this question is as follows: (i) suppose that for some fixed probability   p  , for each bin j=1,...,N independently of the others, a black ball is placed in bin j; but that (ii) we condition on the total number of bins containing black balls being equal to m; and (iii) suppose that each bin not containing a black ball independently of all other bins has probability   q   of receiving a white ball, while each bin containing a black ball independently of all other bins has probability q* [defined to satisfy:   q*/(1-q*) = eθ q/(1-q)  ],  of receiving a white ball; but that (iv) we condition on the total number of bins containing white balls being equal to n.) Now what is the probability distribution of the number X of bins occupied by both a black and a white ball ? Your answer should turn out not to depend on p and q, i.e. to depend only on m, n, N and θ.

(D). Agresti, Chapter 1: Do # 1.7 but instead of treating the number Y of successes (new drug better) as Y=20 out of n=20 trials as requested in the book, do the exercise based on Y=29 in n=30 trials; also do the exercises numbered 1.8, 1.13, 1.38.

A listing of R functions and commands that can be used to solve problem (A) above can be found here. You can also look at the resulting picture either for sample size n=40 as requested, or for n=100.

For an interesting comparison between an `Agresti-Coull (1998) confidence interval ' advocated by the author of our text (see problem 1.25), versus the other standard  intervals we are studying, and also versus a transformed Wald-interval (with 0.5 added  to number of successes and failures) on logit scale, see this picture.


Assignment 2.(Second 2 weeks of course, HW due Wed., Sep. 26). Read all of Chapter 2, plus Chap.3 Sections 3.1-3.2. Then solve and hand in the following problems:

(A). Consider the data in Table 2.8 of the Agresti book, on page 57, and assume that the data sampled in each Age group are independent identically distributed among a well-defined population of employed people. Find a confidence interval for the fraction of each age-defined subpopulation that is not satisfied (i.e., falls in Job-satisfaction category 1), according to each of the following methods: (i) Wald, (ii) Inverted Score-test, (iii) Clopper-Pearson, (iv) Likelihood Ratio Test, and (v) Bayesian credible interval based on a Beta(1,1) = Uniform prior distribution on the unknown fraction p.

(B). Establish a large-sample CI for log relative risk log(p1/p2) using the Delta method, in the setting of independent observations Xj ~ Binom(nj, pj) for j=1,2, and apply it to the data in Table 2.1 on Aspirin and Heart Attack Study data on p.38, in two ways: (i) to find a 90% CI for log RR of fatal heart attack for those on placebo relative to those taking aspirin, and (ii) to find a 95% CI for log RR for incidence of Heart Attack (whether fatal or not) for those on placebo relative to those taking aspirin. Assume that the placebo patients and those taking aspirin were sampled independently and equiprobably from large general populations.

(C). The following data from a paper by Helmes and Fekken (1986 Jour. Clin. Psych. 42, 569-576) classifies a sample of psychiatric patients by their diagnoses and whether their treatment prescribed drugs:
Schizophrenia Affective Disorder Neurosis Personality Disorder Other
Drugs    105 12 18 47
0
No Drugs 8    2 19 52 13
Using these data and assuming the rows were sampled independently and iid, (a) conduct a test of (row vs column) independence and interpret the P-value; (b) Obtain standarized residuals and interpret, and (c) partition chi-squared into three components to describe differences and similarities among the diagnoses, by comparing (i) the first two columns, the 3rd and 4th column, and the last column to the combination of the first two colums and the combination of the 3rd and 4th columns.

(D). Of the 12 candidates for 4 managerial positions, 6 are female and 6 male. Denote the females F1,..,F6 and the males M1,..,M6. The actual result of choosing the managers is (F2,M1,M4,M6). (i) How many possible without-replacement samples were there ? Construct the contingency table for the without-replacement sample actually obtained. (ii) Find and interpret the P-value for the Fisher's Exact Test of the hypothesis that all candidates were equally likely to be selected. (iii) Re-do the same problem if there were 100 candidates (50 male and 50 female) for the four managerial positions.

(E). Find the likelihood ratio test and chi-squared test test-statistics in a 3x3 table for the hypothesis  H_0:   p_{jk} ∝ exp(aj + bk + cjk)  versus the general alternative. Find and interpret the P-values for these statistics in the 3x3 table with first row (3,7,18), second row (5,18,17), and 3rd row (9,35,42).   Note: you should find the likelihood-maximizing values   â, b̂ and ĉ   under H0, either using a numerical-maximizing function like nlm or optim in R, or else set up a multiparameter Newton-Raphson iteration to find an iterative numerical solution of the likelihood equations. Then substitute these MLEs to create the likelihood-ratio or chi-square statistics.


Assignment 3. (Third 2 weeks of course, HW due Mon., Oct.15). Read additional sections of Chapter 3, plus Chapter 4 and the first few sections of Chapter 5. Then solve and hand in the following problems (7 in all): # 3.21 [also do as part (c) of this problem: same as part (b) but with ratio of proportions]; # 4.10, 4.12, 4.20, 4.33 and in addition

(I). Fit a logistic regression model to the Crabs mating data with outcome variable: (Crabs$sat > 1) Use and using the predictors: spine (as factor), weight (rounded to nearest 0.2kg), and width (rounded to nearest cm). You may use interactions if they help. Fit the best model you can, and assess the quality of fit of your best model using the techniques in Sec.5.2.

(II). For the "best" model you fit in problem (I): (a) fit the coefficients directly (by a likelihood calculation that you code in R using "optim" with method="BFGS") and also by coding the Fisher-scoring algorithm and using 5 or 10 iterations (starting from all coef's = 0), and (b) check that the SEs for coefficients found by "glm" are close to those found from your observed information matrix estimates calculated in (a).


Assignment 4. (Fourth 2 weeks of course, HW due Mon., Nov.5). Read the rest of Chapters 5 and 6, plus sections 9.1--9.3. Then solve and hand in the following problems (7 in all): 5.2 (parts f and i only), 5.6, 5.9, 5.10, 5.30 (parts c and d only), 5.39, 6.28.


Assignment 5, due Mon., Nov. 19. Note the sections 6.3.3 and 6.3.4 on predictive power, included in the previous reading assignment. The reading for this 2-week period consists of Chapter 9 -- which we have already been covering in class -- along with a comparison of the methods of fitting these and the generalized linear models (Newton-Raphson, Fisher Scoring, Iteratively Reweighted Least Squares] we have studied so far. The computational topics are in sections 9.6, 9.7, and 4.6. We will then discuss some of the models in Chapter 8 (Sec. 8.1,8.2, 8.5). Then solve and hand in the following problems (7 in all): 6.8, 6.10, 9.2, 9.3, 9.16(d and e only), 9.30, 9.34.


Assignment 6, due Wed., Dec. 12. The reading for the last part of the course consists of Chapter 13, Chapter 14 sections 14.3 and 14.5, along with the material on glmmBUGS linked in the Handouts section (5) below, along with Mini-Course notes I wrote on MCMC Lecture 1 and Lecture 2. For a quicker and more focused introduction, here are Notes on MCMC for random-intercept logistic regression.
You are to solve and hand in for HW6 four problems (with (I) and (II) each counting 15 points and (III) and (IV) 10 points) relating to the latter part of the course:

(I). (a) Build a simple, reasonable model using fixed-effect logistic regression for the outcome class="malignant" in the dataset "biopsy" in package MASS, using only the first biopsies for the 645 unique IDs, and ignoring variable V6.
(b) Fit a mixed-effect logistic model with random intercept for each subject.
(c) Fit a beta-binomial model to these data with logistic-fixed-effect part.
For all three parts, fit the model and interpret/compare the coefficients and quality of fit.

(II). Consider a dataset with a three-level outcome Y (=0,1 or 2) and a predictor X generated artificially with the model

P(Yi=j|Xi) = Multinomial(ni, (pi[j], j=0,1,2))
where
pi[j] = exp((a0+b0*X)*I[j=0] + (a1+b1*X)*I[j=1])/(1+exp(a0+b0*X)+exp(a1+b1*X))
using the commands

set.seed(7010)
Xvec = rnorm(5000)
Ymat = array(0, c(5000,3))
Ymat[,1] = rbinom(5000,1, exp(-1.5+0.6*Xvec)/(1+exp(-1.5+0.6*Xvec)+exp(-.5+0.1*Xvec)))
Ymat[,2] = rbinom(5000,1-Ymat[,1], plogis(-.5+0.1*Xvec))
Ymat[,3] = 1-Ymat[,1]-Ymat[,2]
(i) Fit the parameters of this model using two successive logistic regressions, for (a0,b0) using the data Xvec[ind0], Ymat[ind0,c(1,3)], where     ind0 = which(Ymat[,2]==0)     and for (a1,b1) using the using the data Xvec[ind1], Ymat[ind1,c(2,3)], where     ind1 = which(Ymat[,1]==0).
(ii) Calculate the Fisher information for the correct three-outcome model (with a0=-1.5, b0=0.6, a1=-0.5, b1=0.1) from the definitions and use it to approximate the variance-covariance matrix of the MLE's for the 4-dimensional parameter (a0,b0,a1,b1).
(iii) Compare the SEs of the four parameters' MLEs derived from (ii) with the SEs of the parameter estimates obtained in (i). Why are the SEs larger with the method given in (i) ?

(III). Find and display the posterior densities of four coefficients [Intercept and District2, District3, and District4] in the Bayesian Poisson (fixed-effects) regression of Claims in the dataset "Insurance" within the R package MASS, using first-principles MCMC code following the ideas of the logistic-regression Metropolis-Hastings code given in the class handout. Note that you should specify the Poisson regression with an "offset" term of log(Holders) added to the Intercept.

(IV). Use a package (glmmBUGS or rjags) or else a direct Metropolis-Hastings MCMC implementation as in the script BayesLgst.RLog in class to do a Bayesian analysis of the random intercept effect standard deviation in the "biopsy" dataset (using only first biopsy for each unique ID, and omitting V6) using a fixed and mixed-effect generalized-linear-model (logistic-regression with random intercept).



FINAL PROJECT ASSIGNMENT, due Friday, December 14, 2018, 5pm. As a final course project, you are to write a paper including 5-10 pages of narrative, plus relevant code and graphical or tabular exhibits, on a statistical journal article related to the course or else a data analysis or case-study based on a dataset of your choosing. The guideline is that the paper should be 10 pages if it is primarily expository based on an article, but could have somewhat fewer pages of narrative if based on a data-analytic case study. However, for the latter kind of paper, all numerical outputs should be accompanied by code used to generate them, plus discussion and interpretation of software outputs and graphical exhibits. For a data-analysis or case study, the paper should present a coherent and reasoned data analysis with supporting evidence for the model you choose to fit, the variables you choose to include and exclude, whatever indications you can provide for the adequacy of fit of the models, and a summary of what the model says about the generating mechanism of the data.
Two good sources of data for the paper are the StatLib web-site mentioned below, or Agresti's book web-site.

Possible topics for the paper include: (a) Zero-inflated Poisson regression models, based on an original paper of Lambert but discussed in connection with the Horseshoe Crabs dataset in a web-page posted by Agresti (indexed also under heading 2. in his book web-page.
(b) The relationship between individual heterogeneity and overdispersion and prediction, in a very nice article connected with a court case in the Netherlands mentioned as a Handout (number (3) under Handouts section below).
(c) Discussion of `raking' in connection with survey-weighted contingency tables, and extension of the Iterative Proportional Fitting Algorithm covered in the loglinear chapter in the course.
(d) I mentioned in class that those of you with interests in Educational Statistics might consider covering some article or book-chapter in categorical data analysis related to Item Response Theory modeling, such as the article Semiparametric Estimation in the Rasch Model and Related Exponential Response Models, Including a Simple Latent Class Model for Item Analysis, by Bruce Lindsay, Clifford C. Clogg, John Grego, in the Journal of the American Statistical Association, Vol. 86, No. 413 (Mar., 1991), pp. 96-107, http://www.jstor.org/stable/2289719.
(e) You might also base your paper on discussion of an R-package, with data illustration to do analyses related to some categorical-data data structure not covered in detail in the class, such as analysis of multicategory generalizations of (fixed-effect) logistic or probit regression, or ordinal-outcome categorical modeling, or `social choice' modeling.


Handouts

Two pdf-handout files contain readings related to (Maximum Likelihood) estimation of parameters in Generalized Linear Mixed Models (GLMM's), specifically in random-intercept logistic regression models:

(i)  A handout from Stat 705 on ML estimation using the EM (Expectation-Maximization) algorithm along with another on MCMC (Markov Chain Monte Carlo) techniques.

(ii) A technical report (written by me for the Small Area Income and Poverty Estimates program at the Census Bureau) on numerical maximization of the random-intercept logistic regression model using the Adaptive Gaussian Quadratures method developed by Pinheiro and Bates (the authors of related nonlinear-model mixed-effect software in R later adapted to NLMIXED in SAS).

Additional Handouts for Reference

(1). Proof of limiting distribution of multinomial Chi-square goodness of fit test statistic.

(2). See the directory Survey Confidence Intervals for two papers and a supplement on the extension of binomial confidence intervals to unknown proportions estimated in complex surveys. The JSM 2014 paper, published in the American Statistical Association Proceedings of the Survey Research and Methods Section from the 2014 annual statistical meetings, contains the results of a simulation study showing the relative merits of various binomial-proportion Confidence Intervals adapted to complex survey data. The other paper and supplement, which extends the simulation study and improves the way the intervals are adapted to survey data, will appear soon in the Journal of Survey Statistics and Methodology.

(3). A very interesting case-study on a criminal case in the Netherlands and the importance of accounting for overdispersion in doing justice to a criminal-defendant. The case study is authored by eminent Dutch statisticians, Gill, Groeneboom and de Jong. The math is very accessible and the point very clear.

(4). Over the course of the semester, we will have a series of R scripts in the directory Rscripts. The first of these can already be viewed here.

(5). Several R packages for fitting Generalized Linear Mixed Models (particularly, binomial and Poisson family random-intercept were mentioned in a script-file covered in class. Some only approximate the GLMM log-likelihood using Monte Carlo techniques, such as glmm or glmmBUGS, while others (which are most useful for the relatively simple random-intercept models arising in the applications in Agresti) calculate the log-likelihoods as accurately as desired using Adaptive Gaussian Quadrature (AGQ): these include lme4, glmmML, or GLMMadaptive, and these can also be checked against my own code in the workspace Rscripts. Also see exposition of AGQ that I wrote in a random-intercept logistic-regression context, which should be accessible and useful to students in this course.

(6). Another package that can be used to fit multiple-outcome ("generalized-logistic" or "multinomial") logistic regression is mlogit. That package, which may be the only R package currently capable of fitting random-intercept models of generalized logistic type, was written not for that purpose but to analyze `social choice' datasets of interest to econometricians.


August 27, 2018

SYLLABUS for Stat 770, based on Agresti 3rd edition, Fall 2018

1. Introduction --- binomial and multinomial probabilities, statistical tests, estimators and confidence intervals. Law of large numbers, central limit theorem, delta method, asymptotic normal distribution of maximum likelihood estimators, Wilks' Theorem. Ch.1 & Ch.16 Review -- 2-3 classes.

2. Computational Techniques -- Numerical Maximization, Fisher Scoring, Iterative Proportional Fitting, EM Algorithm, Bayesian (MCMC) Computation.Partially covered in Asymptotics review, with more material throughout semester, amounting to approximately 2 classes

3. Describing Contingency Tables --- models and measures of independence vs. association of categorical variables in multiway contingency tables, including tests of independence. Hypotheses equating proportions for different variables. Conditional and marginal odds ratios and relative risks. Confidence intervals for parameters, including Bayesian formulations. Historical notes on contingency table analysis. Ch.2 & Ch.3 -- 4 classes.

4. Generalized linear models. Formulation of conditional response probabilities as linear expressions in terms of covariables. Likelihood and inference. Quasilikelihood and estimating equations.

5. Logistic regression.  Interpretation and inference on model parameters. Model fitting, prediction, and comparison.

6. Model-building including variables selection, diagnostics and inference about variable associations in logistic regression models.

7. Logistic regression extensions: multiple-category responses, weighted observations, and missing data.

8. Loglinear models and their likelihood-based parametric statistical inference.

9. Generalized linear models with random effects. Likelihood and penalized likelihood based inference. Missing-data formulation.

10. Comparison of model-fitting strategies. Likelihood, quasilikelihood, penalized likelihood, Bayes.


Additional Computing Resources.  There are many publicly available datasets for practice data-analyses. Many of them are taken from journal articles and/or textbooks and documented or interpreted. A good place to start is Statlib. Datasets needed in the course will be either be posted to the course web-page, or indicated by links which will be provided here.
A good set of links to data sources from various organizations including Federal and international statistical agencies is at Washington Statistical Society links.


Important Dates


The UMCP Math Department home page.
The University of Maryland home page.
My home page.
Eric V Slud, Dec. 7, 2018.