Statistics 770 Analysis of Categorical Data

Spring 2003 MWF 10 , Mth 1311

Instructor:Eric Slud, Statistics program, Math. Dept.

Office:Mth 2314, x5-5469, email evs@math.umd.edu

Course Text:A. Agresti,Categorical Data Analysis, 2nd ed. 2002 .

Overview:This course covers the statistical analysis of discrete data, cross-classified by and modelled in terms of auxiliary covariate measurements which may be continuous or discrete. Such data structures arise in a wide variety of fields of application, especially in the social and biological sciences. The basic underlying model is the multinomial distribution, with cell- probabilities parametrically restricted according to their array structure, with conditional probability masses for a distinguished response variable often expressed linearly in terms of covariates. Important models of this type (some of which generalize to the case of continuous covariates) include logistic regression, other `generalized linear models', and loglinear models. The modern approach to these topics involves estimation via likelihood-based methods or generalizations to so-called quasilikelihood estimating equations, with emphasis on statistical computing and model diagnostics. Methods covered in the course will be presented in terms of theoretical properties, computational implementation, and real-data application.

Prerequisite:Stat 420 or Stat 700, plus some computing familiarity.

Course requirements and Grading:there will be 7 or 8 graded homework sets (one every 1½ to 2 weeks) which together will count 50% of the course grade. There will also be an in-class test and a final data analysis/simulation course project (which you will summarize in a 5-10 page paper plus graphical/numerical exhibits), each of which will count as 25% of the course grade.

NOTE ON USE OF THEORETICAL MATERIAL.Both in homeworks and the in-class test, there will be theoretical material at the level of probability theory needed to apply the law of large numbers and central limit theorem, along with the `delta method' (Taylor linearization) and other manipulations at advanced-calculus level. Course Coverage: Chapters 1-8 and 14 of the Agresti book, plus bits and pieces from Chapters 9, 11, 12, 15, 16.

NOTE ON COMPUTING.Both in the homework-sets and the course project, you will be required to do computations on real datasets well beyond the scope of hand calculation. Any of several statistical-computing platforms can be used to accomplish these: Splus (3.4 or 6.0) or R, SAS, Minitab, Matlab, or SPSS, or others. You can probably not get by with just a spreadsheet-program, but many other perfectly adequate statistical packages exist, although support and help for them may be hard to come by on campus (and is not available from me). If you are learning one of these packages for the first time, I recommend Splus or R (they are syntactically the same, and R is free), and I will provide links to free online materials introducing them. The Agresti book gives scripts and illustrations in SAS. I will give examples and scripts in Splus or R and if necessary will provide individual help in Splus, R, SAS or Matlab. Of all of these packages, the one which would be most generally useful to you in statistical practice would be SAS, while for methodological statistical research the recommended platform is probably Splus or R, unless you are interested in large-dataset numerical and computer-intensive topics, in which case the best choice might be SASor Matlab.

COMPUTER ACCOUNTS.Math, Stat, and Amsc graduate students have access to Splus and Matlab under Unix through their mathnet accounts, and others can have access through glue accounts. PC-based (student) versions of Minitab, Splus and Matlab are available and not too expensive. R is freely available in Unix or PC form through this link. SAS in a Unix environment is available to you free through a course campus-account on the `cluster' machines which has been set up for you (with individual user-names and initial passwords which I will distribute.)

Getting Started in SAS.

Various pieces of information to help you get started in using SAS can be found under an

old (F01) course website Stat430. In particular you can find:

--- running SAS under Unix on cluster machines.

Instructions and links are included there concerning a downloadable `script' enabling remote callup of SAS when you are running your cluster account remote from a campus WAM or mathnet or glue workstation.

--- an overview of the minimum necessary steps to use SAS from Mathnet.

--- links to stat430 problem assignments.

--- a series of SAS logs with edited outputs for illustrative examples.

SAS Illustrations and Logs Related to the Agresti book.

The Agresti text has an Appendix A describing software, including SAS scripts, which can be used to perform categorical data analyses. In addition, datasets can be downloaded from Agresti's website. Several logs in SAS (with some comparative Splus analyses) doing illustrative data analyses and providing standard SAS scripts can be found here.

Test Topics and Coverage Related to In-Class Test April 14. Also seedirectory of SASlogs,and a Sample Test. A small writeup of computational details related to the first problem of the sample test (a corrected version of the one discussed in class 4/09/03 and 4/11/03 can be found here .

Assignment 1. (First 1½ weeks of course, HW due Wed., Feb. 12). Read all of Chapter 1, plus the first of the sections from the historical notes in Chapter 16. Then solve and hand in all of the following problems:

(A) Calculate the coverage probabilities of Wald, Clopper-Pearson, and Score-statistic 95% confidence intervals for a binomial parameter ð in a sample of size n=40, for at least three different values of the true parameter ð. If you do this with a small program, then try to produce a plot exactly analogous to Figure 1.3 on page 19 of Agresti. If you have not yet mastered the capabilities of your software, then you may instead just tabulate several values of the coverage probabilities for the three types of confidence interval. Whichever of these displays you choose for your results: explain in detail why the coverage probabilities are discontinuous functions of ð , piecewise smooth but with discontinuity at finitely many points; and what are the ð values of the discontinuity points ?

(B) Agresti: Chapter 1, # 1.7, 1.8, 1.9, 1.12, 1.24.

A listing of Splus functions and commands used to solve problem (A) above can be found here .You can also look at the resulting picture either for sample size n=40 as requested, or for n=100.

For an interesting comparison between an `Agresti-Coull (1998) confidence interval ' advocated by the author of our text (see problem 1.24), versus the other standard intervals we are studying, and also versus a transformed Wald-interval (with 0.5 added to number of successes and failures) on logit scale, see figure.

Assignment 2. (HW due Wed., Feb. 26). Read Chapters 2, 3, and Sections 16.2, 16.4. Then solve and hand in all the following problems: (A) Find and check using Splus or other statistical software an identity connecting the incomplete Gamma function (ie the cumulative distribution function of the Gamma distribution) with the tail probabilities of a Poisson distribution, where the Poisson and Gamma distribution parameters are related in a way you should specify. The method is to begin by finding the derivative with respect to the Poisson parameter of the Poisson distribution function, and then choose the Gamma parameters so that the density function is (up to sign) equal to that derivative. (B) Agresti: Chapter 2, # 2.3, 2.8, 2.12, 2.21, 2.34, 2.37, Chapter 3, # 3.1, 3.6.

In order to experiment with calculation and comparison of Wald, Score, and Profile likelihood intervals, see Slog.

Assignment 3. (HW due Mon., Mar. 10). Finish Chapter 3; also read Chapter 14 through 14.3. Then solve and hand in the following problems: Agresti: Chapter 14, # 14.4, Chapter 2, # 2.30, Chapter 3, # 3.4, 3.14, 3.15, 3.35, 3.38. Also do problem 3.34, and derive the large-sample distribution for the case of Poisson sampling (3.34 counts as 2 problems). For calculations in Splus related to the Score, Profile , and (`exact') Cornfield (1956) confidence intervals, see HW23log. For other aspects of the solutions of the more theoretical problems in HW assignment 3, click here.

Assignment 4. (HW due Mon., Mar. 31). Read Chapter 4 through 4.7, and look at and try out SAS scripts, Appendix A of Agresti. Then solve and hand in the following problems: Agresti: Chapter 3, # 3.10, 3.11, 3.30; Chapter 4, # 4.2, 4.6, 4.7, 4.27.

Assignment 5. (HW due Fri., Apr. 11). Read through the examples of logistic regression fitting in Chapters 5 and 6, and SAS examples. Then solve and hand in the following:

(A) Create an Analysis of Deviance table for (i) Logistic Regression, and (ii) Probit Regression using the kyphosis data which you can find on the /~evs/~s798c/Data web-page. These data were discussed in an in-class handout, which can be found under the SASlogs link to file SASlog3.txt . As far as I know the most economical way to do this is to do your PROC GENMOD regressions using forward or backward selection (with a very forgiving criterion for variable inclusion if SELECTION=FORWARD and a very restrictive criterion for variable retention if SELECTION=BACKWARD.

(B) Recall that you can get the data examples in the book by downloading them from under Agresti's website. Do this for the horseshoe crab satellite data, and do a SAS run to get the fitted coefficient values for the negative binomial model mentioned in Exercise 4.11, p.160. Then answer the questions in this exercise.

(C) Chapter 5, # 5.1.

Assignment 6. (HW due Mon., May. 12). Read Chapter 8 in conjunction with the lectures. Solve and hand in the following: Ch. 5 # 5.9, 5.33; Ch. 6 # 6.1, 6.7, 6.33; Ch. 8 # 8.1, 8.5, 8.19.

FINAL PROJECT ASSIGNMENT, due Wed., May 19 by 12 noon. As a final course project, you are to write a paper including at least 3 pages of narrative and at least 5 pages overall on an extended data analysis or case-study, based on a dataset of your choosing. The best sources are the StatLib web-site mentioned above, or Agresti's book web-site. The paper should present a coherent and reasoned data analysis with supporting evidence for the model you choose to fit, the variables you choose to include and exclude, whatever indications you can provide for the adequacy of fit of the models, and a summary of what the model says about the generating mechanism of the data.

For two pdf-files with readings related to (Maximum Likelihood) estimation of parameters in Generalized Linear Mixed Models (GLMM's), specifically in random-intercept logistic regression models, you can look at:

(i) A handout from Stat 798c on ML estimation using the EM (Expectation-Maximization) algorithm along with MCMC (Markov Chain Monte Carlo techniques.

(ii) A technical report (written by me for the Small Area Income and Poverty Estimates program at the Census Bureau) on numerical maximization of the random-intercept logistic regression model using the Adaptive Gaussian Quadratures method developed by Pinheiro and Bates (the authors of related nonlinear-model mixed-effect software in Splus).

SYLLABUS for Stat 770

1. Introduction--- binomial and multinomial probabilities, statistical tests, estimators and confidence intervals. Law of large numbers, central limit theorem, delta method, asymptotic normal distribution of maximum likelihood estimators, Wilks' Theorem.

2. Describing Contingency Tables--- models an measures of independence vs. association of categorical variables in multiway contingency tables. Hypotheses equating proportions for different variables. Conditional and marginal odds ratios and relative risks. Historical notes on contingency table analysis.

3. Statistics, hypothesis tests of independence, and confidence intervals for parameters in multiway contingency tables.

4. Generalized linear models.Formulation of conditional response probabilities as linear expressions in terms of covariables. Likelihood and inference. Quasilikelihood and estimating equations.

5. Logistic regression.Interpretation and inference on model parameters. Model fitting, prediction, and comparison.

6. Model-buildingincluding variables selection, diagnostics and inference about variable associations in logistic regression models.

7. Logistic regression extensions for multinomial responses.

8. Loglinear modelsand their likelihood-based parametric statistical inference.

9. Generalized linear models with random effects.Likelihood and penalized likelihood based inference. Missing-data formulation.

10. Comparison of model-fitting strategies.Likelihood, quasilikelihood, penalized likelihood, Bayes.

Additional Computing Resources. There are many publicly available datasets for practice data-analyses. Many of them are taken from journal articles and/or textbooks and documented or interpreted. A good place to start is Statlib . Datasets needed in the course will be either be posted to the course web-page, or indicated by links which will be provided here.

The UMCP Math Department home page.

The University of Maryland home page.

My home page.

© Eric V Slud, Mar. 19, 2003.