Statistics 770  Analysis of Categorical Data
Spring 2003                                                                            MWF 10 , Mth 1311
Instructor: Eric Slud, Statistics program, Math. Dept.
Office:  Mth 2314, x5-5469, email evs@math.umd.edu
Course Text: A. Agresti, Categorical Data Analysis, 2nd ed. 2002 .

Overview: This course covers the statistical analysis of discrete data, cross-classified by and
modelled in terms of auxiliary covariate measurements which may be continuous or discrete.
Such data structures arise in a wide variety of fields of application, especially in the social and
biological sciences. The basic underlying model is the multinomial distribution, with cell-
probabilities parametrically restricted according to their array structure, with conditional
probability masses for a distinguished response variable often expressed linearly in terms of
covariates. Important models of this type (some of which generalize to the case of continuous
covariates) include logistic regression, other `generalized linear models', and loglinear models.
The modern approach to these topics involves estimation via likelihood-based methods or
generalizations to so-called quasilikelihood estimating equations, with emphasis on statistical
computing and model diagnostics. Methods covered in the course will be presented in terms of
theoretical properties, computational implementation, and real-data application.

Prerequisite: Stat 420 or Stat 700, plus some computing familiarity.
Course requirements and Grading: there will be 7 or 8 graded homework sets (one every
1 to 2 weeks) which together will count 50% of the course grade. There will also be an
in-class test and a final data analysis/simulation course project (which you will summarize
in a 5-10 page paper plus graphical/numerical exhibits), each of which will count as 25%
of the course grade.
NOTE ON USE OF THEORETICAL MATERIAL.  Both in homeworks and the in-class test,
there will be theoretical material at the level of probability theory needed to apply the law
of large numbers and central limit theorem, along with the `delta method' (Taylor linearization)
and other manipulations at advanced-calculus level. Course Coverage: Chapters 1-8 and 14
of the Agresti book, plus bits and pieces from Chapters 9, 11, 12, 15, 16.
NOTE ON COMPUTING.  Both in the homework-sets and the course project, you will be
required to do computations on real datasets well beyond the scope of hand calculation.
Any of several statistical-computing platforms can be used to accomplish these:
Splus (3.4 or 6.0) or R, SAS, Minitab, Matlab, or SPSS, or others. You can probably not get
by with just a spreadsheet-program, but many other perfectly adequate statistical packages
exist, although support and help for them may be hard to come by on campus (and is not
available from me). If you are learning one of these packages for the first time, I recommend 
Splus or R (they are syntactically the same, and R is free), and I will provide links to free online
materials introducing them. The Agresti book gives scripts and illustrations in SAS. I will give
examples and scripts in Splus or R and if necessary will provide individual help in Splus, R,
SAS or Matlab. Of all of these packages, the one which would be most generally useful to
you in statistical practice would be SAS, while for methodological statistical research the
recommended platform is probably Splus or R, unless you are interested in large-dataset
numerical and computer-intensive topics, in which case the best choice might be SASor Matlab.

COMPUTER ACCOUNTS.  Math, Stat, and Amsc graduate students have access to Splus
and Matlab under Unix through their mathnet accounts, and others can have access
through glue accounts. PC-based (student) versions of Minitab, Splus and Matlab are
available and not too expensive. R is freely available in Unix or PC form through this link.
SAS in a Unix environment is available to you free through a course campus-account on
the `cluster' machines which has been set up for you (with individual user-names and
initial passwords which I will distribute.)

Getting Started in SAS.
Various pieces of information to help you get started in using SAS can be found under an
old (F01) course website  Stat430.  In particular you can find:
--- running SAS under Unix on cluster machines.
    Instructions and links are included there concerning a downloadable `script' enabling
    remote callup of SAS when you are running your cluster account remote from a campus
    WAM or mathnet or glue workstation.
--- an overview of the minimum necessary steps to use SAS from Mathnet.
--- links to stat430  problem assignments.
---  a series of SAS logs with edited outputs for illustrative examples.

SAS Illustrations and Logs Related to the Agresti book.
The Agresti text has an Appendix A describing software, including SAS scripts, which 
can be used to perform categorical data analyses. In addition, datasets can be downloaded
from Agresti's website. Several logs in SAS (with some comparative Splus analyses)  
doing illustrative data analyses and providing standard SAS scripts can be found here.

Test Topics and Coverage Related to In-Class Test April 14.  Also see  
directory of SASlogs, and a Sample Test. A small writeup of computational details
related to the first problem of the sample test (a corrected version of the one 
discussed in class 4/09/03 and 4/11/03 can be found here .

Assignment 1. (First 1 weeks of course, HW due Wed., Feb. 12). Read all of Chapter 1,
plus the first of the sections from the historical notes in Chapter 16. 

Then solve and hand in all of the following problems:
(A) Calculate the coverage probabilities of Wald, Clopper-Pearson, and Score-statistic
95% confidence intervals for a binomial parameter   in a sample of size n=40, for at
least three different values of the true parameter  . If you do this with a small program,
then try to produce a plot exactly analogous to Figure 1.3 on page 19 of Agresti. If you
have not yet mastered the capabilities of your software, then you may instead just tabulate
several values of the coverage probabilities for the three types of confidence interval.
Whichever of these displays you choose for your results: explain in detail why the
coverage probabilities are discontinuous functions of   , piecewise smooth but with 
discontinuity at finitely many points; and what are the    values of the discontinuity points ?
(B) Agresti: Chapter 1, # 1.7, 1.8, 1.9, 1.12, 1.24.
A listing of Splus functions and commands used to solve problem (A) above
can be found here . You can also look at the resulting picture either for 
sample size n=40 as requested, or for n=100.
For an interesting comparison between an `Agresti-Coull (1998) confidence interval '
advocated by the author of our text (see problem 1.24), versus the other standard 
intervals we are studying, and also versus a transformed Wald-interval (with 0.5 added 
to number of successes and failures) on logit scale, see figure.
Assignment 2. (HW due Wed., Feb. 26). Read Chapters 2, 3, and Sections 16.2, 16.4.
              
Then solve and hand in all the following problems:
              
(A) Find and check using Splus or other statistical software an identity connecting the 
incomplete Gamma function (ie the cumulative distribution function of the 
Gamma distribution) with the tail probabilities of a Poisson distribution, where the 
Poisson and Gamma distribution parameters are related in a way you should specifyThe method is to begin by finding the derivative with respect to the Poisson parameter 
of the Poisson distribution function, and then choose the Gamma parameters so that 
the density function is (up to sign) equal to that derivative.

(B) Agresti: Chapter 2, # 2.3, 2.8, 2.12, 2.21, 2.34, 2.37,   Chapter 3, # 3.1, 3.6.
In order to experiment with calculation and comparison of Wald, Score, and Profile 
likelihood  intervals, see Slog.
Assignment 3. (HW due Mon., Mar. 10). Finish Chapter 3; also read Chapter 14
through 14.3. Then solve and hand in the following problems:
              
Agresti: Chapter 14, # 14.4,  Chapter 2, # 2.30,  Chapter 3, # 3.4, 3.14, 3.15, 
     3.35, 3.38. Also do problem 3.34, and derive the large-sample distribution for 
     the case of Poisson sampling (3.34 counts as 2 problems). 

For calculations in Splus related to the Score,  Profile , and (`exact') Cornfield (1956) 
confidence  intervals, see HW23log. For other aspects of the solutions of the more  
theoretical problems in HW assignment 3, click here.
Assignment 4. (HW due Mon., Mar. 31). Read Chapter 4 through 4.7, and look at and try 
out SAS scripts, Appendix A of Agresti. Then solve and hand in the following problems:
              
Agresti: Chapter 3, # 3.10, 3.11, 3.30; Chapter 4, # 4.2, 4.6, 4.7, 4.27.
Assignment 5. (HW due Fri., Apr. 11). Read through the examples of logistic regression  
fitting in Chapters 5 and 6, and SAS examples. Then solve and hand in the following:
(A)  Create an Analysis of Deviance table for (i) Logistic Regression, and (ii) Probit   
Regression using the kyphosis data which you can find on the  /~evs/~s798c/Data   
web-page.  These data were discussed in an in-class handout, which can be found under  
the SASlogs link to file  SASlog3.txt  .  As far as I know the most economical way to do   
this is to do your PROC GENMOD regressions using forward or backward selection (with   
a very forgiving criterion for variable inclusion if SELECTION=FORWARD and a very   
restrictive criterion for variable retention if SELECTION=BACKWARD.
(B) Recall that you can get the data examples in the book by downloading them from under  
Agresti's website. Do this for the horseshoe crab satellite data, and do a SAS run to get the   
fitted coefficient values for the negative binomial model mentioned in Exercise 4.11, p.160.   
Then answer the questions in this exercise.
(C) Chapter 5, # 5.1.
Assignment 6. (HW due Mon., May. 12). Read Chapter 8 in conjunction with the lectures.   
Solve and hand in the following: Ch. 5 # 5.9, 5.33; Ch. 6 # 6.1, 6.7, 6.33; Ch. 8 # 8.1, 8.5, 8.19.
FINAL PROJECT ASSIGNMENT, due Wed., May 19 by 12 noon. As a final course project,  
you are to write a paper including at least 3 pages of narrative and at least 5 pages overall on    
an extended data analysis or case-study, based on a dataset of your choosing. The best sources   
are the StatLib web-site mentioned above, or Agresti's book web-site. The paper should    
present a coherent and reasoned data analysis with supporting evidence for the model you    
choose to fit, the variables you choose to include and exclude, whatever indications you can     
provide for the adequacy of fit of the models, and a summary of what the model says about the   
generating mechanism of the data. 



For two pdf-files with readings related to (Maximum Likelihood) estimation of parameters     
in Generalized Linear Mixed Models (GLMM's), specifically in random-intercept logistic     
regression models, you can look at:
   (i)  A handout from Stat 798c on ML estimation using the EM (Expectation-Maximization)    
algorithm along with MCMC (Markov Chain Monte Carlo techniques.
     (ii) A technical report (written by me for the Small Area Income and Poverty Estimates      
program at the Census Bureau) on numerical maximization of the random-intercept logistic    
regression model using the Adaptive Gaussian Quadratures method developed by Pinheiro    
and Bates (the authors of related nonlinear-model mixed-effect software in Splus).



                    SYLLABUS for Stat 770
1. Introduction --- binomial and multinomial probabilities, statistical tests, estimators
     and confidence intervals. Law of large numbers, central limit theorem, delta method,
     asymptotic normal distribution of maximum likelihood estimators, Wilks' Theorem.
2. Describing Contingency Tables --- models an measures of independence vs.
     association of categorical variables in multiway contingency tables. Hypotheses
     equating proportions for different variables. Conditional and marginal odds ratios
     and relative risks. Historical notes on contingency table analysis.
3.  Statistics, hypothesis tests of independence, and confidence intervals for parameters 
     in multiway contingency tables.
4. Generalized linear models. Formulation of conditional response probabilities as 
    linear expressions in terms of covariables. Likelihood and inference. 
    Quasilikelihood and estimating equations.
5. Logistic regression.  Interpretation and inference on model parameters. Model fitting, 
    prediction, and comparison.
6. Model-building including variables selection, diagnostics and inference about 
    variable associations in logistic regression models.
7. Logistic regression extensions for multinomial responses.
8. Loglinear models and their likelihood-based parametric statistical inference.
9. Generalized linear models with random effects. Likelihood and penalized likelihood 
    based inference. Missing-data formulation.
10. Comparison of model-fitting strategies. Likelihood, quasilikelihood, penalized 
    likelihood, Bayes.

Additional Computing Resources.  There are many publicly available datasets
for practice data-analyses. Many of them are taken from journal articles and/or
textbooks and documented or interpreted. A good place to start is Statlib .
Datasets needed in the course will be either be posted to the course web-page,
or indicated by links which will be provided here.

The UMCP Math Department home page.
The University of Maryland home page.
My home page.
 Eric V Slud, Mar. 19, 2003.