**Instructor: **Eric Slud, Statistics program, Math. Dept.

**Office: ** Mth 2314, x5-5469, email evs@math.umd.edu, **Office Hours:** M11, W1,
or by appointment

**Course Text:** A. Agresti, *Categorical Data Analysis*, 3rd ed. 2013.

**We will NOT have an in-class test this semester.**

Please fill out the on-line Evaluation form on this
Course and instructor at http://CourseEvalUM.umd.edu.
**Thank you.**

**Overview:** This course covers the statistical analysis of discrete data, cross-classified by and modeled in terms of auxiliary covariate measurements which may be continuous or discrete. Such data structures arise in a wide variety of fields of application, especially in the social and biological sciences. The basic underlying model is the multinomial distribution, with cell-probabilities parametrically
restricted according to their array structure, with conditional probability masses for a distinguished
response variable often expressed linearly in terms of covariates. Important models of this type (some
of which generalize to the case of continuous covariates) include logistic regression, other `generalized
linear models', and loglinear models. The modern approach to these topics involves estimation via
likelihood-based methods or generalizations to so-called quasilikelihood estimating equations, with
emphasis on statistical computing and model diagnostics. In addition, computational advances have made
categorical data models with random effects tractable to estimate and interpret, and Bayesian and
empirical-Bayes methods are an important part of the material included in the new edition of the
Agresti text. Methods covered in the course will be presented in terms of theoretical properties,
computational implementation (primarily in **R**), and real-data application.

**NOTE ON USE OF THEORETICAL MATERIAL. **Both in homeworks and the in-class test, there will
be theoretical material at the level of probability theory needed to apply the law of large numbers and
central limit theorem, along with the `delta method' (Taylor linearization) and other manipulations at
advanced-calculus level.

**Prerequisite: **Stat 420 or Stat 700, plus some computing familiarity.

**Course requirements and Grading:** there will be 7 graded homework sets (one every 2 weeks)
which together will count 50% of the course grade. There will also be an in-class test (in early November)
and a final data analysis/simulation course project (which you will summarize in a 5-10 page paper plus
graphical/numerical exhibits), each of which will count as 25% of the course grade.

NOTE: since the in-class test will not be given this semester, the course grade will be based 2/3 on Homework
and 1/3 on the Final Paper.

**Course Coverage:** Chapters 1-6, 8-9 and 13 of the Agresti book, plus bits and pieces from Chapters
7, 16, and 17.

**NOTE ON COMPUTING. **Both in the homework-sets and the course project, you will be required
to do computations on real datasets well beyond the scope of hand calculation or spreadsheet programs.
Any of several statistical-computing platforms can be used to accomplish these: **R**, SAS, Minitab,
Matlab, or SPSS, or others. If you are learning one of these packages for the first time, I recommend
**R** which is free and open-source and is the most flexible and useful for research statisticians.
I will provide links to free online **R** tutorials and will provide examples and scripts and will
offer some **R** help. The Agresti book gives scripts and illustrations in SAS.

**Getting Started in R and SAS.** Lots of R introductory materials can be found on my last-year's
STAT 705 website. Another free and interactive site I
recently came across forintroducing R to social scientists is: https://campus.sagepub.com/blog/beginners-guide-to-r.

Various pieces of information to help you get started in using SAS can be found under an old (F09) course
website Stat430. In particular you can find:

--- an overview of the minimum necessary steps to use SAS from Mathnet.

--- a series of SAS logs with edited outputs for illustrative examples.

The Agresti text has an Appendix A describing software, including SAS scripts, which can be used to perform categorical data analyses. In addition, datasets can be downloaded from Agresti's website. Several logs in SAS (with some comparative Splus analyses) doing illustrative data analyses and including standard SAS scripts can be found here.

**Test Topics and Coverage:** you can get an idea of these from In-Class Test from April 14, 2003. Also see a directory
of SASlogs and a
Sample Test. A small writeup of computational
details related to the first problem of the sample test can be found here.

**Notes and Guidelines.** Homeworks should be handed in as hard-copy in-class, except for occasional
due-dates on Fridays when you may submit them electronically, via email, in pdf format. Solutions will
usually be posted, and a percentage deduction of the overall HW score will generally be made for later papers.

Assignment 1. (First 2 weeks of course, HW due Wed., Sep. 12). Read all of Chapter 1, plus the first of the sections from the historical notes in Chap. 17. Then solve and hand in all of the following problems:

**(A).**

**(B).**_{i} are independent and identically distributed
(*iid*) discrete random variables with values {1,...,K} and probability mass function P(X_{i}=k) =
p_{k}. Find the joint probability distribution of (N_{k}, k=1,...,K) where
N_{k} = summation over i=1 to N of I_{[Xi=k]} where
N~Poisson(n) is independent of {X_{i}: i=1,2,...}.

**(C).**

(b) Let the m black balls position themselves among N bins as in (a). But now suppose that, given the positions
of black balls, the positions for the n white balls are chosen in such a way that the odds for each white ball
to fall in a bin occupied by a black ball is multiplied by a factor e^{θ} as compared with the
odds of falling in a bin not occupied by a black ball. (*A more precise way to say or model this question is
as follows: (i) suppose that for some fixed probability p , for each bin j=1,...,N independently
of the others, a black ball is placed in bin j; but that (ii) we condition on the total number of bins containing
black balls being equal to m; and (iii) suppose that each bin not containing a black ball independently of all
other bins has probability q of receiving a white ball, while each bin containing a black
ball independently of all other bins has probability q ^{*} [defined to satisfy:
q^{*}/(1-q^{*}) = e^{θ} q/(1-q) ], of receiving a white ball; but
that (iv) we condition on the total number of bins containing white balls being equal to n.*) Now what is
the probability distribution of the number X of bins occupied by both a black and a white ball ?

**(D).**

**A listing of R functions and commands that can be used to solve problem (A) above can be found here. You can also look at the resulting picture either for
sample size n=40 as requested, or
for n=100.**

For an interesting comparison between an `Agresti-Coull (1998) confidence interval ' advocated by the author of our text (see problem 1.25), versus the other standard intervals we are studying, and also versus a transformed Wald-interval (with 0.5 added to number of successes and failures) on logit scale, see this picture.

Assignment 2.(Second 2 weeks of course, HW due Wed., Sep. 26). Read all of Chapter 2, plus Chap.3 Sections 3.1-3.2. Then solve and hand in the following problems:

**(A).** Consider the data in Table 2.8 of the Agresti book, on page 57, and assume that the data sampled
in each Age group are independent identically distributed among a well-defined population of employed people.
Find a confidence interval for the fraction of each age-defined subpopulation that is not satisfied (i.e., falls in
Job-satisfaction category 1), according to each of the following methods: (i) Wald, (ii) Inverted Score-test,
(iii) Clopper-Pearson, (iv) Likelihood Ratio Test, and (v) Bayesian credible interval based on a Beta(1,1) =
Uniform prior distribution on the unknown fraction p.

**(B).** Establish a large-sample CI for log relative risk log(p_{1}/p_{2}) using the Delta
method, in the setting of independent observations X_{j} ~ Binom(n_{j}, p_{j}) for j=1,2, and
apply it to the data in Table 2.1 on Aspirin and Heart Attack Study data on p.38, in two ways: (i) to find a 90% CI
for log RR of fatal heart attack for those on placebo relative to those taking aspirin, and (ii) to find a 95% CI
for log RR for incidence of Heart Attack (whether fatal or not) for those on placebo relative to those taking aspirin.
Assume that the placebo patients and those taking aspirin were sampled independently and equiprobably from large
general populations.

**(C).** The following data from a paper by Helmes and Fekken (1986 Jour. Clin. Psych. **42**, 569-576)
classifies a sample of psychiatric patients by their diagnoses and whether their treatment prescribed drugs:

Drugs

No Drugs

Using these data and assuming the rows were sampled independently and iid, (a) conduct a test of (row vs column)
independence and interpret the P-value; (b) Obtain standarized residuals and interpret, and (c) partition chi-squared
into three components to describe differences and similarities among the diagnoses, by comparing (i) the first two columns,
the 3rd and 4th column, and the last column to the combination of the first two colums and the combination of the 3rd
and 4th columns.

**(D).** Of the 12 candidates for 4 managerial positions, 6 are female and 6 male. Denote the females
F1,..,F6 and the males M1,..,M6. The actual result of choosing the managers is (F2,M1,M4,M6). (i) How many possible
without-replacement samples were there ? Construct the contingency table for the without-replacement sample actually
obtained. (ii) Find and interpret the P-value for the Fisher's Exact Test of the hypothesis that all candidates were
equally likely to be selected. (iii) Re-do the same problem if there were 100 candidates (50 male and 50 female) for
the four managerial positions.

**(E).** Find the likelihood ratio test and chi-squared test test-statistics in a 3x3 table
for the hypothesis H_0: p_{jk} ∝ exp(aj + bk + cjk) versus the general alternative.
Find and interpret the P-values for these statistics in the 3x3 table with first row (3,7,18), second row
(5,18,17), and 3rd row (9,35,42). *Note: you should find the likelihood-maximizing values â,
b̂ and ĉ under H _{0}, either using a numerical-maximizing function like*

Assignment 3. (Third 2 weeks of course, HW due Mon.,
Oct.15). Read additional sections of Chapter 3, plus Chapter 4 and the first few sections of Chapter 5.
Then solve and hand in the following problems (7 in all): # 3.21 [also do as part (c) of this problem: same as
part (b) but with **ratio** of proportions]; # 4.10, 4.12, 4.20, 4.33 and in addition

**(I).** Fit a logistic regression model to the Crabs mating data with outcome variable: (Crabs$sat > 1)
Use and using the predictors: spine (as factor), weight (rounded to nearest 0.2kg), and width (rounded to
nearest cm). You may use interactions if they help. Fit the best model you can, and assess the quality of fit
of your best model using the techniques in Sec.5.2.

**(II).** For the "best" model you fit in problem (I): (a) fit the coefficients directly (by a likelihood
calculation that you code in R using "optim" with method="BFGS") and also by coding the Fisher-scoring algorithm
and using 5 or 10 iterations (starting from all coef's = 0), and (b) check that the SEs for coefficients found
by "glm" are close to those found from your observed information matrix estimates calculated in (a).

Assignment 4. (Fourth 2 weeks of course, HW due Mon., Nov.5). Read the rest of Chapters 5 and 6, plus sections 9.1--9.3. Then solve and hand in the following problems (7 in all): 5.2 (parts f and i only), 5.6, 5.9, 5.10, 5.30 (parts c and d only), 5.39, 6.28.

Assignment 5, due Mon., Nov. 19. Note the sections 6.3.3 and 6.3.4 on predictive power, included in the previous reading assignment. The reading for this 2-week period consists of Chapter 9 -- which we have already been covering in class -- along with a comparison of the methods of fitting these and the generalized linear models (Newton-Raphson, Fisher Scoring, Iteratively Reweighted Least Squares] we have studied so far. The computational topics are in sections 9.6, 9.7, and 4.6. We will then discuss some of the models in Chapter 8 (Sec. 8.1,8.2, 8.5). Then solve and hand in the following problems (7 in all): 6.8, 6.10, 9.2, 9.3, 9.16(d and e only), 9.30, 9.34.

Assignment 6, due Wed., Dec. 12. The reading
for the last part of the course consists of Chapter 13, Chapter 14 sections 14.3 and 14.5, along with the
material on **glmmBUGS** linked in the Handouts section (5) below, along with Mini-Course notes I wrote on
MCMC Lecture 1 and Lecture 2. For a quicker and more focused introduction,
here are Notes on MCMC for random-intercept
logistic regression.

**(I).** (a) Build a simple, reasonable model using fixed-effect logistic regression for the outcome
class="malignant" in the dataset "biopsy" in package MASS, using only the first biopsies for the 645 unique IDs,
and ignoring variable V6.

For all three parts, fit the model and interpret/compare the coefficients and quality of fit.

**(II).** Consider a dataset with a three-level outcome Y (=0,1 or 2) and a predictor X
generated artificially with the model

using the commands

set.seed(7010)

Xvec = rnorm(5000)

Ymat = array(0, c(5000,3))

Ymat[,1] = rbinom(5000,1, exp(-1.5+0.6*Xvec)/(1+exp(-1.5+0.6*Xvec)+exp(-.5+0.1*Xvec)))

Ymat[,2] = rbinom(5000,1-Ymat[,1], plogis(-.5+0.1*Xvec))

Ymat[,3] = 1-Ymat[,1]-Ymat[,2]

**(III).** Find and display the posterior densities of four coefficients [Intercept and District2,
District3, and District4] in the Bayesian Poisson (fixed-effects) regression of Claims in the dataset
"Insurance" within the R package MASS, using first-principles MCMC code following the ideas of the
logistic-regression Metropolis-Hastings code given in the class handout. Note that you should specify
the Poisson regression with an "offset" term of log(Holders) added to the Intercept.

**(IV).** Use a package (glmmBUGS or rjags) or else a direct Metropolis-Hastings MCMC implementation as
in the script **BayesLgst.RLog** in class to do a Bayesian analysis of the random intercept effect standard
deviation in the "biopsy" dataset (using only first biopsy for each unique ID, and omitting V6) using a fixed and
mixed-effect generalized-linear-model (logistic-regression with random intercept).

FINAL PROJECT ASSIGNMENT, due Friday, December 14, 2018, 5pm. As a final course
project, you are to write a paper including 5-10 pages of narrative, plus relevant code and graphical or tabular
exhibits, on a statistical journal article related to the course or else a data analysis or case-study based on
a dataset of your choosing. The guideline is that the paper should be 10 pages if it is primarily expository
based on an article, but could have somewhat fewer pages of narrative if based on a data-analytic case study.
However, for the latter kind of paper, all numerical outputs should be accompanied by code used to generate them,
plus discussion and interpretation of software outputs and graphical exhibits. For a data-analysis or case study,
the paper should present a coherent and reasoned data analysis with supporting evidence for the model you choose
to fit, the variables you choose to include and exclude, whatever indications you can provide for the adequacy
of fit of the models, and a summary of what the model says about the generating mechanism of the data.

Two good sources of data for the paper are the StatLib web-site mentioned below,
or Agresti's book web-site.

Possible topics for the paper include: **(a)** Zero-inflated Poisson regression models, based on an original paper
of Lambert but discussed in connection with the Horseshoe Crabs dataset in a web-page posted by Agresti (indexed
also under heading 2. in his book web-page.

**(b)** The relationship between individual heterogeneity and
overdispersion and prediction, in a very nice article connected with a court case in the Netherlands
mentioned as a Handout (number (3) under Handouts section below).

**(c)** Discussion of `raking' in connection with survey-weighted contingency tables, and extension of the Iterative Proportional Fitting Algorithm covered in the loglinear chapter in the course.

**(d)** I mentioned in class that those of you with interests in Educational Statistics might consider covering some article or book-chapter in categorical data analysis related to Item Response Theory modeling, such as the article *Semiparametric Estimation in the Rasch Model and Related Exponential Response Models, Including a Simple Latent Class Model for Item Analysis*, by Bruce Lindsay, Clifford C. Clogg, John Grego, in the Journal of the American Statistical Association, Vol. 86, No. 413 (Mar., 1991), pp. 96-107, http://www.jstor.org/stable/2289719.

**(e)** You might also base your paper on discussion of an R-package, with data
illustration to do analyses related to some categorical-data data structure not covered in detail in the class, such as analysis of multicategory generalizations of (fixed-effect) logistic or probit regression, or ordinal-outcome categorical modeling, or `social choice' modeling.

Two pdf-handout files contain readings related to (Maximum Likelihood) estimation of parameters in Generalized Linear Mixed Models (GLMM's), specifically in random-intercept logistic regression models:

(i) A handout from Stat 705 on ML estimation using the EM (Expectation-Maximization) algorithm along with another on MCMC (Markov Chain Monte Carlo) techniques.

(ii) A technical report (written by me for the Small Area Income and Poverty Estimates program at the Census Bureau) on numerical maximization of the random-intercept logistic regression model using the Adaptive Gaussian Quadratures method developed by Pinheiro and Bates (the authors of related nonlinear-model mixed-effect software in R later adapted to NLMIXED in SAS).

**(1).** Proof of limiting distribution
of multinomial Chi-square goodness of fit test statistic.

**(2).** See the directory Survey Confidence Intervals for two papers
and a supplement on the extension of binomial confidence intervals to unknown proportions estimated in complex
surveys. The JSM 2014 paper, published in the American Statistical Association Proceedings of the Survey Research
and Methods Section from the 2014 annual statistical meetings, contains the results of a simulation study showing
the relative merits of various binomial-proportion Confidence Intervals adapted to complex survey data. The other
paper and supplement, which extends the simulation study and improves the way the intervals are adapted to survey
data, will appear soon in the Journal of Survey Statistics and Methodology.

**(3).** A very interesting case-study on a criminal case in the Netherlands and
the importance of accounting for overdispersion in doing justice to a criminal-defendant. The case study is authored
by eminent Dutch statisticians, Gill, Groeneboom and de Jong. The math is very accessible and the point very clear.

**(4).** Over the course of the semester, we will have a series of R scripts in the directory Rscripts. The first of these can already be viewed here.

**(5).** Several R packages for fitting Generalized Linear Mixed Models (particularly, binomial and Poisson
family random-intercept were mentioned in a script-file covered in class. Some only
approximate the GLMM log-likelihood using Monte Carlo techniques, such as glmm or glmmBUGS, while others (which are most useful
for the relatively simple random-intercept models arising in the applications in Agresti) calculate the
log-likelihoods as accurately as desired using Adaptive Gaussian Quadrature (AGQ): these include lme4, glmmML, or GLMMadaptive, and these can also be checked against my own code in the workspace Rscripts. Also see exposition of AGQ that I wrote in a random-intercept logistic-regression context, which should be accessible and useful to students in this course.

**(6).** Another package that can be used to fit multiple-outcome ("generalized-logistic" or "multinomial")
logistic regression is mlogit.
That package, which may be the only R package currently capable of fitting random-intercept models of generalized
logistic type, was written not for that purpose but to analyze `social choice' datasets of interest to
econometricians.

**(7).** A talk I gave about a case-study in generalized (i.e., multi-outcome) logistic regression, in
the context of gross labor-status flows using Current Population Survey data, can be found here.

August 27, 2018

**1. Introduction** --- binomial and multinomial probabilities, statistical tests, estimators
and confidence intervals. Law of large numbers, central limit theorem, delta method,
asymptotic normal distribution of maximum likelihood estimators, Wilks' Theorem.

**2. Computational Techniques** -- Numerical Maximization, Fisher Scoring, Iterative Proportional
Fitting, EM Algorithm, Bayesian (MCMC) Computation.

**3. Describing Contingency Tables** --- models and measures of independence vs. association of
categorical variables in multiway contingency tables, including tests of independence. Hypotheses equating
proportions for different variables. Conditional and marginal odds ratios and relative risks. Confidence
intervals for parameters, including Bayesian formulations. Historical notes on contingency table analysis.

**4. Generalized linear models. **Formulation of conditional response probabilities as
linear expressions in terms of covariables. Likelihood and inference. Quasilikelihood and estimating equations.

**5. Logistic regression. **Interpretation and inference on model parameters. Model
fitting, prediction, and comparison.

**6. Model-building** including variables selection, diagnostics and inference about variable
associations in logistic regression models.

**7. Logistic regression extensions**: multiple-category responses, weighted observations,
and missing data.

**8. Loglinear models** and their likelihood-based parametric statistical inference.

**9. Generalized linear models with random effects.** Likelihood and penalized likelihood
based inference. Missing-data formulation.

**10. Comparison of model-fitting strategies.** Likelihood, quasilikelihood, penalized
likelihood, Bayes.

Additional Computing Resources. There are many
publicly available datasets for practice data-analyses. Many of them are taken from journal articles
and/or textbooks and documented or interpreted. A good place to start is Statlib. Datasets needed in the course
will be either be posted to the course web-page, or indicated by links which will be provided here.

A good set of links to data sources from various organizations including Federal
and international statistical agencies is at Washington
Statistical Society links.

**First Class: Mon., August 27, 2018****Labor Day Holiday: Mon., Sept. 3, 2018**, NO CLASS**Rosh Hashonah Holiday: Mon., Sept. 10, 2018**, NO CLASS**Yom Kippur Holiday: Wed., Sept. 19, 2018**, NO CLASS**NO In-Class Test on November 12, 2018****Last schedule-adjustment Date (for Drop/Withdrawal): November 5, 2018****NOTE:**No Class on Wed., November 21, 2018,**Thanksgiving Break****Last day of classes: Mon. December 10, 2018**

**The UMCP Math Department home page.
The University of Maryland home page.
My home page.
© Eric V Slud, March 11, 2019.**