Mon. 3-4, Rm Mth 2400
Eric Slud Statistics Program , Math Department Rm 2314 x5-5469
Interested participants should get in touch with
me at firstname.lastname@example.org
Relation to previous RITs
Schedule of Talks
Research Focus: Large datasets arise naturally in many areas of
and business. Typically, as the size of a dataset gets large, the complexity of questions
which one addresses with it also increases. Such problems range from standard parametric
models whose parameters are allowed to vary with cross-classifying variables, to problems
in regression and time series which lead naturally to the development of Order-selection
techniques, to Semiparametric Statistical Inference (problems where the nuisance parameters
are infinite-dimensional but may be approximated in some sense by finite-dimensional
parameters of growing dimension, to Classification and Clustering.
Growing parameter dimension violates the formal setting of most
statistics, in which parameter-dimension is fixed and sample size increases to infinity.
The new setting requires a new Asymptotic Theory which explicitly recognizes the controlled
growth of the parameter-space of a probability model as a function of the dataset size.
This Research Interaction seminar on mathematical and statistical
topics in Large Cross-
Classified Datasets broadly encompasses the overlap of my students' thesis projects and
most of my own current research interests.
The reading list presented below for this
RIT has a somewhat more theoretical flavor than
those in related past RITs I have run. However, the problem descriptions and
applications which we discussed in the past are closely related to those we will discuss this
term, and should serve as very good motivation for students in search of good research
problems which mix theory and applications.
Graduate Prerequisites: To benefit from this research
activity, a graduate student
should have completed Stat 700 and at least one of Stat 740, 741, 750, or 770,
and have some familiarity with Statistical Computing at the level of Stat 430 (SAS
programming) or Stat 798C (Splus or R).
Undergraduate Prerequisites: An interested undergraduate
should have had at
least one course in Mathematical Statistics (e.g. Stat 401 or 420) and experience ---
either in courses or projects --- with numerical computing or data analysis.
Graduate Program: Graduate students will be involved in
reading and presenting
papers from the statistical literature concerning provable properties of models and
statistical inference methods related to large cross-classified data structures,
including longitudinal data and spatial or survey data cross-classified or stratified in
terms of many observed covariates. In some cases, students may explore and
present software for the statistical analysis of some of the data structures studied.
(One example would be GEE or generalized estimating equation methods for
Undergraduate Program: Undergraduate students will be
involved primarily in
comparative numerical experiments involving algorithms for simulating and analyzing
the large cross-classified data structures we study.
Work Schedule: We will meet weekly in the fall of 2005.
Students will choose
from the following list of Topics and Papers (which will regularly be augmented on
this web-page) and present the material in subsequent weeks, after an introductory
couple of weeks' talks by me. Presentations can be informal, but should be detailed
enough and present enough background that we can understand the issues and ideas
clearly. It is expected that many presentations will extend to a second week.
Topics by Keyword:
misspecified regression models, random-effect GLM's,
regression-variable & model selection
principal components analysis & factor analysis,
asymptotics for linear models with growing numbers of regressors,
errors-in-variables (`measurement error') models,
longitudinal models & GEE methods (Generalized Estimating Equations),
Panel Data Econometrics,
classification and clustering in large datasets,
experimental design (`response surface methodology')
Topics by Theoretical Idea :
Rates of growth of parameter dimension p(n) compatible with
consistency, asymptotic normality & efficiency of estimators,
Linear model asymptotics; GLM extensions using exponential families,
Profile and adjusted profile likelihoods,
Projected score and Hilbert space techniques,
Other modified profile likelihoods,
Properties of automatic model selection methods,
Methods related to misspecified models.
Relevant papers read in past terms (which might be revisited):
Neyman, J. and E. Scott (1948) Consistent estimates based on
partially consistent observations. Econometrica 16, 1-32.
Self, Steven G. and Liang, Kung-Yee (1987) Asymptotic properties
of maximum likelihood estimators and likelihood ratio tests under
nonstandard conditions. Jour. Amer. Statist. Assoc. 82, 605-610.
White, Halbert (1982) Maximum likelihood estimation of misspecified
Econometrica 50, no. 1, 1-25.
Important papers on the focused RIT topic:
Chen, Ru (2005) Misspecified Models with Parameters of
University of Maryland College Park Thesis, Statistics Program, August 2005.
He, X. and Shao, Qi-Man (2004) On parameters of increasing dimensions. Preprint.
Jiang, Jiming (1999) Conditional inference about generalized
linear mixed models.
Ann. Statist. 27 , 1974-2007.
Li, Haihong, Lindsay, Bruce G. and Waterman, Richard P. (2003)
Efficiency of projected
score methods in rectangular array asymptotics. J. Roy. Statist. Soc. Ser. B 65, 191-208.
Lindsay, Bruce, Clogg, C., and Grego, J. (1991)
Semiparametric estimation in the Rasch
model and related exponential response models, including a simple latent class model
for item analysis. J. Amer. Statist. Assoc. 86, 96-107.
McCullagh, P. and Tibshirani, R. (1990) A simple method for the
adjustment of profile
likelihoods. J. Roy. Statist. Soc. Ser. B 52, 325-344.
Pfanzagl, J. (1993) Incidental versus random nuisance
parameters. Ann. Statist.
Portnoy, Stephen (1988) Asymptotic behavior of
likelihood methods for exponential
families when the number of parameters tends to infinity. Ann. Statist. 16, 356-366.
Sartori, N. (2005) Modifications to the profile likelihood in models
nuisance parameters. Preprint.
Slud, E. and Vonta, F. (2005) Efficient semiparametric estimators
profile likelihood. Jour. Statist. Planning & Inf. 44, 339-367.
Wei, C. Z. (1992) On predictive least squares principles. Ann. Statist. 20, 1-42.
Yang, Yuhong, and Barron, Andrew R. (1998) An asymptotic property of
selection criteria. IEEE Trans. Inform. Theory 44, no. 1, 95-116.
Yohai, V. and Maronna, R. (1979) Asymptotic behavior of M-estimators
linear model. Ann. Statist. 7, 258-268.
Many additional references on large datasets with particular reference
Microarrays (used in AMSC seminars in Fall '01 and RIT in Spring '02)
can be found here.
A web-site of microarray references, created by a statistician at LSU named
Barry Moser, may also be helpful. There is also a Special Issue of Statistical Science
(Feb. 2003) on Statistical Challenges and Methods for Microarray Analysis
which contains survey articles and bibliographies with many items of methodological
interest for large cross-classified genomic datasets. Special Issues of other journals
(including Statistica Sinica) also were devoted to the topic. See further 2002 RIT
In 2003 the RIT broadened to encompass non-genomics problems under the
heading of Statistics of Large Cross-Classified Datasets. You can see the back-
ground discussion, readings and talk titles at the RIT '03 web-page. Finally, in
Spring '04, the Large Cross-Classified Datasets considered became (mostly
biomedical) multicenter studies unified by random effects models under the trendy
title of "Meta-Analysis", with web-page here.
Schedule of Talks ---
(1990 JRSSB paper of McCullagh & Tibshirani)
Please get in touch with me at
to volunteer to read a paper and give a talk.