Large Cross-Classified Dataset RIT Page, Fall '05

Statistics of Large Cross-Classified Datasets

Statistics of Models with Parameter Dimension
Growing as a Function of Sample Size

Mon. 3-4, Rm Mth 2400 Fall '05

Eric Slud Statistics Program , Math Department Rm 2314 x5-5469

Interested participants should get in touch with me at evs@math.umd.edu

Reading list

Relation to previous RITs

Schedule of Talks

Research Focus: Large datasets arise naturally in many areas of science, government,
and business. Typically, as the size of a dataset gets large, the complexity of questions
which one addresses with it also increases. Such problems range from standard parametric
models whose parameters are allowed to vary with cross-classifying variables, to problems
in regression and time series which lead naturally to the development of Order-selection
techniques, to Semiparametric Statistical Inference (problems where the nuisance parameters
are infinite-dimensional but may be approximated in some sense by finite-dimensional
parameters of growing dimension, to Classification and Clustering.

Growing parameter dimension violates the formal setting of most textbook mathematical
statistics, in which parameter-dimension is fixed and sample size increases to infinity.
The new setting requires a new Asymptotic Theory which explicitly recognizes the controlled
growth of the parameter-space of a probability model as a function of the dataset size.

This Research Interaction seminar on mathematical and statistical topics in Large Cross-
Classified Datasets broadly encompasses the overlap of my students' thesis projects and
most of my own current research interests.

The reading list presented below for this RIT has a somewhat more theoretical flavor than
those in related past RITs I have run. However, the problem descriptions and
applications which we discussed in the past are closely related to those we will discuss this
term, and should serve as very good motivation for students in search of good research
problems which mix theory and applications.

Graduate Prerequisites: To benefit from this research activity, a graduate student
should have completed Stat 700 and at least one of Stat 740, 741, 750, or 770,
and have some familiarity with Statistical Computing at the level of Stat 430 (SAS
programming) or Stat 798C (Splus or R).

Undergraduate Prerequisites: An interested undergraduate should have had at
least one course in Mathematical Statistics (e.g. Stat 401 or 420) and experience ---
either in courses or projects --- with numerical computing or data analysis.

Graduate Program: Graduate students will be involved in reading and presenting
papers from the statistical literature concerning provable properties of models and
statistical inference methods related to large cross-classified data structures,
including longitudinal data and spatial or survey data cross-classified or stratified in
terms of many observed covariates. In some cases, students may explore and
present software for the statistical analysis of some of the data structures studied.
(One example would be GEE or generalized estimating equation methods for
longitudinal data.)

Undergraduate Program: Undergraduate students will be involved primarily in
comparative numerical experiments involving algorithms for simulating and analyzing
the large cross-classified data structures we study.

Work Schedule: We will meet weekly in the fall of 2005. Students will choose
from the following list of Topics and Papers (which will regularly be augmented on
this web-page) and present the material in subsequent weeks, after an introductory
couple of weeks' talks by me. Presentations can be informal, but should be detailed
enough and present enough background that we can understand the issues and ideas
clearly. It is expected that many presentations will extend to a second week.

Topics & Papers

Topics by Keyword:
                  misspecified regression models, random-effect GLM's,
                 regression-variable & model selection
                 principal components analysis & factor analysis,
                 asymptotics for linear models with growing numbers of regressors,
                 errors-in-variables (`measurement error') models,
                 longitudinal models & GEE methods (Generalized Estimating Equations),
                 Panel Data Econometrics,
                 classification and clustering in large datasets,
                 experimental design (`response surface methodology')

Topics by Theoretical Idea :
                 Rates of growth of parameter dimension p(n) compatible with
                        consistency, asymptotic normality & efficiency of estimators,
                 Linear model asymptotics; GLM extensions using exponential families,
                 Profile and adjusted profile likelihoods,
                 Projected score and Hilbert space techniques,
                 Estimating equations,
                 Other modified profile likelihoods,
                 Properties of automatic model selection methods,
                 Methods related to misspecified models.

Reading List

Relevant papers read in past terms (which might be revisited):

Neyman, J. and E. Scott (1948) Consistent estimates based on
partially consistent observations. Econometrica 16, 1-32.

Self, Steven G. and Liang, Kung-Yee (1987) Asymptotic properties
of maximum likelihood estimators and likelihood ratio tests under
nonstandard conditions. Jour. Amer. Statist. Assoc. 82, 605-610.

White, Halbert (1982) Maximum likelihood estimation of misspecified models.
Econometrica 50, no. 1, 1-25.

Important papers on the focused RIT topic:

Chen, Ru (2005) Misspecified Models with Parameters of Increasing Dimension.
University of Maryland College Park Thesis, Statistics Program, August 2005.

This thesis of my just-finished student contains some new results on consistency
and asymptotic normality and variances of least-squares estimates within fixed-effect
linear regression models when the data actually are governed by mixed-effect models.

He, X. and Shao, Qi-Man (2004) On parameters of increasing dimensions. Preprint.

Jiang, Jiming (1999) Conditional inference about generalized linear mixed models.
Ann. Statist. 27 , 1974-2007.

Li, Haihong, Lindsay, Bruce G. and Waterman, Richard P. (2003) Efficiency of projected
score methods in rectangular array asymptotics. J. Roy. Statist. Soc. Ser. B 65, 191-208.

This is a really interesting paper showing the phenomenon in "two-index asymptotics"
that when numbers of cluster-level nuisance parameters increase overall as the square root
of data-size, uncorrected MLE's are generally consistent but inefficient while a suitable
"projected score" correction restores efficiency.

Lindsay, Bruce, Clogg, C., and Grego, J. (1991) Semiparametric estimation in the Rasch
model and related exponential response models, including a simple latent class model
for item analysis. J. Amer. Statist. Assoc. 86, 96-107.

Another very interesting paper, but this one has a sizeable data-analytic component
concerning models that are very important in educational statistics that also exhibit the
"two-index asymptotics" phenomenon but usually collapse the nuisance parameters'
dimension by treating them as random effects.

McCullagh, P. and Tibshirani, R. (1990) A simple method for the adjustment of profile
likelihoods. J. Roy. Statist. Soc. Ser. B 52, 325-344.

This paper will be our introduction to "profile likelihood" methods which deal with
(sometimes high-dimensional) nuisance parameters by replacing them in the likelihood
by suitable estimators from the data. Various "adjustments" to the resulting likelihood
may improve the properties of theresulting ML estimator of the parameter of interest.

Pfanzagl, J. (1993) Incidental versus random nuisance parameters. Ann. Statist.
21, 1663-91.

Portnoy, Stephen (1988) Asymptotic behavior of likelihood methods for exponential
families when the number of parameters tends to infinity. Ann. Statist. 16, 356-366.

This widely cited paper deals with extensions to GLM exponential family settings
of consistency and asymptotic normality results for linear regression with number of
predictors growing as a function of sample size.

Sartori, N. (2005) Modifications to the profile likelihood in models with incidental
nuisance parameters. Preprint.

Slud, E. and Vonta, F. (2005) Efficient semiparametric estimators via modified
profile likelihood. Jour. Statist. Planning & Inf. 44, 339-367.

A recent paper with my former student, which gives a general approach to
constructing semiparametric efficient estimators (of finite-dimensional structural
parameters in the presence of infinite-dimensional or functional nuisance
parameters) by means of a 'modified profile likelihood'. One of the ingredients in
this approach is to find consistent preliminary estimators, for which the techniques
of the other papers to be studied in this RIT should be very helpful.

Wei, C. Z. (1992) On predictive least squares principles. Ann. Statist. 20, 1-42.

Yang, Yuhong, and Barron, Andrew R. (1998) An asymptotic property of model
selection criteria. IEEE Trans. Inform. Theory 44, no. 1, 95-116.

Yohai, V. and Maronna, R. (1979) Asymptotic behavior of M-estimators for the
linear model. Ann. Statist. 7, 258-268.

A paper on "M-estimation", relevant to semiparametric linear
regression (estimation using unknown or mis-specified error distribution) which
shows clearly -- if somewhat technically -- how consistency and asymptotic
normality of estimators results from specific assumptions on the rate of growth
of the number of predictors with sample size. I would like someone to present the
(very accessible!) parts of the proof relating to consistency, i.e., to Theorem 2.

Discussion of Relation to Previous RIT's

Many additional references on large datasets with particular reference to DNA
Microarrays (used in AMSC seminars in Fall '01 and RIT in Spring '02)
can be found here.
A web-site of microarray references, created by a statistician at LSU named
Barry Moser, may also be helpful. There is also a Special Issue of Statistical Science
(Feb. 2003) on Statistical Challenges and Methods for Microarray Analysis
which contains survey articles and bibliographies with many items of methodological
interest for large cross-classified genomic datasets. Special Issues of other journals
(including Statistica Sinica) also were devoted to the topic. See further 2002 RIT
materials here.

In 2003 the RIT broadened to encompass non-genomics problems under the
heading of Statistics of Large Cross-Classified Datasets. You can see the back-
ground discussion, readings and talk titles at the RIT '03 web-page. Finally, in
Spring '04, the Large Cross-Classified Datasets considered became (mostly
biomedical) multicenter studies unified by random effects models under the trendy
title of "Meta-Analysis", with web-page here.

Schedule of Talks ---

Organization & Overview on `Two-Index Asymptotics': Sept 12, Eric Slud
Ajusted Profile Likelihood: Sept 19, Nadarajasundaram Ganesh
(1990 JRSSB paper of McCullagh & Tibshirani)
Modified profile likelihood ideas : Sept 26, Eric Slud
Two-Index Asymptotics & Projected Score Methods: Oct 3, Yabing Mai
(2003 JRSSB Paper of Li, Lindsay and Waterman,
spilling over a little into following meeting on Oct. 3)

Cross-Classified SEER survival data: Oct. 10, Hantao Mai
(general intro, related to a Stat.Med. paper of Royston)
Mini-course Lecture on Factor Analysis : Oct. 17, E. Slud, 4pm Mth 3206

Cross-classified cancer survival data, cont'd: Oct. 24, Hantao Mai.

Rasch model paper of Lindsay, Clogg, & Grego (1991): Oct. 31, Ziliang Li

Gave second talk Nov. 14 on results from the same paper.

No RIT meeting November 7: Stat Seminar meets at
unusual time at 4-5pm on that day.

Last meeting for term will be Nov. 28, at which I will relate my students' and
my work on cross-classified factor analysis and Two-Index Asymptotics
results we discussed earlier (Oct. 3) in connection with the Li, Lindsay and
Waterman (2003 JRSSB) paper.
Please get in touch with me at evs@math.umd.edu
to volunteer to read a paper and give a talk.

Return to my home page.

© Eric V Slud, November 17, 2005.

Topics & Papers

Reading List

Discussion of Relation to Previous RIT's

Schedule of Talks ---

Return to my home page.

© Eric V Slud, November 17, 2005.