Statistics of Large Cross-Classified Datasets

Statistics of Models with Parameter Dimension
Growing as a Function of Sample Size

Mon. 3-4,  Rm  Mth 2400                                                Fall '05

Eric Slud        Statistics Program , Math Department        Rm 2314       x5-5469

Interested participants should get in touch with me at

Reading list

Relation to previous RITs

Schedule of Talks

Research Focus: Large datasets arise naturally in many areas of science, government,
and business. Typically, as the size of a dataset gets large, the complexity of questions
which one addresses with it also increases.  Such problems range from standard parametric
models whose parameters are allowed to vary with cross-classifying variables, to problems
in regression and time series which lead naturally to the development of Order-selection
techniques, to Semiparametric Statistical Inference (problems where the nuisance parameters
are infinite-dimensional but may be approximated in some sense by finite-dimensional
parameters of growing dimension, to Classification and Clustering.

Growing parameter dimension violates the formal setting of most textbook mathematical
statistics, in which parameter-dimension is fixed and sample size increases to infinity.
The new setting requires a new Asymptotic Theory which explicitly recognizes the controlled
growth of the parameter-space of a probability model as a function of the dataset size.

This Research Interaction seminar on mathematical and statistical topics in Large Cross-
Classified Datasets broadly encompasses the overlap of my students' thesis projects and
most of my own current research interests.

The reading list presented below for this RIT has a somewhat more theoretical flavor than
those in related past RITs I have run. However, the problem descriptions and
applications which we discussed in the past are closely related to those we will discuss this
term, and should serve as very good motivation for students in search of good research
problems which mix theory and applications.

Graduate Prerequisites: To benefit from this research activity, a graduate student
should have completed Stat 700 and at least one of Stat 740, 741, 750, or 770,
and have some familiarity with Statistical Computing at the level of Stat 430 (SAS
programming) or Stat 798C (Splus or R).

Undergraduate Prerequisites: An interested undergraduate should have had at
least one course in Mathematical Statistics (e.g. Stat 401 or 420) and experience ---
either in courses or projects --- with numerical computing or data analysis.

Graduate Program: Graduate students will be involved in reading and presenting
papers from the statistical literature concerning provable properties of models and
statistical inference methods related to large cross-classified data structures,
including longitudinal data and spatial or survey data cross-classified or stratified in
terms of many observed covariates. In some cases, students may explore and
present software for the statistical analysis of some of the data structures studied.
(One example would be  GEE  or generalized estimating equation  methods for
longitudinal data.)

Undergraduate Program: Undergraduate students will be involved primarily in
comparative numerical experiments involving algorithms for simulating and analyzing
the large cross-classified data structures we study.

Work Schedule: We will meet weekly in the fall of 2005. Students will choose
from the following list of Topics and Papers (which will regularly be augmented on
this web-page) and present the material in subsequent weeks, after an introductory
couple of weeks' talks by me. Presentations can be informal, but should be detailed
enough and present enough background that we can understand the issues and ideas
clearly. It is expected that many presentations will extend to a second week.

Topics & Papers

Topics by Keyword:
                  misspecified regression models, random-effect GLM's,
                 regression-variable & model selection
                 principal components analysis & factor analysis,
                 asymptotics for linear models with growing numbers of regressors,
                 errors-in-variables (`measurement error') models,
                 longitudinal models & GEE methods (Generalized Estimating Equations),
                 Panel Data Econometrics,
                 classification and clustering in large datasets,
                 experimental design (`response surface methodology')

Topics by Theoretical Idea   :
                 Rates of growth of parameter dimension  
p(n)   compatible with
                        consistency, asymptotic normality &   efficiency of estimators,
                 Linear model asymptotics; GLM extensions using exponential families,
                 Profile and adjusted profile likelihoods,
                 Projected score and Hilbert space techniques,
                 Estimating equations,
                 Other modified profile likelihoods,
                 Properties of automatic model selection methods,

                 Methods related to misspecified models.

Reading List

Relevant papers read in past terms (which might be revisited):

Neyman, J. and E. Scott (1948) Consistent estimates based on
partially consistent observations. Econometrica 16, 1-32.

Self, Steven G. and Liang, Kung-Yee (1987) Asymptotic properties
of maximum likelihood estimators and likelihood ratio tests under
nonstandard conditions.  Jour. Amer. Statist. Assoc. 82, 605-610.

White, Halbert (1982) Maximum likelihood estimation of misspecified models.
Econometrica 50, no. 1, 1-25.

Important papers on the focused RIT topic:

Chen, Ru (2005) Misspecified Models with Parameters of Increasing Dimension.
University of Maryland College Park Thesis, Statistics Program, August 2005.

  • This thesis of my just-finished student contains some new results on consistency
    and asymptotic normality and variances of least-squares estimates within fixed-effect
    linear regression models when the data actually are governed by mixed-effect models.
  • He, X. and Shao, Qi-Man (2004) On parameters of increasing dimensions. Preprint.

    Jiang, Jiming (1999) Conditional inference about generalized linear mixed models.
    Ann. Statist. 27 , 1974-2007.

    Li, Haihong, Lindsay, Bruce G. and Waterman, Richard P. (2003) Efficiency of projected
    score methods in rectangular array asymptotics.
    J. Roy. Statist. Soc. Ser. B 65, 191-208.

  • This is a really interesting paper showing the phenomenon in "two-index asymptotics"
    that when numbers of cluster-level nuisance parameters increase overall as the square root
    of data-size, uncorrected MLE's are generally consistent but inefficient while a suitable
    "projected score" correction restores efficiency.
  • Lindsay, Bruce, Clogg, C., and Grego, J. (1991) Semiparametric estimation in the Rasch
    model and related exponential response models, including a simple latent class model
    for item analysis.
    J. Amer. Statist. Assoc. 86, 96-107.

  • Another very interesting paper, but this one has a sizeable data-analytic component
    concerning models that are very important in educational statistics that also exhibit the
    "two-index asymptotics" phenomenon but usually collapse the nuisance parameters'
    dimension by treating them as random effects.
  • McCullagh, P. and Tibshirani, R. (1990) A simple method for the adjustment of profile
    J. Roy. Statist. Soc. Ser. B 52, 325-344.

  • This paper will be our introduction to "profile likelihood" methods which deal with
    (sometimes high-dimensional) nuisance parameters by replacing them in the likelihood
    by suitable estimators from the data. Various "adjustments" to the resulting likelihood
    may improve the properties of theresulting ML estimator of the parameter of interest.
  • Pfanzagl, J. (1993) Incidental versus random nuisance parameters. Ann. Statist.
    21, 1663-91.

    Portnoy, Stephen  (1988) Asymptotic behavior of likelihood methods for exponential
    families when the number of parameters tends to infinity.
    Ann. Statist. 16, 356-366.

  • This widely cited paper deals with extensions to GLM exponential family settings
    of consistency and asymptotic normality results for linear regression with number of
    predictors growing as a function of sample size.
  • Sartori, N. (2005) Modifications to the profile likelihood in models with incidental
    nuisance parameters.

    Slud, E. and Vonta, F. (2005) Efficient semiparametric estimators via modified
    profile likelihood.
    Jour. Statist. Planning & Inf. 44, 339-367.

  • A recent paper with my former student, which gives a general approach to
    constructing semiparametric efficient estimators (of finite-dimensional structural
    parameters in the presence of infinite-dimensional or functional nuisance
    parameters) by means of a 'modified profile likelihood'. One of the ingredients in
    this approach is to find consistent preliminary estimators, for which the techniques
    of the other papers to be studied in this RIT should be very helpful.
  • Wei, C. Z. (1992) On predictive least squares principles. Ann. Statist. 20, 1-42.

    Yang, Yuhong, and Barron, Andrew R. (1998) An asymptotic property of model
    selection criteria.
    IEEE Trans. Inform. Theory 44, no. 1, 95-116.

    Yohai, V. and Maronna, R. (1979) Asymptotic behavior of M-estimators for the
    linear model.
    Ann. Statist. 7, 258-268.

  • A paper on "M-estimation", relevant to semiparametric linear
    regression (estimation using unknown or mis-specified error distribution) which
    shows clearly -- if somewhat technically -- how consistency and asymptotic
    normality of estimators results from specific assumptions on the rate of growth
    of the number of predictors with sample size. I would like someone to present the
    (very accessible!) parts of the proof relating to consistency, i.e., to Theorem 2.

  • Discussion of Relation to Previous RIT's

              Many additional references on large datasets with particular reference to DNA
    Microarrays (used in AMSC seminars in Fall '01 and RIT in Spring '02)
    can be found here.
              A web-site of microarray references, created by a statistician at LSU named
    Barry Moser, may also be helpful. There is also a Special Issue of Statistical Science
    (Feb. 2003) on Statistical Challenges and Methods for Microarray Analysis
    which contains survey articles and bibliographies with many items of methodological
    interest for large cross-classified genomic datasets. Special Issues of other journals
    (including Statistica Sinica) also were devoted to the topic. See further 2002 RIT
    materials here.

              In 2003 the RIT broadened to encompass non-genomics problems under the
    heading of Statistics of Large Cross-Classified Datasets. You can see the back-
    ground discussion, readings and talk titles at the RIT '03 web-page. Finally, in
    Spring '04, the Large Cross-Classified Datasets considered became (mostly
    biomedical) multicenter studies unified by random effects models under the trendy
    title of "Meta-Analysis", with web-page here.

    Schedule of Talks ---