Statistics of Large Cross-Classified Datasets

Thur 1-2,  Rm  Mth 2400                                                      F '03

Eric Slud       Statistics Program, Math Department Rm 2314

Interested participants should get in touch with me at

Research Focus: Large datasets arise naturally in many areas of science,
government, and business. Typically, as the size of a dataset gets large, the
complexity of questions which one addresses with it also increases.  Such
problems range from Semiparametric Statistical Inference to Order-selection
problems in regression and time series, to Classification and Clustering as in the
Microarray Data problem-area in which I ran the AMSC seminar in Fall '01
and an RIT in 2002-03. This is unlike the formal setting of most mathematical
statistics, in which parameter-dimension is fixed and sample size increases to
infinity. The contrast suggests the need for a new Asymptotics which explicitly
recognizes the growth of the parameter-space of a probability model as a function
of the size  n  of the dataset.
        This `research interaction' seminar on  mathematical/statistical topics in
Large Cross-Classified Datasets broadly encompasses the overlap of my students'
thesis projects and most of my own current research interests.

Graduate Prerequisites: To benefit from this research activity, a graduate student
should have completed Stat 700 and at least one of Stat 740, 741, 750, or 770, and
have some familiarity with Statistical Computing at the level of Stat 430 (SAS
programming) or Stat 798C (Splus and SAS).

Undergraduate Prerequisites: An interested undergraduate should have had at
least one course in Mathematical Statistics (e.g. Stat 401 or 420) and considerable
experience -- either in courses or projects --- with numerical computing or
data analysis.

Graduate Program: Graduate students will be involved in reading and presenting
papers from the statistical literature concerning provable properties of models and
statistical-inference methods related to large cross-classified data structures,
including longitudinal data and spatial or survey data cross-classified or stratified in
terms of many observed covariates. In some cases, students may explore and present
software for the statistical analysis of some of the data structures studied. (One
example would be  GEE  or generalized estimating equation  methods for
longitudinal data.

Undergraduate Program: Undergraduate students will be involved primarily in
comparative numerical experiments involving algorithms for simulating and analyzing
the large cross-classified data structures we study.

Work Schedule: Unlike the meeting schedule in previous terms, we will meet weekly
in the fall of 2003. Students will choose from the following list of Topics and Papers
(which will regularly be augmented on this web-page) and present the material in
subsequent weeks. Presentations can be informal, but should be detailed enough and
present enough background that we can understand the issues and ideas clearly. It is
expected that many presentations will extend to a second week.

Topics & Papers

Topics:  misspecified models, random-effect GLM's,
                regression-variable & model selection,
                principal components analysis, factor analysis,
                asymptotics for models with numbers of regressors
                        growing with sample size,
                errors-in-variables (`measurement error') models,
                longitudinal models & GEE methods (Generalized
                        Estimating Equations),
                Panel Data Econometrics,
                classification and clustering in large datasets,
                experimental design (`response surface methodology').

Papers read in past terms:

Neyman, J. and E. Scott (1948) Consistent estimates based on
partially consistent observations. Econometrica 16, 1-32.

Pinheiro, J. and Bates, D. (1995) Documentation of Splus functions lme and nlme
(with data illustrations).

Robinson, G. K. (1991) That BLUP is a good thing: the estimation of random effects.
With comments and a rejoinder by the author.  Statist. Sci. 6 , 15-51.

Self, Steven G. and Liang, Kung-Yee (1987) Asymptotic properties
of maximum likelihood estimators and likelihood ratio tests under
nonstandard conditions.  Jour. Amer. Statist. Assoc. 82, 605-610.

Shibata, Ritei (1981) An optimal selection of regression variables.
Biometrika  68, 45-54.

Many additional references on large datasets with particular reference to DNA
Microarrays (used in AMSC seminars and RITs in past terms) can be found here .
A web-site of microarray references, created by a statistician at LSU named
Barry Moser, may also be helpful.

Additional papers and books of interest:

Agresti, A. (2002) Categorical Data Analysis, 2nd ed. Chapter 11.

Akaike, Hirotugu  (1970) Statistical predictor identification. Ann. Inst. Statist.Math.
22, 203--217.

Akaike, H. (1973). Information theory and an extension of the maximum likelihood
 principle. Proc. 2nd Int. Symp. on Information Theory, B. N. Petrov and F. Csaki, eds.
Akademia Kiado, Budapest, 267-281.

Barron, Andrew, Rissanen, Jorma, and Yu, Bin (1998) The minimum description
length principle in coding and modeling. Information theory: 1948--1998. IEEE
Trans. Inform. Theory 44, no. 6, 2743--2760.

Box, George E. P., Hunter, William G., and  Hunter, J. Stuart (1978) Statistics for
experimenters. An introduction to design, data analysis, and  model building.
John Wiley & Sons, New York-Chichester-Brisbane,.

Ford, Ian , Titterington, D. M. , and Kitsos, Christos P.  (1989), Recent advances in
nonlinear experimental design, Technometrics 31 , 49-60

Fuller, Wayne (1987) Measurement Error Models. New York: John Wiley.

Li, Haihong, Lindsay, Bruce G. and Waterman, Richard P. (2003) Efficiency of projected
score methods in rectangular array asymptotics. J. R. Stat. Soc. Ser. B 65, 191-208.

Lindsay, Bruce G. and Lesperance, Mary L.  (1993) A review of semiparametric mixture
models. Statistical modelling (Leuven). J. Statist. Plann. Inference 47 (1995),  29-39.

Myers, Raymond H. and Montgomery, Douglas C. (2002) Response Surface Methodology:
process and product optimization using designed experiments. New York: John Wiley.

Rao, C.R and Wu, Y. (2001). On model selection (with discussion), IMS Lecture notes-
Monograph Series, Vol. 38, pp. 1-64.

Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist.  6, 461-464.

Slud, E., Stone, M., Smith, P. and Goldstein, M. Jr. (2002) Principal components
representation of the two-dimensional coronal tongue surface, Phonetica 59, 108-133.

White, Halbert. (1982) Maximum likelihood estimation of misspecified models.
Econometrica 50, no. 1, 1-25.

Yang, Yuhong, and Barron, Andrew R. (1998) An asymptotic property of model
selection criteria. IEEE Trans. Inform. Theory 44, no. 1, 95-116.

There is also a Special Issue of Statistical Science (Feb. 2003) on Statistical
Challenges and Methods for Microarray Analysis which contains survey articles
and bibliographies with many items of interest for our RIT. Special Issues of
other journals (including Statistica Sinica) also were devoted to the topic.

Other papers used in Candidacy presentation of Ru Chen:

D.A. Freedman (1983),  A note on Screening Regression Equations, The American
Statistician 37, 152-155.

Laurence S. Freedman & David Pee (1989) Return to a Note on Screening Regression
Equations, The American Statistician 43, 279-282

Portnoy, Stephen  (1988) Asymptotic behavior of likelihood methods for exponential
families when the number of parameters tends to infinity. Ann. Statist. 16, 356-366.

Shao, Jun (1997) An asymptotic theory for linear model selection. With comments and
a rejoinder by the author. Statist. Sinica 7, 221-264.

Other papers used in Candidacy presentation of Sophie Tsou:

Anderson, T.W. and Rubin, H. (1956) Statistical inference in factor analysis. Proc. 3rd
Berk. Symp. 5, 111-150.

Tucker, L. (1966) Some mathematical notes on 3-mode factor analysis. Psychometrika 31(3).

(Other) Papers covered this term:

Donald W. K. Andrews, DWK and  Ploberger, W. (1994)  Optimal Tests when a Nuisance
Parameter is Present Only Under the Alternative. Econometrica 62(6), pp. 1383-1414.

Ghosh, M. and Rao, J. (1994) Small area estimation: an appraisal. Statist. Sci. 9.

(Other) Papers to be considered for the future:

Liu,  M.,Taylor,  J. and Belin, T. (2000)  Multiple imputation and posterior simulation for
  multivariate missing data in longitudinal studies. Biometrics 56, 1157-63.

Schedule of Talks --

Return to my home page.

© Eric V Slud, November 14, 2003.