Thur 1-2, Rm Mth 2400
Eric Slud Statistics Program, Math Department Rm 2314
Interested participants should get in touch with me at firstname.lastname@example.org
Research Focus: Large datasets arise naturally
in many areas of science,
government, and business. Typically, as the size of a dataset gets large, the
complexity of questions which one addresses with it also increases. Such
problems range from Semiparametric Statistical Inference to Order-selection
problems in regression and time series, to Classification and Clustering as in the
Microarray Data problem-area in which I ran the AMSC seminar in Fall '01
and an RIT in 2002-03. This is unlike the formal setting of most mathematical
statistics, in which parameter-dimension is fixed and sample size increases to
infinity. The contrast suggests the need for a new Asymptotics which explicitly
recognizes the growth of the parameter-space of a probability model as a function
of the size n of the dataset.
This `research interaction' seminar on mathematical/statistical topics in
Large Cross-Classified Datasets broadly encompasses the overlap of my students'
thesis projects and most of my own current research interests.
Graduate Prerequisites: To benefit from
this research activity, a graduate student
should have completed Stat 700 and at least one of Stat 740, 741, 750, or 770, and
have some familiarity with Statistical Computing at the level of Stat 430 (SAS
programming) or Stat 798C (Splus and SAS).
Undergraduate Prerequisites: An interested
undergraduate should have had at
least one course in Mathematical Statistics (e.g. Stat 401 or 420) and considerable
experience -- either in courses or projects --- with numerical computing or
Graduate Program: Graduate students will
be involved in reading and presenting
papers from the statistical literature concerning provable properties of models and
statistical-inference methods related to large cross-classified data structures,
including longitudinal data and spatial or survey data cross-classified or stratified in
terms of many observed covariates. In some cases, students may explore and present
software for the statistical analysis of some of the data structures studied. (One
example would be GEE or generalized estimating equation methods for
Undergraduate Program: Undergraduate students
will be involved primarily in
comparative numerical experiments involving algorithms for simulating and analyzing
the large cross-classified data structures we study.
Work Schedule: Unlike the meeting schedule
in previous terms, we will meet weekly
in the fall of 2003. Students will choose from the following list of Topics and Papers
(which will regularly be augmented on this web-page) and present the material in
subsequent weeks. Presentations can be informal, but should be detailed enough and
present enough background that we can understand the issues and ideas clearly. It is
expected that many presentations will extend to a second week.
Topics: misspecified models, random-effect GLM's,
regression-variable & model selection,
principal components analysis, factor analysis,
asymptotics for models with numbers of regressors
growing with sample size,
errors-in-variables (`measurement error') models,
longitudinal models & GEE methods (Generalized
Panel Data Econometrics,
classification and clustering in large datasets,
experimental design (`response surface methodology').
Papers read in past terms:
Neyman, J. and E. Scott (1948) Consistent estimates based on
partially consistent observations. Econometrica 16, 1-32.
Pinheiro, J. and Bates, D. (1995) Documentation of Splus functions
lme and nlme
(with data illustrations).
Robinson, G. K. (1991) That BLUP is a good thing: the estimation
of random effects.
With comments and a rejoinder by the author. Statist. Sci. 6 , 15-51.
Self, Steven G. and Liang, Kung-Yee (1987) Asymptotic properties
of maximum likelihood estimators and likelihood ratio tests under
nonstandard conditions. Jour. Amer. Statist. Assoc. 82, 605-610.
Shibata, Ritei (1981) An optimal selection of regression variables.
Biometrika 68, 45-54.
Many additional references on large datasets
with particular reference to DNA
Microarrays (used in AMSC seminars and RITs in past terms) can be found here .
A web-site of microarray references, created by a statistician at LSU named
Barry Moser, may also be helpful.
Additional papers and books of interest:
Agresti, A. (2002) Categorical Data Analysis, 2nd ed. Chapter 11.
Akaike, Hirotugu (1970) Statistical predictor identification.
Akaike, H. (1973). Information theory and an extension of the maximum
principle. Proc. 2nd Int. Symp. on Information Theory, B. N. Petrov and F. Csaki, eds.
Akademia Kiado, Budapest, 267-281.
Barron, Andrew, Rissanen, Jorma, and Yu, Bin (1998) The minimum description
length principle in coding and modeling. Information theory: 1948--1998. IEEE
Trans. Inform. Theory 44, no. 6, 2743--2760.
Box, George E. P., Hunter, William G., and Hunter, J. Stuart (1978)
experimenters. An introduction to design, data analysis, and model building.
John Wiley & Sons, New York-Chichester-Brisbane,.
Ford, Ian , Titterington, D. M. , and Kitsos, Christos P. (1989),
nonlinear experimental design, Technometrics 31 , 49-60
Fuller, Wayne (1987) Measurement Error Models. New York: John Wiley.
Li, Haihong, Lindsay, Bruce G. and Waterman, Richard P. (2003) Efficiency
score methods in rectangular array asymptotics. J. R. Stat. Soc. Ser. B 65, 191-208.
Lindsay, Bruce G. and Lesperance, Mary L. (1993) A review of
models. Statistical modelling (Leuven). J. Statist. Plann. Inference 47 (1995), 29-39.
Myers, Raymond H. and Montgomery, Douglas C. (2002) Response Surface
process and product optimization using designed experiments. New York: John Wiley.
Rao, C.R and Wu, Y. (2001). On model selection (with discussion),
IMS Lecture notes-
Monograph Series, Vol. 38, pp. 1-64.
Schwarz, G. (1978). Estimating the dimension of a model. Ann. Statist. 6, 461-464.
Slud, E., Stone, M., Smith, P. and Goldstein, M. Jr. (2002) Principal
representation of the two-dimensional coronal tongue surface, Phonetica 59, 108-133.
White, Halbert. (1982) Maximum likelihood estimation of misspecified
Econometrica 50, no. 1, 1-25.
Yang, Yuhong, and Barron, Andrew R. (1998) An asymptotic property
selection criteria. IEEE Trans. Inform. Theory 44, no. 1, 95-116.
There is also a Special Issue of Statistical
Science (Feb. 2003) on Statistical
Challenges and Methods for Microarray Analysis which contains survey articles
and bibliographies with many items of interest for our RIT. Special Issues of
other journals (including Statistica Sinica) also were devoted to the topic.
Other papers used in Candidacy presentation of Ru Chen:
D.A. Freedman (1983), A note on Screening Regression Equations,
Statistician 37, 152-155.
Laurence S. Freedman & David Pee (1989) Return to a Note on Screening
Equations, The American Statistician 43, 279-282
Portnoy, Stephen (1988) Asymptotic behavior of likelihood methods
families when the number of parameters tends to infinity. Ann. Statist. 16, 356-366.
Shao, Jun (1997) An asymptotic theory for linear model selection.
With comments and
a rejoinder by the author. Statist. Sinica 7, 221-264.
Other papers used in Candidacy presentation of Sophie Tsou:
Anderson, T.W. and Rubin, H. (1956) Statistical inference in factor
analysis. Proc. 3rd
Berk. Symp. 5, 111-150.
Tucker, L. (1966) Some mathematical notes on 3-mode factor analysis.
(Other) Papers covered this term:
Donald W. K. Andrews, DWK and Ploberger, W. (1994) Optimal
Tests when a Nuisance
Parameter is Present Only Under the Alternative. Econometrica 62(6), pp. 1383-1414.
Ghosh, M. and Rao, J. (1994) Small area estimation: an appraisal.
Statist. Sci. 9.
(Other) Papers to be considered for the future:
Liu, M.,Taylor, J. and Belin, T. (2000) Multiple
imputation and posterior simulation for
multivariate missing data in longitudinal studies. Biometrics 56, 1157-63.