Growing as a Function of Sample Size

**Mon. 3-4, Rm Mth 2400
Fall '05**

**Eric
Slud
Statistics Program ,
Math Department
Rm 2314 x5-5469**

Interested participants should get in touch with
me at **evs@math.umd.edu**

**Research Focus:** Large datasets arise naturally in many areas of
science, government,

and business. Typically, as the size of a dataset
gets large, the complexity of questions

which one addresses with
it also increases. Such problems range from standard parametric

models whose parameters are allowed to vary with cross-classifying
variables, to problems

in regression and time series which lead
naturally to the development of Order-selection

techniques, to
Semiparametric Statistical Inference (problems where the nuisance
parameters

are infinite-dimensional but may be approximated in
some sense by finite-dimensional

parameters of growing dimension,
to Classification and Clustering.

Growing parameter dimension violates the formal setting of most
textbook mathematical

statistics, in which parameter-dimension
is fixed and sample size increases to infinity.

The new setting
requires a new Asymptotic Theory which explicitly recognizes the
controlled

growth of the parameter-space of a probability model
as a function of the dataset size.

This Research Interaction seminar on mathematical and statistical
topics in Large Cross-

Classified Datasets broadly encompasses
the overlap of my students' thesis projects and

most of my own
current research interests.

The reading list presented below for this
RIT has a somewhat more theoretical flavor than

those in
related past RITs I have run. However, the
problem descriptions and

applications which we discussed in the
past are closely related to those we will discuss this

term,
and should serve as very good motivation for students in search of
good research

problems which mix theory and applications.

**Graduate Prerequisites:** To benefit from this research
activity, a graduate student

should have completed Stat 700
and at least one of Stat 740, 741, 750, or 770,

and have some
familiarity with Statistical Computing at the level of Stat 430
(SAS

programming) or Stat 798C (Splus or R).

**Undergraduate Prerequisites:** An interested undergraduate
should have had at

least one course in Mathematical Statistics
(e.g. Stat 401 or 420) and experience ---

either in courses
or projects --- with numerical computing or data analysis.

**Graduate Program:** Graduate students will be involved in
reading and presenting

papers from the statistical literature
concerning provable properties of models and

statistical
inference methods related to large cross-classified data structures,

including longitudinal data and spatial or survey data
cross-classified or stratified in

terms of many observed
covariates. In some cases, students may explore and

present
software for the statistical analysis of some of the data
structures studied.

(One example would be ** GEE **
or *generalized estimating equation * methods for

longitudinal data.)

**Undergraduate Program:** Undergraduate students will be
involved primarily in

comparative numerical experiments
involving algorithms for simulating and analyzing

the large
cross-classified data structures we study.

**Work Schedule:** We will meet weekly in the fall of 2005.
Students will choose

from the following list of Topics and Papers
(which will regularly be augmented on

this web-page) and present
the material in subsequent weeks, after an introductory

couple of
weeks' talks by me. Presentations can be informal, but should be
detailed

enough and present enough background that we can
understand the issues and ideas

clearly. It is expected that
many presentations will extend to a second week.

**Topics by Keyword:
misspecified regression models,
random-effect GLM's,
regression-variable & model selection
principal components analysis & factor analysis,
asymptotics for linear models with growing numbers of regressors,
errors-in-variables (`measurement error') models,
longitudinal models & GEE methods (Generalized
Estimating Equations),
Panel Data Econometrics,
classification and clustering in large datasets,
experimental design (`response surface
methodology')**

**Topics by Theoretical Idea :
Rates of growth of parameter dimension
p(n)
compatible with
consistency, asymptotic normality & efficiency of estimators,
Linear model asymptotics; GLM extensions using
exponential families,
Profile and adjusted profile
likelihoods,
Projected score and Hilbert space
techniques,
Estimating equations,
Other modified profile likelihoods,
Properties of automatic model selection methods,
Methods
related to misspecified models. **

**Relevant papers read in past terms (which might be revisited):**

Neyman, J. and E. Scott (1948) *Consistent estimates based on*
*partially consistent observations. *Econometrica **16**, 1-32.

Self, Steven G. and Liang, Kung-Yee (1987) *Asymptotic properties*
*of maximum likelihood estimators and likelihood ratio tests under*
*nonstandard conditions.* Jour. Amer. Statist. Assoc. **82**,
605-610.

White, Halbert (1982) *Maximum likelihood estimation of misspecified
models.*

Econometrica **50**, no. 1, 1-25.

**Important papers on the focused RIT topic:**

Chen, Ru (2005) ** Misspecified Models with Parameters of
Increasing Dimension. **

University of Maryland College Park
Thesis, Statistics Program, August 2005.

and asymptotic normality and variances of least-squares estimates within fixed-effect

linear regression models when the data actually are governed by mixed-effect models.

He, X. and Shao, Qi-Man (2004) * On parameters of increasing
dimensions.* Preprint.

Jiang, Jiming (1999) *Conditional inference about generalized
linear mixed models.*

Ann. Statist. **27** , 1974-2007.

Li, Haihong, Lindsay, Bruce G. and Waterman, Richard P. (2003)
*Efficiency of projected score methods in rectangular array
asymptotics. *J. Roy. Statist. Soc. Ser. B

that when numbers of cluster-level nuisance parameters increase overall as the square root

of data-size, uncorrected MLE's are generally consistent but inefficient while a suitable

"projected score" correction restores efficiency.

Lindsay, Bruce, Clogg, C., and Grego, J. (1991)
*Semiparametric estimation in the Rasch model and related
exponential response models, including a simple latent class
model for item analysis. *J. Amer. Statist. Assoc.

concerning models that are very important in educational statistics that also exhibit the

"two-index asymptotics" phenomenon but usually collapse the nuisance parameters'

dimension by treating them as random effects.

McCullagh, P. and Tibshirani, R. (1990) * A simple method for the
adjustment of profile likelihoods.* J. Roy. Statist. Soc. Ser. B

(sometimes high-dimensional) nuisance parameters by replacing them in the likelihood

by suitable estimators from the data. Various "adjustments" to the resulting likelihood

may improve the properties of theresulting ML estimator of the parameter of interest.

Pfanzagl, J. (1993) *Incidental versus random nuisance
parameters. * Ann. Statist.

**21**, 1663-91.

Portnoy, Stephen (1988) *Asymptotic behavior of
likelihood methods for exponential families when the number
of parameters tends to infinity. *Ann. Statist.

of consistency and asymptotic normality results for linear regression with number of

predictors growing as a function of sample size.

Sartori, N. (2005) *Modifications to the profile likelihood in models
with incidental nuisance parameters. * Preprint.

Slud, E. and Vonta, F. (2005) *Efficient semiparametric estimators
via modified profile likelihood.* Jour. Statist. Planning &
Inf.

constructing semiparametric efficient estimators (of finite-dimensional structural

parameters in the presence of infinite-dimensional or functional nuisance

parameters) by means of a 'modified profile likelihood'. One of the ingredients in

this approach is to find consistent preliminary estimators, for which the techniques

of the other papers to be studied in this RIT should be very helpful.

Wei, C. Z. (1992) *On predictive least squares principles.*
Ann. Statist. **20**, 1-42.

Yang, Yuhong, and Barron, Andrew R. (1998) *An asymptotic property of
model selection criteria. *IEEE Trans. Inform. Theory

Yohai, V. and Maronna, R. (1979) * Asymptotic behavior of M-estimators
for the linear model.* Ann. Statist.

regression (estimation using unknown or mis-specified error distribution) which

shows clearly -- if somewhat technically -- how consistency and asymptotic

normality of estimators results from specific assumptions on the rate of growth

of the number of predictors with sample size. I would like someone to present the

(very accessible!) parts of the proof relating to consistency, i.e., to Theorem 2.

Many additional references on large datasets with particular reference
to DNA

Microarrays (used in AMSC seminars in Fall '01 and RIT
in Spring '02)

can be found
here.

A
web-site of microarray references, created by a statistician at LSU named

Barry Moser, may also be helpful. There is also a Special Issue of
Statistical Science

(Feb. 2003) on *Statistical Challenges and Methods
for Microarray Analysis*

which contains survey articles and
bibliographies with many items of methodological

interest for
large cross-classified genomic datasets. Special Issues of other
journals

(including Statistica Sinica) also were devoted to
the topic. See further 2002 RIT

materials **here**.

In 2003 the RIT broadened to encompass non-genomics problems under the

heading of **Statistics of Large Cross-Classified Datasets**. You
can see the back-

ground discussion, readings and talk titles at the
RIT '03
web-page.
Finally, in

Spring '04, the Large Cross-Classified Datasets considered
became (mostly

biomedical) multicenter studies unified by random effects
models under the trendy

title of "Meta-Analysis", with web-page here.

**Organization &****Overview on `Two-Index Asymptotics':**Sept 12, Eric Slud-
**Ajusted Profile Likelihood:**Sept 19, Nadarajasundaram Ganesh

(*1990 JRSSB paper of McCullagh & Tibshirani*) -
Sept 26, Eric SludModified profile likelihood ideas : -
**Two-Index Asymptotics & Projected Score Methods:**Oct 3, Yabing Mai

(*2003 JRSSB Paper of Li, Lindsay and Waterman,*)

spilling over a little into following meeting on Oct. 3 **Cross-Classified SEER survival data:**Oct. 10, Hantao Mai

(*general intro, related to a Stat.Med. paper of Royston*)

**Mini-course Lecture on Factor Analysis :**Oct. 17, E. Slud, 4pm Mth 3206**Cross-classified cancer survival data, cont'd:**Oct. 24, Hantao Mai.-
**Rasch model paper of Lindsay, Clogg, & Grego (1991):**Oct. 31, Ziliang Li
( - No RIT meeting November 7: Stat Seminar
meets at

unusual time at 4-5pm on that day. - Last meeting for term will be
**Nov. 28**, at which I will relate my students' and

my work on**cross-classified factor analysis and Two-Index Asymptotics**

results we discussed earlier (Oct. 3) in connection with the Li, Lindsay and

Waterman (2003 JRSSB) paper.*Please get in touch with me at***evs@math.umd.edu**

*to volunteer to read a paper and give a talk.*

### Return to my home page.

##### © Eric V Slud, November 17, 2005.