Instructor: Eric Slud, Statistics Program, Math. Dept.
Office: Mth 2314, x5-5469, email evs@math.umd.edu
Office hours: M10, W12, F10
Course Text: K. Mardia, J.Kent, and J. Bibby
Multivariate Analysis, 1980,
Academic Press (paperback).
This text covers both theory and data examples, with ample verbal explanations
and motivation.
Recommended Texts: (i) Anderson, T.W. An Introduction
to Multivariate
Statistical Analysis, 3rd ed. 2003, Wiley-Interscience.
This is a standard and authoritative, but very theoretical and fairly
dry book,
with much deeper mathematical treatment than the Mardia, Kent and
Bibby text.
(ii) Hardle, Wolfgang, and
Simar, Leopold, Applied Multivariate Statistical
Analysis 2003,
Springer-Verlag.
This is a much newer text,
emphasizing nonparametric techniques and computational examples,
primarily geared towards economics and finance.
Overview:
This course is about statistical models and methods of inference
for multivariate observations with dependent coordinates. Theoretical material
relates to the multivariate normal distribution and to the statistical sampling
behavior of empirical variance-covariance matrices and of various projections
and eigen-decompositions of them. Models studied include regression, principal
components analysis, factor models, and canonical correlations. In addition,
important atheoretical analysis methods like Clustering algorithms will also be
discussed. All methods will be illustrated using computational data examples.
Prerequisite: STAT 420 or STAT 700.
Familiarity with some (any) statistical
software package would be very helpful.
Probability theory material needed throughout this course includes
joint probability
densities and change-of-variable formulas, law
of large numbers and (multivariate)
central limit theorem. In
addition, the course makes extensive use of linear algebra,
especially
including eigenvalues and eigenspaces for symmetric matrices.
The data exercises in the course require that you have access to
a reasonably powerful statistical
software package, e.g. Splus/R, SAS,
or even StatA or Minitab or others like SPSS. Good facility
with MATLAB
would also be enough. I will do examples and provide software scripts in
Splus/R,
and can help you get past coding difficulties in Splus or R
but can probably not help much with
programming difficulties if you do your
data exercises in other languages.
Course requirements and Grading: there will
be 7 or 8 graded homework sets
(one every 1½ to 2 weeks) which
together will count 50% of the course grade.
These will be about
evenly divided between theoretical problems and computational
data
analysis problems. There will also be an in-class test and a final
examination
[which will probably become a take-home or project
depending on class preferences],
which will respectively count
20% and 30% toward the overall course grade.
Some Datasets for the project and homework can be
found here .
Also see the Hardle
and Simar web-page.
Homework
Assignments, including any changes and hints, will
continually be posted Problem 1. Simulate 1000 batches of normal
data matrices of n=150 Problem 2. The second problem is about discriminant analysis
with (a) To find the estimated optimal quadratic discriminant region
for (b) To estimate the probabilities of misclassification. This is
done Note: for a small Log showing how to discriminate the
BankNotes Problem 3. Use the R classification and clustering routines
to see 0. Probability & Linear Algebra Review.
The UMCP
Math Department home page. The University of
Maryland home page. My home
page. © Eric V Slud,
December 6, 2006.
here. The directory in which you can find
old homework assignments and
selected problem solutions is Homework.
HW Set 7. This problem set
will be due Monday, December 11.
R scripts for them have
been provided (see below).
observations with 6 columns, say with
mean 0 and covariance matrix
2*1x2 +
(1,-1,1,-1,1,-1)x2 + diag(2,3,2,3,2,3). Perform on the dataset
for each such matrix a maximum likelihood factor analysis (with q=2),
and then use R function "promax" in package "mva" to do a Varmax
rotation (on the orthonormal-column matrix Lambda0 of factor
loadings),
and tally for your 1000 datasets the total number of rotated
factor coordinates in the three ranges (0,.2), (.2,.8), and
(.8,1),
and report the results. See R Log
for a little R code and commentary on
how to do this. It seems to
me that one cannot really know
what the rotated factors mean statistically
without seeing
what they do typically in studies like these.
cross-validated estimation of misclassification
probability, for the
Swiss BankNote data. Here are the steps
which I would like you to
follow. See R
Script on Discrimination methods for R steps, implemented
on the Iris data, and the resulting picture.
The progression of steps is:
discriminating genuine from forged banknotes, and the (Fisher)
linear
discriminant, and code them both.
both through a theoretical (or simulated) multivariate-normal
probability
based on the estimated paramaters for the two banknote
groups for the
the linear discrimination regions in (a), and
then also by a cross-
validated estimation procedure. The
cross-validated procedure would
successively leave out one or a
few observations from one or both of
the two groups; re-estimate the
discrimination regions using the
retained observations; and record
whether the omitted observations are
properly discriminated (ie
classified) or not, tallying the overall
relative frequencies of
misclassification.
essentially perfectly using two PC's, click here.
(a) whether the grouping of US companies by industry
section can be
reproduced by a formal classification algorithm,
and
(b) whether the unsupervised clustering algorithms ("diana",
"agnes", and
maybe "Kmeans") can provide any other meaningful
grouping. Use the data
on US companies which can be found here. (It was downloaded from the
Hardle &
Simar website: you can freely download html text (including
verbal
description of data, i.e. descriptions of the variables which are
listed in Appendix B.5, pp. 455-6 of the Hardle & Simar book) and the
ASCII datasets themselves from the
Hardle
and Simar web-page.
In this dataset, you might want to explore the
possibility of generating
new columns (e.g. ratios or
interactions between columns and ratios,
formed from existing columns)
before applying the clustering algorithms,
and you might want to
reduce the dimension of the resulting explanatory-
variable sets via
PC's to try some sort of visual clustering, as was
done in the small
log referenced at the end of Problem 2 above. See
R Log on Clustering for some examples and
details on using
the R clustering software. The "Dendrogram" pictures
mentioned in the
Log in the small simulated data example are "agglomerative clustering"
and "divisive clustering".
SYLLABUS for Stat 750
We will cover Chapters 1-13 of the Mardia, Kent, and
Bibby book: topics include
the multivariate normal distribution,
Wishart's and Hotelling's distributions;
tests of hypotheses,
estimation, distribution of test criteria; generalized distance,
discriminant analysis; regression and correlation; multivariate
analysis of variance;
principal components, canonical
correlations, factor analysis, and clustering.
OUTLINE
(a) Multivariate normal distribution: alternative characterizations.
1. Wishart distribution; Hotelling T2;
Mahalanobis distance.
2. Statistics based on likelihood for
multivariate normal data.
(a) Estimation (likelihood, sufficiency, MLE.
(b) Hypothesis testing techniques, including likelihood
ratios; simultaneous
confidence intervals, multivariate parametric and nonparametric tests.
3.Multivariate regression.
(a) MLE, general linear hypothesis,
multiple correlation, least squares, variable selection.
4. Econometrics
(a) Simultaneous equation and instrumental variables models.
(b) Comparison of estimators.
5. Principal Components Analysis.
(a) Definitions & sampling properties.
(b) Correspondence analysis.
(c) Principal components regression.
6. Factor Analysis.
(a) Definition of models, rotation of factors.
(b) Goodness of fit, relation with PCA.
7. Multivariate Analysis of Variance.
8. Cluster Analysis.
9. Permutational and Bootstrap ideas in Multivariate.
Important Dates