Statistics 750 Multivariate Analysis

Spring 2022 MWF 9-9:50am, PHY 2122

In-person class with assignments and additional materials on ELMS

Instructor: Professor Eric Slud, Statistics Program, Math Dept., Rm 2314, x5-5469, slud@umd.edu

Office hours: M 1-2, W 11-12 (initially), or email me to make an appointment (can be on Zoom).

Syllabus Lecture Handouts Statistical Computing (R) Handouts Homework

Course Text: K. Mardia, J.Kent, and J. Bibby Multivariate Analysis, 1980, Academic Press (paperback, free online).
This text covers both theory and data examples, with ample verbal explanations and motivation.

Recommended Texts: (i) R. Johnson and D. Wichern, Applied Multivariate Statistical Analysis, 6th ed. Available free as pdf online.
This is a popular and good applied book to be used as a source of examples and alternate, intuitive explanations.
(ii) We will also refer to topics from two widely referenced statistical-machine-learning books (both free online):
C. Bishop (2006), Pattern recognition. Machine learning.
T. Hastie, R. Tibshirani and J. Friedman (2009), The Elements of Statistical Learning, 2nd ed.
(iii) Anderson, T.W. An Introduction to Multivariate Statistical Analysis, 3rd ed. 2003, Wiley-Interscience.
Standard and authoritative, but theoretical and fairly dry, with deeper mathematical treatment than Mardia, Kent and Bibby.
(iv) Haerdle, W. and Simar, L. (2007) Applied Multivariate Statistical Analysis, 2nd ed., Springer.
Another good applied book, maybe at a slightly higher mathematical level than Johnson-Wichern. Available as free e-book to students through the UMD library.

Overview: This course is about statistical models and methods of inference for multivariate observations with dependent coordinates. Much of the theoretical material relates to the multivariate normal distribution and to the statistical sampling behavior of empirical variance-covariance matrices and of various projections and eigen-decompositions of them. Models studied include regression, principal components analysis, factor models, and canonical correlations. In addition, important algorithmic or machine-learning methods like Clustering and Support Vector Machines will also be discussed. All methods will be illustrated using computational data examples in R.

Prerequisite: STAT 420 or STAT 700. Familiarity with some (any) statistical software package would be very helpful, but familiarity with R would be best. The presentation will be geared to second-year Stat grad students.
Probability theory material needed throughout this course includes joint probability densities and change-of-variable formulas, law of large numbers and (multivariate) central limit theorem. In addition, the course makes extensive use of linear algebra, especially including eigenvalues and eigenspaces and singular value decompositions.

The data exercises in the course require that you have familiarity with and access to a reasonably powerful statistical software package, e.g. R, SAS, Python or MATLAB. I will do examples and provide software scripts in R, and can help you get past coding difficulties in R but can probably not help much with programming difficulties if you do your data exercises in other languages.

Course requirements and Grading: there will be 6 graded homework sets (one every 1½ to 2 weeks) which together will count 45% of the course grade. These will be divided about evenly between theoretical problems and computational data analysis problems. There will also be an in-class test (tentatively scheduled for Wed., March 16) and a final take-home or project, which will respectively count 25% and 30% toward the overall course grade.

The course project will be either a paper on a topic not fully covered in class, with mathematical content related to the material of the course and preferably with an illustrative data analysis, or an extended and coherent data analysis and writeup (of about 10-12 pages, not including computer output). It will be due by midnight, Monday May 16, 2022.
Text-file containing suggested topics, papers and book-chapters you might use in your Term Project.
All homework and take-home work will be handed in as uploaded *pdf or *doc files on ELMS.

Some Datasets for the project and homework can be found here. Another good source of larger and more challenging datasets is the UCI Machine Learning Repository. Different directories linked on this web-page will contain R Scripts, mostly R Logs containing code, discussion and interpretations for class material that you should work through yourself, and Handouts (mostly pdfs, with some additional R Scripts).

Statistical Computing Scripts and Handouts

I hope that you all have some familiarity with R. If you need a refresher or want to build your R skills quickly, try this basic intro or various useful files and relevant links on Statistical Computing on my course web-page for Stat 705. Another free and interactive site I recently came across for introducing R to social scientists is: https://campus.sagepub.com/blog/beginners-guide-to-r.
In general, I will put R Scripts relevant to the course material, including material from demonstrations, homeworks and exploring various R packages into the R Scripts Directory.

(1). There are lots of very handy R packages doing Multivariate Statistics calculations and displays that you can download directly, and I will tell you about them throughout the term. Some will be used as sources of interesting multivariate data, some for implementing the theory that we cover in class. A few of the packages that you will want to install on your own computers are: mclust, Yaletoolkit, ICSNP, and chemometrics. For an overview of all of the R packages doing Multivariate Statistics tasks, try this Multivariate Stats link.

(2). In the RScripts directory you will find updated periodically a set of scripts covering class R demonstrations and illustrating related topics and R packages you can use to solve homework problems. (Of course you can use other packages too, but please cross-check any new ones you find that we have not discussed in class, to check that they compute things the way you expect them to, so that they match the computing formulas we develop in class.

Homework: Assignments, including any changes and hints, will continually be posted here. The most current form of the assignment will be posted also on ELMS. The directory in which you can find old homework assignments and selected problem solutions is Homework.

HW1 due Monday Feb.7, 11:59pm (upload to ELMS) Hints for HW1 can be found here.
Read Chapters 1 and 2 of Mardia, Kent and Bibby. Do problems # 2.5.1, 2.6.4, 2.7.1 in MKB, along with 4 additional problems (A), (B), (C) and (D) that are written out in the linked pdf document HW1Spr22.pdf in the Homework folder. All 7 are to be handed in (uploaded) Monday Feb. 7 on ELMS.
NOTE: There is a correction to the statement of Problem B.(iii), given explicitly on ELMS and also linked here. The revised form of the HW1Spr22.pdf document linked above reflects this correction, as of 5pm 2/7/22.

HW2 due Wednesday Feb.23, 11:59pm (upload to ELMS)
Reading assignment in MKB text: Chapter 3 omitting 3.4.3, 3.6.2 and 3.7-3.8, Ch. 4 Sec. 4.2 through 4.2.2.2, pp.~102-107,
and Ch. 5, Section 5.2.1 and 5.3.1
Some of Ch.3 is difficult reading, and you may find the coverage in Haerdle and Simar (2007, 2nd ed.) Chapter 5, Sec. 6.1, Sec.7.1 a little less demanding. In that case, the Haerdle and Simar reading (and compare the similar treatment in Johnson and Wichern) should be good enough.

Hand in the following 7 problems by 11:59pm Wednesday 2/23/22.
(I) Problems 3.2.6 on p.87, 3.4.2 on p.89, and 3.4.16 on p.92 in Chapter 3 of MKB.
(II) Exercise 5.10 on pp.159-160 of Haerdle and Simar, and Ex.4.21, p.205 in Johnson-Wichern.
(III) Generate 10,000 independent W₄(Σ,10) random matrices, where Σ is a 4x4 diagonal matrix = diag(1:4). Use this random sample to estimate 0.5, 0.8, 0.9 and 0.95 quantiles for each of the 4 eigenvalues of a W₄(Σ,10) matrix, and look at the histograms of these eigenvalues (10,000 observations for each eignevalue) to see how each of them differs from a normal distribution.
(IV) In (III), you may use the rWishart function in base-R to do the main part of this simulation. Check as part of your simulation, using the distribution of the largest eigenvalue, that rWishart is giving you the same result as the simWish function coded in classroom demonstration from first principles.

HW3 due Saturday Mar.12, 11:59pm (upload to ELMS)

Topics Power of Tests, UIT's, and Regression.
Reading: Chapter 3 in-class material related to MKB proof of Theorem 3.5.2
Chapter 4 Section 4.2.2.2 and (if you like) Section 5.3.1b in MKB
Chapter 7 in Johnson and Wichern or Ch.6 in Mardia Kent Bibby

(I)(10 points) MKB #3.5.1(c). Problem on power of 1-sample Hotelling T² tests at a particular value μ₁ ≠ μ₀ = 0.

(II)(15 points) Recall the dataset "Coated" within package MVTests (2 samples associated with different "Coatings" consisting of 15 2-dimensional observations of a Depth and Number) that we used in class, in the HotellingT2.RLog script in the RScripts directory of the course web-page. We performed a two-sample Hotelling T² test and used the MVTests function "TwoSamplesHT2" to reproduce the exact statistic value and p-value. That function also outputs simultaneous confidence intervals $CI for the difference between the Depth and Number mean-parameters across the two coatings. Those simultaneous confidence intervals are the ones you would get from implementing the UIT two-sample test as a Union-Intersection Test as discussed in class on 2/25. Show exactly how you would compute those simultaneous confidence intervals, verifying to 6 decimal-place accuracy that these are the confidence intervals (at the default confidence level 0.95) produced by "TwoSamplesHT2" MVTests.

(III)(15 points) (a) Consider the problem of testing H₀: μ = k * μ₀ in terms of an nxp Normal(μ,Σ) data-matrix X, where μ₀ is known and (k, Σ) are unknown. Find the Likelihood Ratio Test, including the exact distribution of the test statistic. (The MKB book has MLE information in Ch.4 to help you in this.)
(b) Consider the cork tree data from Table 1.4.1 on MKB p.12. (This is available also as dataset box.cork in the R package "agridat".) Apply the test you found in (a) to this dataset with n=28, p=4, μ₀ = 1 (vector of 4 1's), and give the p-value of the test.

(IV) This problem investigates the Fisher iris data (which can also be found in the MVTests package). Using the R script provided in http://www.math.umd.edu/~evs/s750/Rscript.Iris, (a) perform and interpret a little Monte-Carlo simulation, using the estimated covariance matrix as the true parameter, and (b) bootstrap the iris data (which means: repeatedly draw iid samples of the same size (n=150) by sampling equiprobably with replacement from the original set of 150 vectors of 6 variables (one of which is the 3-valued categorical label "species"), to check the deficiencies of the Wishart as distribution of the sample covariance matrix in the iris data sample.
I have coded little R functions in Rscript.Iris to do the Simulation and Bootstrap sampling respectively from the multivariate normal with the covariance of the iris data and from the Iris dataset itself.
The point of this Exercise is to run these functions to determine whether the Wishart distribution seems adequate for this dataset (which has data considerably more granular than is strictly compatible with multivariate normal distributions). Summarize your findings from the simulation concerning the differences between your bootstrapped covariance-matrices from the iris data and the corresponding Wishart matrices if the data were multivariate normal (with same mean and covariance).

(V) Do Problems 7.7 (p.421) and 7.21 (p.425) in Johnson and Wichern, 6th ed.

HW4 due Sunday, April 10, 11:59pm (upload to ELMS), 65 points total

Topics PCA and Factor Analysis.
Reading: Chapter 8, 9 in MKB, Chapter 8 through 8.4 and 9 through 9.4 in Johnson and Wichern

(I). Do problems 8.2.2-8.2.4 combined into one problem (counts 15 points), 8.4.2, 8.8.1 in MKB, and 8.10 in Johnson-Wichern, p.473.

(II). Here are two more problems on Factor Analysis: Johnson and Wichern #9.9 and #9.28. For the data on international Women's track records in problem 9.28, step through the PCA analysis in #8.18 [but do not hand that analysis in, only use it to inform your factor analysis].

HW5 due Monday, April 25, 11:59pm (upload to ELMS), 60 points, 6 problems total

Topics: Factor Analysis, Canonical Correlations, Cluster Analysis
Reading: Canonical Correlations, Ch.10 in MKB or Johnson & Wichern
Cluster Analysis, Ch.13 MKB or Ch.12 Johnson & Wichern through 12.6

(I). Do problems 9.6.1, 13.2.1, 13.3.2 in MKB, and 12.7 in Johnson and Wichern.

(II). Do problem 10.2.10 in MKB, but you need not use problem 10.2.9 to find the canonical correlations; use any moethod you like. But as part ofthe same problem [to be handed in], perform a hypothesis test of whether reading ability is correlated with arithmetic ability (using multivariate-normal assumptions, as in the MKB chapter.

(III). Consider the banknote data (in the mclust package), restricting yourself to the first 100 obseervations (the genuine banknotes) and omit the 1st and 6th columns ("Status" and "Top").
(a) Do a factor analysis with 2 factors with "varimax" rotation, and interpret the factors as well as you can.
(b) In the factor model with 2 factors, general diagonal Ψ and side condition that Λ' Ψ Λ is diagonal, test the hypothesis that a 2-factor model is adequate.
(c) In the factor model with 2 factors and Ψ of the form σ² times the identity matrix, estimate the loadings and σ² and test whether this model is adequate.

Homework 6, due 11:59pm May 10, 2022 4-part problem, with point values indicated, totaling 65pts.

Read in the dataset HCV from the UCI Machine Learning Repository, omitting the first column and deleting the 7 subjects in "suspect Blood Donor" category using the following R statements:

> HCV = read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00571/hcvdat0.csv")[,-1]
HCV = HCV[HCV$Categ !="0s=suspect Blood Donor",]
HCV$Category = factor(HCV$Categ)
HCV$Sex = ifelse(HCV$Sex=="m", 1, 2)

Of the columns 2:13 (all numeric except for Factor in Category column), Category is the class label that we hope to reproduce via clustering.

Since the measurements in columns 4:13 are very skewed (positive), convert them to logs before proceeding:
> for(i in 4:13) HCV[,i] = log(HCV[,i])
You may find that you want to take further transformations of the columns before going on to other steps (informed by how those steps turn out if you do not transform).

There are a total of 31 missing values in the remaining data. Replace them by the average of the values in their respective columns within their same "Category" (after doing whatever nonlinear transformations on the columns that seem appropriate to you, like the log's above). For example, if ALP is missing in a row with Category = "Cirrhosis", then replace it by the average of all ALP values in the "Cirrhosis" records.)

(I)(20pts) Apply k-means, agnes (separately with method="single", method="complete", and method="average"), diana and at least one other clustering algorithm (your choice) to these data. You might want to pre-process the data (to make all the measurement units and indicators comparable) by re-scaling so that all the numeric columns have variance 1. Some of these clustering methods will give nonsensical answers (eg clusters of 1, etc.), so you may want to choose a distance-function yourself by creating a "dissimilarity matrix" in a way other than simply using L2 or L1 distances.
(A) Calculate the confusion-matrices for the methods you try that create 4 clusters.
(B) Calculate the "silhouette" for each clustering method with 3 to 6 cluster, and explain for each method what is the best number of clusters to choose based upon "silhouette".

(II)(15pts) Apply a PC or Factor Analysis method to the dataset, ignoring the Sex and Category columns, and find the best grouping of columns you can into at most 4 groups for describing these data.

(III)(10pts) Using [only] the PCs or Factors you found in part (II), together with Sex, re-do the clustering methods you tried in part (I). Did you lose clustering information or precision in passing from (I) to these new clusterings ?

(IV)(20pts) Pick the single most successful of your clustering methods tried in (I) (or possibly (III)), and assess its accuracy by computing a Confusion Matrix, and the (average) Sensitivity and PPV of the 4 clusters across 2000 resampled datasets created by the Nonparametric Bootstrap. By looking at these same performance metrics for the clusters in each of the 2000 resampled datasets, provide the best summary you can of how stable your clustering method is (across these 2000 resampled datasets).

NOTE: because the "0=Blood Donor" category is so large compared to the others, you may subset it if you like, ie, if you think it will clarify your clustering you may reduce the size of the dataset by retaining a random subset (generated once) of 75 of the 533 "0=Blood Donor" records.

Lecture Handouts See directory for handouts and directory for R Scripts, some dated by Lecture.

(A) File of keywords for topics covered, lecture by lecture.

(B) Handout on Spherical Symmetry cleans up some unfinished business from the January 31 lecture on the Spherical Symmetry topic.

(C) Handout on EM Algorithm to supplement our class coverage of EM and its application to ML estimation in the Factor Model.

(D) Text-file containing suggested topics, papers and book-chapters you might use in your Term Project.

(E) A beautifully written tutorial introduction to Spectral Clustering supplements the material we covered briefly on Spectral Clustering in class on April 22, with a little more illustration on implementation (on an ideal dataset and on the iris data) in the R Script IrisCluster.RLog. That Script contains a quick overview of all the clustering methods we have covered, implemented in R.

(F) Handout summarizing bootstrap ideas for Large-p Clustering based on background bootstrap ideas plus several journal papers. The references and brief notes on the papers can be found here.

(G) The paper Kernel methods in machine learning in Annals of Statistics 2008 is a sophisticated reference beyond the scope of this course, to serve as further reading on kernel methods in multivariate statistics, beyond what we cover in the final 3 lectures.

SYLLABUS for Stat 750

We will cover Chapters 2, 3, 5-10, and 13 of the Mardia, Kent, and Bibby book thoroughly: topics include the multivariate normal distribution, Wishart's and Hotelling's distributions; tests of hypotheses, estimation in the general linear model, distribution of test criteria; generalized distance, principal components, canonical correlations, factor analysis, and clustering. Other chapters and topics in MKB will be touched more lightly and material will be taken from the other books and some journal papers. Specific references on machine learning topics and applied examples will be added as the term progresses.

OUTLINE

0. Overview/Introduction: Matrix and Data Structures (2 Lectures)
(a). Data transformations and displays (Ch.1 in MKB and Ch.1 in Johnson-Wichern)
(b). Models; Data Representation versus Sampling Distribution Theory
1. Linear Algebra & Probability Review. (4 Lectures, MKB Ch.2)
(a) Projection matrices and Singular Value Decomposition
(b) Multivariate normal distribution: alternative characterizations
(c) Other multivariate distributions: spherically symmetric and elliptical distributions
2. Wishart distribution; Hotelling T²; Mahalanobis distance. (4 Lectures, MKB Ch.3)
(a) Multivariate data transformations
(b) Distributions of multivariate test statistics
3. Statistics based on likelihood for multivariate normal data (4 Lectures, MKB Ch.4-5)
(a) Likelihood, sufficiency and ML Estimation
(b) Hypothesis testing techniques, including likelihood ratios (Wilks Thm)
simultaneous confidence intervals, multivariate parametric and nonparametric tests
4.Multivariate regression. (4 Lectures, MKB Ch.6 and 12)
(a) MLE, general linear hypothesis, multiple correlation, least squares, variable selection
(b) Idea of MANOVA
5. Econometric Ideas (3 Lectures, MKB Ch.7)
(a) Simultaneous equation and instrumental variables models
{b) Comparison of estimators.
6. Principal Components Analysis. (4 Lectures, MKB Ch.9)
(a) Definitions & sampling properties
(b) Principal components regression
(c) Extensions (Sparse PCA)
7. Factor Analysis. (3 Lectures, MKB Ch.10)
(a) Definition of models, rotation of factors
(b) Goodness of fit, relation to PCA
8. Cluster Analysis. (4 Lectures, MKB Ch.13)
(a) Divisive and Aggregative Hierarchical Clustering; Dendrograms
(b) Evaluation of Cluster effectiveness by replication methods
9. Miscellaneous Data-Analytic and Machine-Learning Ideas in Multivariate Stats (8 Lectures)
(a) Permutational and Bootstrap ideas in Multivariate Stats
(b) Discrimination and Support Vector Machines
(c) Multidimensional scaling
(d) Sparsity and high-dimensional data
(e) Projection Pursuit