LECTURE KEYWORD SUMMARY, MULTIVARIATE STATISTICS, STAT 750, Spring 2022
=======================================================================

Lec.1, 1/24/22

Keywords to begin with:

Data structure
              **  table n x (p+d), n=#obs, p=#Outcomes Y_i, d=#predictors X_j
Data Display 
              **  summaries of Outcomes and Predictors by variable, 
                  scatterplots of Y's vs X's
Data Transformation **  linear transformation, projection, centering and rescaling,
                  subsetting by group, conversion to ranks, other nonlinear recoding

Subsetting X's -- Variable Selection Model Selection

Simultaneous subsetting of X's and Y's so that groups of X's are suitable for predicting
     subsets of Y's              (Examples: recommender systems or genomics)

Statistics    **  sampling distribution (theoretical)
              **  exact calculation of density under model
              **  versus Monte Carlo empirical distribution 
                  (use multivariate t or Wishart as examples)
              **  reference distribution under null hypothesis

Univariate models ** single Y modeled conditionally given multiple X

Multivariate models ** multiple outcomes Y modeled, maybe conditionally given X

================
Lec.2, 1/26/22

Data Display  **  correlation (Pearson or Spearman?) pairwise within Y's , within X's
                  
Data Transformation **  conversion to ranks, other nonlinear recoding

Classification/ ** groups g pre-defined via Y's, mapping to be defined as f(X)
Discrimination     primarily "supervised" with true labels, sometimes "semi-supervised" 

================

Lec.3, 1/28/22

Matrix algebra (see Appendix with that title in Mardia, Kent and Bibby)

Definitions of column-space, row-space, rank, nonnegative-definite

Master result: Singular Value Decomposition, contains 
      Spectral representation of Symmetric Nonnegative-definite (covariance) matrices

Corollaries: Projection Matrices via SVD, symmetric square-root of covariance matrix 
                verification of formulas for trace and det
                   respectively as sum and product of eigenvalues

Expression for joint density f(x) as limiting probability per unit volume for 
              small boxes decreasing to the point x

================

Lec.4, 1/31/22

Review Jacobian change of variable formula for probability densities of smooth
           and smoothly invertible function Y=g(X) of random vector X with density f(x)

Spherical symmetry (rotational invariance) for random vector

Examples of spherically symmetric joint densities 

Fact: for rotationally symmetric random p-vector X,  R=length(X)  and X/R  are indep
          random variables, with X/R uniformly distributed on the surface of p-dim sphere

(See pdf handout 2. on this topic).

================

Lec.5, 2/2/22

Conclusion of rotational-symmetry topic; hints on Exercises

Equivalent definitions of multivariate normal: via density, via ch.f., and as 
          affine transformation for vector with iid N(0,1) entries.

=================

Lec.6, 2/4/22

Run-through of properties of multivariate normal: mean , variance, independence equivalent 
to  uncorrelatdness; generalized inverse of  covariance matrix in singular case; density 
of multivariate normal on affine subspace in singular-covariance case; maximum-probability 
(or minimum-volume for fixed probability) sets as ellipsoids.

================

Lec.7, 2/7/22

Conditional density of Y given X when these random vectors are jointly multivariate normal
Multivariate CLT as justifification for multivariate normal
Mixtures of multivariate normal densities
Maximum likelihood estimation from iid multivariate normal samples

Sufficient statistics and likelihood ratio tests for the mean in multivariate-normal setting

===============

Lec.8, 2/9/22

Conditional densities for one multivariate normal subvector given another

===============

Lec.9, 2/11/22

Xbar and S as MLE's 
Formulation of multivariate normal parameter space and hypotheses

===============

Lec.10, 2/14/22

Likelihood ratio test (LRT) and Wilks' Theorem
LRT for null hypothesis of specified multivariate normal mean (one-sample case) 
    with unrestricted unknown covariance matrix
Wishart distribution, Malanobis distance

===============

Lec.11, 2/16/22

Hotelling T^2 distribution
Independence of Xbar and S based on multivariate normal data matrix
Independence of weighted combinations of rows of n x p multivariate normal data matrix
     based on n-dimensional orthonormal vectors of weights

===============

Lec.12, 2/18/22

Two-sample tests of same versus different means in sample populations with unknown 
unrestricted variance matrix assumed to be the same across samples

R Script and demonstration of one- and two-sample tests and simulation of p-values

Further distributions arising in Multivariate Normal hypothesis tests (end of Ch.3 MKB)

----------------

Lec.13, 2/21/22

Accuracy of Monte Carlo calculations of distributional percentage points and p-values
Proof that  T^2(p,m)  Hotelling T^2 distribution is the same as ((mp/(m-p+1))* F_{p,n-p+1}

----------------

Lec.14, 2/23/22

Catalogue of hypothesis tests we obtain for multivariate normal means and variances
    using Likelihood Ratio Test, and also using Union Intersection Test idea
Template for obtaining new hypothesis tests based on differently constrained parameters
Relationship between UIT's and simultaneous confidence intervals.

------------------

Lec.15, 2/25/22

Two-sample LRT for equality of covariance matrices
More on UITs and simultaneous CIs: derivations in cases 

------------------

Lec.16, 2/28/22

Introduction of Multivariate Regression Model,
Motivation by comparison with univariate regression models and 
   derivation of MLEs for coefficient matrix B and outcome covariance matrix Sigma

------------------

Lec.17, 3/2/22

Demonstration that B-hat and residual-matrix U-hat are independent in multivariate-normal
  regression model, and verification of Wishart distribution for Sigma-hat

Computational demonstration of model fitting and hypothesis tests for correlation 
  between outcome variables in multivariate regression, and of relation between
  conditional distribution of residuals (one column given others) and copmprehensive univariate
  regression model for one column $Y^{(j)}$ in terms of X and of other outcome columns $Y^{(-j)}$

------------------

Lec.18, 3/4/22

Completion of Ch.6 MKB: covered Sec 6.3 through 6.3.1
LRT hypothesis test for C1 B M1 = D uin multivariate regression
plus: Multiple Correlation, Partial Correlation

------------------

Lec.19, 3/7/22

MANOVA as regression, LRT with Wilks' Lambda
def'n of Pillai's Trace as alternative

-------------------

Lec.20, 3/9/22

MANOVA table demonstration in R
discussion of Wilks Lambda and
  relationship to product of independent Beta's (Thm 3.7.3)
  and approximation in cases k=2 or 3 by F's

--------------------

Lec.21, 3/11/22

Brief discussion of sample test review problems
Introduction to Ideal Principal Components (ie, the principal-
   component eigenspaces fo the true variance matrix Sigma)

---------------------

Lec.22, 3/14/22

Discussion of HW problem (II) extact T^2(p-1,n-1) distribution
    using alternate representation of H0: mu proportional to mu_0
    as  R mu = 0 , where R (px(p-1)) has rows forming 
    an orthonormal basis for {mu_0}-orthcomplement  (cf. MKB, pp.132-133)

Extended discussion/hints on problems of Sample Test

---------------------

Lec.23, 3/16/22

Further discussion on sample test & review  for in-class test

Further introduction to PCA: sample principal components,
   general properties, and Principal Component regression

------------------TEST ON 3/18/22

Lec.24, 3/28/22

Discussion of test solutions and further definitions concerning principal components.

------------------

Lec.25, 3/30/22

Illustration of PC software and R calculations "from scratch" on Boston Housing data
in R Script PrinCompBHous.RLog.

------------------

Lec.26, 4/1/22

Large sample theory for estimates of PCs. PC regression to reduce dimensionality of 
an outcome dataset. Use of PCs of a variable-set as predictive variables for a different outcome.

------------------

Lec.27, 4/4/22

Introduction of Factor Analysis model. Nonidentifiability due to orthogonal rotations of loadings.
Side condtions (several different versions) to restore identifiability.

Orthogonal-column loadings as one possible side condition for identifiable loadings matrix.

------------------

Lec.28, 4/6/22

Illustration of Factor Analysis R-functions in the 5-company stock-returns example 
(#9.4 in Johnson & Wichern) including interpretation of loadings in FactorExmp.RLog script.

------------------

Lec.29, 4/8/22

Principal Factor Method (3 versions: using correlation matrix R in place of S)
	(i) direct use of PCs with top-k eigenvectors of S as loadings, then Psi as diag (matrix residual)
	(ii) estimate communalities via max correlations (of j'th  variable on others), then Psi, then Lambda via spectral decomposition of R - Psi.
	(iii) same plan as (ii) but communalities estimated via multiple corr of j'th variable on others.

Contrasted these approximate "principal factor methods" with MLE Factor Model estimates, 
used in formal goodness of fit test for model.

------------------

Lec.30, 4/11/22

Introduction of EM Algorithm & Woodbury Identity
for Factor Analysis MLE Calculation via EM

following C. Bishop book's Chapter 12, esp. Sec.12.4

------------------

Lec.31, 4/13/22

Completion of EM Algorithm implementation for Factor Analysis
(Rubin & Thayer 1982)
Computational Illustration on 103x5 stock-returns dataset of 
LRT Goodness of Fit test for "Probabilistic PCA"
which is the factor model with   Psi = sigma^2 * I_{pxp}

------------------

Lec.32, 4/15/22

Canonical Correlation, motivation and linear-algebra solution
including goodness-of-fit test (under normality) for independence of X, Y
------------------

Lec.33, 4/18/22

Introduction/overview of clustering from all 3 books

1. Model-based
   a. Mixture and label-identifier models
   b. Density Estimation
2. Criterion/ Algorithm-based 
3. Hierarchical Agglomerative/Divisive
4. Other (particularly, Spectral Clustering)

Clustering    **  grouping or rule-based subsetting, with the general objective
(subsetting Y's)  that Y observations within group are more alike (homogeneous)
                  than observations across groups,
                  primarily "unsupervised" without labels, sometimes "semi-supervised" 
------------

Lec.34, 4/20  Clustering, continued

Software (library cluster for hierarchical, kmeans, mcluster for mixture models)
Dendrogram data representations

Illustration of clustering and "confusion" matrices for (sample from) iris data 
where the true species-based clusters are known.

-------------

Lec.35, 4/22 More on clustering

Further discussion of the IrisCluster.RLog script showing the software implementation
and interpretation of clusters from methods  kmeans,  agnes,  diana,  mclust.

The discussion is enriched by the model-based clustering analyses 
(with parametric mixture-of-normal models).

General Question: how to assess clustering reliability or quality. Introduce idea of
   clustering data-sample replicates ("bootstrapping clusters") to assess reliability.

-----------

Lec.36, 4/25 Bootstrapping -- in general and in Clustering

Nonparametric vs parametric bootstrapping.
Intermediate case of bootstrapping from a "parametric density" defined from a 
    kernel-density estimator defined from observed data.
Some illustration using the  R Script BootMultivar.RLog.

-----------

Lec.37, 4/27 More discussion of bootstrapping specifically related to clustering, using 
    confusion matrices and metrics like Sensitivity and Positive Predictive Value.
    Further R Script illustration using iris data, cf. BootClus.RLog.

Lec.38, 4/29 Illustration of the bootstrapping of clustering with the R Script


Lec.39, 5/2  Kernel methods -- intro of kernels, basic theory

Lec.40, 5/4  Kernel clustering methods -- radial basis function (Gaussian) kernel
                   and variants

Lec.41, 5/6  More on Kernel clustering, including bootstrapping of the 
                   kernel-based clustering, using script involving iris data.

Lec.42, 5/9  Kernel PCA -- with script illustration, KernelMethods.RLog.

---------------

Eventually we left off the Sparse PCA topics, and several students did them for 
     final projects:

Sparse PCA, Simultaneous PCA, regularization of PCA in high dimensions