NOTES on Bootstrap and Resampling for Clustering in STAT 750
=========================================================== 4/24/22 ff

------------------------------------------------
1. van der Laan & Bryan 2001 Biostatistics paper:
------------------------------------------------
Sec.1.1 expresses the need for statistical theory of consistency etc.

Sec.1.2, p.2 gives the setting: X_i iid copies, i=1,..,n (small, < 100) of p-vectors (p large, > 10,000) where $i$ index subjects and $p$ genes, and the X_i measurements are ratios of observed "prevalences" in the subject-sample versus a control.

Goal, p.3: find genes that are differentially epressed (in subject vs control) and especially groups of such genes that are highly correlated [have high differential activity co-occurring]

mu and Sigma (and estimates of them) used to form cluster groups. Consistency results proved in setting of p >> n as long as  log(p)/n --> 0.

Idea is that S = S(mu,Sigma). Might want to know single or joint inclusion probabilities of i,j in S or other criteria of stability, like confusion matrices, along with sensitivity & positivre predictive value. Further characteristic is set L [of genes] selected as particularly unsuitable for further study.

sens = |overlap of true and estimated S|/|true S|
ppv  = |overlap of true and estimated S|/|estimated S|

WHAT DOES THE PARAMETRIC BOOTSTRAP [UNDER NORMALITY] ESTIMATE ??

pp.4-5: The argument in Sec.1.3 that parametric bootstrap using multivariate normal makes sense does not in any way establish that the actual reference distribution for cluster attributes is insensitivity to the underlying joint distribution with given mu, Sigma.

However, the asymptotics in the article does show that quantities like individual or pairwise inclusion probabilities CAN be estimated via bootstrap when log(p) << n.

Original reference: Felsenstein, J. (1985), Confidence limits on phylogenies: an approach using the bootstrap, Evolution 39, 783-791.

A later, important reference on phylogenetic (hierarchical-clustering trees:

Efron,B. et al. (1996) Bootstrap confidence levels for phylogenetic trees. Proc. Natl
Acad. Sci., USA, 93, 13429–13434.

--- Clustering rules in t van der Laan and Bryan paper are all based on distances $D_{ij}$ that are functions of $\mu, \Sigma$. Examples of subset rules operate in stages: e.g. threshold rules $\mu_j  > C_1$ or $|\mu_j| > C_1$ or estimated correlations between $i,j$

Example $S(\mu,\Sigma) = \{ j: \mu_j > C_1, \quad \exists \;\; i \ne j: \quad \mu_i > C_1, \; \rho_{ij} > C_2\}$ and cluster within $S(\mu,\Sigma)$.

NB $\hat{\Sigma}$ based on size-n sampe (n < p) is singular, so add $\delta > 0$ to diagonal to preserve nonsingularity ?!

Distinction drawn (Sec.2.2.2) between closeness of distribution of cluster-sets S and $\hat{S}$ -- generally impossible -- and closeness of dist'n of the quantities (P_i, P_{ij}, sens, spec) of interest.

Sec.3 Bernstein ineq implies: $\hat{\mu}_j-\mu_j, \hat{\Sigma}_{ij} - \Sigma_{ij}$ \to 0 uniformly as log as log(p)/n --> 0.

-----------------------------------------------------------------
2. van der Laan & Pollard 2003 JSPI paper -- similar to van der Laan and Bryan to test out a specific new algorithm
-----------------------------------------------------------------

3. Vivjerberg 2007 tutorial preprint (project paper, Pomona College)

pp.8-11 give details of the PAM (Kaufman & Rousseuw) algorithm, and the van der Laan and Pollard extension of it.

For variable j  (gene, in microarray examples) within a given clustering, let C_j denote cluster containing j
    a_j = mean d(j,i)  over  i in C_j \ {j}
    b_j = min_{k \ne C_j} mean(d(j,i): i \in C_k)
"silhouette" of variable j is
    s_j = (b_j-a_j)/max(a_j,b_j)

Idea (mentioned also in the van der Laan papers, originated earlier) is to maximize average silhouette.

Also use to assess homogeneity of cluster elements wrt possible splits of those clusters.

This project paper gets wrong what van der Laan and Bryan did in their bootstrap: it was multivariate-normal PARAMETRIC bootstrap, not nonparametric!!

====================================================================================
4. ClusterBootstrap R Package paper (Deen & Rooij 2020 Behavioral Res Meth) -- may be relevant, especially in hierarchical social science data with limited number of clusters (possibly at  multiple levels). Also primarily restricted to hierarchical generalized linear models, which is still pretty general.
------------------------------------------------------------------------------------

------------------------------------------------------------------------------------
5. Kerr,M.K. and Churchill,G.A. (2001) Bootstrapping cluster analysis: assessing the
reliability of conclusions from microarray experiments. Proc. Natl Acad. Sci.
USA, 98, 8961–8965.
------------------------------------------------------------------------------------

The bootstrap idea here is specifically for high-throughput microarrays. (EXPLAIN)

A further issue is whether we cluster subjects (i=1,...,n) or variables (k=1,...,p)
and the relative sizes of n and p.

The method here is promarily (ANOVA) model based. This is very general, depending on which interaction terms are retained in the ANOVA model. Errors are still independent mean 0 equal variance $\sigma^2$. Specific interactions ("variety" by "gene") are biologically interesting, others are "technical", related to specifics of the data collection method.

Data example concerning yeast: 7 "varieties", 7 arrays, 2 dyes, 6118 genes, "green" dye only from the time=variety 0 sample; "red" dye only from the 6 other "veriety"=time-points. ANOVA model includes pairwise interactions for array x dye, array x gene, dye by gene.

-------------------------------------------------------------------------------------------
6. Pvclust: an R package for assessing the uncertainty in hierarchical clustering 
Ryota Suzuki and Hidetoshi Shimodaira, Bioinformatics Applications Note 22 (2006), 1540-1542
-------------------------------------------------------------------------------------------

performs bootstrap (B=1000 recommended) at multiple (default K=10) data sizes
bootstrap is NONPARAMETRIC