MiniScript related to Swiss BankNotes, using Testing Idea from Lecture ====================================================================== STAT 750 Spring 2022 2/14/22 > bankn = data.matrix(read.table("http://www.math.umd.edu/~evs/s750/data/banknote.dat")[,-1]) mu2 = apply(bankn[101:200,],2,mean) S = 0.99*var(bankn[101:200,]- rep(1,100)%o%mu2) ## for counterfeit banknotes > Salt = t(bankn[101:200,]- rep(1,100)%o%mu2) %*% + (bankn[101:200,]- rep(1,100)%o%mu2) / 100 > sum(abs(Salt-S)) [1] 1.348639e-15 ## Treat mean for genuine banknotes as reference mu1 = apply(bankn[1:100,],2,mean) > rbind(mu1,mu2) Length Left Right Bottom Top Diagonal mu1 214.969 129.943 129.720 8.305 10.168 141.517 mu2 214.823 130.300 130.193 10.530 11.133 139.450 T2 = 100*c((mu1-mu2)%*% solve(S, mu1-mu2)) > T2 [1] 7362.316 ## suggests highly significant because >> qchisq(.95,6) = 12.59159 ## Alternate computation from LRT: > 100*log(1+T2/100) [1] 431.2451 ## But what is the correct reference distribution for this "Hotelling T^2"? ## Install package MVTests > library(MVTests) > OneSampleHT2(bankn[101:200,],mu1,0.95)$HT2 [,1] [1,] 7288.693 > 7288.693/.99 ## indicates that denominator n in S above [1] 7362.316 ## must be replaced by n-1 to reproduce standard HT2 value ### Two-sample Hotelling T2 next ... > TwoSamplesHT2(bankn, rep(1:2,rep(100,2)))$HT2 [1,] 2412.451 > Spool = 0.5*(var(bankn[101:200,]- rep(1,100)%o%mu2) + var(bankn[1:100,]- rep(1,100)%o%mu1)) > TwSampT2 = (100/2)*c((mu1-mu2)%*% solve(Spool, mu1-mu2)) > TwSampT2 [1] 2412.451 ### OK, agrees with MVTests software #--------------- Slightly more interesting examples > OneSampleHT2(bankn[101:200,c(1:3,5)],mu1[c(1:3,5)],0.95)$HT2 [,1] [1,] 667.4621 > TwoSamplesHT2(bankn[,c(1:3,5)], rep(1:2,rep(100,2)))$HT2 [,1] [1,] 261.577 # still wildly significant even without varbls "Bottom, Diagonal" #------------------ names(Coated) = c("","Depth","Number","Depth","Number") ## 15 obs "Depth", "Number", for each of two "Coating" treatments apply(Coated[,2:5],2,mean) Depth Number Depth.1 Number.1 54.20000 26.20000 46.20000 23.13333 > tmpCoated = rbind.data.frame(Coated[,2:3],Coated[4:5]) ## 1st 15 rows "Coating1" next 15 rows "Coating2" > TwoSamplesHT2(tmpCoated, rep(1:2,rep(15,2)))$HT2 [,1] [1,] 4.994308 ## how extreme a value is this for HotellingT2 ? > TwoSamplesHT2(tmpCoated, rep(1:2,rep(15,2)))[4:5] $p.value [,1] [1,] 0.1090703 $CI Lower Upper Important Variables? Depth -2.769019 18.76902 FALSE Number -4.757879 10.89121 FALSE ## Let's verify roughly the HT2 p-value (HT2 with 2 dimensions, 28df) > Ztmp = array(rnorm(2e5), c(1e5,2)) Z2tmp = array(rnorm(56e5), c(1e5,28,2)) tmp = apply(Z2tmp,1, function(mat) t(mat)%*%mat)/28 ## took 3 secs > dim(tmp) [1] 4 100000 tmp2 = cbind(Ztmp, t(tmp)) ## 1e5 x 6, 1st two cols Z, next 4 cols W Tsq = c(apply(tmp2,1, function(row) row[1:2] %*% solve(array(row[3:6], c(2,2)),row[1:2]))) ## took 2 secs > mean(Tsq > 4.99) [1] 0.10958 ### compare with p-value 0.10907 ## on another run: > mean(Tsq > 4.994308) [1] 0.10935 ## closer to the "theoretical" p-value ## given above by MVTests ----------------------------------------------------------------- # Further steps from theory discussed on 2/21/22: ----------------------------------------------------------------- We calculated a 2-sample Hotelling T^2 tests statistic, which we saw was T^2(2,28) in the MKB notation. Moreover (from Thm 3.5.2) T^2(2,28) = (28*2/(28-2+1)) F_{2,28-2+1} = (56/27)*F_{2,27} So the p-value for 4.994308 is P(F_{2,27} > (27/56)*4.994308) > 1- pf((27/56)*4.994308, 2,27) [1] 0.1090703 ## exactly the p-value given by MVTests --------------------------------------------------------------------------