LOG of R Analyses Illustrating CLT for SRS Samples -------------------------------------------------- 9/14/05 > mu284 <- read.table("mu284.dat", header=T, skip=28) > dim(mu284) [1] 284 11 > names(mu284) [1] "LABEL" "P85" "P75" "RMT85" "CS82" "SS82" "S82" "ME84" "REV84" [10] "REG" "CL" > order(mu284$P85)[280:284] [1] 211 29 114 137 16 > order(mu284$P75)[280:284] [1] 47 29 114 137 16 > order(mu284$ME84)[280:284] [1] 199 29 114 16 137 > order(mu284$REV84)[280:284] [1] 47 29 114 137 16 > indLrg <- c(114, 137, 16) > mu284[indLrg,2:9] P85 P75 RMT85 CS82 SS82 S82 ME84 REV84 ### The three largest cities are 114 229 247 3471 20 32 61 24694 17949 ### much larger than the others. 137 424 446 6720 21 35 81 47074 38945 16 653 671 6263 34 41 101 45324 59877 > round(apply(mu284[,2:9],2,mean),2) P85 P75 RMT85 CS82 SS82 S82 ME84 REV84 29.36 28.81 245.09 9.10 22.19 47.54 1778.96 3077.52 ### First we create a histogram of 1000 sampled means from ### a SRS sample of 50 cities out of all 284. > hstA <- numeric(1000) for (i in 1:1000) hstA[i] <- mean(sample(mu284$REV84, 50)) hist(hstA, nclass=16, xlab="REV84", ylab="Rel Freq", prob=T, main="Histogram of SRS Means from Mu284, n=50") > abline(v=mean(mu284$REV84)) lines((18:53)*100, dnorm((18:53)*100, mean(mu284$REV84), sqrt((1-50/284)*var(mu284$REV84)/50)), lty=2) ### The overplotted line is the normal density with the ### theoretical mean and variance of the SRS sample average. ### Picture saved as hst_mu284.pdf > hstB <- numeric(1000) indSm <- setdiff(1:284, indLrg) ### The 281 cities other than the largest 3 for (i in 1:1000) hstB[i] <- mean(sample(mu284$REV84[indSm], 50)) hist(hstB, nclass=16, xlab="REV84", ylab="Rel Freq", prob=T, main="Histogram of SRS Means from Mu281, n=50") lines( > abline(v=mean(mu284$REV84[indSm])) lines((16:40)*100, dnorm((16:40)*100, mean(mu284$REV84[indSm]), sqrt((1-50/281)*var(mu284$REV84[indSm])/50)), lty=2) ### Saved as hst_mu281.pdf THE MORAL IS CLEAR: THE HISTOGRAM LOOKS MUCH MORE NORMAL IN THE SECOND CASE, WHEN THE OUTLYING LARGE CITIES ARE NOT PART OF THE SAMPLING FRAME. ------------------------------------------------------------ FURTHER ILLUSTRATIONS: (i) SRS SAMPLING FROM A BINARY POPULATION AS IN AN OPINION POLL, LEADING TO THEORETICAL HYPERGEOMETRIC DISTRIBUTION, and (ii) SRS SAMPLING FROM A GAMMA-DISTRIBUTED SUPERPOPULATION. (i) Note that if all N attributes Y_i are 1 or 0 corresponding to "Yes" or "No" opinion respectively on an opinion-poll question, then the number of Yes answers in an SRS sample of size n is exactly a Hypergeometric random variable, and the attribute total t is the number in the whole population who would answer Yes if asked. In R we can get normal distribution probabilities using the function "pnorm" and hypergeometric distribution probabilties using "phyper": The SE below refers to the SE of the t-hat estimator which is N/n times the hypergeometrically distributed total within the sample. > Hypfram <- cbind.data.frame(N=c(10000,10000, rep(1000,4)), n=c(rep(200,4),20,20), t=c(3000,3000, rep(300,4)), SE=rep(0,6), Thresh = rep(c(1,1.645),3), Uplim=rep(0,6), Normal=rep(0,6), Actual=rep(0,6)) attach(Hypfram) Hypfram$SE <- round((N/n)*sqrt(n*(N-n)*t*(N-t)/(N^2*(N-1))),1) Hypfram$Uplim <- round(t + Thresh*Hypfram$SE,1) Hypfram$Normal <- round(1-pnorm(Thresh),3) Hypfram$Actual <- round(1-phyper(Hypfram$Uplim*(n/N), t, N-t, n),3) detach(Hypfram) > Hypfram N n t SE Thresh Uplim Normal Actual 1 10000 200 3000 320.8 1.000 3320.8 0.159 0.156 2 10000 200 3000 320.8 1.645 3527.7 0.050 0.052 3 1000 200 300 29.0 1.000 329.0 0.159 0.171 4 1000 200 300 29.0 1.645 347.7 0.050 0.052 5 1000 20 300 101.5 1.000 401.5 0.159 0.111 6 1000 20 300 101.5 1.645 467.0 0.050 0.046 THIS TABLE SHOWS THE VERY CLOSE CORRESPONDENCE BETWEEN HYPERGEOMETRIC ("ACTUAL") AND APPROXIMATE-NORMAL TAIL PROBABILITIES FOR THE T-HAT SAMPLING DISTRIBUTION. For a graphical exhibit, let's consider the setting of the third row, with N=1000, n=200, t=3000. We overplot the discrete masses from the hypergeometic with the approximating normal density with the same mean and variance: > plot( 5*(60+(-30:30)), dhyper(60+(-30:30), 300, 700, 200)/5, xlab="T-hat value", ylab="Probability Density", main= "Discrete Hypergeom Density & Approximating Normal") ### NB: Probability Masses are Divided by Interval Width of 5 lines( 5*(60+(-30:30)), dnorm(5*(60+(-30:30)), 300, Hypfram$SE[3]), lty=2) ### Note: THAT/5 is distributed approximately as normal with ### mean t/5=60 and standard deviation Hypfram$SE[3]/5=(29.0)/5 ### which says : THAT ~ |N( 300, 29.0) ### The picture is saved as "HyperPic.pdf" ---------------------------------------------------------------- (ii) Now to get superpopulation, generate 5000 Gamma variates with shape-parameter 1.5 and scale parameter 1: > set.seed(12345) > Yvec <- rgamma(5000, shape=1.5) ### The population attributes ### Now let's do an SRS sample of size 100 and over-plot ### a histogram of the results with the normal density ### with the theortically valid mean and variance > Ystat <- c(mean= mean(Yvec), var= var(Yvec)) > Ystat <- c(mean= mean(Yvec), var= var(Yvec)) > Ystat mean var 1.484325 1.444006 ### 1.5 is the theoretical value for the ### superpopulation, but now Yvec is regarded as fixed > Thist <- numeric(1000) for(i in 1:1000) Thist[i] <- mean(sample(Yvec, 100)) hist(Thist, nclass=24, xlab="Sample mean", ylab="Prob Density", main="Histogram of SRS Sample Means, N=5000, n=100, Gamma Data", prob=T) ### Now overplot the normal density with proper theoretical mean ### and standard deviation for SRS mean lines(1+(0:50)/50, dnorm(1+(0:50)/50, Ystat[1], sqrt((1-100/5000)*Ystat[2]/100)), lty=2) ### saved as Hist_Gam100.pdf ### Now let's do one final one with a smaller sample size, n=30. for(i in 1:1000) Thist[i] <- mean(sample(Yvec, 30)) hist(Thist, nclass=24, xlab="Sample mean", ylab="Prob Density", main="Histogram of SRS Sample Means, N=5000, n=30, Gamma Data", prob=T) ### Now overplot the normal density with proper theoretical mean ### and standard deviation for SRS mean lines(.8+1.4*(0:50)/50, dnorm(.8+1.4*(0:50)/50, Ystat[1], sqrt((1-30/5000)*Ystat[2]/30)), lty=2) ### saved as Hist_Gam30.pdf