Remarks on HOMEWORK 4, STAT 440, DUE 10/31/05 ============================================= #5.9 In #9, the main idea in proving the hint is to express the double summation over j,k with j ne k as the unrestricted double summationminus the summation ("along the diagonal") in which j and k are orcedto be equal. Then substitute the definitions of SSB, SSW, MSW, MSB. #5.12. N=580, M=24, n=12, m=3 two-stage sample. You are to use sample estimators for the within and between sums of squares (MSW, MSB): the formulas are just like the population-based ones [and the form of the ANOVA table is 11.b is supposed to suggest that to you] except that to combine them into an UNBIASED two-stage variance formula you need to use the formula given in 11.d. I'll say something about this in class. In #12b, the relative costs c_1 and c_2 for sampling PSU (case) and SSU (can) are given in terms of times required; section 5.5.2 of the Chapter [about which I will also say something in class today] then gives the optimal number of cans to sample within each case IF you know MSW and MSB (as you do from part a.). #5.15 Note that there are only 33, not 35, high schools in the region where the new survey will be conducted. #5.17 ## Inputting data: > measles <- read.table("measles.dat", header=T,sep=",") NOTE that "no answer" or "missing data" is denoted in this dataset by 9's. #5.22. Inputting and examining data: ## Use na.strings to read "." as missing=NA > ozone.fr <- read.table("ozone.dat",sep=",", header=T, na.strings=".") ## First row is all missing, so delete it. > ozone.fr <- ozone.fr[-1,] ## Note that the first column is always a time measurement, ## not ozone!! Also, there are some other missing ## observations. > (1:730)[apply(ozone.fr[,2:25],1, function(row) sum(is.na(row))==24)] [1] 558 > ozone.fr <- ozone.fr[-558,] ## The following are the column-wise counts of missing values. > apply(ozone.fr[,2:25],2, function(col) sum(is.na(col))) B C D E F G H I J K L M N O P Q R S T U V W X Y 7 5 4 5 6 7 4 10 6 8 19 15 14 7 25 6 11 14 17 6 5 5 24 11 There certainly are missing values here, but since the column-positions do have a definite meaning as hours within the day, I would delete only the complete rows 1 and 559, which are all missing. Then I would draw the systematic sample as instructed. At this point, you could do either of two things, both of which seem somewhat reasonable (but not a perfect solution) to me: (i) you could reduce the N and n by the numbers of missing values in respectively the whole pupulation and the sample, or (ii) you could replace each observation by the nearest non-missing value (preceding, if there are non-missing values both 1hr before and 1hr after. If you can think of another sensible approach, that is OK too, but you should say clearly what you are doing with the data. NOTE: In #5.22, you should realize that all of the columns other than the first (which is a Time measurement) are hourly ozone measurements. So you are asked to plot histograms etc. for the population of these ozone readings. "Population" means "totality of all observations". #6.4. In #6.4, unlike the Example 6.2, there are no Mi's because you are directly given the psi_i's. (In the Example, the Mi's had been used to give psi_i by the formula M-i/647.) What the methods require for their definition is not Mi's but psi_i's, and those are given.