Remarks on HOMEWORK 4, STAT 440, DUE 10/31/05
=============================================

#5.9 In #9, the main idea in proving the hint is to express 
the double summation over j,k with j ne k as the unrestricted 
double summationminus the summation ("along the diagonal") in 
which j and k are orcedto be equal. Then substitute the 
definitions of SSB, SSW, MSW, MSB.


#5.12. N=580, M=24, n=12, m=3 two-stage sample.
You are to use sample estimators for the 
within and between sums of squares (MSW, MSB): the 
formulas are just like the population-based ones 
[and the form of the ANOVA table is 11.b is 
supposed to suggest that to you] except that to 
combine them into an UNBIASED two-stage variance 
formula you need to use the formula given in 11.d.
I'll say something about this in class.

In #12b, the relative costs c_1 and c_2 for sampling PSU 
(case) and SSU (can) are given in terms of times required; 
section 5.5.2 of the Chapter [about which I will also say 
something in class today] then gives the optimal number of 
cans to sample within each case IF you know MSW and MSB (as 
you do from part a.).


#5.15 Note that there are only 33, not 35, high schools in 
  the region where the new survey will be conducted.

#5.17 ## Inputting data:
> measles <- read.table("measles.dat", header=T,sep=",")

NOTE that "no answer" or "missing data" is denoted in this 
dataset by
9's. 

#5.22. Inputting and examining data:

## Use na.strings to read "." as missing=NA
> ozone.fr <- read.table("ozone.dat",sep=",", header=T, na.strings=".")
## First row is all missing, so delete it.
> ozone.fr <- ozone.fr[-1,]
## Note that the first column is always a time measurement,
##   not ozone!! Also, there are some other missing 
##   observations. 
> (1:730)[apply(ozone.fr[,2:25],1, function(row) sum(is.na(row))==24)]
[1] 558
> ozone.fr <- ozone.fr[-558,]
## The following are the column-wise counts of missing values.
> apply(ozone.fr[,2:25],2, function(col) sum(is.na(col)))
 B  C  D  E  F  G  H  I  J  K  L  M  N  O  P  Q  R  S  T  U  V  W  X  Y 
 7  5  4  5  6  7  4 10  6  8 19 15 14  7 25  6 11 14 17  6  5  5 24 11 

There certainly are missing values here, but since the
column-positions do have a definite meaning as hours within the day, I
would delete only the complete rows 1 and 559, which are all
missing. Then I would draw the systematic sample as instructed. At
this point, you could do either of two things, both of which seem
somewhat reasonable (but not a perfect solution) to me:
   (i) you could reduce the N and n by the numbers of missing values
in respectively the whole pupulation and the sample, or 
   (ii) you could replace each observation by the nearest non-missing
value (preceding, if there are non-missing values both 1hr before and
1hr after.  

If you can think of another sensible approach, that is OK too, but you
should say clearly what you are doing with the data.

NOTE: In #5.22, you should realize that all of the columns other 
than the first (which is a Time measurement) are hourly 
ozone measurements. So you are asked to plot histograms etc. 
for the population of these ozone readings. "Population" 
means "totality of all observations".


#6.4.  In #6.4, unlike the Example 6.2, there are no Mi's because 
you are directly given the psi_i's. (In the Example, the 
Mi's had been used to give psi_i by the formula M-i/647.)
What the methods require for their definition is not Mi's 
but psi_i's, and those are given.