HW 18, assigned  11/28/2016, due Monday 12/5  extended to Wednesday 12/7
--------------------------------------------

Obtain the dataset from package MASS on diabetes in 532  Pima indian women, as follows:

> library(MASS)
> pima = rbind(Pima.tr,Pima.te)
> dim(pima)
[1] 532   8
> names(pima)
[1] "npreg" "glu"   "bp"    "skin"  "bmi"   "ped"   "age"   "type" 
#------------------------------------

(A) Find the best logistic regression model you can for the response variable 
"type" in terms of the other variables in this data-frame (and any re-codes 
and dummy variables and interactions you like). I suggest using "step" with 
a value k >= 4 (note that BIC would correspond to roughly k=6), using 
interactions of order at most 3 (e.g. bmi:skin:bp).

Here, "best" means best in terms of AIC and also classification accuracy. 
In part (A), stick with AIC. In part (B), you are asked to assess diabetic 
classification accuracy on various subgroups. And in part (C), you are asked 
to assess classification accuracy by cross-validation.

#------------------------------------

(B) For your best fitted model, for the following subgroups, compare the 
proportion of women with diabetes with the average fitted probability of diabetes:

(i) the 15 subgroups defined by glu in [50,59.99), [60,69.99), ..., [190,199.99)

(ii) the 12 subgroups defined by  pmin(pima$npreg, 11)

Also display the numbers of women with diabetes correctly and incorrectly 
classified as diabetic by your model according to the criterion that the 
fitted probability of diabetes (type=="Yes") is >= 0.5.

#------------------------------------

(C) As in the 3rd part of (B), suppose that we use the fitted probability 
of diabetes being >= 0.5 as the way to classify each woman as diabetic or 
not, and that the "figure of merit" for such a classification scheme is 
the proportion of misclassifications.

Estimate the probability of misclassification from your best fitted model 
on a new set of data by cross-validation, as follows:

1. repeatedly, a total of N=1000 times, split the data randomly into a test 
set of size 200 and the complementary training set of size 332

2. for each data splitting i, fit your best chosen model in part (A) to the 
training set, and then use that model to generate the set of 200 fitted 
probabilities of diabetes for the women in the test set, and calculate 
the proportion of the 200 fitted probabilities rounded to 0 or 1 which 
disagree with the corresponding (pima$type=="Yes") binary indicators to 
obtain a "misclassification rate" for the i'th data-splitting.

3. Average your 1000 misclassification rates to get an overall 
misclassification rate. Compare this value to the misclassification rate 
you found on the whole dataset at th end of part (B). Would you have 
expected the rate in part (B) to be larger or smaller than the 
cross-validated rate you just found in (C) ? Why ?