HW 18, assigned 11/28/2016, due Monday 12/5 extended to Wednesday 12/7 -------------------------------------------- Obtain the dataset from package MASS on diabetes in 532 Pima indian women, as follows: > library(MASS) > pima = rbind(Pima.tr,Pima.te) > dim(pima) [1] 532 8 > names(pima) [1] "npreg" "glu" "bp" "skin" "bmi" "ped" "age" "type" #------------------------------------ (A) Find the best logistic regression model you can for the response variable "type" in terms of the other variables in this data-frame (and any re-codes and dummy variables and interactions you like). I suggest using "step" with a value k >= 4 (note that BIC would correspond to roughly k=6), using interactions of order at most 3 (e.g. bmi:skin:bp). Here, "best" means best in terms of AIC and also classification accuracy. In part (A), stick with AIC. In part (B), you are asked to assess diabetic classification accuracy on various subgroups. And in part (C), you are asked to assess classification accuracy by cross-validation. #------------------------------------ (B) For your best fitted model, for the following subgroups, compare the proportion of women with diabetes with the average fitted probability of diabetes: (i) the 15 subgroups defined by glu in [50,59.99), [60,69.99), ..., [190,199.99) (ii) the 12 subgroups defined by pmin(pima$npreg, 11) Also display the numbers of women with diabetes correctly and incorrectly classified as diabetic by your model according to the criterion that the fitted probability of diabetes (type=="Yes") is >= 0.5. #------------------------------------ (C) As in the 3rd part of (B), suppose that we use the fitted probability of diabetes being >= 0.5 as the way to classify each woman as diabetic or not, and that the "figure of merit" for such a classification scheme is the proportion of misclassifications. Estimate the probability of misclassification from your best fitted model on a new set of data by cross-validation, as follows: 1. repeatedly, a total of N=1000 times, split the data randomly into a test set of size 200 and the complementary training set of size 332 2. for each data splitting i, fit your best chosen model in part (A) to the training set, and then use that model to generate the set of 200 fitted probabilities of diabetes for the women in the test set, and calculate the proportion of the 200 fitted probabilities rounded to 0 or 1 which disagree with the corresponding (pima$type=="Yes") binary indicators to obtain a "misclassification rate" for the i'th data-splitting. 3. Average your 1000 misclassification rates to get an overall misclassification rate. Compare this value to the misclassification rate you found on the whole dataset at th end of part (B). Would you have expected the rate in part (B) to be larger or smaller than the cross-validated rate you just found in (C) ? Why ?