Homework Problem 26, Due Monday May 10 or Wednesday May 12. ---------------------------------------------------------- Access the data "Rubber" from the MASS library within R or Splus6.0. From either platform, you must first issue the command > library(MASS) to place the MASS (Venables & Ripley) datasets into your search-path. This dataset has 3 variables: loss (the amount of rubber wear, the response variable) and two predictors, "hard" and "tens". The objective of this exercise is to compare the predictive success of different methods of "predicting" above-median rubber-loss (ie, loss > 165). (i) Develop three methods of predicting above-median rubber loss from the variables "hard" and "tens": --- using a linear-regression model (normal errors) --- using a logistic regression model --- using a nonparametric regression model (kernel density estimator with a single bandwidth b=40) for "loss" as a function of the linear-regression fitted linear combination of "hard" and "tens". You should code a function which does the model-fitting and defines predictions based on a dataset of the same structure as "Rubber", but with size n between 25 and 30. Note that each prediction method is an algorithm mapping the dataset to a logical vector of the same size n (where T corresponds to "> 165" and F to "<= 165"). For convenience, you should probably write your function to calculate the numbers or proportions of correct predictions on a test dataset input to the same function. (ii) Do a small cross-validation study (say, of 1000 replications), by repeatedly leaving out 5 observations chosen at random from the original dataset of 30, designed to estimate the accuracy of prediction of [loss > 165] by each of the three prediction methods you developed in (i). (iii) Do a small bootstrap study (preferably of many more than 1000 replications) designed to find a 95% confidence interval for the probability P(loss > 165 | hard, tens) by each of your three methods in (i), for several (say, the first 5) of the (hard, tens) combinations actually occurring in the data. There are a few different ways to do such a study. Do the study TWO DIFFERENT WAYS chosen from among the following: (a) bootstrap the triples (loss, hard, tens) directly (ie, directly sample with replacement from the set of 30 triples); OR (b) do a parametric bootstrap of the data, by simulating with replacement from only the pairs (hard, tens) and generating the additive regression errors from the normal linear regression model with parameters fitted to the dataset of all 30 points; OR (c) form the residuals from the linear regression model (fitted to the original dataset), and bootstrap them (ie repeatedly select with replacement), each time adding them back to the orginal linear-regression predictors to get a `pseudo-data' sample (pseudo-loss, hard, tens) of size 30 on which you can check the behavior of your prediction methods in (i).