Splus Example, 3/31/03 ---------------------- > steamdat <- matrix(c(10.98, 35.3, 11.13, 29.7, 12.51, 30.8, 8.40, 58.8, + 9.27, 61.4, 8.73, 71.3, 6.36, 74.4, 8.50, 76.7, 7.82, 70.7, 9.14, 57.5, + 8.24, 46.4, 12.19, 28.9, 11.88, 28.1, 9.57, 39.1, 10.94, 46.8, 9.58, + 48.5, 10.09, 59.3, 8.11, 70.0, 6.83, 70.0, 8.88, 74.5, 7.68, 72.1, + 8.47, 58.1, 8.86, 44.6, 10.36, 33.4, 11.08, 28.6), ncol=2, byrow=T, + dimnames=list(NULL,c("steamuse","temp"))) > steamdat steamuse temp [1,] 10.98 35.3 [2,] 11.13 29.7 [3,] 12.51 30.8 [4,] 8.40 58.8 [5,] 9.27 61.4 [6,] 8.73 71.3 [7,] 6.36 74.4 [8,] 8.50 76.7 [9,] 7.82 70.7 [10,] 9.14 57.5 [11,] 8.24 46.4 [12,] 12.19 28.9 [13,] 11.88 28.1 [14,] 9.57 39.1 [15,] 10.94 46.8 [16,] 9.58 48.5 [17,] 10.09 59.3 [18,] 8.11 70.0 [19,] 6.83 70.0 [20,] 8.88 74.5 [21,] 7.68 72.1 [22,] 8.47 58.1 [23,] 8.86 44.6 [24,] 10.36 33.4 [25,] 11.08 28.6 ## This is the dataset used in the little SAS illustration for ## PROC REG which is given in the web-page handout http://www.math.umd.edu/~evs/s798c/Handouts/Lec03Pt5B.pdf > motif() > plot(steamdat[,2],steamdat[,1], xlab="Temp", ylab="Use", main="Draper-Smith Steam Data Example") ## This produces a scatterplot, motivating line-fitting. > lmtmp <- lm(steamuse ~ . , data=data.frame(steamdat)) > lmtmp Call: lm(formula = steamuse ~ ., data = data.frame(steamdat)) Coefficients: (Intercept) temp 13.62299 -0.07982869 Degrees of freedom: 25 total; 23 residual Residual standard error: 0.8901245 Value Std. Error t value Pr(>|t|) (Intercept) 13.62298927 0.58146349 23.428795 0.00000e+00 temp -0.07982869 0.01052358 -7.585697 1.05495e-07 > names(lmtmp) [1] "coefficients" "residuals" "fitted.values" "effects" [5] "R" "rank" "assign" "df.residual" [9] "contrasts" "terms" "call" > lines(steamdat[,2], lmtmp$fitted, lty=3) > names(summary(lmtmp)) [1] "call" "terms" "residuals" "coefficients" "sigma" [6] "df" "r.squared" "fstatistic" "cov.unscaled" "correlation" > dim(model.matrix(lmtmp)) [1] 25 2 ### Next we do two things to show how to `interact' with the plot. ### The first is purely graphical: we highlight a few of the points by ### pointing and clicking at them, using "identify": > identify(steamdat[,2],steamdat[,1]) ### Now click successively with left mouse-button over the three ### uppermost points in the plot and then the three lowermost, ### and then click middle mouse-button [1] 3 15 17 11 19 7 > printgraph(file="Steamplot.ps") ### These are the indices of the points clicked on: a great way ### to identify outliers "visually" More conventionally, we can try to identify outliers according to the "hat matrix": if X denotes the design-matrix (in this case the 25x2 matrix model.matrix(lmtmp)) for a simple linear regression, for which the fitted variance is > summary(lmtmp)$sigma^2 [1] 0.7923217 > sum(lmtmp$residuals^2)/23 [1] 0.7923217 then the theoretical vector of variances for the residuals from the linear-regression fit is > rvar <- { mtmp <- model.matrix(lmtmp) 0.7923217*diag(diag(25)-mtmp %*% solve(t(mtmp) %*% mtmp, t(mtmp))) } So the `standardized residuals' are: > lmtmp$resid/sqrt(rvar) of which only the one with indices 3, 7, 11 seem `significantly' large: > order(lmtmp$resid/sqrt(rvar))[c(1,25)] [1] 11 3 > (lmtmp$resid/sqrt(rvar))[c(3,7,11)] 3 7 11 1.599349 -1.573203 -1.930487