DATA ANALYSIS LOG IN R USING LM ================================== (I) (10/26/09) Start with "steamdat" example from Applied Regression Analysis text of Draper and Smith. > steamdat = matrix(c(10.98, 35.3, 11.13, 29.7, 12.51, 30.8, 8.40, 58.8, 9.27, 61.4, 8.73, 71.3, 6.36, 74.4, 8.50, 76.7, 7.82, 70.7, 9.14, 57.5, 8.24, 46.4, 12.19, 28.9, 11.88, 28.1, 9.57, 39.1, 10.94, 46.8, 9.58, 48.5, 10.09, 59.3, 8.11, 70.0, 6.83, 70.0, 8.88, 74.5, 7.68, 72.1, 8.47, 58.1, 8.86, 44.6, 10.36, 33.4, 11.08, 28.6), ncol=2, byrow=T, dimnames=list(NULL,c("steamuse","temp"))) ### 25 (y,x) pairs : Steam Use is the response > plot(steamdat[,2],steamdat[,1], xlab="Temp", ylab="Use", main="Draper-Smith Steam Data Example") ## This produces a scatterplot, motivating line-fitting. > lmtmp = lm(steamuse ~ . , data=data.frame(steamdat)) > lmtmp Call: lm(formula = steamuse ~ ., data = data.frame(steamdat)) Coefficients: (Intercept) temp 13.62299 -0.07982869 > lines(steamdat[,2], lmtmp$fitted, lty=3) #### To examine and "interact with" the plot, > identify(steamdat[,2],steamdat[,1]) [1] 7 11 15 17 19 ### After clicking near individual ## points in graph with left mouse button, end ### with right mmouse-button click. ### Output is vector of indices for identified points. ### (Could also arrange to print some other vector ### entries at these indices instead.) > summary(lmtmp)$coef Estimate Std. Error t value Pr(>|t|) (Intercept) 13.6229893 0.58146349 23.428795 1.496788e-17 temp -0.0798287 0.01052358 -7.585697 1.054950e-07 > anova(lmtmp) Analysis of Variance Table Response: steamuse Df Sum Sq Mean Sq F value Pr(>F) temp 1 45.592 45.592 57.543 1.055e-07 *** Residuals 23 18.223 0.792 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 ### To reconstruct SSQ: > sum((lmtmp$fit-mean(lmtmp$fit))^2) [1] 45.5924 > 24*var(lmtmp$fit) [1] 45.5924 > sum(lmtmp$resid^2) [1] 18.22340 ### To reconstruct F value > summary(lmtmp)$sigma^2 [1] 0.7923217 > 24*var(lmtmp$fit)/summary(lmtmp)$sigma^2 [1] 57.5428 > dim(model.matrix(lmtmp)) [1] 25 2 ### 1st column is all 1's, 2nd = "temp" mtmp = model.matrix(lmtmp) #### Finally, for standard set of plots including #### residuals: > plot(lmtmp) ### To calculate standardized residuals: > sum(abs(summary(lmtmp)$cov.unscaled - solve(t(mtmp)%*% mtmp))) [1] 2.98985e-15 ### shows exactly how to get $cov.unscaled hat.matrix = mtmp %*% summary(lmtmp)$cov.unsc %*% t(mtmp) res.sd = sqrt(diag(diag(25)-hat.matrix))*summary(lmtmp)$sigma stdized.resid = lmtmp$resid/res.sd ### stanardized residuals > round(stdized.resid,3) 1 2 3 4 5 6 7 8 9 10 11 0.205 -0.146 1.599 -0.608 0.632 0.940 -1.573 1.198 -0.187 0.123 -1.930 12 13 14 15 16 17 18 19 20 21 22 1.046 0.600 -1.083 1.210 -0.197 1.381 0.088 -1.413 1.432 -0.221 -0.592 23 24 25 -1.385 -0.703 -0.311 ### these std resid's used to highlight "outliers --------------------------------------------------------- (II) Now look at a linear regression example with many variables added successively, "attitude" data in R > names(attitude) [1] "rating" "complaints" "privileges" "learning" "raises" [6] "critical" "advance" > fit0 = lm(rating ~ raises, data=attitude) summary(fit0)$coef Estimate Std. Error t value Pr(>|t|) (Intercept) 19.9777882 11.6880752 1.709245 0.0984683708 raises 0.6909058 0.1786164 3.868098 0.0005978237 ### So "raises" is a highly significant variable > fit1 = update(fit0, .~. + learning) summary(fit1)$coef Estimate Std. Error t value Pr(>|t|) (Intercept) 15.8090710 11.0843525 1.426251 0.16525699 raises 0.3785885 0.2174073 1.741379 0.09300174 learning 0.4320785 0.1925901 2.243513 0.03326766 ### "learning" also seems significant, but makes "raises" less so ### but to check overall improvement in model, can use ### "Wilks Theorem" chi-square for LR Test > 2*(logLik(fit1)-logLik(fit0)) [1] 5.128222 ## Pretty significant for 1 df chisq > 1-pchisq(5.128222, 1) [1] 0.02353983 ### Can also see improvement is SSR by addition of extra term: > anova(fit1) Analysis of Variance Table Response: rating Df Sum Sq Mean Sq F value Pr(>F) raises 1 1496.48 1496.48 17.1175 0.0003075 *** learning 1 440.04 440.04 5.0334 0.0332677 * Residuals 27 2360.45 87.42 ### In going from "raises" to "raises + learning" the SSQ #### increment of 440.04 is obtained in either of two ways: > 29*var(fit1$fitted)-29*var(fit0$fitted) [1] 440.0364 > sum(fit0$resid^2) - sum(fit1$resid^2) [1] 440.0364 ### The F-test has same p-value .03327 as the t-test in the ### coef table; but the LR asymptotic chi-square ### approximation from the Wilks test is slightly different. ### Finally we get Rsquared 2 ways: recall that response is "rating", > c(summary(fit1)$r.sq, 1-sum(fit1$resid^2)/( (nrow(attitude)-1)*var(attitude$rating) ) ) [1] 0.4506703 0.4506703 ============================================================= ### For steps to use in fitting "Stepwise" eg via AIC, see ### the log: StepExmp.Log