Spring
2008
MWF 2, Hornbake 0125
Instructor: Eric Slud, Statistics Program, Math. Dept.
Office: Mth 2314, x5-5469, ;
evs@math.umd.edu
Office hours: tentatively, M 1, W 11, and Th 1. But you
can make an appointment for
office-hour help at other times by emailing
me.
Course Text: Venables, W. N. and Ripley, B. D.
Modern Applied Statistics
with S-PLUS (4th ed, 2002.). New York:
Springer-Verlag.
Recommended:
R. Cody & J. Smith, Applied Statistics and the
SAS Programming Language, 4th ed. Prentice-Hall.
Additional:
R. A. Becker, J. M. Chambers, and A. R. Wilks (1988). The
New S Language. Pacific Grove,
CA: Wadsworth & Brooks/Cole.
J.M. Chambers and T.J. Hastie (1993). Statistical
Models in S.
London: Chapman & Hall.
For information concerning Homework click here.
Overview: Statistical research and application has changed dramatically
because of
cheap and powerful computational and graphical tools. This
course presents modern
methods of computational statistics and their
application to both practical problems
and research. The
techniques covered in STAT 705, which include some numerical-
analysis
ideas arising particularly in Statistics, should be part of every
statistician's toolbox.
Statistical methodology will be presented informally, with emphasis
on the intuitive basis
for the techniques and brief discussion of
their theoretical pedigree. Implementation of
each method will be given
in R, SAS, or both, and each method will be illustrated by
application to data, often from real datasets but sometimes simulated.
Prerequisite: STAT 420 or STAT 700, and some programming experience
(any language).
Course requirements and Grading: Grading will be based completely
on graded DAILY
assignments involving data analysis and statistical computation (a
total of about 40 of them).
The homework tasks will be of moderate
length and difficulty assigned in each class session,
usually to
be handed in 2 classes after the one in which the assignment is given.
1. Introduction to R:
Starting and quitting R, on-line help, R operators
and functions, creating
R objects, data types (vectors,
matrices, factors, functions, lists), managing
data
(combining objects, subsetting, creation of frames), R
graphics.
2. Monte Carlo and Simulation in R:
Basic random number generation, applications of LLN and CLT in
simulations,
numerical integration, importance sampling, empirical distributions,
Markov Chain
Monte Carlo. Managing loops in R.
3. Numerical Optimization in Statistics:
Objective functions in statistics, and managing functions in R.
Linear and nonlinear
least squares, special considerations in maximizing likelihoods, penalized
likelihood,
steepest descent, quasi-Newton-Raphson methods, constrained maximization,
EM
algorithm. Diagnostics for misspecified models.
4. Introduction to SAS:
The SAS environment, SAS data sets, sorting and merging, combining and
subsetting
data, basic statistical procedures.
5. Linear and Generalized Linear Models:
Regression summaries, model fitting, prediction, model updating, analysis
of residuals,
model criticism, ANOVA, generalized linear models, specifying
link and variance
functions, stepwise model selection, deviance analysis.
Comparisons of implementations in R and SAS. Fitting
mixed-effect (generalized)
linear models in R.
6. Bootstrapping Methodology:
Parametric bootstrap, empirical CDF, bootstrap standard errors and confidence
intervals,
estimation of bias, jackknife, application to regression.
7. Smoothing & Nonparametric Regression:
Spline smoothing, kernel smoothing, selecting tuning parameters by
cross-validation.
Graphical aspects of smoothing.
8. MCMC and the Gibbs Sampler.
Definitions and basic ideas of MCMC ad Gibbs-Sampler simulation
methodology,
including a brief introduction to `Bayesian Computing'
using BUGS through R.
Note: The course will concentrate heavily on R, which is a free software
package
syntactically almost identical to Splus, which was the
software emphasized in the course
up to a couple of years ago. If you are
new to R, you should get started as soon as possible,
using it
either on the MathNet or WAM machines (where it is already loaded and installed)
or on your home computer by downloading the software following instructions
at the
R website. For the
systematic Introduction to R and R reference manual distributed
with the R software, either download from the
R website or simply invoke
the command
> help.start()
from within R. For a slightly
less extensive introductory tutorial in R, click
here.
For a quick
start, see Rbasics handout
, and then get started reading about S (or equivalently,
R)
syntax in the Venables and Ripley text.
In the middle of the course, we will give a brief introduction to
SAS, primarily in order to
contrast the way
in which linear and generalized-linear models are handled in the two
software packages, but this course is not a general introduction to SAS.
The topics of individual pdf-file note-packets are as follows
Notes1.pdf
: Overview, Unix & R preliminaries, R language
Notes2.pdf:
Functions in R, & how and why to vectorize. Also:
Notes3.pdf
: Introduction to Graphics in Splus. Also: Simulation
Notes4.pdf
: Numerical maximization methods (for likelihoods). Notes5.pdf:
Miscellanea: subsetting & parallelizing plus:
Notes6.pdf : EM (Expectation-maximization)
Algorithm for ML
Lec03Pt5.pdf: SAS Introduction. Lec03Pt5B.pdf:
Linear Regression in SAS (including some graphics.) Lec03Pt5C.pdf:
Factors, ANOVA and Regression in SAS vs. Splus.
Lec03Pt5D.pdf:
Simulation in Splus versus SAS. Lec03Pt6.pdf:
Markov Chain Monte Carlo: introduction and application
HANDOUTS distributed in class are included
for reference here
.
The topics treated on these handout R and SAS logs are as follows:
DensNPR.Log
: this log is a condensed version, for handouts 4/28 and 4/30
Factor.Log : class handout on R
handling of Factors and contrasts
Contr.txt : handout mentioned in 4/4/08
class on defining contrasts in R BassSAS.txt
: scripts in Splus for an illustrative regression in SAS
StepExmp.Log : gives a
script in R and SAS for stepwise (mostly forward) GLMlog.R :
is the record of a small R session showing how the dispersion Rlog1.txt
: covers an in-class demonstration of random-number
Rlog2.txt
: re-caps an in-class demonstration of acceptance/
Rlog3.txt
: a log related to Maximization, Root-finding,
& Rfcn.Log
: a log related to simulation of Mixtures and defining Slog4.txt
: illustration using Steam-Use data from Draper and
CrabsLog.pdf
: extended data-fitting example in (Splus and) R for
DensEst.Log
: log illustrating several different density estimation
elements, Vector & Array operations, Inputting Data,
and Lists.
Introduction to Pseudo-Random Number Generation.
speedup methods (Accept-Reject and Importance sampling).
Introduction to Smoothing Splines (and their use in
quick function-inversion in R).
estimation with missing data.
in an EM estimation problem in random-intercept logistic
regression.
in Spring '04, of the DensEst.Log and NonPReg.Log below,
illustrating several different density estimation and
nonparametric regression and smoothing techniques. In addition,
the density estimation part has a small section on (Least-
Squares) cross-validated bandwidth selection, and the
nonparametric regression component also has some material on
comparative evaluation of methods using cross-validation.
(using the Bass data in an illustrative example) within
linear model fitting functions.
for use with Factors in fitting linear models.
on a dataset involving fish (Bass) in polluted lakes.
selection of variables for linear regression within an R
dataset called "attitude" rating places to work in terms of
ratings in various categories reported on numerical scales.
and goodness of fit of glm-fitted model objects can be assessed.
generation and simulation, plus a brief section on
unix.time applied to linear-algebra operations.
rejection sampling, with outputs illustrated by
(scatterplot-related) graphics.
vectorization in Splus.
inverse functions via uniroot.
Smith regression book, showing PROC REG in SAS and
the Splus steps related to function lm for
reproducing the same computed results.
GLM analysis of Horseshoe Crab data discussed
extensively in Agresti Categorical Data Analysis book.
techniques (kernel-density estimation, splines, and
parametric fitting by a mixture of Gaussian or logistic
components) using the Galaxies data from a 1996 article
by Roeder. Plots can be found in pdf format here.
NonPReg.Log
: log illustrating several methods of nonparametric
regression and smoothing, using artificial (simulated)
data. Methods include kernel-density, lowess, and splines.
Plots can be found in pdf format here.
Bootstr.Log
: log with data examples to illustrate the connections between
and mechanics of: Permutational distributions, p-values and
confidence intervals, Parametric Bootstrap and (a very quick
idea of) Nonparametric Bootstrap.
Steps for analysis of kyphosis dataset
(available both in Splus as
a dataset
and also under ASCII data directory on this web-page) using
Generalized Linear Model modules, glm
in Splus and PROC GENMOD in SAS.
SASlog1.txt
: log of practice scripts for categorical data analysis
(PROC's FREQ
and GENMOD in SAS).
SASlog2.txt
: log on GLM's and deviance, with Analysis of Deviance
Tables and
implementations in both SAS and Splus.
SASlog3.txt
: additional material specifically related to
kyphosis
dataset,
model-fitting and interpretation
in both SAS and R including some material on `deviance'
and `standardized Pearson'
logistic-regression residuals. Some additional material on
stepwise fitting in PROC LOGISTIC and building an
analysis of deviance table from SAS
 
output can be found in another SASlog
.
Finally, a little Splus log summarizing
the steps in some GLM's of Fisher scoring versus Newton-
Raphson iterations to calculate Maximum Likelihood Estimates can
be found in NR.FS.Glm
.
Listings of all special-purpose R functions
referenced
in the Lecture Notes and Handouts can be found here.
HOMEWORK PROBLEMS and due dates (usually 3 classes
after they
are assigned), can be found
here.
(Occasional solutions will also be
posted to the same place.)
For guidelines on the amount of material
(code & output) to submit with the Homeworks, see the
Instructions.txt
file.
DATA
Several datasets used in the course and handouts can be found here
in ASCII
or text format.
From Mathnet accounts, later in the course you
will be able to find additional datasets in R
workspaces in the directory
/nfs/projects/statdata/SplusCrs/Data
In addition, in any environment supporting R, you have access to
lots of data in pre-supplied
R libraries which you can look at either by issuing the commands
> search()
or > data()
COMPUTER ACCOUNTS. MATH, STAT, and AMSC graduate students have
access to R and Matlab under Unix through their mathnet accounts,
and
others can have access through glue accounts. PC-based (student) versions
of Minitab, Splus and Matlab are available and not too expensive. R
is freely
available in Unix or PC form through this link.
SAS in a Unix environment is
available to you free through a WAM account.
Getting Started in SAS.
Various pieces of information to help you get started in using SAS can
be
found under the course website Stat430. In particular you can find:
--- running SAS on University machines.
Instructions and links are included there concerning
a downloadable `script'
enabling remote callup of SAS when you are running your cluster account
remote
from a campus WAM or mathnet or glue workstation.
--- an overview of the minimum necessary steps to use SAS from Mathnet.
--- links to stat430 problem assignments.
--- a series of SAS
logs with edited outputs for illustrative examples.
Additional Computing Resources.
There are many publicly available
datasets for practice
data-analyses. Many of them are taken from journal articles
and/or textbooks and documented or interpreted. A good place to start
is Statlib .
Datasets needed in the course will be either be posted to the course
web-page,
or indicated by links which will be provided here.
The UMCP Math Department home page.
The University of Maryland home page.
My home page.
© Eric V Slud, April 25, 2008.