Spring 2004 MWF 2, Mth 0403
Instructor: Eric Slud, Statistics program, Math. Dept.
Office: Mth 2314, x5-5469, email evs@math.umd.edu
Office hours: tentatively, M 11 and F 1 . But you can
make an appointment for
office-hour help at other times by emailing
me.
Course Text: Venables, W. N. and Ripley, B. D.
Modern Applied Statistics
with S-PLUS (4th ed, 2002.). New York: Springer-Verlag.
Overview: Statistical research and application has changed dramatically
because of
cheap and powerful computational and graphical tools. This course
presents modern
methods of computational statistics and their application to both practical
problems
and research. The techniques covered in STAT 798C, which include
some numerical-
analysis ideas arising particularly in Statistics, should be
part of every statistician's toolbox.
Statistical methodology will be presented informally, with emphasis
on the intuitive basis
for the techniques and brief discussion of their theoretical pedigree.
Implementation of
each method will be given in Splus, SAS, or both, and each method will
be illustrated by
application to real-world data.
As contrasted with previous terms in which this course was offered,
I
plan to modify
the pace of work: one shorter homework task will be assigned
in EACH class
session rather than larger problem sets due only after a couple
of weeks.
Apart from this, the topics and level of the course will be very much
the same as they
were in Spring 2003, with the following changes:
(a) I will not spend so much time on numerical maximization of (log-)
likelihoods as
a numerical analysis topic.
(b) I will introduce "Bayesian Computing", specifically MCMC and the
Gibbs
Sampler, in slightly more detail than I did last year.
(c) In the SAS component of the course, I will introduce the basics of SAS Macro's.
Emphasis in the course will continue to be on Splus rather than SAS.
Prerequisite: STAT 420 or STAT 700, and some programming experience (any language).
Course requirements and Grading: Grading will be based completely
on graded DAILY
assignments involving data analysis and statistical computation (a
total of about 40 of them).
For a cumulatively updated file
documenting Splus 6.0 differences
from the Splus 3.4 and R notes
given below, click here
.
The topics of individual pdf-file note-packets are as follows
Lec03Pt1.pdf
: Overview, Unix & Splus preliminaries, Splus language
elements, Vector & Array operations, Inputting Data,
and Lists.
Lec03Pt1B.pdf:
Functions in Splus, & how and why to vectorize. Also:
Introduction to Psuedo-Random Number Generation.
Lec03Pt2.pdf
: Introduction to Graphics in Splus. Also: Simulation
speedup methods (Accept-Reject and Importance sampling).
Lec03Pt3.pdf
: Numerical maximization methods (for likelihoods).
Lec03Pt3B.pdf:
Miscellanea: subsetting & parallelizing plus:
Introduction to Smoothing Splines (and their use in
quick function-inversion in Splus).
Lec03Pt4.pdf
: EM (Expectation-maximization) Algorithm for ML
estimation with missing data.
Lec03Pt5.pdf:
SAS Introduction.
Lec03Pt5B.pdf:
Linear Regression in SAS (including some graphics.)
Lec03Pt5C.pdf:
Factors, ANOVA and Regression in SAS vs. Splus.
Lec03Pt5D.pdf:
Simulation in Splus versus SAS.
Lec03Pt6.pdf:
Markov Chain Monte Carlo: introduction and application
in an EM estimation problem in random-intercept logistic
regression.
HANDOUTS distributed in class are included for reference here .
The topics treated on these handout Splus and SAS logs are as follows:
DensNPR.Log
: this log is a condensed version, for handouts 4/28 and 4/30
in Spring '04, of the DensEst.Log and NonPReg.Log below,
illustrating several different density estimation and
nonparametric regression and smoothing techniques. In addition,
the density estimation part has a small section on (Least-
Squares) cross-validated bandwidth selection, and the
nonparametric regression component also has some material on
comparative evaluation of methods using cross-validation.
Factor.Log
: class handout on Splus handling of Factors and contrasts
(using the Bass data in an illustrative exasmple) within
linear model fitting functions.
BassSAS.txt
: scripts in Splus for an illustrative regression in SAS
on a dataset involving fish (Bass) in polluted lakes.
Slog1.txt
: covers an in-class demonstration of random-number
generation and simulation, plus a brief section on
unix.time applied to linear-algebra operations.
Slog2.txt
: re-caps an in-class demonstration of acceptance/
rejection sampling, with outputs illustrated by
(scatterplot-related) graphics.
Slog3.txt
: a log related to Maximization, Root-finding, &
vectorization in Splus.
Slog4.txt
: illustration using Steam-Use data from Draper and
Smith regression book, showing PROC REG in SAS and
the Splus steps related to function lm for
reproducing the same computed results.
CrabsLog.pdf
: extended data-fitting example in (Splus and) R for
GLM analysis of Horseshoe Crab data discussed
extensively in Agresti Categorical Data Analysis book.
DensEst.Log
: log illustrating several different density estimation
techniques (kernel-density estimation, splines, and
parametric fitting by a mixture of Gaussian or logistic
components) using the Galaxies data from a 1996 article
by Roeder. Plots can be found in pdf format here.
NonPReg.Log
: log illustrating several methods of nonparametric
regression and smoothing, using artificial (simulated)
data. Methods include kernel-density, lowess, and splines.
Plots can be found in pdf format here.
Bootstr.Log
: log with data examples to illustrate the connections between
and mechanics of: Permutational distributions, p-values and
confidence intervals, Parametric Bootstrap and (a very quick
idea of) Nonparametric Bootstrap.
Steps for analysis of kyphosis dataset
(available both in Splus as
a dataset
and also under ASCII data directory on this web-page) using
Generalized
Linear Model modules, glm in Splus and
PROC GENMOD in SAS.
SASlog1.txt
: log of practice scripts for categorical data analysis
(PROC's FREQ
and GENMOD in SAS).
SASlog2.txt
: log on GLM's and deviance, with Analysis of Deviance
Tables and
implementations in both SAS and Splus.
SASlog3.txt
: additional material specifically related to kyphosis
dataset,
model-fitting and interpretation
in both SAS and Splus
including some material on `deviance' and `standardized Pearson'
logistic-regression residuals. Some additional material on stepwise
fitting in PROC LOGISTIC and building an analysis of deviance table
from SAS output can be found in another SASlog
.
Finally, a little Splus log summarizing the steps in some GLM's of Fisher
scoring versus
Newton-Raphson iterations to calculate Maximum Likelihood Estimates can
be found
in NR.FS.Glm
.
A cumulative listing of all special-purpose Splus functions
referenced
in the Lecture Notes and Handouts can be found here.
HOMEWORK PROBLEMS and due dates (usually 2 classes
after they
are assigned), can be found
here.
For
guidelines on the amount of material
(code & output) to submit with the Homeworks, see
the Instructions.txt
file.
DATASETS can be found here in ASCII
or text format. From Mathnet accounts, you can
also copy the data as Splus3.4 objects (for inclusion
into a .Data directory) from the directory
/usr/local/StatData/SplusCrs/.Data
In addition, in any environment supporting Splus, you have access to
lots of data in pre-supplied
S and Splus directories which you can look at by issuing the command
> search()
and then, for example if you want
to see all of the Splus object entries in the 6th directory on
the list, which in my (Mathnet,
Splus3.4) unix environment is "/usr/local/splus34/s/.Datasets",
use the command to see that among,
other items, the vectors lottery.number and lottery.payoff
can be found.
> objects(6)
TOPICS
1. Introduction to S-Plus:
Starting and quitting S-Plus, on-line help, S-Plus operators and functions,
creating
S-Plus objects, data types (vectors, matrices, factors, functions,
lists), managing
data (combining objects, subsetting, creation of frames), S-Plus
graphics.
2. Monte Carlo and Simulation in S-Plus:
Basic random number generation, applications of LLN and CLT in
simulations,
numerical integration, importance sampling, empirical distributions,
Markov Chain
Monte Carlo. Managing loops in Splus.
3. Numerical Optimization in Statistics:
Objective functions in statistics, and managing functions in Splus.
Linear and nonlinear
least squares, special considerations in maximizing likelihoods, penalized
likelihood,
steepest descent, quasi-Newton-Raphson methods, constrained maximization,
EM
algorithm. Diagnostics for misspecified models.
4. Introduction to SAS:
The SAS environment, SAS data sets, sorting and merging, combining and
subsetting
data, basic statistical procedures.
5. Linear and Generalized Linear Models:
Regression summaries, model fitting, prediction, model updating, analysis
of residuals,
model criticism, ANOVA, generalized linear models, specifying
link and variance
functions, stepwise model selection, deviance analysis.
6. Comparisons of implementations in Splus and SAS.
7. Bootstrapping Methodology:
Parametric bootstrap, empirical CDF, bootstrap standard errors and confidence
intervals,
estimation of bias, jackknife, application to regression.
8. Smoothing & Nonparametric Regression:
Spline smoothing, kernel smoothing, selecting tuning parameters by cross-validation.
Graphical aspects of smoothing.
REFERENCES
W. N. Venables and B. D. Ripley (2002), Modern Applied
Statistics with S-Plus, 4th ed. Springer-Verlag,
New York.
Recommended:
R. Cody & J. Smith, Applied Statistics and the
SAS Programming Language, 4th ed. Prentice-Hall.
Additional:
R. A. Becker, J. M. Chambers, and A. R. Wilks (1988). The
New S Language. Pacific Grove,
CA: Wadsworth & Brooks/Cole.
J.M. Chambers and T.J. Hastie (1993. Statistical
Models
in S. London: Chapman & Hall.
COMPUTER ACCOUNTS. MATH, STAT, and AMSC graduate students have
access to Splus and Matlab under Unix through their mathnet accounts,
and
others can have access through glue accounts. PC-based (student) versions
of Minitab, Splus and Matlab are available and not too expensive. R
is freely
available in Unix or PC form through this link.
SAS in a Unix environment is
available to you free through a course campus-account on the `cluster'
machines which has been set up for you (with individual user-names
and
initial passwords which I will distribute.)
Getting Started in SAS.
Various pieces of information to help you get started in using SAS can
be
found under an old (F01) course website
Stat430. In particular you can find:
--- running SAS under Unix on cluster machines.
Instructions and links are included there concerning
a downloadable `script'
enabling remote callup of SAS when you are running your cluster account
remote
from a campus WAM or mathnet or glue workstation.
--- an overview of the minimum necessary steps to use SAS from Mathnet.
--- links to stat430 problem assignments.
--- a series of SAS
logs with edited outputs for illustrative examples.
Additional Computing Resources.
There
are many publicly available
datasets for practice data-analyses. Many of them are taken from journal
articles
and/or textbooks and documented or interpreted. A good place to start
is
Statlib .
Datasets needed in the course will be either be posted to the course
web-page,
or indicated by links which will be provided here.
The UMCP Math Department home page.
The University of Maryland home page.
My home page.
© Eric V Slud, Apr. 29, 2004.