Statistics 798C  COMPUTATIONAL METHODS IN STATISTICS

Spring 2004                                                    MWF 2, Mth 0403

Instructor: Eric Slud, Statistics program, Math. Dept.
Office:  Mth 2314, x5-5469, email evs@math.umd.edu
Office hours:  tentatively, M 11 and F 1 . But you can make an appointment for
     office-hour help at other times by emailing me.

Course Text:  Venables, W. N. and Ripley, B. D.  Modern Applied Statistics
   with S-PLUS (4th ed, 2002.).  New York: Springer-Verlag.

Overview: Statistical research and application has changed dramatically because of
cheap and powerful computational and graphical tools.  This course presents modern
methods of computational statistics and their application to both practical  problems
and research.  The techniques covered in STAT 798C, which include some numerical-
analysis ideas arising particularly in Statistics,  should be part of every statistician's toolbox.

Statistical methodology will be presented informally, with emphasis on the intuitive basis
for the techniques and brief discussion of their theoretical pedigree. Implementation of
each method will be given in Splus, SAS, or both, and each method will be illustrated by
application to real-world data.

As contrasted  with previous terms in which this course was offered, I plan to modify
the pace of work:  one shorter homework task will be assigned in EACH class
session rather than larger problem sets due only after  a couple of weeks.

Apart from this, the topics and level of the course will be very much the same as they
were in Spring 2003, with the following changes:

(a) I will not spend so much time on numerical maximization of (log-) likelihoods as
a numerical analysis topic.

(b) I will introduce "Bayesian Computing", specifically MCMC and the Gibbs
Sampler, in slightly more detail than I did last year.

(c) In the SAS component of the course, I will introduce the basics of SAS Macro's.

Emphasis in the course will continue to be on Splus rather than SAS.

Prerequisite: STAT 420 or STAT 700, and some programming experience (any language).

Course requirements and Grading: Grading will be based completely on graded DAILY
assignments involving data analysis and statistical computation (a total of about 40 of them).


For a cumulatively updated file documenting Splus 6.0 differences
from the Splus 3.4 and R notes given below, click here .



LECTURE NOTES will be available here throughout the semester.
I suggest downloading and reading them as we arrive at each topic, since
they will be updated and modified over the course of the term.

     The topics of individual pdf-file note-packets are as follows

      Lec03Pt1.pdf : Overview, Unix & Splus preliminaries, Splus language
                        elements, Vector & Array operations, Inputting Data,
                        and Lists.

      Lec03Pt1B.pdf: Functions in Splus, & how and why to vectorize. Also:
                        Introduction to Psuedo-Random Number Generation.
 

      Lec03Pt2.pdf : Introduction to Graphics in Splus. Also: Simulation
                       speedup methods (Accept-Reject and Importance sampling).
 

      Lec03Pt3.pdf : Numerical maximization methods (for likelihoods).
 

      Lec03Pt3B.pdf: Miscellanea: subsetting & parallelizing plus:
                        Introduction to Smoothing Splines (and their use in
                        quick function-inversion in Splus).
 

      Lec03Pt4.pdf : EM (Expectation-maximization) Algorithm for ML
                        estimation with missing data.
 

      Lec03Pt5.pdf:  SAS Introduction.
 

      Lec03Pt5B.pdf: Linear Regression in SAS (including some graphics.)
 

      Lec03Pt5C.pdf: Factors, ANOVA and Regression in SAS vs. Splus.
 

      Lec03Pt5D.pdf: Simulation in Splus versus SAS.
 

      Lec03Pt6.pdf: Markov Chain Monte Carlo: introduction and application
                       in an EM estimation problem in random-intercept logistic
                       regression.
 

HANDOUTS  distributed in class are included for reference here .

         The topics treated on these handout Splus and SAS logs are as follows:

      DensNPR.Log  :  this log is a condensed version, for handouts 4/28 and 4/30
                      in Spring '04, of the DensEst.Log and NonPReg.Log below,
                      illustrating several different density estimation and
                      nonparametric regression and smoothing techniques. In addition,
                      the density estimation part has a small section on (Least-
                      Squares) cross-validated bandwidth selection, and the
                      nonparametric regression component also has some material on
                      comparative evaluation of methods using cross-validation.

      Factor.Log :  class handout on Splus handling of Factors and contrasts
                    (using the Bass data in an illustrative exasmple) within
                    linear model fitting functions.

      BassSAS.txt :  scripts in Splus for an illustrative regression in SAS
                     on a dataset involving fish (Bass) in polluted lakes.

      Slog1.txt  :  covers an in-class demonstration of random-number
                        generation and simulation, plus a brief section on
                   unix.time  applied to linear-algebra operations.

      Slog2.txt  :  re-caps an in-class demonstration of acceptance/
                        rejection sampling, with outputs illustrated by
                       (scatterplot-related) graphics.

      Slog3.txt  :  a log related to Maximization, Root-finding, &
                        vectorization in Splus.

      Slog4.txt  :  illustration using Steam-Use data from Draper and
                        Smith regression book, showing PROC REG in SAS and
                        the Splus steps related to function  lm  for
                        reproducing the same computed results.

      CrabsLog.pdf : extended data-fitting example in (Splus and) R for
                        GLM analysis of Horseshoe Crab data discussed
                        extensively in Agresti Categorical Data Analysis book.

      DensEst.Log  :  log illustrating several different density estimation
                        techniques (kernel-density estimation, splines, and
                        parametric fitting by a mixture of Gaussian or logistic
                        components) using the Galaxies data from a 1996 article
                        by Roeder. Plots can be found in pdf format here.

      NonPReg.Log  :  log illustrating several methods of nonparametric
                        regression and smoothing, using artificial (simulated)
                        data. Methods include kernel-density, lowess, and splines.
                        Plots can be found in pdf format here.

      Bootstr.Log :  log with data examples to illustrate the connections between
                        and mechanics of: Permutational distributions, p-values and
                        confidence intervals, Parametric Bootstrap and (a very quick
                        idea of) Nonparametric Bootstrap.
 

          Steps for analysis of kyphosis  dataset (available both in Splus as
     a dataset and also under ASCII data directory on this web-page) using
     Generalized Linear Model modules, glm  in  Splus  and  PROC GENMOD in SAS.

              SASlog1.txt   :    log of practice scripts for categorical data analysis (PROC's FREQ
                                                       and  GENMOD in SAS).

              SASlog2.txt  :     log on GLM's and deviance, with Analysis of Deviance Tables and
                                                        implementations in both SAS and Splus.

              SASlog3.txt  :     additional material specifically related to kyphosis  dataset,
        model-fitting and interpretation in both SAS and Splus
                                                      including some material on `deviance' and `standardized Pearson'
                                                      logistic-regression residuals. Some additional material on stepwise
                                                      fitting in PROC LOGISTIC and building an analysis of deviance table
                                                      from SAS output can be found in  another  SASlog .

                        Finally, a little Splus log summarizing the steps in some GLM's of Fisher scoring versus
                        Newton-Raphson iterations to calculate Maximum Likelihood Estimates can be found
                        in NR.FS.Glm .

A cumulative listing of all special-purpose Splus functions referenced
in the Lecture Notes and Handouts can be found here.

HOMEWORK PROBLEMS and due dates (usually 2 classes after they
are assigned),  can be found here. For guidelines on the amount of material
(code & output) to submit with the Homeworks, see the  Instructions.txt  file.

DATASETS can be found here in ASCII or text format. From Mathnet accounts, you can
   also copy the data as Splus3.4 objects (for inclusion into a .Data directory) from the directory

    /usr/local/StatData/SplusCrs/.Data

In addition, in any environment supporting Splus, you have access to lots of data in pre-supplied
S and Splus directories which you can look at by issuing the command

> search()

and then, for example if you want to see all of the Splus object entries in the 6th directory on
the list, which in my (Mathnet, Splus3.4) unix environment is "/usr/local/splus34/s/.Datasets",
use the command to see that among, other items, the vectors lottery.number and lottery.payoff
can be found.

> objects(6)


               TOPICS

   1. Introduction to S-Plus:

Starting and quitting S-Plus, on-line help, S-Plus operators and functions, creating
S-Plus objects, data types (vectors, matrices, factors, functions, lists), managing
data (combining  objects, subsetting, creation of frames), S-Plus graphics.

   2. Monte Carlo and Simulation in S-Plus:

Basic random number generation, applications of LLN and CLT  in simulations,
numerical integration, importance sampling, empirical distributions, Markov Chain
Monte Carlo. Managing loops in Splus.

   3. Numerical Optimization in Statistics:

Objective functions in statistics, and managing functions in  Splus. Linear and nonlinear
least squares, special considerations in maximizing likelihoods, penalized likelihood,
steepest descent, quasi-Newton-Raphson methods, constrained maximization, EM
algorithm. Diagnostics for misspecified models.

   4. Introduction to SAS:

The SAS environment, SAS data sets, sorting and merging, combining and subsetting
data, basic statistical procedures.

   5. Linear and Generalized Linear Models:

Regression summaries, model fitting, prediction, model updating, analysis of residuals,
model criticism, ANOVA, generalized linear  models, specifying link and variance
functions, stepwise model selection, deviance analysis.

   6. Comparisons of implementations in Splus and SAS.

   7. Bootstrapping Methodology:

Parametric bootstrap, empirical CDF, bootstrap standard errors and confidence intervals,
estimation of bias, jackknife, application to regression.

   8. Smoothing & Nonparametric Regression:

Spline smoothing, kernel smoothing, selecting tuning parameters by cross-validation.
Graphical aspects of smoothing.


        REFERENCES

  W. N. Venables and B. D. Ripley (2002), Modern Applied
    Statistics with S-Plus, 4th ed. Springer-Verlag, New York.

Recommended:  R. Cody & J. Smith, Applied Statistics and the
    SAS Programming Language, 4th ed. Prentice-Hall.

Additional:

   R. A. Becker, J. M. Chambers, and A. R. Wilks (1988). The
     New S Language.  Pacific Grove, CA: Wadsworth & Brooks/Cole.

   J.M. Chambers and T.J. Hastie (1993.  Statistical Models
     in S. London: Chapman & Hall.


COMPUTER ACCOUNTS. MATH, STAT, and AMSC graduate students have
access to Splus and Matlab under Unix through their mathnet accounts, and
others can have access through glue accounts. PC-based (student) versions
of Minitab, Splus and Matlab are available and not too expensive. R is freely
available in Unix or PC form through this link. SAS in a Unix environment is
available to you free through a course campus-account on the `cluster'
machines which has been set up for you (with individual user-names and
initial passwords which I will distribute.)


Getting Started in SAS.

Various pieces of information to help you get started in using SAS can be
found under an old (F01) course website  Stat430. In particular you can find:

--- running SAS under Unix on cluster machines.

    Instructions and links are included there concerning a downloadable `script'
enabling remote callup of SAS when you are running your cluster account remote
from a campus WAM or mathnet or glue workstation.

--- an overview of the minimum necessary steps to use SAS from Mathnet.

--- links to stat430 problem assignments.

---  a series of SAS logs with edited outputs for illustrative examples.


Additional Computing Resources. There are many publicly available
datasets for practice data-analyses. Many of them are taken from journal articles
and/or textbooks and documented or interpreted. A good place to start is Statlib .
Datasets needed in the course will be either be posted to the course web-page,
or indicated by links which will be provided here.


The UMCP Math Department home page.

The University of Maryland home page.

My home page.

© Eric V Slud,  Apr. 29, 2004.