Statistics 705   COMPUTATIONAL STATISTICS

Spring 2008                                                       MWF 2, Hornbake 0125

Instructor: Eric Slud, Statistics Program, Math. Dept.

Office:  Mth 2314, x5-5469, ;     evs@math.umd.edu

Office hours:  tentatively, M 1, W 11, and Th 1. But you can make an appointment for
     office-hour help at other times by emailing me.

Course Text:  Venables, W. N. and Ripley, B. D.  Modern Applied Statistics
   with S-PLUS (4th ed, 2002.).  New York: Springer-Verlag.

Recommended:  R. Cody & J. Smith, Applied Statistics and the
    SAS Programming Language, 4th ed. Prentice-Hall.

Additional:

   R. A. Becker, J. M. Chambers, and A. R. Wilks (1988). The
     New S Language.  Pacific Grove, CA: Wadsworth & Brooks/Cole.

   J.M. Chambers and T.J. Hastie (1993).  Statistical Models in S.
     London: Chapman & Hall.


For information concerning Homework click here.


Overview: Statistical research and application has changed dramatically because of
cheap and powerful computational and graphical tools.  This course presents modern
methods of computational statistics and their application to both practical  problems
and research.  The techniques covered in STAT 705, which include some numerical-
analysis ideas arising particularly in Statistics,  should be part of every statistician's toolbox.

Statistical methodology will be presented informally, with emphasis on the intuitive basis
for the techniques and brief discussion of their theoretical pedigree. Implementation of
each method will be given in R, SAS, or both, and each method will be illustrated by
application to data, often from real datasets but sometimes simulated.

Prerequisite: STAT 420 or STAT 700, and some programming experience (any language).

Course requirements and Grading: Grading will be based completely on graded DAILY
assignments involving data analysis and statistical computation (a total of about 40 of them).
The homework tasks will be of moderate length and difficulty assigned in each class session,
usually to be handed in 2 classes after the one in which the assignment is given.



TOPICS

   1. Introduction to R:

Starting and quitting R, on-line help, R operators and functions, creating
R objects, data types (vectors, matrices, factors, functions, lists), managing
data (combining  objects, subsetting, creation of frames), R graphics.

   2. Monte Carlo and Simulation in R:

Basic random number generation, applications of LLN and CLT  in simulations,
numerical integration, importance sampling, empirical distributions, Markov Chain
Monte Carlo. Managing loops in R.

   3. Numerical Optimization in Statistics:

Objective functions in statistics, and managing functions in  R. Linear and nonlinear
least squares, special considerations in maximizing likelihoods, penalized likelihood,
steepest descent, quasi-Newton-Raphson methods, constrained maximization, EM
algorithm. Diagnostics for misspecified models.

   4. Introduction to SAS:

The SAS environment, SAS data sets, sorting and merging, combining and subsetting
data, basic statistical procedures.

   5. Linear and Generalized Linear Models:

Regression summaries, model fitting, prediction, model updating, analysis of residuals,
model criticism, ANOVA, generalized linear  models, specifying link and variance
functions, stepwise model selection, deviance analysis.

Comparisons of implementations in R and SAS. Fitting mixed-effect (generalized)
linear models in R.

   6. Bootstrapping Methodology:

Parametric bootstrap, empirical CDF, bootstrap standard errors and confidence intervals,
estimation of bias, jackknife, application to regression.

   7. Smoothing & Nonparametric Regression:

Spline smoothing, kernel smoothing, selecting tuning parameters by cross-validation.
Graphical aspects of smoothing.

   8. MCMC and the Gibbs Sampler.

Definitions and basic ideas of MCMC ad Gibbs-Sampler simulation methodology,
including a brief introduction to `Bayesian Computing' using BUGS through R.



Getting Started with R

Note: The course will concentrate heavily on R, which is a free software package
syntactically almost identical to Splus, which was the software emphasized in the course
up to a couple of years ago. If you are new to R, you should get started as soon as possible,
using it either on the MathNet or WAM machines (where it is already loaded and installed)
or on your home computer by downloading the software following instructions at the
R website. For the systematic Introduction to R and R reference manual distributed
with the R software, either download from the R website or simply invoke the command

> help.start()

from within R. For a slightly less extensive introductory tutorial in R, click here. For a quick
start, see Rbasics handout , and then get started reading about S (or equivalently, R)
syntax in the Venables and Ripley text.

In the middle of the course, we will give a brief introduction to SAS, primarily in order to
contrast the way in which linear and generalized-linear models are handled in the two
software packages, but this course is not a general introduction to SAS.



LECTURE NOTES will be available here throughout the semester. You can also
find at this same location several sets of listings of R functions discussed in these Notes.

I suggest downloading and reading the Notes as we arrive at each topic,
since they will be updated and modified during the course.

     The topics of individual pdf-file note-packets are as follows

      Notes1.pdf : Overview, Unix & R preliminaries, R language
                         elements, Vector & Array operations, Inputting Data,
                         and Lists.

      Notes2.pdf: Functions in R, & how and why to vectorize. Also:
                         Introduction to Pseudo-Random Number Generation.

      Notes3.pdf : Introduction to Graphics in Splus. Also: Simulation
                        speedup methods (Accept-Reject and Importance sampling).

      Notes4.pdf : Numerical maximization methods (for likelihoods).

      Notes5.pdf: Miscellanea: subsetting & parallelizing plus:
                         Introduction to Smoothing Splines (and their use in
                         quick function-inversion in R).

      Notes6.pdf : EM (Expectation-maximization) Algorithm for ML
                         estimation with missing data.

      Lec03Pt5.pdf:  SAS Introduction.

      Lec03Pt5B.pdf: Linear Regression in SAS (including some graphics.)

      Lec03Pt5C.pdf: Factors, ANOVA and Regression in SAS vs. Splus.

      Lec03Pt5D.pdf: Simulation in Splus versus SAS.

      Lec03Pt6.pdf: Markov Chain Monte Carlo: introduction and application
                        in an EM estimation problem in random-intercept logistic
                        regression.

HANDOUTS  distributed in class are included for reference here .

The topics treated on these handout R and SAS logs are as follows:

      DensNPR.Log  :  this log is a condensed version, for handouts 4/28 and 4/30
                       in Spring '04, of the DensEst.Log and NonPReg.Log below,
                       illustrating several different density estimation and
                       nonparametric regression and smoothing techniques. In addition,
                       the density estimation part has a small section on (Least-
                       Squares) cross-validated bandwidth selection, and the
                       nonparametric regression component also has some material on
                       comparative evaluation of methods using cross-validation.

      Factor.Log :  class handout on R handling of Factors and contrasts
                     (using the Bass data in an illustrative example) within
                    linear model fitting functions.

      Contr.txt :  handout mentioned in 4/4/08 class on defining contrasts in R
                    for use with Factors in fitting linear models.

      BassSAS.txt :  scripts in Splus for an illustrative regression in SAS
                      on a dataset involving fish (Bass) in polluted lakes.

      StepExmp.Log :  gives a script in R and SAS for stepwise (mostly forward)
                     selection of variables for linear regression within an R
                     dataset called "attitude" rating places to work in terms of
                     ratings in various categories reported on numerical scales.

      GLMlog.R :    is the record of a small R session showing how the dispersion
                     and goodness of fit of glm-fitted model objects can be assessed.

      Rlog1.txt  :  covers an in-class demonstration of random-number
                         generation and simulation, plus a brief section on
                    unix.time  applied to linear-algebra operations.

      Rlog2.txt  :  re-caps an in-class demonstration of acceptance/
                         rejection sampling, with outputs illustrated by
                        (scatterplot-related) graphics.

      Rlog3.txt  :  a log related to Maximization, Root-finding, &
                          vectorization in Splus.

      Rfcn.Log   :  a log related to simulation of Mixtures and defining
                         inverse functions via uniroot.

      Slog4.txt  :  illustration using Steam-Use data from Draper and
                         Smith regression book, showing PROC REG in SAS and
                         the Splus steps related to function  lm  for
                         reproducing the same computed results.

      CrabsLog.pdf : extended data-fitting example in (Splus and) R for
                         GLM analysis of Horseshoe Crab data discussed
                         extensively in Agresti Categorical Data Analysis book.

      DensEst.Log  :  log illustrating several different density estimation
                         techniques (kernel-density estimation, splines, and
                         parametric fitting by a mixture of Gaussian or logistic
                         components) using the Galaxies data from a 1996 article
                         by Roeder. Plots can be found in pdf format here.

      NonPReg.Log  :  log illustrating several methods of nonparametric
                         regression and smoothing, using artificial (simulated)
                         data. Methods include kernel-density, lowess, and splines.
                         Plots can be found in pdf format here.

      Bootstr.Log :  log with data examples to illustrate the connections between
                         and mechanics of: Permutational distributions, p-values and
                         confidence intervals, Parametric Bootstrap and (a very quick
                         idea of) Nonparametric Bootstrap.

          Steps for analysis of kyphosis  dataset (available both in Splus as
     a dataset and also under ASCII data directory on this web-page) using
     Generalized Linear Model modules, glm  in  Splus  and  PROC GENMOD in SAS.

              SASlog1.txt   :    log of practice scripts for categorical data analysis (PROC's FREQ
                                                       and  GENMOD in SAS).

              SASlog2.txt  :     log on GLM's and deviance, with Analysis of Deviance Tables and
                                                        implementations in both SAS and Splus.

              SASlog3.txt  :     additional material specifically related to kyphosis dataset,
        model-fitting and interpretation in both SAS and R including some material on `deviance'
                    and `standardized Pearson' logistic-regression residuals. Some additional material on
                   stepwise fitting in PROC LOGISTIC and building an analysis of deviance table from SAS
                   output can be found in  another  SASlog .

             Finally, a little Splus log summarizing the steps in some GLM's of Fisher scoring versus Newton-
              Raphson iterations to calculate Maximum Likelihood Estimates can be found in NR.FS.Glm .


Listings of all special-purpose R functions referenced
in the Lecture Notes and Handouts can be found here.


HOMEWORK PROBLEMS and due dates (usually 3 classes after they
are assigned),  can be found here. (Occasional solutions will also be
posted to the same place.
)
For guidelines on the amount of material
(code & output) to submit with the Homeworks, see the  Instructions.txt  file.


DATA

Several datasets used in the course and handouts can be found here in ASCII or text format.
From Mathnet accounts, later in the course you will be able to find additional datasets in R
workspaces in the directory

    /nfs/projects/statdata/SplusCrs/Data

In addition, in any environment supporting R, you have access to lots of data in pre-supplied
R libraries which you can look at either by issuing the commands

> search()     or     > data()


COMPUTER ACCOUNTS. MATH, STAT, and AMSC graduate students have
access to R and Matlab under Unix through their mathnet accounts, and
others can have access through glue accounts. PC-based (student) versions
of Minitab, Splus and Matlab are available and not too expensive. R is freely
available in Unix or PC form through this link. SAS in a Unix environment is
available to you free through a WAM account.


Getting Started in SAS.

Various pieces of information to help you get started in using SAS can be
found under the course website 
Stat430. In particular you can find:

--- running SAS on University machines.

    Instructions and links are included there concerning a downloadable `script'
enabling remote callup of SAS when you are running your cluster account remote
from a campus WAM or mathnet or glue workstation.

--- an overview of the minimum necessary steps to use SAS from Mathnet.

--- links to stat430 problem assignments.

---  a series of SAS logs with edited outputs for illustrative examples.


Additional Computing Resources. There are many publicly available
datasets for practice data-analyses. Many of them are taken from journal articles
and/or textbooks and documented or interpreted. A good place to start is Statlib .
Datasets needed in the course will be either be posted to the course web-page,
or indicated by links which will be provided here.


The UMCP Math Department home page.

The University of Maryland home page.

My home page.

© Eric V Slud,  April 25, 2008.