Statistics 705   COMPUTATIONAL STATISTICS IN R

Course Evaluation Note

Fall 2017 MW 5-6:15, MATH Building 1308

Instructor: Eric Slud, Statistics Program, Math. Dept.,   evs@math.umd.edu

Office:  MTH 2314, x5-5469

Office hours:  tentatively, M 3, W 1. But you can make an appointment for office-hour help at other times by emailing me.

Course Text (Recommended):  Venables, W. N. and Ripley, B. D. Modern Applied Statistics with S-PLUS (4th ed, 2002.).  New York: Springer-Verlag.

Additional Recommended texts (free to UMCP students through campus library account):
         Spector, P. Data Manipulation with R (2008), Springer e-book.
         Robert, C. and Casella, G. Introducing Monte Carlo Methods with R (2010), Springer e-book, for use with mcsm R package.
         Albert, J. Bayesian Computation with R (2009), Springer e-book.
         Gentle, J. Computational Statistics (2009), Springer e-book.

      There is no required text. There are many R introductions available on the web, and a set of *pdf course notes including Introduction to R available on this web-page. An excellent introduction to R concepts and syntax can be found in the recommended Venables and Ripley text, but the main value you will derive from this excellent book is a short and insightful introduction to the use of the major base statistical packages, some of which will be introduced in this course.

Some Recommended Online Resources:
The R Introduction that is distributed free with the downloaded R code is this link. It does give a first exposure to R concepts and definitions, but is not as useful as the syntax portions (the first 80 pages) of the Venables and Ripley text. Many other Introduction and sets of notes can be found online, e.g. Rodriguez Princeton Notes. The freely downloadable notes for this course, at Lecture Notes descriptions below, are another good source.

A really useful short summary of a lot of R commands can be found here.

Overview of course: Statistical research and application has changed dramatically because of cheap and powerful computational and graphical tools. This course presents modern methods of computational statistics and their application to both practical problems and research. The techniques covered in STAT 705, which include some numerical-analysis ideas arising particularly in Statistics, should be part of every statistician's toolbox.

Statistical methodology in the course will be presented informally, with emphasis on the intuitive basis for the techniques and brief discussion of their theoretical pedigree. Implementation of each method will be given in R, and each method will be illustrated by application to data, often from real datasets but sometimes from datasets simulated from statistical models.


Prerequisite: STAT 420 or STAT 700, and some programming experience (any language).

Course requirements and Grading: Grading will be based completely on graded DAILY assignments involving data analysis and statistical computation (a total of about 20-22 of them). The homework tasks will be of moderate length and difficulty assigned in each class session, usually due 2 classes after they are assigned.


Homework Guidelines: For Fall 2017, you may [and it is actually preferred that you] hand in your homeworks electronically as single-document pdf's, directed to the grader by the due date at the specially created gmail address stat705.grader@gmail.com. If you create the homework paper by using text files containing R scripts, then I recommend that you import these into MS Word and save the document as a pdf before sending it to the grader. This way of doing it will make it particularly easy to import R exhibits such as Tables and Graphs as part of a single document. Multiple-document submissions will not be acceptable.

Also: the grader will deduct at least 20% credit for late papers, unless you first (before the due-date) get permission from me for lateness.


For information and Directories on the following topics, click these links:

      Homework information     ,       HW Directory
      Data source info     ,       Data Directory  
      Lecture Notes descriptions     ,       Lecture Notes Directory  
      Rlog and Scripts descriptions     ,       Rlog and Scripts Directory  



HONOR CODE

The University of Maryland, College Park has a nationally recognized Code of Academic Integrity, administered by the Student Honor Council. This Code sets standards for academic integrity at Maryland for all undergraduate and graduate students. As a student you are responsible for upholding these standards for this course. It is very important for you to be aware of the consequences of cheating, fabrication, facilitation, and plagiarism. For more information on the Code of Academic Integrity or the Student Honor Council, please visit http://www.shc.umd.edu.

To further exhibit your commitment to academic integrity, remember to sign the Honor Pledge on all examinations and assignments:
"I pledge on my honor that I have not given or received any unauthorized assistance on this examination (assignment)."



OUTLINE of Course TOPICS

   1. Introduction to R:

Starting and quitting R, on-line help, R operators and functions, creating R objects, data types (vectors, matrices, factors, functions, lists), managing data (combining objects, subsetting, creation of frames), R graphics.

   2. Monte Carlo and Simulation in R:

Basic random number generation, applications of LLN and CLT  in simulations,numerical integration, importance sampling, empirical distributions, Markov Chain Monte Carlo. Managing loops in R.

   3. Numerical Optimization in Statistics:

Objective functions in statistics, and managing functions in  R.Linear and nonlinear least squares, special considerations in maximizing likelihoods, penalized likelihood, steepest descent, quasi-Newton-Raphson methods, constrained maximization, EM algorithm. Diagnostics for misspecified models.

   4. Linear and Generalized Linear Models:

Regression summaries, model fitting, prediction, model updating, analysis of residuals,model criticism, ANOVA, generalized linear  models, specifying link and variance functions, stepwise model selection, deviance analysis.

Brief comparisons of implementations in R and SAS. Fitting mixed-effect (generalized) linear models in R.

   5. Bootstrap Methodology:

Parametric bootstrap, empirical CDF, bootstrap standard errors and confidence intervals, estimation of bias, jackknife, application to regression.

   6. Smoothing & Nonparametric Regression:

Spline smoothing, density estimation, local polynomial regression kernel smoothing, selecting tuning parameters by cross-validation. Graphical aspects of smoothing.

   7. MCMC and the Gibbs Sampler.

Definitions and basic ideas of MCMC and Gibbs-Sampler simulation methodology, possibly including a brief introduction to `Bayesian Computing' using BUGS through R.



Getting Started with R

Note: This course is about the R language and statistical programming platform. This free software package is syntactically very similar to the older Splus. If you are new to R, you should get started as soon as possible, using it either on your university Glue account in a Linux setting, or on a workstation or PC, either at the University or on your home computer by downloading the software following instructions at the R website. For the systematic Introduction to R and R reference manual distributed with the R software, either download from the R website or simply invoke the command

> help.start()

from within R. For a quick start, see my own Rbasics handout originally intended for a Survival Analysis class, and then read more about R objects and syntax in the Venables and Ripley text, in my Notes, and in the R introduction manual distributed with the R software.

In the middle of the course, we may also mention SAS and other statistical software, primarily in order to contrast the way in which linear and generalized-linear models are handled in the different software packages, but this course will not spend any time introducing SAS or other software.



LECTURE NOTES will be available here throughout the semester. You can also find at this same location two sets of listings of R functions discussed in these Notes.I suggest downloading and reading the Notes as we arrive at each topic, since they will be updated and modified during the course.

     The topics of individual pdf-filenote-packets are as follows

      Sec1NotF09.pdf: Overview, Unix & R preliminaries, R language
                      elements, Vector & Array operations, Inputting Data,
                      and Lists. Functions in R, & how and why to vectorize.

      Sec2NotF09.pdf:Introduction to Pseudo-Random Number Generation.

      Sec3NotF09.pdf: Introduction to Graphics in R. Also: Simulation
                    speedup methods (Accept-Reject & Importance sampling).

      Sec4NotF09.pdf: Numerical maximization methods (for likelihoods).

      Sec5NotF09.pdf:Miscellanea: subsetting & parallelizing plus:
                    Introduction to Smoothing Splines (and their use in
                    quick function-inversion in R).

      Sec6NotF16.pdf: EM (Expectation-maximization) Algorithm for ML
                    estimation with missing data.

      Sec7NotF09.pdf:Markov Chain Monte Carlo: introduction and application
                   in an EM estimation problem in random-intercept logistic
                   regression. For additional pdf files of "Mini-Course"
                     Lectures, see MCMC Mini-Course.

      BayesConjug.pdf Conjugate priors for Bayesian inference from data
                    assumed to follow Exponential Family distributions.

The remaining Handouts/Notes date from previous years and relate to
comparisons between Splus (which apply also to R) versus SAS.

      Lec03Pt5.pdf:  SAS Introduction.

      Lec03Pt5B.pdf:Linear Regression in SAS (including some graphics.)

      Lec03Pt5C.pdf:Factors, ANOVA and Regression in SAS vs. Splus.

      Lec03Pt5D.pdf:Simulation in Splus versus SAS.


HANDOUTS  distributed in class are included for reference here.

The topics treated on these handout logs are as follows:

      Explaining the Gibbs Sampler :  This is a readable, well written introduction
                      to the idea of the Gibbs Sampler, a good choice for reading material
                      to go with the lectures and HW on the Gibbs Sampler and MCMC.
                      For Background on Markov Chain Monte Carlo: First see
                      Introduction and application of MCMC within an EM estimation problem
                      in random-intercept logistic regression. For additional pdf files of
                      "Mini-Course" Lectures, including computer-generated figures, see
                      Lec.1 on Metropolis-Hastings Algorithm, and Lec.2 on Gibbs Sampler,
                      with Figures that can be found in Mini-Course Figure Folders.

      EM example on Random Effects ANOVA:   this is a pdf associated with an old HW
                      problem, not assigned this year, working out the EM iteration
                      for the EM algorithm likelihood maximization in a Balanced Two-
                      Way Random Effects Analysis of Variance (ANOVA) setting like
                      the one treated in the Class R Log for 10/22/2015.

      DensNPR.Log :  this log is a condensed version
                      in Spring '04, of the DensEst.Log and NonPReg.Log below,
                      illustrating several different density estimation and
                       nonparametric regression and smoothing techniques. In addition,
                       the density estimation part has a small section on (Least-
                      Squares) cross-validated bandwidth selection, and the
                      nonparametric regression component also has some material on
                      comparative evaluation of methods using cross-validation.

      Factor.Log :  class handout on R handling of Factors and contrasts
                    (using the Bass data in an illustrative example) within
                    linear model fitting functions.

      Contrasts.txt :  handout mentioned in 4/4/08 class on defining contrasts in R
                   for use with Factors in fitting linear models.

      StepExmp.Log :  gives a script in R and SAS for stepwise (mostly forward)
                    selection of variables for linear regression within an R
                    dataset called "attitude" rating places to work in terms of
                    ratings in various categories reported on numerical scales.

      GLMdispersF08.Log :   is the record of a small R session showing how the dispersion
                    and goodness of fit of glm-fitted model objects can be assessed.

      RNGdemoF08.Log  :  covers an in-class demonstration of random-number generation
                   and simulation, plus a brief section on unix.time  applied to
                    linear-algebra operations.

      Graphics_Rejection.Log :  re-caps an in-class demonstration of acceptance/rejection
                   sampling, with outputs illustrated by graphics.

      ImportSamp.Log :   gives the Log covered in class on Importance Sampling.

      Antith_Contr09: is a Log covered in class about the methods of Antithetic
                   Variables and Control Variates for speeding up Monte Carlo.

      Minimiz.Log : is a Log combining two parts: one about numerical maximization using
                   "nlm" with and without supplying "gradient" and "hessian"
                   attributes for the values of the function being minimized.
                   The second part is a log involving Maximization, Root-finding,& vectorization in R.

      Rfcn.Log  : a log on simulation of Mixtures and inverse functions via uniroot.

      RlogF09.LinRegr.txt: an R log covered in class 10/26/09 about using and
                interpreting the R linear model-fitting function "lm".

      RlogF09.GLM.txt: an R log from 10/28/09 about fitting and comparison of
                 generalized linear models using the R model-fitting function "lm".

      PredSamp.LM: an R log covered in class Nov. 2009 about Bayesian posterior
                 and predictive sampling in normal linear regression (related to
                 "bass" data of Fall 2009 HW 14 and BayesConjug.pdf Lec-Notes file).

      SteamDat.Exmp :illustration using Steam-Use data from Draper and Smith
                regression book, showing PROC REG in SAS and the R steps related
               to function  lm  for reproducing the same computed results.

      CrabsLog.pdf: extended data-fitting example in (Splus and) R for
                        GLM analysis of Horseshoe Crab data discussed
                        extensively in Agresti Categorical Data Analysis book.

      DensEst.Log :  log illustrating several different density estimation
                        techniques (kernel-density estimation, splines, and
                        parametric fitting by a mixture of Gaussian or logistic
                        components) using the Galaxies data from a 1996 article
                        by Roeder. Plots can be found in pdf format here.

      NonPReg.Log :  log illustrating several methods of nonparametric regression and smoothing, using artificial (simulated)
                        data. Methods include kernel-density, lowess, and splines. Plots can be found in pdf format here.

      Bootstr.Log:  log with data examples to illustrate the connections betweenand mechanics of: Permutational distributions,
                          p-values & confidence intervals, Parametric Bootstrap and (a very quick idea of) Nonparametric Bootstrap.

      A technical report which explains in some detail the idea of "adaptive Gaussian quadrature" related to the topic of
      "Laplace Approximation" covered in Stat 705 can be found here.

        Steps for analysis of kyphosis   dataset (available both in R as a dataset and also under ASCII data directory
         on this web-page) using Generalized Linear Model modules, glm  in  R  and  PROC GENMOD in SAS.

              SASlog1.txt  :     log of practice scripts for categorical data analysis(PROC's FREQ and   GENMOD in SAS).

              SASlog2.txt :      log on GLM's and deviance, with Analysis of Deviance Tables and implementations in both SAS and Splus.

              SASlog3.txt :      additional material specifically related to the kyphosis dataset, model-fitting and interpretation in
       both SAS and R including some material on `deviance' and `standardized Pearson' logistic-regression residuals.
                  Some additional material on stepwise fitting in PROC LOGISTIC and building ananalysis of deviance table from SAS
                  output can be found in another   SASlog .

             Finally, an R log summarizing the steps in some GLM's of Fisher scoring versus Newton-Raphson iterations to calculate
             Maximum Likelihood Estimates can be found in NR.FS.Glm .


Listings of special-purpose R functions referenced in Lec-Notes and Handouts can be found here.


HOMEWORK PROBLEMS and due dates (usually 2, sometimes 3 classes after they are assigned),  can be found here. (Occasional solutions will also be posted to the same place.) For guidelines on the amount of material (code & output) to submit with the Homeworks, see the  Instructions.txt file. As described in the Instructions file, Homeworks are to be handed in as hard-copy in class on the due-date.


DATA

Several datasets used in the course and handouts can be found here in ASCII or text format. Later in the course, I may post additional large datasets to shared drive space available on University accounts.

In addition, in any environment supporting R, you have access to lots of data in pre-suppliedR libraries which you can look at either by issuing the commands

> search()    or     > data()


COMPUTER ACCOUNTS.MATH, STAT, and AMSC graduate students have access to R, SAS and Matlab under Unix through their University glue accounts. R is freely available in Unix or PC form through this link.


Additional Computing Resources.There are many publicly available datasets for practice data-analyses. Many of them are taken from journal articles and/or textbooks and documented or interpreted. A good place to start is Statlib, and additional sources can be found here.

Datasets needed in the course will be either be available in indicated R packages, posted to the Data Directory linked to this web-page, or indicated by links which will be provided in this space.



  • The Campus Course Evaluation Website https://www.courseEvalum.umd.edu is open from Dec. 1 to Dec. 12 for you to submit your evaluation of this course. Please take this opportunity to evaluate me and the course during this period !

    CourseEvalUM main page: https://www.CourseEvalUM.umd.edu (top button)


  • Important Dates


    The UMCP Math Department homepage.

    The University of Maryland home page.

    My home page.

    © Eric V Slud,  August 28, 2017.