**Statistics 705
COMPUTATIONAL STATISTICS IN R**

Fall 2017 MW 5-6:15,
MATH Building 1308

**Instructor: **Eric Slud, Statistics Program,
Math. Dept., evs@math.umd.edu **Office: ** MTH 2314, x5-5469
**Office hours: ** tentatively, M 3, W 1. But you can make an appointment for
office-hour help at other times by emailing me.

**Course Text (Recommended):** Venables, W. N. and Ripley, B. D. *Modern
Applied Statistics with S-PLUS* (4th ed, 2002.). New York: Springer-Verlag.

**Additional Recommended texts (free to UMCP students through campus library account):**

Spector, P. *Data Manipulation with R*
(2008), Springer e-book.

Robert, C. and Casella, G.
*Introducing Monte Carlo Methods with R* (2010), Springer e-book, for use with
**mcsm** R package.

Albert, J. *Bayesian Computation with R*
(2009), Springer e-book.

Gentle, J. *Computational Statistics*
(2009), Springer e-book.

There is no required text. There are many **R**
introductions available on the web, and a set of *pdf course notes including Introduction to **R**
available on this web-page. An excellent introduction to **R** concepts and syntax can be
found in the recommended Venables and Ripley text, but the main value you will derive from this
excellent book is a short and insightful introduction to the use of the major base statistical
packages, some of which will be introduced in this course.

Some Recommended Online Resources:

The **R**
Introduction that is distributed free with the downloaded **R** code is
this link. It does give
a first exposure to **R** concepts and definitions, but is not as useful as the syntax
portions (the first 80 pages) of the Venables and Ripley text. Many other Introduction and sets of
notes can be found online, e.g. Rodriguez
Princeton Notes. The freely downloadable notes for this course, at Lecture
Notes descriptions below, are another good source.

A really useful short summary of a lot of R commands can be found here.

**Overview of course: **Statistical research and application has changed
dramatically because of cheap and powerful computational and graphical tools. This course presents
modern methods of computational statistics and their application to both practical problems and
research. The techniques covered in STAT 705, which include some numerical-analysis ideas arising
particularly in Statistics, should be part of every statistician's toolbox.

Statistical
methodology in the course will be presented informally, with emphasis on the intuitive basis for
the techniques and brief discussion of their theoretical pedigree. Implementation of each method
will be given in **R**, and each method will be illustrated by application to data, often from
real datasets but sometimes from datasets simulated from statistical models.

**Prerequisite: **STAT 420 or STAT 700, and some programming experience (any language).

**Course requirements and Grading: **Grading will be based completely on graded
**DAILY** assignments involving data analysis and statistical
computation (a total of about 20-22 of them). The homework tasks will be of moderate length and
difficulty assigned in *each* class session, usually due 2 classes after they are
assigned.

**Homework Guidelines:** For Fall 2017, you may [and it is actually
**preferred** that you] hand in your homeworks electronically as **single-document pdf's**,
directed to the grader by the due date at the specially created *gmail* address
**stat705.grader@gmail.com**. If you create the homework paper by using text files containing
R scripts, then I recommend that you import these into MS Word and save the document as a pdf
before sending it to the grader. This way of doing it will make it particularly easy to import R
exhibits such as Tables and Graphs as part of a single document. **Multiple-document submissions
will not be acceptable.**

**Also:** the grader
will deduct at least 20% credit for late papers, unless you first (before the due-date) get
permission from me for lateness.

** For information and Directories on the following topics, click these links:
Homework information
, HW Directory
Data source info ,
Data Directory
Lecture Notes descriptions
, Lecture Notes Directory
Rlog and Scripts descriptions ,
Rlog and Scripts Directory **

HONOR CODE

The University
of Maryland, College Park has a nationally recognized Code of Academic Integrity, administered by the
Student Honor Council. This Code sets standards for academic integrity at Maryland for all undergraduate
and graduate students. As a student you are responsible for upholding these standards for this course.
It is very important for you to be aware of the consequences of cheating, fabrication, facilitation,
and plagiarism. For more information on the Code of Academic Integrity or the Student Honor Council,
please visit http://www.shc.umd.edu.

To further exhibit your commitment to academic integrity,
remember to sign the Honor Pledge on all examinations and assignments:

"I pledge on my honor that
I have not given or received any unauthorized assistance on this examination (assignment)."

**1. Introduction to R:**

Starting and quitting **R**, on-line help,
**R** operators and functions, creating **R** objects, data types (vectors, matrices, factors,
functions, lists), managing data (combining objects, subsetting, creation of frames), **R**
graphics.

**2. Monte Carlo and Simulation in R:**

Basic random number generation,
applications of LLN and CLT in simulations,numerical integration, importance sampling, empirical
distributions, Markov Chain Monte Carlo. Managing loops in **R**.

**3. Numerical Optimization in Statistics:**

Objective functions in statistics,
and managing functions in **R**.Linear and nonlinear least squares, special considerations
in maximizing likelihoods, penalized likelihood, steepest descent, quasi-Newton-Raphson methods,
constrained maximization, EM algorithm. Diagnostics for misspecified models.

**4. Linear and Generalized Linear Models:**

Regression summaries, model fitting, prediction, model updating, analysis of residuals,model criticism, ANOVA, generalized linear models, specifying link and variance functions, stepwise model selection, deviance analysis.

Brief comparisons of implementations in **R** and SAS. Fitting mixed-effect
(generalized) linear models in **R**.

**5. Bootstrap Methodology:**

Parametric bootstrap, empirical CDF, bootstrap standard errors and confidence intervals, estimation of bias, jackknife, application to regression.

**6. Smoothing & Nonparametric Regression:**

Spline smoothing, density estimation, local polynomial regression kernel smoothing, selecting tuning parameters by cross-validation. Graphical aspects of smoothing.

**7. MCMC and the Gibbs Sampler.**

Definitions and basic ideas of MCMC and
Gibbs-Sampler simulation methodology, possibly including a brief introduction to `Bayesian Computing'
using BUGS through **R**.

Getting Started with

** Note:** This course is about the **R**
language and statistical programming platform. This free software package is syntactically very
similar to the older **Splus**. If you are new to **R**, you should get started as soon as
possible, using it either on your university *Glue account* in a *Linux* setting, or on a
workstation or PC, either at the University or on your home computer by downloading the software
following instructions at the **R website**.
For the systematic **Introduction to R** and **R reference manual** distributed with the R
software, either download from the R website or simply
invoke the command

> help.start()

from within R. For a quick start, see my own
Rbasics handout originally
intended for a Survival Analysis class, and then read more about **R** objects and syntax in the
Venables and Ripley text, in my Notes, and in the R introduction manual distributed with
the R software.

In the middle of the course, we may also mention **SAS** and other statistical software,
primarily in order to contrast the way in which linear and generalized-linear models are handled
in the different software packages, but this course will **not** spend any time introducing SAS
or other software.

The topics of individual pdf-filenote-packets are as follows

Sec1NotF09.pdf: Overview,
Unix & R preliminaries, R language

elements, Vector & Array
operations, Inputting Data,

and Lists. Functions in **R**,
& how and why to vectorize.

Sec2NotF09.pdf:Introduction to Pseudo-Random Number Generation.

Sec3NotF09.pdf: Introduction to Graphics in R. Also: Simulation

speedup methods (Accept-Reject & Importance sampling).

Sec4NotF09.pdf: Numerical maximization methods (for likelihoods).

Sec5NotF09.pdf:Miscellanea: subsetting & parallelizing plus:

Introduction to
Smoothing Splines (and their use in

quick function-inversion in **R**).

Sec6NotF16.pdf: EM (Expectation-maximization) Algorithm for ML

estimation with
missing data.

Sec7NotF09.pdf:Markov Chain Monte Carlo: introduction and application

in an EM estimation
problem in random-intercept logistic

regression. For additional pdf files of "Mini-Course"

Lectures, see

BayesConjug.pdf Conjugate priors for Bayesian inference from data

assumed to follow
Exponential Family distributions.

comparisons between Splus (which apply also to R) versus SAS.

Lec03Pt5B.pdf:Linear
Regression in SAS (including some graphics.)

Lec03Pt5C.pdf:Factors,
ANOVA and Regression in SAS vs. Splus.

Lec03Pt5D.pdf:Simulation in Splus versus SAS.

**HANDOUTS** distributed in class are included for reference here.

The topics treated on these handout logs are as follows:

Explaining the Gibbs Sampler : This is a readable,
well written introduction

to the idea of the Gibbs Sampler, a good
choice for reading material

to go with the lectures and HW on the Gibbs
Sampler and MCMC.

For **Background on Markov Chain Monte Carlo:** First
see

Introduction and application of
MCMC within an EM estimation problem

in random-intercept logistic
regression. For additional pdf files of

"Mini-Course" Lectures, including
computer-generated figures, see

Lec.1 on Metropolis-Hastings Algorithm, and Lec.2 on Gibbs Sampler,

with
Figures that can be found in Mini-Course Figure Folders.

EM example on Random Effects ANOVA:
this is a pdf associated with an old HW

problem, not assigned this
year, working out the EM iteration

for the EM algorithm likelihood
maximization in a Balanced Two-

Way Random Effects Analysis of Variance
(ANOVA) setting like

the one treated in the Class R Log for 10/22/2015.

DensNPR.Log : this log is
a condensed version

in Spring '04, of the DensEst.Log and NonPReg.Log
below,

illustrating several different density estimation and

nonparametric regression and smoothing techniques. In addition,

the density estimation part has a small section on (Least-

Squares)
cross-validated bandwidth selection, and the

nonparametric regression
component also has some material on

comparative evaluation of methods
using cross-validation.

Factor.Log : class handout on
R handling of Factors and contrasts

(using the Bass data in an illustrative example)
within

linear model fitting functions.

Contrasts.txt : handout mentioned in 4/4/08 class on defining contrasts in R

for use with Factors in fitting linear models.

StepExmp.Log : gives
a script in R and SAS for stepwise (mostly forward)

selection of variables for
linear regression within an R

dataset called "attitude" rating places to work in
terms of

ratings in various categories reported on numerical scales.

GLMdispersF08.Log : is
the record of a small R session showing how the dispersion

and goodness of fit of glm-fitted
model objects can be assessed.

RNGdemoF08.Log
: covers an in-class demonstration of random-number generation

and simulation, plus
a brief section on **unix.time **applied to

linear-algebra operations.

Graphics_Rejection.Log :
re-caps an in-class demonstration of acceptance/rejection

sampling, with outputs illustrated by graphics.

ImportSamp.Log : gives
the Log covered in class on Importance Sampling.

Antith_Contr09: is a Log covered in class about the methods of Antithetic

Variables and Control Variates for speeding up Monte Carlo.

Minimiz.Log : is a Log combining
two parts: one about numerical maximization using

"nlm" with and without supplying
"gradient" and "hessian"

attributes for the values of the function being minimized.

The second part is a log involving
Maximization, Root-finding,& vectorization in **R**.

Rfcn.Log : a log on simulation
of Mixtures and inverse functions via uniroot.

RlogF09.LinRegr.txt: an R log covered in class 10/26/09 about using
and

interpreting the R linear model-fitting function "lm".

RlogF09.GLM.txt: an R log from
10/28/09 about fitting and comparison of

generalized linear models using the R model-fitting function "lm".

PredSamp.LM: an R log covered in
class Nov. 2009 about Bayesian posterior

and predictive sampling in normal linear regression (related to

"bass" data of Fall
2009 HW 14 and BayesConjug.pdf Lec-Notes file).

SteamDat.Exmp :illustration using
Steam-Use data from Draper and Smith

regression book, showing PROC REG in SAS and the **R** steps related

to
function **lm** for reproducing the same computed results.

CrabsLog.pdf: extended data-fitting
example in (Splus and) R for

GLM analysis of Horseshoe
Crab data discussed

extensively in Agresti **Categorical
Data Analysis** book.

DensEst.Log : log
illustrating several different density estimation

techniques
(kernel-density estimation, splines, and

parametric fitting
by a mixture of Gaussian or logistic

components) using the
Galaxies data from a 1996 article

by Roeder. Plots can
be found in pdf format here.

NonPReg.Log : log
illustrating several methods of nonparametric regression and smoothing, using artificial (simulated)

data. Methods include kernel-density, lowess, and splines.
Plots can be found in pdf format here.

Bootstr.Log: log with data
examples to illustrate the connections betweenand mechanics of: Permutational distributions,

p-values & confidence intervals, Parametric Bootstrap and (a very
quick idea of) Nonparametric Bootstrap.

**A technical report which explains in some detail the idea of "adaptive
Gaussian quadrature" related to the topic of "Laplace Approximation"
covered in Stat 705 can be found here. **

Steps for analysis of kyphosis
dataset (available both in R as a dataset and also under ASCII data directory

on this web-page) using Generalized Linear Model modules, **glm **
in **R** and PROC GENMOD in SAS.

SASlog1.txt : log of practice scripts for categorical data analysis(PROC's FREQ and GENMOD in SAS).

SASlog2.txt :
log on GLM's and deviance, with Analysis of Deviance Tables and implementations in both
SAS and **Splus**.

SASlog3.txt :
additional material specifically related to the kyphosis dataset, model-fitting and interpretation in

both SAS and **R** including some material on `deviance' and `standardized
Pearson' logistic-regression residuals.

Some additional material on stepwise fitting in PROC
LOGISTIC and building ananalysis of deviance table from SAS

output can be found in another
SASlog .

Finally, an R log
summarizing the steps in some GLM's of Fisher scoring versus Newton-Raphson iterations to calculate

Maximum Likelihood
Estimates can be found in NR.FS.Glm .

**Listings of special-purpose R functions referenced in Lec-Notes and Handouts
can be found here.**

**HOMEWORK PROBLEMS** and due dates (usually 2, sometimes 3
classes after they are assigned), can be found here. (*Occasional solutions will also be posted
to the same place.*) For guidelines on the amount of material (code & output) to submit with the
Homeworks, see the
Instructions.txt file.* As described in the Instructions file, Homeworks are
to be handed in as hard-copy in class on the due-date.*

**DATA**

Several datasets used in the course and handouts can be found here in ASCII or text format. Later in the course, I may post additional large datasets to shared drive space available on University accounts.

In addition, in any environment supporting R, you have access to lots of data in pre-suppliedR libraries which you can look at either by issuing the commands

> search() or
> data()

**COMPUTER ACCOUNTS.**MATH, STAT, and AMSC graduate students have access to R, SAS and Matlab
under Unix through their University glue accounts. **R** is freely available in Unix or PC form
through this link.

Additional Computing Resources.There are many publicly
available datasets for practice data-analyses. Many of them are taken from journal articles and/or
textbooks and documented or interpreted. A good place to start is
Statlib, and additional sources can be found here.

Datasets needed in the course will
be either be available in indicated **R** packages, posted to the Data Directory linked to this
web-page, or indicated by links which will be provided in this space.

** CourseEvalUM main page: https://www.CourseEvalUM.umd.edu (top button)**

**First Class: Mon., August 28, 2017****Labor Day Holday: Mon., Sept. 4, 2017**, NO CLASS**Last schedule-adjustment Date (for Drop/Withdrawal): November 6, 2017****NOTE:**No Class on November 22, 2017,**Thanksgiving Break****Last day of classes: Mon. December 11, 2016**

The UMCP Math Department homepage.

The University of Maryland home page.

My home page.

© Eric V Slud, August 28, 2017.