Statistics 401 Applied Probability & Statistics II

MWF 2, Rm B0429                                                                 Spring 2011

This course develops standard statistical procedures based on the conceptual introduction
to probability and statistics provided in STAT 400. The mathematical level is that of single-
variable calculus (with some material involving double integrals) and beginning linear
algebra (matrix and vector notation). Topics covered include confidence intervals and
hypothesis tests for means, variances and proportions based on one and two samples;
analysis of variance and linear regression; goodness of fit and categorical data.

For sample problems you can use to practice for the in-class tests or the Final Exam,
including last year's Final and an answer key for the Sample Problems for Test 2, click here.

Answer key for Sample Final Exam Problems is now also included.

For the current Homework Assignment, click here.

For a cumulatively updated directory of Rscripts, click here.

For a cumulatively updated directory of homework solutions, click here.

Instructor:    Eric Slud, Math. Dept. Rm. 2314,  X 5-5469,  evs@math.umd.edu

Office hours: Monday 4, Th 2, or by appointment.

Prerequisite:   Math 140-141   & Stat 400.

Text:   Probability & Statistics for Engineering and the Sciences with Minitab 14, 7th ed. (2008),
              by J. L. Devore, Duxbury Press.

Coverage:   In the first 2 weeks, we will review Stat 400 ideas and techniques. Afterwards: we will
cover in the Devore text:  Chapters 7-10, 12, 14 and parts of 11 and 13, plus some extra handouts
on statistical computing and simulation. For more detailed topics, see the Chapter tables of contents
and the official course syllabus, together with the handouts below.

Grading:   The grade in the course will be based   20% on homeworks (about 8, graded) from the book
and including some supplementary problems of mine, 10% on data-project homeworks, 40% on 2 in-class
tests, and 30% on a comprehensive final.

Computing:   You will need to learn to work with some statistical computing platform to do simple statistical
calculations on moderate to large datasets in the course, and to do data simulations. Calculator or spreadsheet
will not be enough. You may use Minitab or Matlab or R or a standard statistical package like SAS or Stata.
However, I will be providing information and help (and web-posted scripts) only with R. To find
information about which computer labs on campus have which of these kinds of software loaded, click here.

You can find information on getting started with R in the CD that comes with the book, or by visiting the
R web-site   from which you can freely download R software (very similar to Splus) including miscellaneous
packages and datasets. For an introductory tutorial in R, click here. For a quick start, see Rbasics handout,
and then consider reading more about syntax in a book, like the (early chapters of the) book of
W. Venables and Brian Ripley, "Modern Applied Statistics With S" (Springer, currently 4th ed.).

As indicated in the "R_Manual" section of the Devore text's accompanying CD, you can get a special
R package containing all of the book's datasets, from a network of web sites called CRAN that
contain R add-on packages. You do this by the command
> install.packages("Devore7")

To load the files within an R session you type:
> library(Devore7)


Homework Assignments

HW1, due Fri., Feb. 4, .
Reading:   Read and Review Chapters 5 and 6 in DeVore. Also read the Handout (1) below on Simulation.

#1 Suppose that the independent random variables   Xi   for   i=1,..,100   have density   f(x) = 2x   for 0 < x < 1.
(a) Find the approximate probabilities   P( 45+10j < S < 55+10j)   for   j=1,2,3,4,   where S = X1+...+X100.
(b) Find the expectation and variance of the number of indices   i   for which   Xi > 0.6.

#2 Read the Simulation of Random Variables Handout and do Problem Sim.3 on page 4 of that handout.

#3 Suppose that   U1,...,U40   are   Uniform[0,θ] random variables, observed as data.
(a). Show that the scaled average S1 = (U1+...+U40)/20 is an unbiased estmator of   θ.
(b). Show that for some constant c,   c*S2   is an unbiased estimator of   θ , where   S2 = max(U1,...,U40).
Hint: Check that   P(S2 < x) = (x/&theta)40   for   0 < x < θ .

#4 Find the standard error of the two estimators   S1   and   c*S2   appearing respectively in
parts (a) and (b) of Problem #3.

#5 Suppose that   Y1,...,Y1000 are independent identically distributed observations with density   f(y) = 1/3  
for   0 < y < 1   and   f(y) = 2/3   for   1 < y < 2, and for k=1,2,3,4   let   Nk =
(# of indices i in 1..1000 with   (k-1)/2 < Yi < k/2). Find the means and variances of each of the relative
frequencies   Nk/1000, for k=1,2,3,4.

HW2, due Wed., Feb. 18.
Complete your review of Chapter 6 (ML Estimators), and read Sections 7.1 through 7.3 in Ch.7 of DeVore.
Then solve and hand in the following problems:

#1, 2 Problems 20 and 28 in Sec. 6.2, p.251.

#3 (Do #11 on p.263 for practice and look at its solution in the Solutions manual. Then do and hand in the following problem.)
Suppose that you learn a new method of generating 90% two-sided confidence intervals   (L(X), U(X))   for the unknown mean
μ   for samples   X1, ..., Xn   of data in which the individual values   Xi   are approximately normally distributed, where the sample
size   n   is between 35 and 50. Suppose also that you have a method of simulating independent samples   X(r) = X1,r, ..., X42,r  
for   r=1,...,2000,  , on each of which you can calculate the confidence interval   Ir = (L(X(r)), U(X(r)))   .
(For these simulated intervals, you will know the mean parameter   μ0   .)
          (a) What is the the approximate number of these confidence intervals   I1, ..., I2000   that you expect to contain the true mean   μ0   ?
          (b) What kind of random variable is the number   N   of samples   r=1,..,2000   for which   μ0   falls outside   Ir   ?
          (c) What is the approximate probability that   N   in (b) is between   185 and 220 , inclusive of endpoints ?
          (d) Approximately how likely is it that of the first   20   of these samples  X(r)   and   intervals   Ir  ,   r=1,..,20,   all contain   μ0   ?

#4, 5 Problems 8 and 10 in Sec. 7.1, p.262.

#6, 7 Problems 18 and 20 in Sec.7.2, p.269.

#8 You can find by clicking here a dataset consisting of the logarithms of the average annual rainfall in inches from 70 US and
Puerto Rico cities (data from the 1975 Statistical Abstracts of the United States).    (a).   Compute a few scaled relative frequency
histograms of these data (with different numbers L of class intervals), and hand in the one that you think best shows the shape of
the underlying density. Overlay on the same histogram plot (by hand if necessary) a graph of the normal density curve with the
same mean and variance as the sample mean and variance of your data. Use this plot and histogram to comment briefly on whether
you think the assumption of normal distribution for these data is tenable.
        (b).   Give a 95% two-sided confidence interval for the mean of these data, using an assumed-known value 0.25 for the
variance and an assumption of normality for the individual data points.
        (c).   Give a 95% two-sided confidence interval for the mean of these data, assuming normality, if the variance is unknown.
        (d).   Re-do parts (b) and (c), giving approximate large-sample confidence intervals dispensing with the assumption of
normality for the data.

HW3, due Wednesday, March 2.   Read the rest of Chapter 7, and the first 2 sections of Ch.8 of DeVore.
Solve and hand in the following eight problems:

#1 Use R or other statistical software to simulate 100 samples of size 40 of Gamma(1.3,2.6) data-values   Xi
(i.e., random variables with density   f(x) = (2.6)2   x0.3   e-2.6x   for x > 0,   for which the mean is   μ = 1.3/2.6 = 0.5).
        (a). For each sample (i.e. each row of a rectangular   100 x 40   array), calculate Xbar, and use it to
define a large-sample 95% CI for   μ   .
        (b). Plot in some form (or print out) the confidence intervals calculated on your 100 samples of size 40,
indicating whether each CI contains the true value 0.5. (Each CI is a function only of Xbar for that sample.)
        (c). How many times did your CI fail to capture the true value   μ0   ? What is the expected number of
times (out of 100) for this to happen ? Should you have been suprised if this happened as few as 1 time ?
if it happened 9 or more times ?

#2 Do #27, p.270 for practice and then hand in the following: Give some numerical computations (in R or
other computing platform) showing what the 95% confidence intervals would be (for some specific
examples of values   X1+...+Xn = k)   and what their actual coverage probabilities would be
according to exact Binomial(n,p) probability distributions
for the values
(n=78, p=.57), (n=47,p=.53), (n=46,p=.16) according to confidence intervals (7.11), (7.10), and the
one given in problem 27. See the Rscript/Coverage.RLog   script for the necessary R coding.
ALSO: for each of these (n,p) parameter combinations, give at least one nearby value of n (for same p) for
which the ordering of "best performance" among the three intervals is altered.

#3--#4 Do #22, #26, p.269.

#5 Do #38, p.278.

#6 Do #44, p.280.

#7 Do #52, p.281.

#8 Do #12, p.294.

HW4, due Wednesday, March 16.   Read the rest of Chapter 8, and sections 9.1, 9.2 and 9.4
of Ch.9 of DeVore. Solve and hand in the following eight problems:

Do #10, pp.293-4,   #20, p.304,   #32, p.306,   #42, p.311,   #52 and #54, p.317-8.

Do #2, p.334, three ways: using a large-sample Z-approximation as covered in Sec.9.1; using a pooled
t-test as in Sec. 9.2; and with the Satterthwaite-Welch approximation as on pp.336-337.

Do #28, p.342.

HW5, due Monday, April 11.   Read Section 9.5. Read Chapter 14 through Section 14.2. Then read
Section 4.6 plus the handout on Empirical Distribution Functions.

#1:   Problem on power and p-value: suppose that you see data values   X1,...,X31   which can be assumed
to be iid normally distributed, with   Xbar = 24.0   and S = 8.0. Suppose that these date were collected
to test the hypothesis   H0: μ ≤ 22.7   versus   HA: μ > 22.7.
        (a).   Give the p-value for the test in which you treat this as a large-sample test (or equivalently,
             where you take σ0=8.0   as known).
        (b).   Find the power of the size .05 test versus   μ1  = 25,   again treating the test as a large-sample test.
        (c).   Re-do part (a), this time treating the test as a small-n one-sample t test. This part
             of the problem requires you to use a calculator or PC to calculate the p-value using a
             t-distribution probability distribution function program in place of a table.

#2-3:   Ch.9, #62, 64, pp.363-364.

#4:   Ch.9, #68. Do a preliminary test for equality of variances before you decide which two-sample
t interval to use for the mean difference.

#5-#6:   Ch.14, #6, 9, p.575.

#7:   Ch. 9, #72, p.365.

#8:   Ch.4, #94, p.179. Use R to create probability plots using   qqnorm   or   qqplot.
R scripts will be provided,

HW6, due Wednesday, April 20.   In Devore, finish reading Sec. 14.2, and read Ch. 10,
Sections 1 and 2. Solve and hand in the following six problems:
Ch.14, #8, p.575.
Ch. 4, #92(a), pp.178-179.
Ch.10, # 2, 8. pp.378-379;   and # 12, 16, pp.384-385.

HW7, due Wednesday, May 4.   Read Sections 12.1 to 12.3, and solve and hand in
the following problems:

Ch.10, p. 385: #18, 20.
Ch.12, p.453: #6;   #12, p.465;   #20, p.466;   #34, 36, p.476.



GENERAL GUIDELINES ON HOMEWORK.

1. Academic Dishonesty. You may ask questions of each other and of me to get hints on how to solve the
various assigned homework problems. However, you may not share computations and written work: you
must each do that work and write it up individually. Homework papers which have identically copied
segments will be regarded as a violation of the campus honor code.

2. Late Homework and Test Make-Ups. The course policy on late homeworks is that they will be accepted
but graded down, by 10 percent if past due by no more than one class session and by 25 percent if later than
that. These penalties will be waived only for medical excuses or valid University-recognized holidays.
Regarding test make-ups, we will adhere to campus policy.



Sample Problems for Tests and Exams

(1) To practice for Test 1, a series of 10 relevant applied/computational problems drawn (selected and
edited) from the DeVore "Testbank" (on the CD-ROM coming with the book's 7th edition) can be found
here.
(You may have to zoom in with your browser or MS Word reader to read some of the technical formula
elements in this document.)

(2) A sheet of additional problems relevant to Test 1 can be found here. These are modified from similar
problems that I have given in the past, which would call for a little more theoretical interpretation than
the mechanical `applied' problems coming from the TestBank problem-sheet in (1).

(3) Practice problems for In-Class Test 2 (Wed., Dec. 1) can be found here, along with an answer key.
For a list of topics and problem types, click here.

(4). To see last Fall's Stat 401 Final Exam, from another instructor, click here. In this Exam, the MLE in
Problem 1 is not a topic that we spent much time on, but we did spend some time on it and you should be
able to do it. The other problems are well in the mainstream of what we covered this semester.
Answer key is included here.

(5). Try the sample-exam from 1995 which I have adapted to conform more closely to what we studied
this term. An answer key is included here. (Scroll down a bit in the document to find these Sample Exam answers.)



Handouts (some from Stat 400, and some from John Millson):

(1) 10/20/03 There are two handouts here, respectively on   Transformation of Random
Variables  and on Random Number Generation and Simulation These topics are very
important for the rest of the course, as they allow us to generate and interpret `artificial data'
to illustrate the meaning of our Probability Limit Theorems (Law of Large Numbers, Central
Limit Theorem) and later statistical results (Consistent Statistical estimators, Confidence
Intervals). In addition, Simulation gives us an `experimental' avenue to calculate via artificial
data probabilities which may be too difficult to figure analytically.

(2) As of 8/23/10 See John Millson's Stat 401 page for handouts on diverse topics related to the course.

(3) 10/22/03 The handout on Normal Approximation to Binomial Distribution contains a
word-problem worked example, as well as some numerical examples of the quality of the
normal approximation to the Binomial.  This example is continued below, in a statistical
setting (confidence interval for estimate of a population proportion in a political opinion poll)
in handout (7) below, dated 11/19/03.
     A graph comparing the distribution function values of Binom(100,.3) with its
approximating normal distribution N(30,21) can be found here.

(4) 9/29/03 This handout concerns numerical calculations for the Binomial approximation to
Hypergeometric random variables, and the Poisson approximation to the Binomial. In addition,
some simulated-data results are given to show that the expectations and probability mass functions
behave as they should according to the relative-frequency interpretation of probabilities.

(5) 10/27/03 Example of Simulation for Calculating Probability and Expectation.

(6) 11/3/03 Picture showing the behavior of sample averages   Sn/n as a function of   n   from 
1,...,2000   on each of four sets of simulated data, from different types of random variables.
Within each picture, the sample averages Sn/n are based on progressively larger segments of the
same 2000 data-values, and the point is to see that these averages settle down to the place where
the Law of Large numbers guarantee they should for large enough   n,   namely the theoretical
expectation of the individual r.v.'s.

(7) 11/12/03  Pictures showing behavior of scaled relative frequency histograms compared with densities
The document shows plots of histograms in large simulated samples overlaid with the theoretical densities
they are supposed to represent, and of empirical distribution functions overlaid with the theoretical cdf's
the data in large simulated datsets are supposed to represent. The latter are available in two settings:
         (i) The overlaid empirical and theoretical cdf's for 1000 simulated values of Z1+Z2 (sum of two
independent standard normal deviates) can be found here .
         (ii) The overlaid empirical and theoretical cdf's for 1000 simulated values of U1+...+U100
(sum of 100 independent Uniform[0,1] independent deviates) can be found here.

(8) 11/19/03  The word-problem on political opinion polling begun in handout (3) above,
dated 10/22/03, is continued here from the vantage point of statistics, particularly
confidence intervals for estimates of a population proportion in a political opinion poll.


Important Dates

Return to my home page.

© Eric V Slud, May 6, 2011.