Stat 401 Home-Page Spring 2011

Statistics 401 Applied Probability & Statistics II

MWF 2, Rm B0429 Spring 2011

This course develops standard statistical procedures based on the conceptual introduction
to probability and statistics provided in STAT 400. The mathematical level is that of single-
variable calculus (with some material involving double integrals) and beginning linear
algebra (matrix and vector notation). Topics covered include confidence intervals and
hypothesis tests for means, variances and proportions based on one and two samples;
analysis of variance and linear regression; goodness of fit and categorical data.

For sample problems you can use to practice for the in-class tests or the Final Exam,
including last year's Final and an answer key for the Sample Problems for Test 2, click here.
Answer key for Sample Final Exam Problems is now also included.

For the current Homework Assignment, click here.

For a cumulatively updated directory of Rscripts, click here.

For a cumulatively updated directory of homework solutions, click here.

Instructor: Eric Slud, Math. Dept. Rm. 2314, X 5-5469, evs@math.umd.edu

Office hours: Monday 4, Th 2, or by appointment.

Prerequisite: Math 140-141 & Stat 400.

Text: Probability & Statistics for Engineering and the Sciences with Minitab 14, 7th ed. (2008),
by J. L. Devore, Duxbury Press.

Coverage: In the first 2 weeks, we will review Stat 400 ideas and techniques. Afterwards: we will
cover in the Devore text: Chapters 7-10, 12, 14 and parts of 11 and 13, plus some extra handouts
on statistical computing and simulation. For more detailed topics, see the Chapter tables of contents
and the official course syllabus, together with the handouts below.

Grading: The grade in the course will be based 20% on homeworks (about 8, graded) from the book
and including some supplementary problems of mine, 10% on data-project homeworks, 40% on 2 in-class
tests, and 30% on a comprehensive final.

Computing: You will need to learn to work with some statistical computing platform to do simple statistical
calculations on moderate to large datasets in the course, and to do data simulations. Calculator or spreadsheet
will not be enough. You may use Minitab or Matlab or R or a standard statistical package like SAS or Stata.
However, I will be providing information and help (and web-posted scripts) only with R. To find
information about which computer labs on campus have which of these kinds of software loaded, click here.

You can find information on getting started with R in the CD that comes with the book, or by visiting the
R web-site from which you can freely download R software (very similar to Splus) including miscellaneous
packages and datasets. For an introductory tutorial in R, click here. For a quick start, see Rbasics handout,
and then consider reading more about syntax in a book, like the (early chapters of the) book of
W. Venables and Brian Ripley, "Modern Applied Statistics With S" (Springer, currently 4th ed.).

As indicated in the "R_Manual" section of the Devore text's accompanying CD, you can get a special
R package containing all of the book's datasets, from a network of web sites called CRAN that
contain R add-on packages. You do this by the command
> install.packages("Devore7")

To load the files within an R session you type:
> library(Devore7)

Homework Assignments

HW1, due Fri., Feb. 4, .
Reading: Read and Review Chapters 5 and 6 in DeVore. Also read the Handout (1) below on Simulation.

#1 Suppose that the independent random variables X_i for i=1,..,100 have density f(x) = 2x for 0 < x < 1.
(a) Find the approximate probabilities P( 45+10j < S < 55+10j) for j=1,2,3,4, where S = X₁+...+X₁₀₀.
(b) Find the expectation and variance of the number of indices i for which X_i > 0.6.

#2 Read the Simulation of Random Variables Handout and do Problem Sim.3 on page 4 of that handout.

#3 Suppose that U₁,...,U₄₀ are Uniform[0,θ] random variables, observed as data.
(a). Show that the scaled average S₁ = (U₁+...+U₄₀)/20 is an unbiased estmator of θ.
(b). Show that for some constant c, c*S₂ is an unbiased estimator of θ , where S₂ = max(U₁,...,U₄₀).
Hint: Check that P(S₂ < x) = (x/&theta)⁴⁰ for 0 < x < θ .

#4 Find the standard error of the two estimators S₁ and c*S₂ appearing respectively in
parts (a) and (b) of Problem #3.

#5 Suppose that Y₁,...,Y₁₀₀₀ are independent identically distributed observations with density f(y) = 1/3
for 0 < y < 1 and f(y) = 2/3 for 1 < y < 2, and for k=1,2,3,4 let N_k =
(# of indices i in 1..1000 with (k-1)/2 < Y_i < k/2). Find the means and variances of each of the relative
frequencies N_k/1000, for k=1,2,3,4.

HW2, due Wed., Feb. 18.
Complete your review of Chapter 6 (ML Estimators), and read Sections 7.1 through 7.3 in Ch.7 of DeVore.
Then solve and hand in the following problems:

#1, 2 Problems 20 and 28 in Sec. 6.2, p.251.

#3 (Do #11 on p.263 for practice and look at its solution in the Solutions manual. Then do and hand in the following problem.)
Suppose that you learn a new method of generating 90% two-sided confidence intervals (L(X), U(X)) for the unknown mean
μ for samples X₁, ..., X_n of data in which the individual values X_i are approximately normally distributed, where the sample
size n is between 35 and 50. Suppose also that you have a method of simulating independent samples X^(r) = X_1,r, ..., X_42,r
for r=1,...,2000, , on each of which you can calculate the confidence interval I_r = (L(X^(r)), U(X^(r))) .
(For these simulated intervals, you will know the mean parameter μ₀ .)
          (a) What is the the approximate number of these confidence intervals I₁, ..., I₂₀₀₀ that you expect to contain the true mean μ₀ ?
          (b) What kind of random variable is the number N of samples r=1,..,2000 for which μ₀ falls outside I_r ?
          (c) What is the approximate probability that N in (b) is between 185 and 220 , inclusive of endpoints ?
          (d) Approximately how likely is it that of the first 20 of these samples X^(r) and intervals I_r , r=1,..,20, all contain μ₀ ?

#4, 5 Problems 8 and 10 in Sec. 7.1, p.262.

#6, 7 Problems 18 and 20 in Sec.7.2, p.269.

#8 You can find by clicking here a dataset consisting of the logarithms of the average annual rainfall in inches from 70 US and
Puerto Rico cities (data from the 1975 Statistical Abstracts of the United States). (a). Compute a few scaled relative frequency
histograms of these data (with different numbers L of class intervals), and hand in the one that you think best shows the shape of
the underlying density. Overlay on the same histogram plot (by hand if necessary) a graph of the normal density curve with the
same mean and variance as the sample mean and variance of your data. Use this plot and histogram to comment briefly on whether
you think the assumption of normal distribution for these data is tenable.
        (b). Give a 95% two-sided confidence interval for the mean of these data, using an assumed-known value 0.25 for the
variance and an assumption of normality for the individual data points.
        (c). Give a 95% two-sided confidence interval for the mean of these data, assuming normality, if the variance is unknown.
        (d). Re-do parts (b) and (c), giving approximate large-sample confidence intervals dispensing with the assumption of
normality for the data.

HW3, due Wednesday, March 2. Read the rest of Chapter 7, and the first 2 sections of Ch.8 of DeVore.
Solve and hand in the following eight problems:

#1 Use R or other statistical software to simulate 100 samples of size 40 of Gamma(1.3,2.6) data-values X_i
(i.e., random variables with density f(x) = (2.6)² x^0.3 e^-2.6x for x > 0, for which the mean is μ = 1.3/2.6 = 0.5).
        (a). For each sample (i.e. each row of a rectangular 100 x 40 array), calculate Xbar, and use it to
define a large-sample 95% CI for μ .
        (b). Plot in some form (or print out) the confidence intervals calculated on your 100 samples of size 40,
indicating whether each CI contains the true value 0.5. (Each CI is a function only of Xbar for that sample.)
        (c). How many times did your CI fail to capture the true value μ₀ ? What is the expected number of
times (out of 100) for this to happen ? Should you have been suprised if this happened as few as 1 time ?
if it happened 9 or more times ?

#2 Do #27, p.270 for practice and then hand in the following: Give some numerical computations (in R or
other computing platform) showing what the 95% confidence intervals would be (for some specific
examples of values X₁+...+X_n = k) and what their actual coverage probabilities would be
according to exact Binomial(n,p) probability distributions for the values
(n=78, p=.57), (n=47,p=.53), (n=46,p=.16) according to confidence intervals (7.11), (7.10), and the
one given in problem 27. See the Rscript/Coverage.RLog script for the necessary R coding.
ALSO: for each of these (n,p) parameter combinations, give at least one nearby value of n (for same p) for
which the ordering of "best performance" among the three intervals is altered.

#3--#4 Do #22, #26, p.269.

#5 Do #38, p.278.

#6 Do #44, p.280.

#7 Do #52, p.281.

#8 Do #12, p.294.

HW4, due Wednesday, March 16. Read the rest of Chapter 8, and sections 9.1, 9.2 and 9.4
of Ch.9 of DeVore. Solve and hand in the following eight problems:

Do #10, pp.293-4, #20, p.304, #32, p.306, #42, p.311, #52 and #54, p.317-8.

Do #2, p.334, three ways: using a large-sample Z-approximation as covered in Sec.9.1; using a pooled
t-test as in Sec. 9.2; and with the Satterthwaite-Welch approximation as on pp.336-337.

Do #28, p.342.

HW5, due Monday, April 11. Read Section 9.5. Read Chapter 14 through Section 14.2. Then read
Section 4.6 plus the handout on Empirical Distribution Functions.

#1: Problem on power and p-value: suppose that you see data values X₁,...,X₃₁ which can be assumed
to be iid normally distributed, with Xbar = 24.0 and S = 8.0. Suppose that these date were collected
to test the hypothesis H₀: μ ≤ 22.7 versus H_A: μ > 22.7.
        (a). Give the p-value for the test in which you treat this as a large-sample test (or equivalently,
             where you take σ₀=8.0 as known).
        (b). Find the power of the size .05 test versus μ₁ = 25, again treating the test as a large-sample test.
        (c). Re-do part (a), this time treating the test as a small-n one-sample t test. This part
             of the problem requires you to use a calculator or PC to calculate the p-value using a
             t-distribution probability distribution function program in place of a table.

#2-3: Ch.9, #62, 64, pp.363-364.

#4: Ch.9, #68. Do a preliminary test for equality of variances before you decide which two-sample
t interval to use for the mean difference.

#5-#6: Ch.14, #6, 9, p.575.

#7: Ch. 9, #72, p.365.

#8: Ch.4, #94, p.179. Use R to create probability plots using qqnorm or qqplot.
R scripts will be provided,

HW6, due Wednesday, April 20. In Devore, finish reading Sec. 14.2, and read Ch. 10,
Sections 1 and 2. Solve and hand in the following six problems:
Ch.14, #8, p.575.
Ch. 4, #92(a), pp.178-179.
Ch.10, # 2, 8. pp.378-379; and # 12, 16, pp.384-385.

HW7, due Wednesday, May 4. Read Sections 12.1 to 12.3, and solve and hand in
the following problems:

Ch.10, p. 385: #18, 20.
Ch.12, p.453: #6; #12, p.465; #20, p.466; #34, 36, p.476.

GENERAL GUIDELINES ON HOMEWORK.

1. Academic Dishonesty. You may ask questions of each other and of me to get hints on how to solve the
various assigned homework problems. However, you may not share computations and written work: you
must each do that work and write it up individually. Homework papers which have identically copied
segments will be regarded as a violation of the campus honor code.

2. Late Homework and Test Make-Ups. The course policy on late homeworks is that they will be accepted
but graded down, by 10 percent if past due by no more than one class session and by 25 percent if later than
that. These penalties will be waived only for medical excuses or valid University-recognized holidays.
Regarding test make-ups, we will adhere to campus policy.

Sample Problems for Tests and Exams

(1) To practice for Test 1, a series of 10 relevant applied/computational problems drawn (selected and
edited) from the DeVore "Testbank" (on the CD-ROM coming with the book's 7th edition) can be found here.
(You may have to zoom in with your browser or MS Word reader to read some of the technical formula
elements in this document.)

(2) A sheet of additional problems relevant to Test 1 can be found here. These are modified from similar
problems that I have given in the past, which would call for a little more theoretical interpretation than
the mechanical `applied' problems coming from the TestBank problem-sheet in (1).

(3) Practice problems for In-Class Test 2 (Wed., Dec. 1) can be found here, along with an answer key.
For a list of topics and problem types, click here.

(4). To see last Fall's Stat 401 Final Exam, from another instructor, click here. In this Exam, the MLE in
Problem 1 is not a topic that we spent much time on, but we did spend some time on it and you should be
able to do it. The other problems are well in the mainstream of what we covered this semester.
Answer key is included here.

(5). Try the sample-exam from 1995 which I have adapted to conform more closely to what we studied
this term. An answer key is included here. (Scroll down a bit in the document to find these Sample Exam answers.)

Handouts (some from Stat 400, and some from John Millson):

(1) 10/20/03 There are two handouts here, respectively on Transformation of Random
Variables and on Random Number Generation and Simulation . These topics are very
important for the rest of the course, as they allow us to generate and interpret `artificial data'
to illustrate the meaning of our Probability Limit Theorems (Law of Large Numbers, Central
Limit Theorem) and later statistical results (Consistent Statistical estimators, Confidence
Intervals). In addition, Simulation gives us an `experimental' avenue to calculate via artificial
data probabilities which may be too difficult to figure analytically.

(2) As of 8/23/10 See John Millson's Stat 401 page for handouts on diverse topics related to the course.

(3) 10/22/03 The handout on Normal Approximation to Binomial Distribution contains a
word-problem worked example, as well as some numerical examples of the quality of the
normal approximation to the Binomial. This example is continued below, in a statistical
setting (confidence interval for estimate of a population proportion in a political opinion poll)
in handout (7) below, dated 11/19/03.
A graph comparing the distribution function values of Binom(100,.3) with its
approximating normal distribution N(30,21) can be found here.

(4) 9/29/03 This handout concerns numerical calculations for the Binomial approximation to
Hypergeometric random variables, and the Poisson approximation to the Binomial. In addition,
some simulated-data results are given to show that the expectations and probability mass functions
behave as they should according to the relative-frequency interpretation of probabilities.

(5) 10/27/03 Example of Simulation for Calculating Probability and Expectation.

(6) 11/3/03 Picture showing the behavior of sample averages Sn/n as a function of n from
1,...,2000 on each of four sets of simulated data, from different types of random variables.
Within each picture, the sample averages Sn/n are based on progressively larger segments of the
same 2000 data-values, and the point is to see that these averages settle down to the place where
the Law of Large numbers guarantee they should for large enough n, namely the theoretical
expectation of the individual r.v.'s.

(7) 11/12/03 Pictures showing behavior of scaled relative frequency histograms compared with densities.
The document shows plots of histograms in large simulated samples overlaid with the theoretical densities
they are supposed to represent, and of empirical distribution functions overlaid with the theoretical cdf's
the data in large simulated datsets are supposed to represent. The latter are available in two settings:
(i) The overlaid empirical and theoretical cdf's for 1000 simulated values of Z₁+Z₂ (sum of two
independent standard normal deviates) can be found here .
(ii) The overlaid empirical and theoretical cdf's for 1000 simulated values of U₁+...+U₁₀₀
(sum of 100 independent Uniform[0,1] independent deviates) can be found here.

(8) 11/19/03 The word-problem on political opinion polling begun in handout (3) above,
dated 10/22/03, is continued here from the vantage point of statistics, particularly
confidence intervals for estimates of a population proportion in a political opinion poll.

Important Dates

First Class: Mon., Jan. 24
First Mid-Term Exam: in-class, Fri., March 18 on material through Ch.9 Sec. 4.
(Review session in class Wed., March 16.)
Second Mid-Term Exam: in-class, Monday, May 2.
Review Session for Final Exam, Wednesday May 11, 2pm-3:30pm in regular classroom.
Final Exam: Monday, May 16, 1:30pm-3:30pm, in class-room.