Fall 2025
Instructor: Professor Eric Slud
Statistics Program, Math Dept.,
Office hours: Tu 11, Th 2 (initially), or email me to make an appointment (can be on Zoom).
There is no STAT 440 class on Thursday 10/2/2025. Instead there is an asynchronous lecture that will be posted during the evening of 10/1/25 to ELMS continuing the course material. The slide deck for the lecture can be accessed here and under the 3rd Course-Handout bullet.
Course text: Steven Thompson (2012), Sampling Statistics, 3rd ed., Wiley.
Recommended: Lohr, S. L. (2010, 2019). Sampling: Design and Analysis , Duxbury, 2nd or 3rd ed.
Older References:
Cochran, W. J. (1977). Sampling Techniques 3rd. ed., Wiley.Sarndal, C.-E., Swensson, B., and Wretman, J. (1992). Model Assisted Survey Sampling. Springer.
Prerequisite: A semester of statistics at the level of
STAT 401 or 420, including basic probability.
The book and course handouts will also
use the free R statistical programming language; the book provides some code, and I will
provide scripts to get you started on implementing formulas in R, but if you are not familiar
with R you should spend some time right away to get used to the syntax, and work through some
tutorial on objects (vectors, arrays, lists, functions, etc.) in R.
Course Description:
Sampling refers to the statistical techniques used in political polls, marketing surveys, federal data gathering and many areas of social science and public health. It is also used (somewhat differently) in the analysis of engineering or computer-generated data.
This course provides an introduction to methods of sampling and analyzing data from finite populations from both a theoretical and applied perspective. It is intended for math and science students interested in applications and for students in the Applied Statistics track of the Survey Methodology program, as well as students in disciplines such as business, life science or social science who need sampling in their research. Because of the audience in this course from Computer Science and Engineering, the course will also discuss applications to computer- or laboratory-generated data.
The Thompson text emphasizes both mathematical theory and real data applications, often from the viewpoint of sampling animal populations or hard-to-reach human populations. The recommended Lohr text aims to present the sampling building blocks relevant to complex hierarchical sampling designs such as those used in large-scale or statistical agency data collections. The course material requires that you understand basic statistical concepts such as point estimation, confidence intervals, and the central limit theorem.
Many of the course concepts and theoretical assertions will be illustrated and numerically verified using R simulations (via multiple samples drawn from real population datasets).
STAT 440 is part of the required material for the
MATH/STAT/AMSC
MA and PhD Written Examinations in Applied Statistics.
Course Requirements and Grading:
There will be an in-class midterm, 6 or 7 homework sets which will either be graded directly or assessed through a quiz consisting of a problem drawn from the HW set, and either a final in-class exam or a culminating course project. Grades will be based on the midterm (30%), homework grades (25%), quizzes (10%) and the final project (35%)
Course Policies:
(i) As part of the applied homework assignments, students will be expected to do arithmetic calculations on the computer, which will sometimes involve a small amount of programming. Students may choose the language or platform, which may range from Spreadsheets to SAS to R or C++ or Python. However, all computational illustrations in the course and all computer help offered in an office-hour setting will be restricted to R.For the systematic Introduction to R and R reference manual distributed with the R software,
(ii) With every homework problem handed in, you must include a page and line reference from the textbook saying what assumptions and formulas you are using. Failure to include this will lose you points. When you solve a problem via numerical calculation, you must supply the computer code, preferably in R, that you used to solve it. This must be programming-language code, including only basic built-in functions of the programming language, not a call to a package.
(iii) All homeworks must be handed in as pdf or doc files on ELMS on or before the due date.
Late homework will be accepted late, but without a very good excuse, grade will always be reduced.
(iv) The policy on excused absences and missed exams and assignments will follow the University policy explained in the University's policy document. In case of a missed in-class test or exam, an alternate exam will be arranged between student and instructor as soon as possible.
(v) Course policies on academic integrity and use of technology also follow the University's policy document. In addition, with respect to the homework assignments in this course, the work you hand in must be your own, written without the aid of technology. You may receive hints on how to solve problems from me, each other, or online sources, but the words and formulas you write up and submit as homework must be entirely your own.
Final Project Papers, and Some Possible Topics
The final project is a paper of ~12 pages of text (Font 11 point or smaller, with at most 1.5 line spacing) plus data exhibits or pictures, on some topic related to the course material, either from a book chapter of journal article or data analysis or simulation. Topic must be approved by the instructor. Suggestions will be provided (along with resources like chapters and/or journal articles, if requested) Suggested topics and resources will be listed in this space throughout this semester. I will be happy to add topics based on students' interests expressed to me.
(1). Sampling of inputs in engineering or computer experiments, e.g. "Latin Hypercube Sampling"
(2). Spatial sampling for envirnmental monitoring>
Homework Assignments.
HW Set 1 Instructions. Your answers to the submitted HW problems should show complete reasoning and (wherever you used it) computer code, along with page and line references in the Thompson book for where the topic is covered.
HW set 1 consists of 7 problems worth a total of 90 points, due 9/15/25, 11:59pm on ELMS.
#1. (10 points) (a) You randomly sample three elements from a numbered frame list {1,...,6}. What is the probability that unit number 5 is in the sample ? What is the probability that 2 and 4 are both in the sample ? what is the probability that the sample contains {3,4,5} ?
(b) Same questions if your sample is drawn from {1,...,16}, with N=16.
(c) Same questions as (b) if you drew the sample (a "Poisson" sample) with items chosen independently, with probabilities 1/6 for k=1,..,5, and 1/4 for k=6,...,11, and 1/5 for k=12,...,16.
#2. (10 points) You have a population numbered i=1,..,6, with attributes yi respectively 3.7, 14.2, 10.3, 8.5, 7.0, 6.6. You draw a simple random sample of size 3. What is the probability that your sample attribute total is ≤ 20 ? What is the probability that your estimated population attribute total (using the Horvitz-Thompson estimator) is greater than 40 ?
#3. (10 points) You draw two independent samples of size 3 as in #2. What is the probability that the total of sampled element attributes in the second one is greater than the total of sampled element attributes in the first one by at least 3 ?
#4. (20 points) How would your answers in #1(a)-(b), #2, and #3 change if the samples were with replacement ?
#5. (10 points) Suppose we draw 1,000 SRS samples of size n=3 independently from the population of N=6 in problem #2. What is the expectation of the average sample total from these 1,000 samples ? What is the standard deviation ? If you were given the average and standard deviation from the 1,000 samples (and nothing else, in particular not the values yi in either the population or the samples), could you construct confidence interval with approximate 95% level for the unknown population total ?
#6. (15 points) A newspaper wants to know the likely outcome of an election for mayor to be held in 1 week in a town with a population of 10,000 registered voters, between candidates A and B. The newspaper selects a simple random sample of 300 of the registered voters, and miraculously all 300 respond to the survey question of which candidate they prefer. (The values yi=1 for candidate A and =0 for candidate B are the attributes in this finite population.) If the proportion favoring A is actually 45%, then what is the probability that the sample estimate of that proportion is ≥ 0.5 ? and what is the probability that the proportion (favoring A) estimated from the survey is ≥ 0.06 larger than the true proportion ? Does your answer to this 2nd question vary much if the true proportion favoring A is somewhere between 40% and 50% ? Does this answer vary much if the sample size n=300 is held fixed but the town population is 30,000 instead ? 100,000 ?
All the probability questions and answers so far are design-based in the sense that the population composition is taken as a fixed nonrandom fact and all randomness comes from the sampling mechanism. Now we have a "theoretical" problem in which both the underlying attributes and the sampling mechanism, are treated as random.
#7. (15 points) Suppose that 10 population elements numbered 1:10 are independent and identically distributed with attributes yi distributed as Gamma(2,1/2) (same as chi-squared with 4 degrees of freedom, which has mean 4 and variance 8). In this problem, the population size is N=10, and it is intended to draw a Simple Random Sample with n=5. What is the probability that the true population mean is at least 5 ? What is the probability that the estimated population mean from the sample is at least 5 ? What is the probability that the estimated population mean differs from the true population mean by at least 1 ?
HW Set 2 consists of 6 problems worth a total of 70 points, due 9/29/25, 11:59pm on ELMS. Your answers to the submitted HW problems should show complete reasoning and (wherever you used it) computer code, along with page and line references in the Thompson book or other book- (not internet-) source for where the topic is covered.
#1. (15 points) (a) Suppose you are interested in learning information about a finite population of size N by drawing a sample of n people. Suppose that numerical attributes (say, years of schooling) are denoted yi and that you are told N-1 ∑Ni=1 yi2 = m2 and N-1 ∑Ni=1 yi = m1, where m2 > (m1)2.
(a). In terms of m1, m2, give formulas for the theoretical variances of an unbiased estimate of m1 if the samples are drawn by SRS or by equiprobable selection with replacement.
(b). Suppose N=800, m1 = 13.3, m2 = 192.9, and that an investigator who wants to estimate m1 intends to draw a SRS sample from the population. What is the smallest sample size n for which the sample-mean estimate of m1 differs from m1 by at most 2, with probability at least 0.95 ?
(2.) Do the following 5 problems from the Thompson text, worth 10 points each: Ch.2 #1, Ch.4 #1, #2, Ch.5 #2, #3
(3.) (15 points) A population U of 20000 people is known to consist of subpopulations UM consisting of 10,500 men and UF consisting of 9,500 women, but a SRS sample is to be drawn from U without regard to sex, and we are interested in estimating the average of an attribute y accurately in each of the two subpopulations. Suppose you know that the population standard deviation sy,UM ≅ 2.7 and sy,UF ≅ 3.0. How large must the sample size n of the SRS be in order for each of the estimates of total y among the men and total y among the women to be accurate within ± 800 with probability at least 0.95 ? In order to estimate variances of the attributes times domain indicators, you may assume that the domain means of y on UM and UF are approximately equal to each other and to 5.
HW Set 3 consists of 4 problems worth a total of 85 points, due 10/20/25, 11:59pm on ELMS. Your answers to the submitted HW problems should show complete reasoning and (wherever you used it) computer code, along with page and line references in the Thompson book or other book source (or my web-page Handouts or slides) for where the topic is covered.
(1.) (15 points) Look at the general variance formula for Horvitz-Thompson Estimator of a total given as formula (6.5) at page 70 in the Thompson textbook. Substitute the single-inclusion probabilitiesand joint-inclusion probabilities
that we derived for Simple Random Samples in that formula, and prove that it gives the variance
that we have been using for SRS. Also verify that the variance estimator (6.7) simplifies in the SRS case to
.
(2.) (15 points) A population with index labels
and attributes
is sampled by a probability design which has nonrandom sample size
, with first- and second-order inclusion probabilities
and
respectively, where
and
$ for all
.
(a) Prove that for every.
(b) Using (a), and formula (6.5) in Thompson, prove that the formula
is an unbiased estimate of the variance of the Horvitz-Thompson estimator
of the total
.
(3.) (4 parts, 10pts each, 40 total)
(a) Draw a SRS sample of size 300 to estimate average county pop2010 overall and within the domains of metro and non-metro counties (3 separate answers). Give your estimates along with your estimates of SE.(b) Draw a PPSWR sample of size 300 with inclusion probabilities that are 3 times as large for metro=yes counties as for metro=no counties, and give your estimate of average county pop2010 overall and its SE. What are the numbers of metro and non-metro counties you got in your sample ? What were the expected numbers of metro and non-metro counties that you would get based on the known numbers in each domain of counties ?
(c) Draw a Poisson sample of size 300 with inclusion probabilities the same as in (b), and give your estimate of average county pop2010 overall and within the domains of metro and non-metro counties. Also give the SEs for your estimates. In this problem, find the expectations and standard deviations of the numbers of sampled counties you would get in each of the domains and in the sample overall.
(d) Can you reach any tentative conclusions about whether it was helpful to use the metro information in inclusion probabilities or domain structure if you wanted an accurate estimate of average county population overall ? Can you form any tentative conclusions about which of these designs is best for the purpose ?
(4). (20 points) Draw 1000 PPSWR samples of size 200 from the N = 3097 pop2010 "counties" dataset, with inclusion probabilities proportional to the pop2000 column entries. (i) First, find the theoretical mean and variance for the mean pop2010 value in the dataset. (ii) Second, find for each of the 1000 samples the PPSWR-estimated estimate for the mean pop2010 and for its variance, and store these results in a 1000 x 2 array. (iii) Use the array found in part (ii) to find an empirical estimate of the mean county pop in 2010, and a 95% confidence interval for it, as well as an empirical average variance estimate and empirical variance estimate. Your empirical average variance estimate and empirical variance estimate should be fairly close to each other and to the theoretical variance estimate that you found in part (i), and your confidence interval should contain the true average value 73421.19 for pop2010 in this dataset.
Some homework solutions including numerical answers, some discussion,and R scripts will be posted to directories under "Handouts".
Click links here for:
Getting Started in R
Course topics
Requirements, Grading & Policies reproduced here on dated Course Syllabus
Course handouts
Homework Assignments
Final Projects
Other links
Topics: for course outline indexed by chapters in Thompson and Lohr books, click here
- Basic concepts: sample design, bias, sampling and nonsampling errors.
- Anticipated variance, relation between design- and model-based estimators.
- Simple random sampling: statistics, estimates, confidence limits, ratio and regression estimators.
- Poisson, PPS, and Systematic Sampling.
- Stratified sampling, optimal (Neyman) allocation of stratum samples.
- Cluster sampling and complex surveys.
- Nonresponse: ratio adjustments and raking.
- Regression and Calibration under common survey designs.
- Advanced topics: variance estimation, linearization methods, and regression in complex surveys.
Course Handouts including some HW Solutions:
New handout 9/16/25 containing proof of unbiased estimation of s2y,U by s2y,S
Directory of R scripts and pictures for Fall 2025 class demonstrations and occasional HW solutions can be found here
Lecture Slides for 9/23/25, Lecture Slides for 10/2/25
Handout on PPS Sampling Using R.
Handout on Probability Sampling using R based on Example in Chapter 2 of Sarndal et al.(1992)
R Script and Directory of Pictures for Classroom Demo on CLT for SRS Sampling.
Handout on Ratio Estimation in the Mu281 Dataset of Sarndal et al., which is
just the Mu284 dataset linked below omitting the records for the three largest cities
(numbers 16, 114 137).
Handout illustrating Ratio and Regression estimation on the dataset counties.dat
from the Lohr text, consisting of summary variables from a SRS of n=100 out of the
N=3141 US Counties.
Handout illustrating Regression versus Stratified-Sample estimation on
a simulated dataset with binary attributes and 5 strata.
Handout illustrating Regression Estimation within a Stratified Design
and comparison with other estimators.
Handout on Stratified-Sample Estimation relating to Example 4.3, Table 4.2
in the Lohr book.
Handout on biased estimation of Variance in Two-Stage Cluster estimation.
Sample Problems for In-Class Test in 2005 .
Fall 2005 In-Class Test and Solutions to Fall 2005 In-Class Test .
If you want to see a brief R script for doing the raking example covered in
the book (Lohr, Sec. 8.5.2), click here .
Click here to see Sample Problems for the Stat 440 In-Class Final Exam, along
with brief Solutions (to all problems except 9b and 10). Another pdf
handout of sample problems for the Exam can be found here: its solutions can be
found at the end of the same file of Sample Final problem solutions.
Datasets
The Mu 284 Dataset of Sarndal et al, "The MU284 Population" from Appendix B of the book "Model Assisted Survey Sampling" by Sarndal, Swensson and Wretman, avauilable as a built-in dataset in he R backage "surveybootstrap".
Other datasets will be added throughout the semester.
Important Dates
Back to top.
- First Class: September 2 (Tues.), 2025
- No Class: Yom Kippur, October 2 (Thurs.), 2025 (Asynchronous Lecture on ELMS instead)
- No Class: Fall Break, October 14 (Tues.), 2025
- Mid-Term Exam: In class, ~Thursday, Nov. 6.
- Thanksgiving: November 27, 2025
- Final Exam or Project Due: Friday December 17, 10:30am
Other Links
Datasets in R extensive list of built-in datasets in R accessible through the "Datasets" R-package.
ICPSR Data Archives extensive data archives indexed by topic
Many other dataset repositories exist, as well as hard-to-analyze datasets collected for various reasons such as advancing machine learning methodology, e.g. the University of California Irvine Machine Learning Repository
"What is a Survey?" well-known article produced under the auspices of the ASA by the well-known statistician Fritz Scheuren.
Eric Slud home page.
Main departmental page.
Statistics Program page.
© Eric V Slud, October 6, 2025.