STAT/SURV 440 Web-page

STAT/SURV 440 SAMPLING THEORY

Fall 2025 TuTh 12:30pm-1:45pm, CHE2136

Instructor: Professor Eric Slud Statistics Program, Math Dept.,
Rm 2314, x5-5469, slud@umd.edu (the best way to reach me)

Office hours: Tu 11, Th 2 (initially), or email me to make an appointment (can be on Zoom).

Please fill out a survey evaluating me and providing feedback on the STAT 440 Course at https://CourseExp.umd.edu by December 13, 2025.

Click links here for:

Getting Started in R
Course topics
Requirements, Grading & Policies reproduced here on dated Course Syllabus.
Course handouts
Homework Assignments
Final Projects
Other links

Course text: Steven Thompson (2012), Sampling Statistics, 3rd ed., Wiley.

Recommended: Lohr, S. L. (2010, 2019). Sampling: Design and Analysis , Duxbury, 2nd or 3rd ed.

For the current Homework Assignment, click here or go to Assignments on ELMS course-pages.
For the dated Course Syllabus, click here.

Older References:

Cochran, W. J. (1977). Sampling Techniques 3rd. ed., Wiley.
Sarndal, C.-E., Swensson, B., and Wretman, J. (1992). Model Assisted Survey Sampling. Springer.

Prerequisite: A semester of statistics at the level of STAT 401 or 420, including basic probability.
The book and course handouts will also use the free R statistical programming language; the book provides some code, and I will provide scripts to get you started on implementing formulas in R, but if you are not familiar with R you should spend some time right away to get used to the syntax, and work through some tutorial on objects (vectors, arrays, lists, functions, etc.) in R.

Course Description:

Sampling refers to the statistical techniques used in political polls, marketing surveys, federal data gathering and many areas of social science and public health. It is also used (somewhat differently) in the analysis of engineering or computer-generated data.

This course provides an introduction to methods of sampling and analyzing data from finite populations from both a theoretical and applied perspective. It is intended for math and science students interested in applications and for students in the Applied Statistics track of the Survey Methodology program, as well as students in disciplines such as business, life science or social science who need sampling in their research. Because of the audience in this course from Computer Science and Engineering, the course will also discuss applications to computer- or laboratory-generated data.

The Thompson text emphasizes both mathematical theory and real data applications, often from the viewpoint of sampling animal populations or hard-to-reach human populations. The recommended Lohr text aims to present the sampling building blocks relevant to complex hierarchical sampling designs such as those used in large-scale or statistical agency data collections. The course material requires that you understand basic statistical concepts such as point estimation, confidence intervals, and the central limit theorem.

Many of the course concepts and theoretical assertions will be illustrated and numerically verified using R simulations (via multiple samples drawn from real population datasets).

STAT 440 is part of the syllabus for the MATH/STAT/AMSC MA and PhD Written Examinations in Applied Statistics.

Course Requirements and Grading:

There will be an in-class midterm, 6 or 7 homework sets which will either be graded directly or assessed through a quiz consisting of a problem drawn from the HW set, and either a final in-class exam or a culminating course project. Grades will be based on the midterm (30%), homework grades (25%), quizzes (10%) and the final project (35%)

Course Policies:

(i) As part of the applied homework assignments, students will be expected to do arithmetic calculations on the computer, which will sometimes involve a small amount of programming. Students may choose the language or platform, which may range from Spreadsheets to SAS to R or C++ or Python. However, all computational illustrations in the course and all computer help offered in an office-hour setting will be restricted to R.

For the systematic Introduction to R and R reference manual distributed with the R software,
either download from the R website or simply invoke the command

> help.start()

from within R. For introductory tutorials in R, click CUNY or University of Illinois.

I will post Scripts for class demonstrations, simulations and examples in the Handouts section.
Other introductory R scripts are given in the Thompson text.

(ii) With every homework problem handed in, you must include a page and line reference from the textbook saying what assumptions and formulas you are using. Failure to include this will lose you points. When you solve a problem via numerical calculation, you must supply the computer code, preferably in R, that you used to solve it. This must be programming-language code, including only basic built-in functions of the programming language, not a call to a package.

(iii) All homeworks must be handed in as pdf or doc files on ELMS on or before the due date.
Late homework will be accepted late, but without a very good excuse, grade will always be reduced.

(iv) The policy on excused absences and missed exams and assignments will follow the University policy explained in the University's policy document. In case of a missed in-class test or exam, an alternate exam will be arranged between student and instructor as soon as possible.

(v) Course policies on academic integrity and use of technology also follow the University's policy document. In addition, with respect to the homework assignments in this course, the work you hand in must be your own, written without the aid of technology. You may receive hints on how to solve problems from me, each other, or online sources, but the words and formulas you write up and submit as homework must be entirely your own.

Final Project Papers, and Some Possible Topics

The final project is a paper of ~12 pages of text (Font 11 point or smaller, with at most 1.5 line spacing) plus data exhibits or pictures, on some topic related to the course material, either from a book chapter or journal article or data analysis or simulation. Drawing connections between your topic and the mathematical course material will be an important criterion on which your paper will be evaluated. The topic must be approved by the instructor. Suggestions will be provided (along with resources like chapters and/or journal articles, if requested). Suggested topics and resources will be listed in this space. I will be happy to add topics based on students' interests expressed to me.

GENERAL CONSIDERATIONS. If you write a paper about a particular subject-matter topic [e.g., environmental monitoring or financial auditing in topics (2) or (5) in the list below], you might base your paper on a book-chapter or article to explain the topic, including the kind of survey data-collection used; but then, ideally, you would illustrate the survey sampling method or analysis (or consider alternative analyses) based on some actual data that the book or journal article directs you to, or that you can find in some data repository. Or if you do a paper about collecting and analyzing some survey sampling design not fully covered in the course, then you might illustrate the method covered by implementing the survey design and analyzing it (multiple times, i.e. with repeated survey samples done in a loop) using an available dataset serving as the whole "population".
DATA SOURCES. There are many online sources for datasets to use as illustrative "populations" to sample from. A few are given in the Other links section of this web-page. For example, you can find MANY survey datasets by searching on "survey datasets" under ICPSR Data Archives. You can also find survey-related datasets in the R packages survey or surveybootstrap.

EXAMPLES OF POSSIBLE TOPICS
(1). Sampling of inputs in engineering or computer experiments, e.g. "Latin Hypercube Sampling".
(2). Spatial sampling for environmental monitoring.
(3). Adaptive or "network" or "respondent-driven" sampling for hard-to-reach populations, as covered in the Thompson book or elsewhere.
(4). Raking or calibration of weights to externally known totals, and application to social science surveys. The Lohr book is a good source for this, and there is an interesting optimization angle about which I can offer additional readings.
(5). Sampling as a topic in the auditing of financial records. I am not an expert in this, but some combination of stratified SRS and PPS seems to be state of the art. Here the interesting aspect is the application, making good use of the sampling methods we covered in the course.
(6). Simulation Study comparing Survey Designs. Another possibility would be to conduct and report the results of a simulation study contrasting the effectiveness (bias and variance, forestimated averages of specific attribute variables y_i) of various survey designs based on an existing real (large) dataset which would serve as the true population U, assuming complete response and different possibilities for previous knowledge of population variables x_i either at individual level or as population totals.
(7). Election Data, analyzing or comparing Political Opinion Polls. This is a very big topic, but you can find lots of data and various methods and reports at the American National Election Studies website. There are (many) statistical and political science journal articles under this heading. Two examples from the journal Public Opinion Quarterly, on the general topic of comparing the validity of probability samples versus online opt-in (nonprobability) samples, are this one and this.

HOMEWORK ASSIGNMENTS.

HW Set 1 Instructions. Your answers to the submitted HW problems should show complete reasoning and (wherever you used it) computer code, along with page and line references in the Thompson book for where the topic is covered.

HW set 1 consists of 7 problems worth a total of 90 points, due 9/15/25, 11:59pm on ELMS.

#1. (10 points) (a) You randomly sample three elements from a numbered frame list {1,...,6}. What is the probability that unit number 5 is in the sample ? What is the probability that 2 and 4 are both in the sample ? what is the probability that the sample contains {3,4,5} ?
(b) Same questions if your sample is drawn from {1,...,16}, with N=16.
(c) Same questions as (b) if you drew the sample (a "Poisson" sample) with items chosen independently, with probabilities 1/6 for k=1,..,5, and 1/4 for k=6,...,11, and 1/5 for k=12,...,16.

#2. (10 points) You have a population numbered i=1,..,6, with attributes y_i respectively 3.7, 14.2, 10.3, 8.5, 7.0, 6.6. You draw a simple random sample of size 3. What is the probability that your sample attribute total is ≤ 20 ? What is the probability that your estimated population attribute total (using the Horvitz-Thompson estimator) is greater than 40 ?

#3. (10 points) You draw two independent samples of size 3 as in #2. What is the probability that the total of sampled element attributes in the second one is greater than the total of sampled element attributes in the first one by at least 3 ?

#4. (20 points) How would your answers in #1(a)-(b), #2, and #3 change if the samples were with replacement ?

#5. (10 points) Suppose we draw 1,000 SRS samples of size n=3 independently from the population of N=6 in problem #2. What is the expectation of the average sample total from these 1,000 samples ? What is the standard deviation ? If you were given the average and standard deviation from the 1,000 samples (and nothing else, in particular not the values y_i in either the population or the samples), could you construct confidence interval with approximate 95% level for the unknown population total ?

#6. (15 points) A newspaper wants to know the likely outcome of an election for mayor to be held in 1 week in a town with a population of 10,000 registered voters, between candidates A and B. The newspaper selects a simple random sample of 300 of the registered voters, and miraculously all 300 respond to the survey question of which candidate they prefer. (The values y_i=1 for candidate A and =0 for candidate B are the attributes in this finite population.) If the proportion favoring A is actually 45%, then what is the probability that the sample estimate of that proportion is ≥ 0.5 ? and what is the probability that the proportion (favoring A) estimated from the survey is ≥ 0.06 larger than the true proportion ? Does your answer to this 2nd question vary much if the true proportion favoring A is somewhere between 40% and 50% ? Does this answer vary much if the sample size n=300 is held fixed but the town population is 30,000 instead ? 100,000 ?

All the probability questions and answers so far are design-based in the sense that the population composition is taken as a fixed nonrandom fact and all randomness comes from the sampling mechanism. Now we have a "theoretical" problem in which both the underlying attributes and the sampling mechanism, are treated as random.

#7. (15 points) Suppose that 10 population elements numbered 1:10 are independent and identically distributed with attributes y_i distributed as Gamma(2,1/2) (same as chi-squared with 4 degrees of freedom, which has mean 4 and variance 8). In this problem, the population size is N=10, and it is intended to draw a Simple Random Sample with n=5. What is the probability that the true population mean is at least 5 ? What is the probability that the estimated population mean from the sample is at least 5 ? What is the probability that the estimated population mean differs from the true population mean by at least 1 ?

HW Set 2 consists of 6 problems worth a total of 70 points, due 9/29/25, 11:59pm on ELMS. Your answers to the submitted HW problems should show complete reasoning and (wherever you used it) computer code, along with page and line references in the Thompson book or other book- (not internet-) source for where the topic is covered.

#1. (15 points) (a) Suppose you are interested in learning information about a finite population of size N by drawing a sample of n people. Suppose that numerical attributes (say, years of schooling) are denoted y_i and that you are told N^-1 ∑^N_i=1 y_i² = m₂ and N^-1 ∑^N_i=1 y_i = m₁, where m₂ > (m₁)².
(a). In terms of m₁, m₂, give formulas for the theoretical variances of an unbiased estimate of m₁ if the samples are drawn by SRS or by equiprobable selection with replacement.
(b). Suppose N=800, m₁ = 13.3, m₂ = 192.9, and that an investigator who wants to estimate m₁ intends to draw a SRS sample from the population. What is the smallest sample size n for which the sample-mean estimate of m₁ differs from m₁ by at most 2, with probability at least 0.95 ?

(2.) Do the following 5 problems from the Thompson text, worth 10 points each: Ch.2 #1, Ch.4 #1, #2, Ch.5 #2, #3

(3.) (15 points) A population U of 20000 people is known to consist of subpopulations U_M consisting of 10,500 men and U_F consisting of 9,500 women, but a SRS sample is to be drawn from U without regard to sex, and we are interested in estimating the average of an attribute y accurately in each of the two subpopulations. Suppose you know that the population standard deviation s_{y,U_M} ≅ 2.7 and s_{y,U_F} ≅ 3.0. How large must the sample size n of the SRS be in order for each of the estimates of total y among the men and total y among the women to be accurate within ± 800 with probability at least 0.95 ? In order to estimate variances of the attributes times domain indicators, you may assume that the domain means of y on U_M and U_F are approximately equal to each other and to 5.

HW Set 3 consists of 4 problems worth a total of 85 points, due 10/22/25, 11:59pm on ELMS. (Note extended due date.) Your answers to the submitted HW problems should show complete reasoning and (wherever you used it) computer code, along with page and line references in the Thompson book or other book source (or my web-page Handouts or slides) for where the topic is covered.

(1.) (15 points) Look at the general variance formula for Horvitz-Thompson Estimator of a total given as formula (6.5) at page 70 in the Thompson textbook. Substitute the single-inclusion probabilities $LaTeX: \pi_i$ and joint-inclusion probabilities $LaTeX: \pi_{ij}$ that we derived for Simple Random Samples in that formula, and prove that it gives the variance $LaTeX: N^2 \, (1/n - 1/N) \, s_{y,U}^2$ that we have been using for SRS. Also verify that the variance estimator (6.6) simplifies in the SRS case to $LaTeX: N^2 (1/n - 1/N) s^2_{y,S}$ .

(2.) (15 points) A population with index labels $LaTeX: U = \{1,2,\ldots,N\}$ and attributes $LaTeX: y_1,\ldots, y_N$ is sampled by a probability design which has nonrandom sample size , with first- and second-order inclusion probabilities $LaTeX: \pi_i > 0$ and $LaTeX: \pi_{i,j} > 0$ respectively, where $LaTeX: i,j \in U$ and $LaTeX: \pi_{i,i} = \pi_i$ for all .
(a) Prove that for every $LaTeX: i \in S, \;\; \sum_{j=1}^N \, \pi_{ij} \, = \, n \, \pi_i$ .
(b) Using (a), and formula (6.5) in Thompson, prove that the formula

$LaTeX: \hat{V}(\hat{t}_y) \; = \; \frac{1}{2} \, \sum_{i=1}^N \, \sum_{j=1}^N \, \frac{\pi_i \pi_j \, - \, \pi_{i,j}}{\pi_{i,j}} \, \big( \frac{y_i}{\pi_i} \, - \, \frac{y_j}{\pi_j}\big)^2 \, I_{[i,j \in S]}$

is an unbiased estimate of the variance of the Horvitz-Thompson estimator $LaTeX: \hat{t}_y \, = \, \sum_{i \in S} \, y_i/\pi_i$ of the total $LaTeX: t_y \, = \, \sum_{i=1}^N \, y_i$ .

Start the next two problems with the 3100 x 3 data-frame "counties" stored in "Counties.rds" in the Scripts folder of the Stat 440 course web-page. (Note that you input these data into an R workspace using the readRDS() function.) This dataset was extracted from the "county" dataset included in the "usdata" R-package by deleting 3 tiny counties with missing "metro" indicator, 3 other counties with missing pop2000, and also dropping the 39 large counties with more than 1,000,000 2010 population contained in the pop2010 column. Here is the code for this preprocessing:

> library(usdata)
counties = county[,c(4,11,3)]
ind = which(counties$pop2010 <= 1e6 & !is.na(counties$metro) & !is.na(counties$pop2000))
counties = counties[ind,] ### now 3097 x 3 columns: pop2010, indicator of metro and pop2000

> aggregate(counties$pop2010, by=list(counties$metro), function(x) c(mean=mean(x), sd=sd(x)))
Group.1 x.mean x.sd
1 no 23462.99 21922.36
2 yes 160992.35 199057.85

> table(counties$metro)
no yes
1972 1125

There are two problems to solve with these data, based on code that you are to write from scratch, using only base R (or Python, if you prefer), NOT any other packages. In your code containing solutions to these problems, specify your starting seed, so that your code can be run and results replicated. Your seed should be chosen individually, and should not be the same as anybody else's.

(3.) (4 parts, 10pts each, 40 total)
(a) Draw a SRS sample of size 300 to estimate average county pop2010 overall and within the domains of metro and non-metro counties (3 separate answers). Give your estimates along with your estimates of SE.

(b) Draw a PPSWR sample of size 300 with inclusion probabilities that are 3 times as large for metro=yes counties as for metro=no counties, and give your estimate of average county pop2010 overall and its SE. What are the numbers of metro and non-metro counties you got in your sample ? What were the expected numbers of metro and non-metro counties that you would get based on the known numbers in each domain of counties ?

(c) Draw a Poisson sample of size 300 with inclusion probabilities the same as in (b), and give your estimate of average county pop2010 overall and within the domains of metro and non-metro counties. Also give the SEs for your estimates. In this problem, find the expectations and standard deviations of the numbers of sampled counties you would get in each of the domains and in the sample overall.

(d) Can you reach any tentative conclusions about whether it was helpful to use the metro information in inclusion probabilities or domain structure if you wanted an accurate estimate of average county population overall ? Can you form any tentative conclusions about which of these designs is best for the purpose ?

(4). (20 points) Draw 1000 PPSWR samples of size 200 from the N = 3097 pop2010 "counties" dataset, with inclusion probabilities proportional to the pop2000 column entries. (i) First, find the theoretical mean and variance for the mean pop2010 value in the dataset. (ii) Second, find for each of the 1000 samples the PPSWR-estimated estimate for the mean pop2010 and for its variance, and store these results in a 1000 x 2 array. (iii) Use the array found in part (ii) to find an empirical estimate of the mean county pop in 2010, and a 95% confidence interval for it, as well as an empirical average variance estimate and empirical variance estimate. Your empirical average variance estimate and empirical variance estimate should be fairly close to each other and to the theoretical variance estimate that you found in part (i), and your confidence interval should contain the true average value 73421.19 for pop2010 in this dataset.

HW Set 4 consists of 4 problems worth a total of 80 points, due 11/06/25, 11:59pm on ELMS. Your answers to the submitted HW problems should show complete reasoning and (wherever you used it) computer code, along with page and line references in the Thompson or other book- (not internet-) source for where the topic is covered. Three of the problems are # 1, 2 in Chapter 11 and #1 in Chapter 8 of the Thompson book. The full assignment, including a longer 4th computational problem, is linked here.

HW Set 5 consists of 6 problems worth a total of 85 points, due 11/25/25, 11:59pm on ELMS. (OK to hand in by 11:59pm 11/26.)
(A) Four of the problems, each worth 10 points, are # 1, 3 and 5 in Chapter 12 and #1 in Chapter 13 of the Thompson book.
(B) (20 points) One extra problem is (a) to use the data given in Problem #2 on the Midterm to provide a 95% confidence interval post-stratification estimator for the proportion of the whole population favoring candidate Jones, and (b) to compare that estimator and its variance with the one based on a multiple-regression estimator for the average of y_i = I_{[i favors Jones]} with predictor variables x_i = (I_{[i is Democrat]}, I_{[i is Independent]}).
You could either derive and use formulas for the regression estimator in part (b), or else you could create a dataset of 600 units with filled-in columns x_i and y_i according to the Table data and use (unweighted) linear-regression software on that dataset.
(C) (25 points) A second extra problem is computational, about a ratio estimator and its variance based on a Stratified SRS sample. Along with the full assignment, it is linked here.

HW Set 6 consists of 4 problems worth a total of 70 points, due 12/10/25, 11:59pm on ELMS. Your answers to the submitted HW problems should show complete reasoning and (wherever you used it) computer code, along with page and line references in the Thompson or Lohr book- (not internet-) source for where the topic is covered. Here is a pdf of the full assignment .

Some homework solutions including numerical answers, some discussion, and R scripts will be posted to directories under "Handouts".

Topics: for course outline indexed by chapters in Thompson and Lohr books, click here

Basic concepts: sample design, bias, sampling and nonsampling errors.

Anticipated variance, relation between design- and model-based estimators.

Simple random sampling: statistics, estimates, confidence limits, ratio and regression estimators.

Poisson, PPS, and Systematic Sampling.

Stratified sampling, optimal (Neyman) allocation of stratum samples.

Cluster sampling and complex surveys.

Nonresponse: ratio adjustments and raking.

Regression and Calibration under common survey designs.

Advanced topics: variance estimation, linearization methods, and regression in complex surveys.

Course Handouts including some HW Solutions:

New handout 12/2/25 containing derivation and formulas for variances of 2-stage cluster survey estimators for discussion in class and for use on HW6.

New handout 9/16/25 containing proof of unbiased estimation of s²_y,U by s²_y,S

Directory of R scripts and pictures for Fall 2025 class demonstrations and occasional HW solutions can be found here

Lecture Slides for 9/23/25, Lecture Slides for 10/2/25

Handout on Survey Regression for Fall 2025 including formulas and sketch-proof for Variance Formula of Survey Regression Estimator of population total.

Here is a link for Topics and Problem Types for STAT 440 In-Class Test F25. Additional problems from previous semesters can be found at the end of this bulleted list of handouts.

Sample and Practice Problems for STAT 440 In-Class Test F25

R Script and Directory of Pictures for Classroom Demo on CLT for SRS Sampling.

Handout illustrating Ratio and Regression estimation on the dataset counties.dat
from the Lohr text, consisting of summary variables from a SRS of n=100 out of the N=3141 US Counties.

Handout illustrating Regression versus Stratified-Sample estimation on a simulated dataset with binary attributes and 5 strata.

Handout illustrating Regression Estimation within a Stratified Design and comparison with other estimators.

Handout on Stratified-Sample Estimation relating to Example 4.3, Table 4.2 in the Lohr book.

Sample Problems for In-Class Test in 2005.

Fall 2005 In-Class Test and Solutions to Fall 2005 In-Class Test .

Click here to see Sample Problems for the Stat 440 In-Class Final Exam, along with brief Solutions (to all problems except 9b and 10). Another pdf handout of sample problems (for the Exam in 2007) can be found here: its solutions can be found at the end of the same file of Sample Final problem solutions.

Datasets
The Mu 284 Dataset of Sarndal et al, "The MU284 Population" from Appendix B of the book "Model Assisted Survey Sampling" by Sarndal, Swensson and Wretman, available as a built-in dataset in the R package "surveybootstrap".

Various datasets using in the Fall 2025 semester related to US counties are built-in datasets in the R package "usdata".

Important Dates

First Class: September 2 (Tues.), 2025

No Class: Yom Kippur, October 2 (Thurs.), 2025 (Asynchronous Lecture on ELMS instead)

No Class: Fall Break, October 14 (Tues.), 2025

Mid-Term Exam: In class, Tuesday, Nov. 11, 2025.

Thanksgiving: November 27, 2025

Last class: Thursday Dec. 11, 2025

Final Exam or Project Due: Friday December 19, 11:59pm

Back to top.

Other Links

Datasets in R extensive list of built-in datasets in R accessible through the "Datasets" R-package.
ICPSR Data Archives extensive data archives indexed by topic
Many other dataset repositories exist, as well as hard-to-analyze datasets collected for various reasons such as advancing machine learning methodology, e.g. the University of California Irvine Machine Learning Repository
"What is a Survey?" well-known article produced under the auspices of the ASA by the well-known statistician Fritz Scheuren.

Eric Slud home page.

Main departmental page.

Statistics Program page.

© Eric V Slud, December 9, 2025.