Stat 430, Fall 2001 10/15/01 Topics for Stat 430 Test, along with Sample Problems INSTRUCTIONS: The in-class test on Friday, October 19, will be closed-book, but you should bring (or arrange to share) a calculator, and you may use a one- or two-sided 8.5" by 11" sheets of notes (formulas, summaries of SAS proc's, etc.) which you prepare in advance. Topic I. Familiarity with SAS PROC's. For each of the following, explain how you would produce using SAS the statistical or graphical output requested. Give exact PROC's and OPTIONS you would use, along with any associated DATA steps you would need to prepare them. Give exact SAS statements if you can, but clear explanations with references to specific PROC's, statements, and OPTIONS will also be good enough. Assume that you are in the midst of a SAS session and have already declared libname home "." ; and that you have a SAS dataset times.ssd04 in your home directory, consisting of 200 records with numerical columns LIFTIM , AMT, GRP where LIFTIM is a survival-time in days, AMT is a blood-level of some chemical (rounded to one of 5 discrete levels, and GRP = 0 or 1 is a group-indicator respectively of placebo and treatment group in the survival study. Here are some sample questions: a. How would you produce a table showing the number of observations with all the possible (two-way) combinations of values for the variables AMT and GRP ? b. How would you calculate the partial correlation between LIFTIM and AMT adjusted (linearly) for the variable GRP, i.e. for group membership ? c. How would you calculate and display the ranges (distance from smallest to largest observed value) within each of the 10 cross- classified groups defined by levels of AMT and GRP ? d. How would you create side-by-side histograms of LIFTIM values by GRP ? e. How would you check whether the values of LIFTIM are approximately normally distributed within each of the two treatment-groups ? II. Understanding of Statistic Definitions and Relationships. a. For data defined by the following data-step, calculate the value and degrees of freedom for the chi-squared test statistic for association between row (Risk) and column (Disease) categories: data expose ; input Risk $ Disease $ Count ; datalines; N Y 200 N N 300 Y Y 300 Y N 200 ; run; b. Calculate the odds ratio and risk ratio for the dataset given in (a). Suppose that you are given the following outputs from PROC MEANS and PROC REG applied to the data times.ssd04 : Variable N Mean Std Dev Minimum Maximum LIFTIM 200 534.60 495.8 31 1020 AMT 200 1.659 0.547 0.487 2.743 GRP 200 0.437 0.496 0 1.0 from PROC REG statement: MODEL LIFTIM AMT = GRP ; The REG Procedure ... Dependent Variable: LIFTIM Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 359.87 8.536 42.16 <.0001 GRP 1 399.84 91.496 4.37 <.0001 Dependent Variable: AMT Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 1.5140 0.0423 35.78 <.0001 GRP 1 -0.3308 0.1203 -2.75 0.0065 from statement MODEL LIFTIM = AMT GRP ; Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 -333.99 -4.752 -70.49 <.0001 AMT 1 453.20 184.97 2.45 0.0152 GRP 1 267.12 68.67 3.89 0.0001 c. What is the value of the t-test statistic for testing whether the average LIFTIM is different in GRP=1 than in GRP=0 ? d. What is the correlation between AMT and GRP ? e. What is the partial correlation between LIFTIM and AMT adjusted for GRP ? f. What is the variance of the vector of residuals from the linear regression of LIFTIM on GRP ? What is the sum of squared errors from this regression ? III. Statistical Interpretations of SAS Output Using the data in part (a) of II above: a. What would you conclude about association of row and column categories, if the data were generated by independently sampling 500 individuals with and 500 without a known risk-factor and recording whether they later developed the disease under study ? Would you expect the Fisher Exact test, the Mantel-Haenszel chi-square, and the continuity-adjusted chi-square test to give substantially the same results ? (Why or why not ?) b. For the dataset in part I (also II(b)), suppose that the output from PROG REG data=home.times ; MODEL LIFTIM = AMT ; run; includes: The REG Procedure ... Dependent Variable: LIFTIM Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 -367.63 7.772 -47.3 <.0001 AMT 1 543.84 120.32 4.52 <.0001 (i) Interpret the T-statistic and p-value for the AMT coefficient. (ii) Give the assumptions which the data ought to satisfy for these conclusions to be valid. (iii) Looking at the outputs under the second MODEL statement in part II(b), is there any indication that some part of the assumptions you gave in (ii) is NOT valid ? Explain. (iv) What SAS analyses could you perform on the dataset times.ssd04 to check some other aspect(s) of the assumptions in (ii) ?