Log to Summarize Data-File Generation and Subsetting for Stat 798S Project File ===================================================== Lymphoma/Leukemia data, pulled from SEER Registry, Leukemia/Lymhpona Cases 1973-2001 Initially: Rec. length 166, total 226835 patient-records Fields to keep in initial re-coding/analysis: Reg = SEER Registry, cols 1-2, regional code (place) Cas = Case number, cols 3-10, case code to check for duplicates (ideally unique for each rec within Reg) Src = Type of reporting source, col 13, 1-digit code, eg 1=Hosp, 3 =Lab Byr = year of birth, cols 20-23 Age = age at diagnosis, cols 24-26 Sex = sex (1=M, 2=F), col 30 SqNo = sequence number, 00=1 Primary only, 01=first of 2 or more .. 04=4th of 4 or more, ...99 unspecified cols. 32-33 DgYr = year of diagnosis (1973-2001), col 36-39 Grad = Grade, col. 50 Dconf = diagnostic confirmation (<5 if microscopically confirmed, 5-7 if by lab test, 8 "clinical only", 9 unknown), col. 51 Surg = Site-specific surgery code, cols 64-65 Rad = radiation therapy codes, col 67 (1-6 Yes, 0 & 7 No, 8-9 Unknown) RdSrg = radiation sequence with surgery, col 69 Dth = vital status indicator, 1=Alive, 4=Dead, col. 70 Site = Site recode, cols 76-80 RacB = race recode B, cols 82-83, 01=White nonHisp, 02=Black, 11=WhiteHisp plus several others Stag = SEER Historic Stage A, 0=Noninvasive, 1=Localized, 2=Regional, 4=Distant, 9=Unk, col. 86 [Ignore this for Leukemias/Lymphomas because all these are considered unstaged=9 but check!] Tim = survival time recode, cols 91-94 (yymm for years & months) Caus = Cause of Death Recode, cols. 137-141 Frst = first malignant primary indicator, 1=Yes, 0=No, col.152 STCO = state county code, cols.161-165 FTYP = type of followup expected, 1=Autopsy or Dth-cert only, 2 or 4 active followup, 3 should not apply (in situ cervix uteri), col. 166 ---------------------------------------------------------------------- ## Codes for SITE are given under Cause-of-Death codes for Malignant ## Cancers , as follows: 33010 = Hodgkin Lymphoma, 33040 = Non-Hodgkin Lymphoma 34000 = Myeloma 35011 = ALL (Acute Lymphocytic Leukemia), 35012=Chronic, 35013=Other 35021 = AML (Acute Myeloid Leukemia), 35022=Chronic, 35023=Other 35031 = Acute Monocytic Leukemia, 35032=Chronic, 35033=Other 35041 = Other Acute Leukemia, 35043 = Aleukemic, Subleukemic and NOS ## Most frequent sites are: Hodgkin, non-Hodgkin, Myeloma, and ## Lymphocytic or Myeloid Leukemia ======================================================================== Subsetting and Recoding Steps (1) Delete cases without active followup and FTYP code. (2) Delete DgYr (because = Age + BYr) (3) Recode Dth=1 if dead (value=4), 0 if alive (<>4) (4) Recode Tim to numeric time in months, recalling that "9999" becomes NA: (using ifelse with yrs > 30) but that there are no longer any 9999's (5) Only NON-factors will be: Case, BYr, SqNo, Age, Dth, Tim, Frst Remaining data-set > LYMleu [1] 223012 20 > names(LYMleu) [1] "Reg" "Case" "Src" "BYr" "Age" "Sex" "SqNo" "Grad" "Dconf" [10] "Surg" "Rad" "RdSrg" "Dth" "Site" "RacB" "Stag" "Tim" "Caus" [19] "Frst" "STCO" NOTE: there are 1694 duplicated case numbers, which seems to correspond to individuals diagnosed and coded for two separate cancers (multiple tumors, usually with different Site). RESTRICT ATENTION TO LYMPHOMAS, SITES 33011, 33041 and REGISTRIES 1,2,20 > LYMleu.sub <- LYMleu[(as.numeric(LYMleu$Site)==1 | as.numeric(LYMleu$Site)==3) & + as.numeric(LYMleu$Reg) < 4,] > dim(LYMleu.sub) [1] 45507 20 ## Almost all rec's from Src=1, so omit that column and other Src values > LYMleu.sub <- LYMleu.sub[LYMleu.sub$Src==1,] ## Next delete all records with duplicated case-numbers, and remove Case column > LYMleu.sub <- LYMleu.sub[!duplicated(LYMleu.sub$Case),] ## Next delete all records with SqNo !=0 and remove SqNo and Frst column > LYMleu.sub <- LYMleu.sub[LYMleu.sub$SqNo==0,] ## Restrict to RacB = 1 and 2 (WhiteNonHisp and Black) only, delete STCO column > LYMleu.sub <- LYMleu.sub[as.numeric(LYMleu.sub$RacB) < 3,] ## Also omit column Surg, and restrict to Rad==0 or 1 > LYMleu.sub <- LYMleu.sub[as.numeric(LYMleu.sub$Rad) < 3,] ## Keep only the records with confirmed diagnosis, and delete Dconf column > LYMleu.sub <- LYMleu.sub[LYMleu.sub$Dconf==1,] ## We recode the Cause-of-Death Info to variable Caus: 0 = Alive, 1= Death (Lymphoma), 2=Death (Cancer, non-lymphoma), 3= Death (Non-cancer) > COD <- as.numeric(levels(LYMleu.sub$Caus)[as.numeric(LYMleu.sub$Caus)]) %/% 1000 COD <- ifelse( COD==33, 1, ifelse(COD > 49, 3, ifelse(COD==0, 0, 2))) > table(COD) COD 0 1 2 3 12026 12627 1830 5206 > LYMleu.sub$Caus <- COD ## Recode RdSrg to be 1 if Both 0 if not > LYMleu.sub$RdSrg <- as.numeric(LYMleu.sub$RdSrg != 0) > table(LYMleu.sub$RdSrg) 0 1 29485 2204 ## Recode Site to 1 if Hodgkin, 0 if non-Hodgkin Lymphoma > LYMleu.sub$Site <- as.numeric(LYMleu.sub$Site==33011) > table(LYMleu.sub$Site) 0 1 24161 7528 ## Now we are down to 31689 x 13 > names(LYMleu.sub) [1] "Reg" "BYr" "Age" "Sex" "Grad" "Rad" "RdSrg" "Dth" "Site" [10] "RacB" "Stag" "Tim" "Caus"