Dear Terry, Thank you very much for taking your time to address this problem!
I did check the data in F&H. I couldn't detect any differences between the R data set and the one in the Appendix. The preface in F&H acknowledges that the data set was obtained from Roland Dickinson. Is the data set in R created by Tom Fleming based on the original Mayo data? Where do the papers that reference this data set get their data from? Do they get it from the URL that you gave me? It is impossible to tell from the papers because they just cite the F&H appendix as the source of the data, but obviously they must have gotten it as an electronic version from somewhere. If so, is the electronic version the same as the R data set? This is relevant for me because I am trying to compare the results of my estimation algorithm to that in another paper (which, of course, simply cites F&H for the data). Best regards, Ravi. ---------------------------------------------------------------------------- ------- Ravi Varadhan, Ph.D. Assistant Professor, The Center on Aging and Health Division of Geriatric Medicine and Gerontology Johns Hopkins University Ph: (410) 502-2619 Fax: (410) 614-9625 Email: [EMAIL PROTECTED] Webpage: http://www.jhsph.edu/agingandhealth/People/Faculty/Varadhan.html ---------------------------------------------------------------------------- -------- -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Terry Therneau Sent: Monday, November 24, 2008 8:40 AM To: [EMAIL PROTECTED] Cc: r-help@r-project.org Subject: Re: [R] Discrepancy in the PBC data set The data set in R is wrong. I've found mistakes on 2 lines in a quick look. I don't know if the data is incorrect in the Appendix of Fleming and Harrington as well (someone seems to have borrowed my copy), which is where the data set appears to have been taken from, given all the "-9" codes in it. (Note, Tom Fleming originally got the data from me, so I'm fairly confident in calling my Mayo version the authoritative one). I'll make sure this gets fixed. You can grab a correct data set from our department web page. Code is below. Terry Therneau pbcurl <- "http://mayoresearch.mayo.edu/mayo/research/biostat/upload/therneau_upload/p bc.d at" pbc <- read.table(pbcurl, header=F, col.names=c('id', 'time', 'status', 'trt', 'age', 'sex', 'ascites', 'hepato', 'spiders', 'edema', 'bili', 'chol', 'albumin', 'copper', 'alk.phos', 'ast', 'trig', 'platelet', 'protime', 'stage'), na.strings='.') pbc$age <- pbc$age/365.25 newfit <- coxph(Surv(time, status==2) ~ age + edema + log(bili) + log(protime) + log(albumin), data=pbc) newfit coef exp(coef) se(coef) z p age 0.0396 1.0404 0.00767 5.16 2.4e-07 edema 0.8963 2.4505 0.27141 3.30 9.6e-04 log(bili) 0.8636 2.3716 0.08294 10.41 0.0e+00 log(protime) 2.3868 10.8791 0.76851 3.11 1.9e-03 log(albumin) -2.5069 0.0815 0.65292 -3.84 1.2e-04 Likelihood ratio test=231 on 5 df, p=0 n=416 (2 observations deleted due to missingness) ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.