Dear Thomas & David That makes sense! If I wanted to use survey on the summarized data, I suppose that I could 'de-summarize' or 're-individualize' the data to give the design object the correct information on the number of observations. Or I could revert to using the actual individual-level data.
Thanks a lot, your input has been very helpful. Laust **** Post doc. Laust Mortensen, PhD Epidemiology Unit University of Southern Denmark 2009/10/13 Thomas Lumley <tlum...@u.washington.edu>: > > I think there is a much simpler explanation. > > The survey design object has eight observations, two per country. With a > sample size of two per country it is hardly surprising that country-specific > estimates are not very precise. The actual data has hundreds of thousands of > observations per country, so it will have more precise estimates. > > Grouping the data doesn't make a difference for model-based glm estimation, > where it is simply a computational convenience. It *does* make a difference > for design-based estimation, because it changes the design. > > -thomas > > > On Tue, 13 Oct 2009, Laust wrote: > >> Dear David, >> >> Thanks again for your input! I realize that I did a bad job of >> explaining this in my first email, but the setup is that in Finland >> persons who die are sampled with a different probability (1) from >> those who live (.5). This was done by the Finnish data protection >> authorities to protect individuals against identification. In the rest >> of the countries everyone is sampled with a probability of 1. The data >> that I am supplying to R is summarized data for each country >> stratified by case status. Another way of organizing the data would >> be: >> >> # creating data >> listc <- c("Denmark","Finland","Norway","Sweden") >> listw <- c(1,2,1,1) >> listd <- c(1000,1000,1000,2000) >> listt <- c(755000,505000,905000,1910000) >> list.cwdt <- c(listc, listw, listd, listt) >> country2 <- data.frame(country=listc,weight=listw,deaths=listd,time=listt) >> >> I hope that it is clearer now that for no value of the independent >> variable 'country' is the rate going to be zero. I think this was also >> not the case in my original example, but this was obscured by my poor >> communication- & R-skills. But if data is organized this way then >> sampling weight of 2 for Finland should only be applied to the >> time-variable that contains person years at risk and *not* to the >> number of deaths, which would complicate matters further. I would know >> how to get this to work in R or in any other statistical package. >> Perhaps it is - as Peter Dalgaard suggested - the estimation of the >> dispersion parameter by the survey package that is causing trouble, >> not the data example eo ipso. Or perhaps I am just using survey in a >> wrong way. >> >> Best >> Laust >> >> **** >> Post doc. Laust Mortensen, PhD >> Epidemiology Unit >> University of Southern Denmark >> >> On Mon, Oct 12, 2009 at 3:32 PM, David Winsemius <dwinsem...@comcast.net> >> wrote: >>> >>> I think you are missing the point. You have 4 zero death counts associated >>> with much higher person years of exposure followed by 4 death counts in the >>> thousands associated with lower degrees of exposures. It seems unlikely that >>> these are real data as there are not cohorts that would exhibit such lower >>> death-rates. So it appears that in setting up your test case, you have >>> created an impossibly unrealistic test problem. >>> >>> -- >>> David >>> >>> >>> On Oct 12, 2009, at 9:12 AM, Laust wrote: >>> >>>> Dear Peter, >>>> >>>> Thanks for the input. The zero rates in some strata occurs because >>>> sampling depended on case status: In Finland only 50% of the non-cases >>>> were sampled, while all others were sampled with 100% probability. >>>> >>>> Best >>>> Laust >>>> >>>> On Sat, Oct 10, 2009 at 11:02 AM, Peter Dalgaard >>>> <p.dalga...@biostat.ku.dk> wrote: >>>>> >>>>> Sorry, forgot to "reply all"... >>>>> >>>>> Laust wrote: >>>>>> >>>>>> Dear list, >>>>>> >>>>>> I am trying to set up a propensity-weighted regression using the >>>>>> survey package. Most of my population is sampled with a sampling >>>>>> probability of one (that is, I have the full population). However, for >>>>>> a subset of the data I have only a 50% sample of the full population. >>>>>> In previous work on the data, I analyzed these data using SAS and >>>>>> STATA. In those packages I used a propensity weight of 1/[sampling >>>>>> probability] in various generalized linear regression-procedures, but >>>>>> I am having trouble setting this up. I bet the solution is simple, but >>>>>> I’m a R newbie. Code to illustrate my problem below. >>>>> >>>>> Hi Laust, >>>>> >>>>> You probably need the package author to explain fully, but as far as I >>>>> can see, the crux is that a dispersion parameter is being used, based on >>>>> Pearson residuals, even in the Poisson case (i.e. you effectively get >>>>> the same result as with quasipoisson()). >>>>> >>>>> I don't know what the rationale is for this, but it is clear that with >>>>> your data, an estimated dispersion parameter is going to be large. E.g. >>>>> the data has both 0 cases in 750000 person-years and 1000 cases in 5000 >>>>> person-years for Denmark, and in your model they are supposed to have >>>>> the same Poisson rate. >>>>> >>>>> summary.svyglm starts off with >>>>> >>>>> est.disp <- TRUE >>>>> >>>>> and AFAICS there is no way it can get set to FALSE. Knowing Thomas, >>>>> there is probably a perfectly good reason not to just set the dispersion >>>>> to 1, but I don't get it either... >>>>> >>>>>> >>>>>> Thanks >>>>>> Laust >>>>>> >>>>>> # loading survey >>>>>> library(survey) >>>>>> >>>>>> # creating data >>>>>> listc <- >>>>>> >>>>>> c("Denmark","Finland","Norway","Sweden","Denmark","Finland","Norway","Sweden") >>>>>> listw <- c(1,2,1,1,1,1,1,1) >>>>>> listd <- c(0,0,0,0,1000,1000,1000,2000) >>>>>> listt <- c(750000,500000,900000,1900000,5000,5000,5000,10000) >>>>>> list.cwdt <- c(listc, listw, listd, listt) >>>>>> country <- >>>>>> data.frame(country=listc,weight=listw,deaths=listd,yrs_at_risk=listt) >>>>>> >>>>>> # running a frequency weighted regression to get the correct point >>>>>> estimates for comparison >>>>>> glm <- glm(deaths ~ country + offset(log(yrs_at_risk)), >>>>>> weights=weight, data=country, family=poisson()) >>>>>> summary(glm) >>>>>> regTermTest(glm, ~ country) >>>>>> >>>>>> # running survey weighted regression >>>>>> svy <- svydesign(~0,,data=country, weight=~weight) >>>>>> svyglm <- svyglm(deaths ~ country + offset(log(yrs_at_risk)), >>>>>> design=svy, data=country, family=poisson()) >>>>>> summary(svyglm) >>>>>> # point estimates are correct, but standard error is way too large >>>>>> regTermTest(svyglm, ~ country) >>>>>> # test indicates no country differences >>>>>> >>>>>> ______________________________________________ >>>>>> R-help@r-project.org mailing list >>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>>> PLEASE do read the posting guide >>>>>> http://www.R-project.org/posting-guide.html >>>>>> and provide commented, minimal, self-contained, reproducible code. >>>>> >>>>> >>>>> -- >>>>> O__ ---- Peter Dalgaard Øster Farimagsgade 5, Entr.B >>>>> c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K >>>>> (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 >>>>> ~~~~~~~~~~ - (p.dalga...@biostat.ku.dk) FAX: (+45) 35327907 >>>>> >>>>> >>>> >>>> ______________________________________________ >>>> R-help@r-project.org mailing list >>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>> PLEASE do read the posting guide >>>> http://www.R-project.org/posting-guide.html >>>> and provide commented, minimal, self-contained, reproducible code. >>> >>> David Winsemius, MD >>> Heritage Laboratories >>> West Hartford, CT >>> >>> >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > > Thomas Lumley Assoc. Professor, Biostatistics > tlum...@u.washington.edu University of Washington, Seattle > > > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.