> On 3 Dec 2017, at 16:31 , Arie ten Cate <arietenc...@gmail.com> wrote: > > Peter, > > This is a highly structured text. Just for the discussion, I separate > the building blocks, where (D) and (E) and (F) are new: > > BEGIN OF TEXT -------------------- > > (A) > > Non-‘NULL’ ‘weights’ can be used to indicate that different > observations have different variances (with the values in ‘weights’ > being inversely proportional to the variances); > > (B) > > or equivalently, when the elements of ‘weights’ are positive integers > w_i, that each response y_i is the mean of w_i unit-weight > observations > > (C) > > (including the case that there are w_i observations equal to y_i and > the data have been summarized). > > (D) > > However, in the latter case, notice that within-group variation is not > used. Therefore, the sigma estimate and residual degrees of freedom > may be suboptimal; > > (E) > > in the case of replication weights, even wrong. > > (F) > > Hence, standard errors and analysis of variance tables should be > treated with care. > > END OF TEXT -------------------- > > I don't understand (D), partly because it is unclear to me whether (D) > refers to (C) or to (B)+(C):
B, including C, is "the latter case". > If (D) refers only to (C), as the reader might automatically think > with the repetition of the word "case", then it is unclear to me to > what block does (E) refer. Not so. If it did, it should go inside the parentheses. > If, on the other hand, (D) refers to (B)+(C) then (E) probably > refers to (C) and then I suggest to make this more clear by replacing > "in the case of replication weights" in (E) by "in the case of > summarized data". > That would be wrong. Data can be summarized by means of groups (and SDs, which are unused, hence the suboptimality), _including_ the case where all elements are identical. > I suggest to change "even wrong" in (E) into the more down-to-earth "wrong". That would seem to be a matter of taste. Howver, "equivalently" in (B) does not look right. > > (For the record: I prefer something like my original explanation of > the problem with (C), instead of (D)+(E)+(F): > "With summarized data the standard errors get smaller with > increasing numbers of observations w_i. However, when for instance all > w_i are multiplied with the same constant larger than one, the > reported standard errors do not get smaller since the w_i are defined > apart from an arbitrary positive multiplicative constant. Hence the > reported standard errors tend to be too large and the reported t > values and the reported number of significance stars too small. > Obviously, also the reported number of observations and the reported > number of degrees of freedom are too small." > Note that with heteroskedasticity, _the_ residual standard error > has no meaning.) > > Finally, about the original text: (B) and (C) mention only y_i, not > x_i, while this is about entire observations. Maybe this can remedied > also? > > Arie > > On Tue, Nov 28, 2017 at 1:01 PM, peter dalgaard <pda...@gmail.com> wrote: >> My local R-devel version now has (in ?lm) >> >> Non-‘NULL’ ‘weights’ can be used to indicate that different >> observations have different variances (with the values in >> ‘weights’ being inversely proportional to the variances); or >> equivalently, when the elements of ‘weights’ are positive integers >> w_i, that each response y_i is the mean of w_i unit-weight >> observations (including the case that there are w_i observations >> equal to y_i and the data have been summarized). However, in the >> latter case, notice that within-group variation is not used. >> Therefore, the sigma estimate and residual degrees of freedom may >> be suboptimal; in the case of replication weights, even wrong. >> Hence, standard errors and analysis of variance tables should be >> treated with care. >> >> OK? >> >> >> -pd >> >> >>> On 12 Oct 2017, at 13:48 , Arie ten Cate <arietenc...@gmail.com> wrote: >>> >>> OK. We have now three suggestions to repair the text: >>> - remove the text >>> - add "not" at the beginning of the text >>> - add at the end of the text a warning; something like: >>> >>> "Note that in this case the standard estimates of the parameters are >>> in general not correct, and hence also the t values and the p value. >>> Also the number of degrees of freedom is not correct. (The parameter >>> values are correct.)" >>> >>> A remark about the glm example: the Reference manual says: "For a >>> binomial GLM prior weights are used to give the number of trials when >>> the response is the proportion of successes ....". Hence in the >>> binomial case the weights are frequencies. >>> With y <- 0.51 and w <- 100 you get the same result. >>> >>> Arie >>> >>> On Mon, Oct 9, 2017 at 5:22 PM, peter dalgaard <pda...@gmail.com> wrote: >>>> AFAIR, it is a little more subtle than that. >>>> >>>> If you have replication weights, then the estimates are right, it is >>>> "just" that the SE from summary.lm() are wrong. Somehow, the text should >>>> reflect this. >>>> >>>> It is of some importance when you put glm() into the mix, because you can >>>> in fact get correct results from things like >>>> >>>> y <- c(0,1) >>>> w <- c(49,51) >>>> glm(y~1, weights=w, family=binomial) >>>> >>>> -pd >>>> >>>>> On 9 Oct 2017, at 07:58 , Arie ten Cate <arietenc...@gmail.com> wrote: >>>>> >>>>> Yes. Thank you; I should have quoted it. >>>>> I suggest to remove this text or to add the word "not" at the beginning. >>>>> >>>>> Arie >>>>> >>>>> On Sun, Oct 8, 2017 at 4:38 PM, Viechtbauer Wolfgang (SP) >>>>> <wolfgang.viechtba...@maastrichtuniversity.nl> wrote: >>>>>> Ah, I think you are referring to this part from ?lm: >>>>>> >>>>>> "(including the case that there are w_i observations equal to y_i and >>>>>> the data have been summarized)" >>>>>> >>>>>> I see; indeed, I don't think this is what 'weights' should be used for >>>>>> (the other part before that is correct). Sorry, I misunderstood the >>>>>> point you were trying to make. >>>>>> >>>>>> Best, >>>>>> Wolfgang >>>>>> >>>>>> -----Original Message----- >>>>>> From: R-devel [mailto:r-devel-boun...@r-project.org] On Behalf Of Arie >>>>>> ten Cate >>>>>> Sent: Sunday, 08 October, 2017 14:55 >>>>>> To: r-devel@r-project.org >>>>>> Subject: [Rd] Discourage the weights= option of lm with summarized data >>>>>> >>>>>> Indeed: Using 'weights' is not meant to indicate that the same >>>>>> observation is repeated 'n' times. As I showed, this gives erroneous >>>>>> results. Hence I suggested that it is discouraged rather than >>>>>> encouraged in the Details section of lm in the Reference manual. >>>>>> >>>>>> Arie >>>>>> >>>>>> ---Original Message----- >>>>>> On Sat, 7 Oct 2017, wolfgang.viechtba...@maastrichtuniversity.nl wrote: >>>>>> >>>>>> Using 'weights' is not meant to indicate that the same observation is >>>>>> repeated 'n' times. It is meant to indicate different variances (or to >>>>>> be precise, that the variance of the last observation in 'x' is >>>>>> sigma^2 / n, while the first three observations have variance >>>>>> sigma^2). >>>>>> >>>>>> Best, >>>>>> Wolfgang >>>>>> >>>>>> -----Original Message----- >>>>>> From: R-devel [mailto:r-devel-boun...@r-project.org] On Behalf Of Arie >>>>>> ten Cate >>>>>> Sent: Saturday, 07 October, 2017 9:36 >>>>>> To: r-devel@r-project.org >>>>>> Subject: [Rd] Discourage the weights= option of lm with summarized data >>>>>> >>>>>> In the Details section of lm (linear models) in the Reference manual, >>>>>> it is suggested to use the weights= option for summarized data. This >>>>>> must be discouraged rather than encouraged. The motivation for this is >>>>>> as follows. >>>>>> >>>>>> With summarized data the standard errors get smaller with increasing >>>>>> numbers of observations. However, the standard errors in lm do not get >>>>>> smaller when for instance all weights are multiplied with the same >>>>>> constant larger than one, since the inverse weights are merely >>>>>> proportional to the error variances. >>>>>> >>>>>> Here is an example of the estimated standard errors being too large >>>>>> with the weights= option. The p value and the number of degrees of >>>>>> freedom are also wrong. The parameter estimates are correct. >>>>>> >>>>>> n <- 10 >>>>>> x <- c(1,2,3,4) >>>>>> y <- c(1,2,5,4) >>>>>> w <- c(1,1,1,n) >>>>>> xb <- c(x,rep(x[4],n-1)) # restore the original data >>>>>> yb <- c(y,rep(y[4],n-1)) >>>>>> print(summary(lm(yb ~ xb))) >>>>>> print(summary(lm(y ~ x, weights=w))) >>>>>> >>>>>> Compare with PROC REG in SAS, with a WEIGHT statement (like R) and a >>>>>> FREQ statement (for summarized data). >>>>>> >>>>>> Arie >>>>>> >>>>>> ______________________________________________ >>>>>> R-devel@r-project.org mailing list >>>>>> https://stat.ethz.ch/mailman/listinfo/r-devel >>>>> >>>>> ______________________________________________ >>>>> R-devel@r-project.org mailing list >>>>> https://stat.ethz.ch/mailman/listinfo/r-devel >>>> >>>> -- >>>> Peter Dalgaard, Professor, >>>> Center for Statistics, Copenhagen Business School >>>> Solbjerg Plads 3, 2000 Frederiksberg, Denmark >>>> Phone: (+45)38153501 >>>> Office: A 4.23 >>>> Email: pd....@cbs.dk Priv: pda...@gmail.com >>>> >> >> -- >> Peter Dalgaard, Professor, >> Center for Statistics, Copenhagen Business School >> Solbjerg Plads 3, 2000 Frederiksberg, Denmark >> Phone: (+45)38153501 >> Office: A 4.23 >> Email: pd....@cbs.dk Priv: pda...@gmail.com -- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Office: A 4.23 Email: pd....@cbs.dk Priv: pda...@gmail.com ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel