Re: [Rd] Discourage the weights= option of lm with summarized data

2017-10-12 Thread Arie ten Cate
 OK. We have now three suggestions to repair the text:
 - remove the text
 - add "not" at the beginning of the text
 - add at the end of the text a warning; something like:

  "Note that in this case the standard estimates of the parameters are
in general not correct, and hence also the t values and the p value.
Also the number of degrees of freedom is not correct. (The parameter
values are correct.)"

A remark about the glm example: the Reference manual says: "For a
binomial GLM prior weights are used to give the number of trials when
the response is the proportion of successes ".  Hence in the
binomial case the weights are frequencies.
With y <- 0.51 and w <- 100 you get the same result.

   Arie

On Mon, Oct 9, 2017 at 5:22 PM, peter dalgaard  wrote:
> AFAIR, it is a little more subtle than that.
>
> If you have replication weights, then the estimates are right, it is "just" 
> that the SE from summary.lm() are wrong. Somehow, the text should reflect 
> this.
>
> It is of some importance when you put glm() into the mix, because you can in 
> fact get correct results from things like
>
> y <- c(0,1)
> w <- c(49,51)
> glm(y~1, weights=w, family=binomial)
>
> -pd
>
>> On 9 Oct 2017, at 07:58 , Arie ten Cate  wrote:
>>
>> Yes.  Thank you; I should have quoted it.
>> I suggest to remove this text or to add the word "not" at the beginning.
>>
>>   Arie
>>
>> On Sun, Oct 8, 2017 at 4:38 PM, Viechtbauer Wolfgang (SP)
>>  wrote:
>>> Ah, I think you are referring to this part from ?lm:
>>>
>>> "(including the case that there are w_i observations equal to y_i and the 
>>> data have been summarized)"
>>>
>>> I see; indeed, I don't think this is what 'weights' should be used for (the 
>>> other part before that is correct). Sorry, I misunderstood the point you 
>>> were trying to make.
>>>
>>> Best,
>>> Wolfgang
>>>
>>> -Original Message-
>>> From: R-devel [mailto:r-devel-boun...@r-project.org] On Behalf Of Arie ten 
>>> Cate
>>> Sent: Sunday, 08 October, 2017 14:55
>>> To: r-devel@r-project.org
>>> Subject: [Rd] Discourage the weights= option of lm with summarized data
>>>
>>> Indeed: Using 'weights' is not meant to indicate that the same
>>> observation is repeated 'n' times.  As I showed, this gives erroneous
>>> results. Hence I suggested that it is discouraged rather than
>>> encouraged in the Details section of lm in the Reference manual.
>>>
>>>   Arie
>>>
>>> ---Original Message-
>>> On Sat, 7 Oct 2017, wolfgang.viechtba...@maastrichtuniversity.nl wrote:
>>>
>>> Using 'weights' is not meant to indicate that the same observation is
>>> repeated 'n' times. It is meant to indicate different variances (or to
>>> be precise, that the variance of the last observation in 'x' is
>>> sigma^2 / n, while the first three observations have variance
>>> sigma^2).
>>>
>>> Best,
>>> Wolfgang
>>>
>>> -Original Message-
>>> From: R-devel [mailto:r-devel-boun...@r-project.org] On Behalf Of Arie ten 
>>> Cate
>>> Sent: Saturday, 07 October, 2017 9:36
>>> To: r-devel@r-project.org
>>> Subject: [Rd] Discourage the weights= option of lm with summarized data
>>>
>>> In the Details section of lm (linear models) in the Reference manual,
>>> it is suggested to use the weights= option for summarized data. This
>>> must be discouraged rather than encouraged. The motivation for this is
>>> as follows.
>>>
>>> With summarized data the standard errors get smaller with increasing
>>> numbers of observations. However, the standard errors in lm do not get
>>> smaller when for instance all weights are multiplied with the same
>>> constant larger than one, since the inverse weights are merely
>>> proportional to the error variances.
>>>
>>> Here is an example of the estimated standard errors being too large
>>> with the weights= option. The p value and the number of degrees of
>>> freedom are also wrong. The parameter estimates are correct.
>>>
>>>  n <- 10
>>>  x <- c(1,2,3,4)
>>>  y <- c(1,2,5,4)
>>>  w <- c(1,1,1,n)
>>>  xb <- c(x,rep(x[4],n-1))  # restore the original data
>>>  yb <- c(y,rep(y[4],n-1))
>>>  print(summary(lm(yb ~ xb)))
>>>  print(summary(lm(y ~ x, weights=w)))
>>>
>>> Compare with PROC REG in SAS, with a WEIGHT statement (like R) and a
>>> FREQ statement (for summarized data).
>>>
>>>Arie
>>>
>>> __
>>> R-devel@r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>> __
>> R-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
> --
> Peter Dalgaard, Professor,
> Center for Statistics, Copenhagen Business School
> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> Phone: (+45)38153501
> Office: A 4.23
> Email: pd@cbs.dk  Priv: pda...@gmail.com
>
>
>
>
>
>
>
>
>

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Bug in model.matrix.default for higher-order interaction encoding when specific model terms are missing

2017-10-12 Thread Tyler
Hi,

I recently ran into an inconsistency in the way model.matrix.default
handles factor encoding for higher level interactions with categorical
variables when the full hierarchy of effects is not present. Depending on
which lower level interactions are specified, the factor encoding changes
for a higher level interaction. Consider the following minimal reproducible
example:

--

> runmatrix = expand.grid(X1=c(1,-1),X2=c(1,-1),X3=c("A","B","C"))> 
> model.matrix(~(X1+X2+X3)^3,data=runmatrix)   (Intercept) X1 X2 X3B X3C X1:X2 
> X1:X3B X1:X3C X2:X3B X2:X3C X1:X2:X3B X1:X2:X3C
11  1  1   0   0 1  0  0  0  0
0 0
21 -1  1   0   0-1  0  0  0  0
0 0
31  1 -1   0   0-1  0  0  0  0
0 0
41 -1 -1   0   0 1  0  0  0  0
0 0
51  1  1   1   0 1  1  0  1  0
1 0
61 -1  1   1   0-1 -1  0  1  0
-1 0
71  1 -1   1   0-1  1  0 -1  0
-1 0
81 -1 -1   1   0 1 -1  0 -1  0
1 0
91  1  1   0   1 1  0  1  0  1
0 1
10   1 -1  1   0   1-1  0 -1  0  1
0-1
11   1  1 -1   0   1-1  0  1  0 -1
0-1
12   1 -1 -1   0   1 1  0 -1  0 -1
0 1
attr(,"assign")
 [1] 0 1 2 3 3 4 5 5 6 6 7 7
attr(,"contrasts")
attr(,"contrasts")$X3
[1] "contr.treatment"

--

Specifying the full hierarchy gives us what we expect: the interaction
columns are simply calculated from the product of the main effect columns.
The interaction term X1:X2:X3 gives us two columns in the model matrix,
X1:X2:X3B and X1:X2:X3C, matching the products of the main effects.

If we remove either the X2:X3 interaction or the X1:X3 interaction, we get
what we would expect for the X1:X2:X3 interaction, but when we remove the
X1:X2 interaction the encoding for X1:X2:X3 changes completely:

--

> model.matrix(~(X1+X2+X3)^3-X1:X3,data=runmatrix)   (Intercept) X1 X2 X3B X3C 
> X1:X2 X2:X3B X2:X3C X1:X2:X3B X1:X2:X3C
11  1  1   0   0 1  0  0 0 0
21 -1  1   0   0-1  0  0 0 0
31  1 -1   0   0-1  0  0 0 0
41 -1 -1   0   0 1  0  0 0 0
51  1  1   1   0 1  1  0 1 0
61 -1  1   1   0-1  1  0-1 0
71  1 -1   1   0-1 -1  0-1 0
81 -1 -1   1   0 1 -1  0 1 0
91  1  1   0   1 1  0  1 0 1
10   1 -1  1   0   1-1  0  1 0-1
11   1  1 -1   0   1-1  0 -1 0-1
12   1 -1 -1   0   1 1  0 -1 0 1
attr(,"assign")
 [1] 0 1 2 3 3 4 5 5 6 6
attr(,"contrasts")
attr(,"contrasts")$X3
[1] "contr.treatment"



> model.matrix(~(X1+X2+X3)^3-X2:X3,data=runmatrix)   (Intercept) X1 X2 X3B X3C 
> X1:X2 X1:X3B X1:X3C X1:X2:X3B X1:X2:X3C
11  1  1   0   0 1  0  0 0 0
21 -1  1   0   0-1  0  0 0 0
31  1 -1   0   0-1  0  0 0 0
41 -1 -1   0   0 1  0  0 0 0
51  1  1   1   0 1  1  0 1 0
61 -1  1   1   0-1 -1  0-1 0
71  1 -1   1   0-1  1  0-1 0
81 -1 -1   1   0 1 -1  0 1 0
91  1  1   0   1 1  0  1 0 1
10   1 -1  1   0   1-1  0 -1 0-1
11   1  1 -1   0   1-1  0  1 0-1
12   1 -1 -1   0   1 1  0 -1 0 1
attr(,"assign")
 [1] 0 1 2 3 3 4 5 5 6 6
attr(,"contrasts")
attr(,"contrasts")$X3
[1] "contr.treatment"


> model.matrix(~(X1+X2+X3)^3-X1:X2,data=runmatrix)   (Intercept) X1 X2 X3B X3C 
> X1:X3B X1:X3C X2:X3B X2:X3C X1:X2:X3A X1:X2:X3B X1:X2:X3C
11  1  1   0   0  0  0  0  0 1
0 0
21 -1  1   0   0  0  0  0  0-1
0 0
31  1 -1   0   0  0  0  0  0-1
0 0
41 -1 -1   0   0  0  0  0  0 1
0 0
51  1  1   1   0  1  0  1  0 0
1 0
61 -1  1   1   0 -1  0  1  0 0
   -1 0
71  1 -1   1   0  1  0 -1  0 0
   -1 0
81 -1 -1   1