[Rd] Discourage the weights= option of lm with summarized data
In the Details section of lm (linear models) in the Reference manual, it is suggested to use the weights= option for summarized data. This must be discouraged rather than encouraged. The motivation for this is as follows. With summarized data the standard errors get smaller with increasing numbers of observations. However, the standard errors in lm do not get smaller when for instance all weights are multiplied with the same constant larger than one, since the inverse weights are merely proportional to the error variances. Here is an example of the estimated standard errors being too large with the weights= option. The p value and the number of degrees of freedom are also wrong. The parameter estimates are correct. n <- 10 x <- c(1,2,3,4) y <- c(1,2,5,4) w <- c(1,1,1,n) xb <- c(x,rep(x[4],n-1)) # restore the original data yb <- c(y,rep(y[4],n-1)) print(summary(lm(yb ~ xb))) print(summary(lm(y ~ x, weights=w))) Compare with PROC REG in SAS, with a WEIGHT statement (like R) and a FREQ statement (for summarized data). Arie __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Discourage the weights= option of lm with summarized data
Indeed: Using 'weights' is not meant to indicate that the same observation is repeated 'n' times. As I showed, this gives erroneous results. Hence I suggested that it is discouraged rather than encouraged in the Details section of lm in the Reference manual. Arie ---Original Message- On Sat, 7 Oct 2017, wolfgang.viechtba...@maastrichtuniversity.nl wrote: Using 'weights' is not meant to indicate that the same observation is repeated 'n' times. It is meant to indicate different variances (or to be precise, that the variance of the last observation in 'x' is sigma^2 / n, while the first three observations have variance sigma^2). Best, Wolfgang -Original Message- From: R-devel [mailto:r-devel-boun...@r-project.org] On Behalf Of Arie ten Cate Sent: Saturday, 07 October, 2017 9:36 To: r-devel@r-project.org Subject: [Rd] Discourage the weights= option of lm with summarized data In the Details section of lm (linear models) in the Reference manual, it is suggested to use the weights= option for summarized data. This must be discouraged rather than encouraged. The motivation for this is as follows. With summarized data the standard errors get smaller with increasing numbers of observations. However, the standard errors in lm do not get smaller when for instance all weights are multiplied with the same constant larger than one, since the inverse weights are merely proportional to the error variances. Here is an example of the estimated standard errors being too large with the weights= option. The p value and the number of degrees of freedom are also wrong. The parameter estimates are correct. n <- 10 x <- c(1,2,3,4) y <- c(1,2,5,4) w <- c(1,1,1,n) xb <- c(x,rep(x[4],n-1)) # restore the original data yb <- c(y,rep(y[4],n-1)) print(summary(lm(yb ~ xb))) print(summary(lm(y ~ x, weights=w))) Compare with PROC REG in SAS, with a WEIGHT statement (like R) and a FREQ statement (for summarized data). Arie __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Discourage the weights= option of lm with summarized data
Yes. Thank you; I should have quoted it. I suggest to remove this text or to add the word "not" at the beginning. Arie On Sun, Oct 8, 2017 at 4:38 PM, Viechtbauer Wolfgang (SP) wrote: > Ah, I think you are referring to this part from ?lm: > > "(including the case that there are w_i observations equal to y_i and the > data have been summarized)" > > I see; indeed, I don't think this is what 'weights' should be used for (the > other part before that is correct). Sorry, I misunderstood the point you were > trying to make. > > Best, > Wolfgang > > -Original Message- > From: R-devel [mailto:r-devel-boun...@r-project.org] On Behalf Of Arie ten > Cate > Sent: Sunday, 08 October, 2017 14:55 > To: r-devel@r-project.org > Subject: [Rd] Discourage the weights= option of lm with summarized data > > Indeed: Using 'weights' is not meant to indicate that the same > observation is repeated 'n' times. As I showed, this gives erroneous > results. Hence I suggested that it is discouraged rather than > encouraged in the Details section of lm in the Reference manual. > >Arie > > ---Original Message- > On Sat, 7 Oct 2017, wolfgang.viechtba...@maastrichtuniversity.nl wrote: > > Using 'weights' is not meant to indicate that the same observation is > repeated 'n' times. It is meant to indicate different variances (or to > be precise, that the variance of the last observation in 'x' is > sigma^2 / n, while the first three observations have variance > sigma^2). > > Best, > Wolfgang > > -Original Message- > From: R-devel [mailto:r-devel-boun...@r-project.org] On Behalf Of Arie ten > Cate > Sent: Saturday, 07 October, 2017 9:36 > To: r-devel@r-project.org > Subject: [Rd] Discourage the weights= option of lm with summarized data > > In the Details section of lm (linear models) in the Reference manual, > it is suggested to use the weights= option for summarized data. This > must be discouraged rather than encouraged. The motivation for this is > as follows. > > With summarized data the standard errors get smaller with increasing > numbers of observations. However, the standard errors in lm do not get > smaller when for instance all weights are multiplied with the same > constant larger than one, since the inverse weights are merely > proportional to the error variances. > > Here is an example of the estimated standard errors being too large > with the weights= option. The p value and the number of degrees of > freedom are also wrong. The parameter estimates are correct. > > n <- 10 > x <- c(1,2,3,4) > y <- c(1,2,5,4) > w <- c(1,1,1,n) > xb <- c(x,rep(x[4],n-1)) # restore the original data > yb <- c(y,rep(y[4],n-1)) > print(summary(lm(yb ~ xb))) > print(summary(lm(y ~ x, weights=w))) > > Compare with PROC REG in SAS, with a WEIGHT statement (like R) and a > FREQ statement (for summarized data). > > Arie > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Discourage the weights= option of lm with summarized data
OK. We have now three suggestions to repair the text: - remove the text - add "not" at the beginning of the text - add at the end of the text a warning; something like: "Note that in this case the standard estimates of the parameters are in general not correct, and hence also the t values and the p value. Also the number of degrees of freedom is not correct. (The parameter values are correct.)" A remark about the glm example: the Reference manual says: "For a binomial GLM prior weights are used to give the number of trials when the response is the proportion of successes ". Hence in the binomial case the weights are frequencies. With y <- 0.51 and w <- 100 you get the same result. Arie On Mon, Oct 9, 2017 at 5:22 PM, peter dalgaard wrote: > AFAIR, it is a little more subtle than that. > > If you have replication weights, then the estimates are right, it is "just" > that the SE from summary.lm() are wrong. Somehow, the text should reflect > this. > > It is of some importance when you put glm() into the mix, because you can in > fact get correct results from things like > > y <- c(0,1) > w <- c(49,51) > glm(y~1, weights=w, family=binomial) > > -pd > >> On 9 Oct 2017, at 07:58 , Arie ten Cate wrote: >> >> Yes. Thank you; I should have quoted it. >> I suggest to remove this text or to add the word "not" at the beginning. >> >> Arie >> >> On Sun, Oct 8, 2017 at 4:38 PM, Viechtbauer Wolfgang (SP) >> wrote: >>> Ah, I think you are referring to this part from ?lm: >>> >>> "(including the case that there are w_i observations equal to y_i and the >>> data have been summarized)" >>> >>> I see; indeed, I don't think this is what 'weights' should be used for (the >>> other part before that is correct). Sorry, I misunderstood the point you >>> were trying to make. >>> >>> Best, >>> Wolfgang >>> >>> -Original Message- >>> From: R-devel [mailto:r-devel-boun...@r-project.org] On Behalf Of Arie ten >>> Cate >>> Sent: Sunday, 08 October, 2017 14:55 >>> To: r-devel@r-project.org >>> Subject: [Rd] Discourage the weights= option of lm with summarized data >>> >>> Indeed: Using 'weights' is not meant to indicate that the same >>> observation is repeated 'n' times. As I showed, this gives erroneous >>> results. Hence I suggested that it is discouraged rather than >>> encouraged in the Details section of lm in the Reference manual. >>> >>> Arie >>> >>> ---Original Message- >>> On Sat, 7 Oct 2017, wolfgang.viechtba...@maastrichtuniversity.nl wrote: >>> >>> Using 'weights' is not meant to indicate that the same observation is >>> repeated 'n' times. It is meant to indicate different variances (or to >>> be precise, that the variance of the last observation in 'x' is >>> sigma^2 / n, while the first three observations have variance >>> sigma^2). >>> >>> Best, >>> Wolfgang >>> >>> -Original Message- >>> From: R-devel [mailto:r-devel-boun...@r-project.org] On Behalf Of Arie ten >>> Cate >>> Sent: Saturday, 07 October, 2017 9:36 >>> To: r-devel@r-project.org >>> Subject: [Rd] Discourage the weights= option of lm with summarized data >>> >>> In the Details section of lm (linear models) in the Reference manual, >>> it is suggested to use the weights= option for summarized data. This >>> must be discouraged rather than encouraged. The motivation for this is >>> as follows. >>> >>> With summarized data the standard errors get smaller with increasing >>> numbers of observations. However, the standard errors in lm do not get >>> smaller when for instance all weights are multiplied with the same >>> constant larger than one, since the inverse weights are merely >>> proportional to the error variances. >>> >>> Here is an example of the estimated standard errors being too large >>> with the weights= option. The p value and the number of degrees of >>> freedom are also wrong. The parameter estimates are correct. >>> >>> n <- 10 >>> x <- c(1,2,3,4) >>> y <- c(1,2,5,4) >>> w <- c(1,1,1,n) >>> xb <- c(x,rep(x[4],n-1)) # restore the original data >>> yb <- c(y,rep(y[4],n-1)) >>> print(summary(lm(yb ~ xb))) >>> print(summary(lm(y ~ x, weights=w))) >>> >>> Compare with PROC REG in SAS, with a WEIGHT statement (like R) and a >>> FREQ statement (for summarized data). >>> >>>Arie >>> >>> __ >>> R-devel@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-devel >> >> __ >> R-devel@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel > > -- > Peter Dalgaard, Professor, > Center for Statistics, Copenhagen Business School > Solbjerg Plads 3, 2000 Frederiksberg, Denmark > Phone: (+45)38153501 > Office: A 4.23 > Email: pd@cbs.dk Priv: pda...@gmail.com > > > > > > > > > __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Bug in model.matrix.default for higher-order interaction encoding when specific model terms are missing
I think it is not a bug. It is a general property of interactions. This property is best observed if all variables are factors (qualitative). For example, you have three variables (factors). You ask for as many interactions as possible, except an interaction term between two particular variables. When this interaction is not a constant, it is different for different values of the remaining variable. More precisely: for all values of that variable. In other words: you have a three-way interaction, with all values of that variable. An even smaller example is the following script with only two variables, each being a factor: df <- expand.grid(X1=c("p","q"), X2=c("A","B","C")) print(model.matrix(~(X1+X2)^2,data=df)) print(model.matrix(~(X1+X2)^2 -X1,data=df)) print(model.matrix(~(X1+X2)^2 -X2,data=df)) The result is: (Intercept) X1q X2B X2C X1q:X2B X1q:X2C 1 1 0 0 0 0 0 2 1 1 0 0 0 0 3 1 0 1 0 0 0 4 1 1 1 0 1 0 5 1 0 0 1 0 0 6 1 1 0 1 0 1 (Intercept) X2B X2C X1q:X2A X1q:X2B X1q:X2C 1 1 0 0 0 0 0 2 1 0 0 1 0 0 3 1 1 0 0 0 0 4 1 1 0 0 1 0 5 1 0 1 0 0 0 6 1 0 1 0 0 1 (Intercept) X1q X1p:X2B X1q:X2B X1p:X2C X1q:X2C 1 1 0 0 0 0 0 2 1 1 0 0 0 0 3 1 0 1 0 0 0 4 1 1 0 1 0 0 5 1 0 0 0 1 0 6 1 1 0 0 0 1 Thus, in the second result, we have no main effect of X1. Instead, the effect of X1 depends on the value of X2; either A or B or C. In fact, this is a two-way interaction, including all three values of X2. In the third result, we have no main effect of X2, The effect of X2 depends on the value of X1; either p or q. A complicating element with your example seems to be that your X1 and X2 are not factors. Arie On Thu, Oct 12, 2017 at 7:12 PM, Tyler wrote: > Hi, > > I recently ran into an inconsistency in the way model.matrix.default > handles factor encoding for higher level interactions with categorical > variables when the full hierarchy of effects is not present. Depending on > which lower level interactions are specified, the factor encoding changes > for a higher level interaction. Consider the following minimal reproducible > example: > > -- > >> runmatrix = expand.grid(X1=c(1,-1),X2=c(1,-1),X3=c("A","B","C"))> >> model.matrix(~(X1+X2+X3)^3,data=runmatrix) (Intercept) X1 X2 X3B X3C X1:X2 >> X1:X3B X1:X3C X2:X3B X2:X3C X1:X2:X3B X1:X2:X3C > 11 1 1 0 0 1 0 0 0 0 > 0 0 > 21 -1 1 0 0-1 0 0 0 0 > 0 0 > 31 1 -1 0 0-1 0 0 0 0 > 0 0 > 41 -1 -1 0 0 1 0 0 0 0 > 0 0 > 51 1 1 1 0 1 1 0 1 0 > 1 0 > 61 -1 1 1 0-1 -1 0 1 0 > -1 0 > 71 1 -1 1 0-1 1 0 -1 0 > -1 0 > 81 -1 -1 1 0 1 -1 0 -1 0 > 1 0 > 91 1 1 0 1 1 0 1 0 1 > 0 1 > 10 1 -1 1 0 1-1 0 -1 0 1 > 0-1 > 11 1 1 -1 0 1-1 0 1 0 -1 > 0-1 > 12 1 -1 -1 0 1 1 0 -1 0 -1 > 0 1 > attr(,"assign") > [1] 0 1 2 3 3 4 5 5 6 6 7 7 > attr(,"contrasts") > attr(,"contrasts")$X3 > [1] "contr.treatment" > > -- > > Specifying the full hierarchy gives us what we expect: the interaction > columns are simply calculated from the product of the main effect columns. > The interaction term X1:X2:X3 gives us two columns in the model matrix, > X1:X2:X3B and X1:X2:X3C, matching the products of the main effects. > > If we remove either the X2:X3 interaction or the X1:X3 interaction, we get > what we would expect for the X1:X2:X3 interaction, but when we remove the > X1:X2 interaction the encoding for X1:X2:X3 changes completely: > > -- > >> model.matrix(~(X1+X2+X3)^3-X1:X3,data=runmatrix) (Intercept) X1 X2 X3B X3C >> X1:X2 X2:X3B X2:X3C X1:X2:X3B X1:X2:X3C > 11 1 1 0 0 1 0 0 0 0 > 21 -1 1 0 0-1 0 0 0 0 > 31 1 -1 0 0-1 0 0 0 0 > 41 -1 -1 0 0 1 0 0 0 0 > 51 1 1 1 0 1 1 0 1 0 > 6
Re: [Rd] Bug in model.matrix.default for higher-order interaction encoding when specific model terms are missing
-1 > >> mm = mm[,-2] >> solve(t(mm) %*% mm) > (Intercept) X2B X2C X1:X2B X1:X2C > (Intercept) 1 -1.0 -1.00.00.0 > X2B -1 1.5 1.00.00.0 > X2C -1 1.0 1.50.00.0 > X1:X2B0 0.0 0.00.50.0 > X1:X2C0 0.0 0.00.00.5 > > Here, we see the actual behavior for model.matrix. The undesired re-coding > of the model matrix interaction term makes the information matrix singular. > >> mm = model.matrix(~(X1+X2)^2-X1,data=design) >> mm > (Intercept) X2B X2C X1:X2A X1:X2B X1:X2C > 2 1 0 0 -1 0 0 > 3 1 1 0 0 1 0 > 4 1 1 0 0 -1 0 > 5 1 0 1 0 0 1 > 6 1 0 1 0 0 -1 > >> solve(t(mm) %*% mm) > Error in solve.default(t(mm) %*% mm) : system is computationally singular: > reciprocal condition number = 5.55112e-18 > > I still believe this is a bug. > > Best regards, > Tyler Morgan-Wall > > On Sun, Oct 15, 2017 at 1:49 AM, Arie ten Cate > wrote: > >> I think it is not a bug. It is a general property of interactions. >> This property is best observed if all variables are factors >> (qualitative). >> >> For example, you have three variables (factors). You ask for as many >> interactions as possible, except an interaction term between two >> particular variables. When this interaction is not a constant, it is >> different for different values of the remaining variable. More >> precisely: for all values of that variable. In other words: you have a >> three-way interaction, with all values of that variable. >> >> An even smaller example is the following script with only two >> variables, each being a factor: >> >> df <- expand.grid(X1=c("p","q"), X2=c("A","B","C")) >> print(model.matrix(~(X1+X2)^2,data=df)) >> print(model.matrix(~(X1+X2)^2 -X1,data=df)) >> print(model.matrix(~(X1+X2)^2 -X2,data=df)) >> >> The result is: >> >> (Intercept) X1q X2B X2C X1q:X2B X1q:X2C >> 1 1 0 0 0 0 0 >> 2 1 1 0 0 0 0 >> 3 1 0 1 0 0 0 >> 4 1 1 1 0 1 0 >> 5 1 0 0 1 0 0 >> 6 1 1 0 1 0 1 >> >> (Intercept) X2B X2C X1q:X2A X1q:X2B X1q:X2C >> 1 1 0 0 0 0 0 >> 2 1 0 0 1 0 0 >> 3 1 1 0 0 0 0 >> 4 1 1 0 0 1 0 >> 5 1 0 1 0 0 0 >> 6 1 0 1 0 0 1 >> >> (Intercept) X1q X1p:X2B X1q:X2B X1p:X2C X1q:X2C >> 1 1 0 0 0 0 0 >> 2 1 1 0 0 0 0 >> 3 1 0 1 0 0 0 >> 4 1 1 0 1 0 0 >> 5 1 0 0 0 1 0 >> 6 1 1 0 0 0 1 >> >> Thus, in the second result, we have no main effect of X1. Instead, the >> effect of X1 depends on the value of X2; either A or B or C. In fact, >> this is a two-way interaction, including all three values of X2. In >> the third result, we have no main effect of X2, The effect of X2 >> depends on the value of X1; either p or q. >> >> A complicating element with your example seems to be that your X1 and >> X2 are not factors. >> >>Arie >> >> On Thu, Oct 12, 2017 at 7:12 PM, Tyler wrote: >> > Hi, >> > >> > I recently ran into an inconsistency in the way model.matrix.default >> > handles factor encoding for higher level interactions with categorical >> > variables when the full hierarchy of effects is not present. Depending on >> > which lower level interactions are specified, the factor encoding changes >> > for a higher level interaction. Consider the following minimal >> reproducible >> > example: >> > >> > -- >> > >> >> runmatrix = expand.grid(X1=c(1,-1),X2=c(1,-1),X3=c("A","B","C"))> >> model.matrix(~(X1+X2+X3)^3,data=runmatrix) (Intercept) X1 X2 X3B X3C >> X1:X2 X1:X3B X1:X3C X2:X3B X2:X3C X1:X2:X3B X1:X2:X3C >> > 11 1 1 0 0 1 0 0 0 0 >> > 0 0 >> > 2
Re: [Rd] Bug in model.matrix.default for higher-order interaction encoding when specific model terms are missing
contrasts as long as X1:X2, X1:X3, and X2:X3 exist in the formula. When > any of the preceding terms do not exist, this heuristic tells us to use > dummy variables to encode the interaction (e.g. "F_j [the interaction term] > is coded ... by dummy variables if it [any of the marginal terms obtained by > dropping a single factor in the interaction] has not [appeared in the > formula]"). However, the example I gave demonstrated that this dummy > variable encoding only occurs for the model where the missing term is the > numeric-numeric interaction, "~(X1+X2+X3)^3-X1:X2". Otherwise, the > interaction term X1:X2:X3 is encoded by contrasts, not dummy variables. This > is inconsistent with the description of the intended behavior given in the > book. > > Best regards, > Tyler > > > On Fri, Oct 27, 2017 at 2:18 PM, Arie ten Cate > wrote: >> >> Hello Tyler, >> >> I want to bring to your attention the following document: "What >> happens if you omit the main effect in a regression model with an >> interaction?" >> (https://stats.idre.ucla.edu/stata/faq/what-happens-if-you-omit-the-main-effect-in-a-regression-model-with-an-interaction). >> This gives a useful review of the problem. Your example is Case 2: a >> continuous and a categorical regressor. >> >> The numerical examples are coded in Stata, and they give the same >> result as in R. Hence, if this is a bug in R then it is also a bug in >> Stata. That seems very unlikely. >> >> Here is a simulation in R of the above mentioned Case 2 in Stata: >> >> df <- expand.grid(socst=c(-1:1),grp=c("1","2","3","4")) >> print("Full model") >> print(model.matrix(~(socst+grp)^2 ,data=df)) >> print("Example 2.1: drop socst") >> print(model.matrix(~(socst+grp)^2 -socst ,data=df)) >> print("Example 2.2: drop grp") >> print(model.matrix(~(socst+grp)^2 -grp ,data=df)) >> >> This gives indeed the following regressors: >> >> "Full model" >> (Intercept) socst grp2 grp3 grp4 socst:grp2 socst:grp3 socst:grp4 >> "Example 2.1: drop socst" >> (Intercept) grp2 grp3 grp4 socst:grp1 socst:grp2 socst:grp3 socst:grp4 >> "Example 2.2: drop grp" >> (Intercept) socst socst:grp2 socst:grp3 socst:grp4 >> >> There is a little bit of R documentation about this, based on the >> concept of marginality, which typically forbids a model having an >> interaction but not the corresponding main effects. (You might see the >> references in https://en.wikipedia.org/wiki/Principle_of_marginality ) >> See "An Introduction to R", by Venables and Smith and the R Core >> Team. At the bottom of page 52 (PDF: 57) it says: "Although the >> details are complicated, model formulae in R will normally generate >> the models that an expert statistician would expect, provided that >> marginality is preserved. Fitting, for [a contrary] example, a model >> with an interaction but not the corresponding main effects will in >> general lead to surprising results ". >> The Reference Manual states that the R functions dropterm() and >> addterm() resp. drop or add only terms such that marginality is >> preserved. >> >> Finally, about your singular matrix t(mm)%*%mm. This is in fact >> Example 2.1 in Case 2 discussed above. As discussed there, in Stata >> and in R the drop of the continuous variable has no effect on the >> degrees of freedom here: it is just a reparameterisation of the full >> model, protecting you against losing marginality... Hence the >> model.matrix 'mm' is still square and nonsingular after the drop of >> X1, unless of course when a row is removed from the matrix 'design' >> when before creating 'mm'. >> >> Arie >> >> On Sun, Oct 15, 2017 at 7:05 PM, Tyler wrote: >> > You could possibly try to explain away the behavior for a missing main >> > effects term, since without the main effects term we don't have main >> > effect >> > columns in the model matrix used to compute the interaction columns (At >> > best this is undocumented behavior--I still think it's a bug, as we know >> > how we would encode the categorical factors if they were in fact >> > present. >> > It's either specified in contrasts.arg or using the default set in >> > options). However, when all the main effects are present, why would the >> > three-factor interaction column not simply be the product of the main >> > effect colum
Re: [Rd] Bug in model.matrix.default for higher-order interaction encoding when specific model terms are missing
Hello Tyler, I rephrase my previous mail, as follows: In your example, T_i = X1:X2:X3. Let F_j = X3. (The numerical variables X1 and X2 are not encoded at all.) Then T_{i(j)} = X1:X2, which in the example is dropped from the model. Hence the X3 in T_i must be encoded by dummy variables, as indeed it is. Arie On Thu, Nov 2, 2017 at 4:11 PM, Tyler wrote: > Hi Arie, > > The book out of which this behavior is based does not use factor (in this > section) to refer to categorical factor. I will again point to this > sentence, from page 40, in the same section and referring to the behavior > under question, that shows F_j is not limited to categorical factors: > "Numeric variables appear in the computations as themselves, uncoded. > Therefore, the rule does not do anything special for them, and it remains > valid, in a trivial sense, whenever any of the F_j is numeric rather than > categorical." > > Note the "... whenever any of the F_j is numeric rather than categorical." > Factor here is used in the more general sense of the word, not referring to > the R type "factor." The behavior of R does not match the heuristic that > it's citing. > > Best regards, > Tyler > > On Thu, Nov 2, 2017 at 2:51 AM, Arie ten Cate wrote: >> >> Hello Tyler, >> >> Thank you for searching for, and finding, the basic description of the >> behavior of R in this matter. >> >> I think your example is in agreement with the book. >> >> But let me first note the following. You write: "F_j refers to a >> factor (variable) in a model and not a categorical factor". However: >> "a factor is a vector object used to specify a discrete >> classification" (start of chapter 4 of "An Introduction to R".) You >> might also see the description of the R function factor(). >> >> You note that the book says about a factor F_j: >> "... F_j is coded by contrasts if T_{i(j)} has appeared in the >> formula and by dummy variables if it has not" >> >> You find: >>"However, the example I gave demonstrated that this dummy variable >> encoding only occurs for the model where the missing term is the >> numeric-numeric interaction, ~(X1+X2+X3)^3-X1:X2." >> >> We have here T_i = X1:X2:X3. Also: F_j = X3 (the only factor). Then >> T_{i(j)} = X1:X2, which is dropped from the model. Hence the X3 in T_i >> must be encoded by dummy variables, as indeed it is. >> >> Arie >> >> On Tue, Oct 31, 2017 at 4:01 PM, Tyler wrote: >> > Hi Arie, >> > >> > Thank you for your further research into the issue. >> > >> > Regarding Stata: On the other hand, JMP gives model matrices that use >> > the >> > main effects contrasts in computing the higher order interactions, >> > without >> > the dummy variable encoding. I verified this both by analyzing the >> > linear >> > model given in my first example and noting that JMP has one more degree >> > of >> > freedom than R for the same model, as well as looking at the generated >> > model >> > matrices. It's easy to find a design where JMP will allow us fit our >> > model >> > with goodness-of-fit estimates and R will not due to the extra degree(s) >> > of >> > freedom required. Let's keep the conversation limited to R. >> > >> > I want to refocus back onto my original bug report, which was not for a >> > missing main effects term, but rather for a missing lower-order >> > interaction >> > term. The behavior of model.matrix.default() for a missing main effects >> > term >> > is a nice example to demonstrate how model.matrix encodes with dummy >> > variables instead of contrasts, but doesn't demonstrate the inconsistent >> > behavior my bug report highlighted. >> > >> > I went looking for documentation on this behavior, and the issue stems >> > not >> > from model.matrix.default(), but rather the terms() function in >> > interpreting >> > the formula. This "clever" replacement of contrasts by dummy variables >> > to >> > maintain marginality (presuming that's the reason) is not described >> > anywhere >> > in the documentation for either the model.matrix() or the terms() >> > function. >> > In order to find a description for the behavior, I had to look in the >> > underlying C code, buried above the "TermCode" function of the "model.c" >> > file, which says: >> > >> &g
Re: [Rd] Bug in model.matrix.default for higher-order interaction encoding when specific model terms are missing
Hello Tyler, You write that you understand what I am saying. However, I am now at loss about what exactly is the problem with the behavior of R. Here is a script which reproduces your experiments with three variables (excluding the full model): m=expand.grid(X1=c(1,-1),X2=c(1,-1),X3=c("A","B","C")) model.matrix(~(X1+X2+X3)^3-X1:X3,data=m) model.matrix(~(X1+X2+X3)^3-X2:X3,data=m) model.matrix(~(X1+X2+X3)^3-X1:X2,data=m) Below are the three results, similar to your first mail. (The first two are basically the same, of course.) Please pick one result which you think is not consistent with the heuristic and please give what you think is the correct result: model.matrix(~(X1+X2+X3)^3-X1:X3) (Intercept) X1 X2 X3B X3C X1:X2 X2:X3B X2:X3C X1:X2:X3B X1:X2:X3C model.matrix(~(X1+X2+X3)^3-X2:X3) (Intercept) X1 X2 X3B X3C X1:X2 X1:X3B X1:X3C X1:X2:X3B X1:X2:X3C model.matrix(~(X1+X2+X3)^3-X1:X2) (Intercept) X1 X2 X3B X3C X1:X3B X1:X3C X2:X3B X2:X3C X1:X2:X3A X1:X2:X3B X1:X2:X3C (I take it that the combination of X3A and X3B and X3C implies dummy encoding, and the combination of only X3B and X3C implies contrasts encoding, with respect to X3A.) Thanks in advance, Arie On Sat, Nov 4, 2017 at 5:33 PM, Tyler wrote: > Hi Arie, > > I understand what you're saying. The following excerpt out of the book shows > that F_j does not refer exclusively to categorical factors: "...the rule > does not do anything special for them, and it remains valid, in a trivial > sense, whenever any of the F_j is numeric rather than categorical." Since > F_j refers to both categorical and numeric variables, the behavior of > model.matrix is not consistent with the heuristic. > > Best regards, > Tyler > > On Sat, Nov 4, 2017 at 6:50 AM, Arie ten Cate wrote: >> >> Hello Tyler, >> >> I rephrase my previous mail, as follows: >> >> In your example, T_i = X1:X2:X3. Let F_j = X3. (The numerical >> variables X1 and X2 are not encoded at all.) Then T_{i(j)} = X1:X2, >> which in the example is dropped from the model. Hence the X3 in T_i >> must be encoded by dummy variables, as indeed it is. >> >> Arie >> >> >> On Thu, Nov 2, 2017 at 4:11 PM, Tyler wrote: >> > Hi Arie, >> > >> > The book out of which this behavior is based does not use factor (in >> > this >> > section) to refer to categorical factor. I will again point to this >> > sentence, from page 40, in the same section and referring to the >> > behavior >> > under question, that shows F_j is not limited to categorical factors: >> > "Numeric variables appear in the computations as themselves, uncoded. >> > Therefore, the rule does not do anything special for them, and it >> > remains >> > valid, in a trivial sense, whenever any of the F_j is numeric rather >> > than >> > categorical." >> > >> > Note the "... whenever any of the F_j is numeric rather than >> > categorical." >> > Factor here is used in the more general sense of the word, not referring >> > to >> > the R type "factor." The behavior of R does not match the heuristic that >> > it's citing. >> > >> > Best regards, >> > Tyler >> > >> > On Thu, Nov 2, 2017 at 2:51 AM, Arie ten Cate >> > wrote: >> >> >> >> Hello Tyler, >> >> >> >> Thank you for searching for, and finding, the basic description of the >> >> behavior of R in this matter. >> >> >> >> I think your example is in agreement with the book. >> >> >> >> But let me first note the following. You write: "F_j refers to a >> >> factor (variable) in a model and not a categorical factor". However: >> >> "a factor is a vector object used to specify a discrete >> >> classification" (start of chapter 4 of "An Introduction to R".) You >> >> might also see the description of the R function factor(). >> >> >> >> You note that the book says about a factor F_j: >> >> "... F_j is coded by contrasts if T_{i(j)} has appeared in the >> >> formula and by dummy variables if it has not" >> >> >> >> You find: >> >>"However, the example I gave demonstrated that this dummy variable >> >> encoding only occurs for the model where the missing term is the >> >> numeric-numeric interaction, ~(X1+X2+X3)^3-X1:X2." >> >> >> >> We have here T_i = X1:X2:X3. Also: F_j = X3 (the only factor)
Re: [Rd] Bug in model.matrix.default for higher-order interaction encoding when specific model terms are missing
Hello Tyler, model.matrix(~(X1+X2+X3)^3-X1:X3) T_i = X1:X2:X3. Let F_j = X3. (The numerical variables X1 and X2 are not encoded at all. Then, again, T_{i(j)} = X1:X2, which in this example is NOT dropped from the model. Hence the X3 in T_i must be encoded by contrast, as indeed it is. Arie On Mon, Nov 6, 2017 at 5:09 PM, Tyler wrote: > Hi Arie, > > Given the heuristic, in all of my examples with a missing two-factor > interaction the three-factor interaction should be coded with dummy > variables. In reality, it is encoded by dummy variables only when the > numeric:numeric interaction is missing, and by contrasts for the other two. > The heuristic does not specify separate behavior for numeric vs categorical > factors (When the author of Statistical Models in S refers to F_j as a > "factor", it is a more general usage than the R type "factor" and includes > numeric variables--the language used later on in the chapter on page 40 > confirms this): when there is a missing marginal term in the formula, the > higher-order interaction should be coded by dummy variables, regardless of > type. Thus, the terms() function is only following the cited behavior 1/3rd > of the time. > > Best regards, > Tyler > > On Mon, Nov 6, 2017 at 6:45 AM, Arie ten Cate wrote: >> >> Hello Tyler, >> >> You write that you understand what I am saying. However, I am now at >> loss about what exactly is the problem with the behavior of R. Here >> is a script which reproduces your experiments with three variables >> (excluding the full model): >> >> m=expand.grid(X1=c(1,-1),X2=c(1,-1),X3=c("A","B","C")) >> model.matrix(~(X1+X2+X3)^3-X1:X3,data=m) >> model.matrix(~(X1+X2+X3)^3-X2:X3,data=m) >> model.matrix(~(X1+X2+X3)^3-X1:X2,data=m) >> >> Below are the three results, similar to your first mail. (The first >> two are basically the same, of course.) Please pick one result which >> you think is not consistent with the heuristic and please give what >> you think is the correct result: >> >> model.matrix(~(X1+X2+X3)^3-X1:X3) >> (Intercept) >> X1 X2 X3B X3C >> X1:X2 X2:X3B X2:X3C >> X1:X2:X3B X1:X2:X3C >> >> model.matrix(~(X1+X2+X3)^3-X2:X3) >> (Intercept) >> X1 X2 X3B X3C >> X1:X2 X1:X3B X1:X3C >> X1:X2:X3B X1:X2:X3C >> >> model.matrix(~(X1+X2+X3)^3-X1:X2) >> (Intercept) >> X1 X2 X3B X3C >> X1:X3B X1:X3C X2:X3B X2:X3C >> X1:X2:X3A X1:X2:X3B X1:X2:X3C >> >> (I take it that the combination of X3A and X3B and X3C implies dummy >> encoding, and the combination of only X3B and X3C implies contrasts >> encoding, with respect to X3A.) >> >> Thanks in advance, >> >> Arie >> >> >> On Sat, Nov 4, 2017 at 5:33 PM, Tyler wrote: >> > Hi Arie, >> > >> > I understand what you're saying. The following excerpt out of the book >> > shows >> > that F_j does not refer exclusively to categorical factors: "...the rule >> > does not do anything special for them, and it remains valid, in a >> > trivial >> > sense, whenever any of the F_j is numeric rather than categorical." >> > Since >> > F_j refers to both categorical and numeric variables, the behavior of >> > model.matrix is not consistent with the heuristic. >> > >> > Best regards, >> > Tyler >> > >> > On Sat, Nov 4, 2017 at 6:50 AM, Arie ten Cate >> > wrote: >> >> >> >> Hello Tyler, >> >> >> >> I rephrase my previous mail, as follows: >> >> >> >> In your example, T_i = X1:X2:X3. Let F_j = X3. (The numerical >> >> variables X1 and X2 are not encoded at all.) Then T_{i(j)} = X1:X2, >> >> which in the example is dropped from the model. Hence the X3 in T_i >> >> must be encoded by dummy variables, as indeed it is. >> >> >> >> Arie >> >> >> >> >> >> On Thu, Nov 2, 2017 at 4:11 PM, Tyler wrote: >> >> > Hi Arie, >> >> > >> >> > The book out of which this behavior is based does not use factor (in >> >> > this >> >> > section) to refer to categorical factor. I will again point to this >> >> > sentence, from page 40, in the same section and referring to the >> >> > behavior >> >> > under question, that shows F_j is not limited to categorical factors: >> >> > "Numeric variables appear in the computations as themselves, uncoded. >> &
Re: [Rd] Discourage the weights= option of lm with summarized data
Since the three posters agree (only) that there is a bug, I propose to file it as a bug, which is the least we can do now. There is more to it: the only other case of a change in the Reference Manual which I know of, is also about the weights option! This is in coxph. The Reference Manual version 3.0.0 (2013) says about coxph: " ... If weights is a vector of integers, then the estimated coefficients are equivalent to estimating the model from data with the individual cases replicated as many times as indicated by weights." This is not true, as can be seen from the following code, which uses the data from the first example in the Reference Manual of coxph: library(survival) print(df1 <- as.data.frame(list( time=c(4,3,1,1,2,2,3), status=c(1,1,1,0,1,1,0), x=c(0,2,1,1,1,0,0), sex=c(0,0,0,0,1,1,1) ))) print(w <- rep(2,7)) print(coxph(Surv(time,status) ~ x + strata(sex),data=df1,weights=w)) # manually doubling the data: print(df2 <- rbind(df1,df1)) print(coxph(Surv(time,status) ~ x + strata(sex), data=df2)) This should not come as a surprise, since with coxph the computation of the likelihood (given the parameters) for a single observation uses also the other observations. This bug has been repaired. The present Reference Manual of coxph says that the weights option specifies a vector of case weights, to which is added only: "For a thorough discussion of these see the book by Therneau and Grambsch." Let us repair the other bug also. Arie On Thu, Oct 12, 2017 at 1:48 PM, Arie ten Cate wrote: > OK. We have now three suggestions to repair the text: > - remove the text > - add "not" at the beginning of the text > - add at the end of the text a warning; something like: > > "Note that in this case the standard estimates of the parameters are > in general not correct, and hence also the t values and the p value. > Also the number of degrees of freedom is not correct. (The parameter > values are correct.)" > > A remark about the glm example: the Reference manual says: "For a > binomial GLM prior weights are used to give the number of trials when > the response is the proportion of successes ". Hence in the > binomial case the weights are frequencies. > With y <- 0.51 and w <- 100 you get the same result. > >Arie > > On Mon, Oct 9, 2017 at 5:22 PM, peter dalgaard wrote: >> AFAIR, it is a little more subtle than that. >> >> If you have replication weights, then the estimates are right, it is "just" >> that the SE from summary.lm() are wrong. Somehow, the text should reflect >> this. >> >> It is of some importance when you put glm() into the mix, because you can in >> fact get correct results from things like >> >> y <- c(0,1) >> w <- c(49,51) >> glm(y~1, weights=w, family=binomial) >> >> -pd >> >>> On 9 Oct 2017, at 07:58 , Arie ten Cate wrote: >>> >>> Yes. Thank you; I should have quoted it. >>> I suggest to remove this text or to add the word "not" at the beginning. >>> >>> Arie >>> >>> On Sun, Oct 8, 2017 at 4:38 PM, Viechtbauer Wolfgang (SP) >>> wrote: >>>> Ah, I think you are referring to this part from ?lm: >>>> >>>> "(including the case that there are w_i observations equal to y_i and the >>>> data have been summarized)" >>>> >>>> I see; indeed, I don't think this is what 'weights' should be used for >>>> (the other part before that is correct). Sorry, I misunderstood the point >>>> you were trying to make. >>>> >>>> Best, >>>> Wolfgang >>>> >>>> -Original Message- >>>> From: R-devel [mailto:r-devel-boun...@r-project.org] On Behalf Of Arie ten >>>> Cate >>>> Sent: Sunday, 08 October, 2017 14:55 >>>> To: r-devel@r-project.org >>>> Subject: [Rd] Discourage the weights= option of lm with summarized data >>>> >>>> Indeed: Using 'weights' is not meant to indicate that the same >>>> observation is repeated 'n' times. As I showed, this gives erroneous >>>> results. Hence I suggested that it is discouraged rather than >>>> encouraged in the Details section of lm in the Reference manual. >>>> >>>> Arie >>>> >>>> ---Original Message- >>>> On Sat, 7 Oct 2017, wolfgang.viechtba...@maastrichtuniversity.nl wrote: >>>> >>>> Using 'weights' is not meant to indicate that the same observation is >>
Re: [Rd] Discourage the weights= option of lm with summarized data
Peter, This is a highly structured text. Just for the discussion, I separate the building blocks, where (D) and (E) and (F) are new: BEGIN OF TEXT (A) Non-‘NULL’ ‘weights’ can be used to indicate that different observations have different variances (with the values in ‘weights’ being inversely proportional to the variances); (B) or equivalently, when the elements of ‘weights’ are positive integers w_i, that each response y_i is the mean of w_i unit-weight observations (C) (including the case that there are w_i observations equal to y_i and the data have been summarized). (D) However, in the latter case, notice that within-group variation is not used. Therefore, the sigma estimate and residual degrees of freedom may be suboptimal; (E) in the case of replication weights, even wrong. (F) Hence, standard errors and analysis of variance tables should be treated with care. END OF TEXT I don't understand (D), partly because it is unclear to me whether (D) refers to (C) or to (B)+(C): If (D) refers only to (C), as the reader might automatically think with the repetition of the word "case", then it is unclear to me to what block does (E) refer. If, on the other hand, (D) refers to (B)+(C) then (E) probably refers to (C) and then I suggest to make this more clear by replacing "in the case of replication weights" in (E) by "in the case of summarized data". I suggest to change "even wrong" in (E) into the more down-to-earth "wrong". (For the record: I prefer something like my original explanation of the problem with (C), instead of (D)+(E)+(F): "With summarized data the standard errors get smaller with increasing numbers of observations w_i. However, when for instance all w_i are multiplied with the same constant larger than one, the reported standard errors do not get smaller since the w_i are defined apart from an arbitrary positive multiplicative constant. Hence the reported standard errors tend to be too large and the reported t values and the reported number of significance stars too small. Obviously, also the reported number of observations and the reported number of degrees of freedom are too small." Note that with heteroskedasticity, _the_ residual standard error has no meaning.) Finally, about the original text: (B) and (C) mention only y_i, not x_i, while this is about entire observations. Maybe this can remedied also? Arie On Tue, Nov 28, 2017 at 1:01 PM, peter dalgaard wrote: > My local R-devel version now has (in ?lm) > > Non-‘NULL’ ‘weights’ can be used to indicate that different > observations have different variances (with the values in > ‘weights’ being inversely proportional to the variances); or > equivalently, when the elements of ‘weights’ are positive integers > w_i, that each response y_i is the mean of w_i unit-weight > observations (including the case that there are w_i observations > equal to y_i and the data have been summarized). However, in the > latter case, notice that within-group variation is not used. > Therefore, the sigma estimate and residual degrees of freedom may > be suboptimal; in the case of replication weights, even wrong. > Hence, standard errors and analysis of variance tables should be > treated with care. > > OK? > > > -pd > > >> On 12 Oct 2017, at 13:48 , Arie ten Cate wrote: >> >> OK. We have now three suggestions to repair the text: >> - remove the text >> - add "not" at the beginning of the text >> - add at the end of the text a warning; something like: >> >> "Note that in this case the standard estimates of the parameters are >> in general not correct, and hence also the t values and the p value. >> Also the number of degrees of freedom is not correct. (The parameter >> values are correct.)" >> >> A remark about the glm example: the Reference manual says: "For a >> binomial GLM prior weights are used to give the number of trials when >> the response is the proportion of successes ". Hence in the >> binomial case the weights are frequencies. >> With y <- 0.51 and w <- 100 you get the same result. >> >> Arie >> >> On Mon, Oct 9, 2017 at 5:22 PM, peter dalgaard wrote: >>> AFAIR, it is a little more subtle than that. >>> >>> If you have replication weights, then the estimates are right, it is "just" >>> that the SE from summary.lm() are wrong. Somehow, the text should reflect >>> this. >>> >>> It is of some importance when you put glm() into the mix, because you can >>> in fact get correct results from things like >>> >>> y <- c(0,