date:20171009

Re: [Rd] Discourage the weights= option of lm with summarized data

2017-10-09 Thread peter dalgaard

AFAIR, it is a little more subtle than that. 

If you have replication weights, then the estimates are right, it is "just" 
that the SE from summary.lm() are wrong. Somehow, the text should reflect this.

It is of some importance when you put glm() into the mix, because you can in 
fact get correct results from things like

y <- c(0,1)
w <- c(49,51)
glm(y~1, weights=w, family=binomial)

-pd
 


> On 9 Oct 2017, at 07:58 , Arie ten Cate  wrote:
> 
> Yes.  Thank you; I should have quoted it.
> I suggest to remove this text or to add the word "not" at the beginning.
> 
>   Arie
> 
> On Sun, Oct 8, 2017 at 4:38 PM, Viechtbauer Wolfgang (SP)
>  wrote:
>> Ah, I think you are referring to this part from ?lm:
>> 
>> "(including the case that there are w_i observations equal to y_i and the 
>> data have been summarized)"
>> 
>> I see; indeed, I don't think this is what 'weights' should be used for (the 
>> other part before that is correct). Sorry, I misunderstood the point you 
>> were trying to make.
>> 
>> Best,
>> Wolfgang
>> 
>> -Original Message-
>> From: R-devel [mailto:r-devel-boun...@r-project.org] On Behalf Of Arie ten 
>> Cate
>> Sent: Sunday, 08 October, 2017 14:55
>> To: r-devel@r-project.org
>> Subject: [Rd] Discourage the weights= option of lm with summarized data
>> 
>> Indeed: Using 'weights' is not meant to indicate that the same
>> observation is repeated 'n' times.  As I showed, this gives erroneous
>> results. Hence I suggested that it is discouraged rather than
>> encouraged in the Details section of lm in the Reference manual.
>> 
>>   Arie
>> 
>> ---Original Message-
>> On Sat, 7 Oct 2017, wolfgang.viechtba...@maastrichtuniversity.nl wrote:
>> 
>> Using 'weights' is not meant to indicate that the same observation is
>> repeated 'n' times. It is meant to indicate different variances (or to
>> be precise, that the variance of the last observation in 'x' is
>> sigma^2 / n, while the first three observations have variance
>> sigma^2).
>> 
>> Best,
>> Wolfgang
>> 
>> -Original Message-
>> From: R-devel [mailto:r-devel-boun...@r-project.org] On Behalf Of Arie ten 
>> Cate
>> Sent: Saturday, 07 October, 2017 9:36
>> To: r-devel@r-project.org
>> Subject: [Rd] Discourage the weights= option of lm with summarized data
>> 
>> In the Details section of lm (linear models) in the Reference manual,
>> it is suggested to use the weights= option for summarized data. This
>> must be discouraged rather than encouraged. The motivation for this is
>> as follows.
>> 
>> With summarized data the standard errors get smaller with increasing
>> numbers of observations. However, the standard errors in lm do not get
>> smaller when for instance all weights are multiplied with the same
>> constant larger than one, since the inverse weights are merely
>> proportional to the error variances.
>> 
>> Here is an example of the estimated standard errors being too large
>> with the weights= option. The p value and the number of degrees of
>> freedom are also wrong. The parameter estimates are correct.
>> 
>>  n <- 10
>>  x <- c(1,2,3,4)
>>  y <- c(1,2,5,4)
>>  w <- c(1,1,1,n)
>>  xb <- c(x,rep(x[4],n-1))  # restore the original data
>>  yb <- c(y,rep(y[4],n-1))
>>  print(summary(lm(yb ~ xb)))
>>  print(summary(lm(y ~ x, weights=w)))
>> 
>> Compare with PROC REG in SAS, with a WEIGHT statement (like R) and a
>> FREQ statement (for summarized data).
>> 
>>Arie
>> 
>> __
>> R-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd@cbs.dk  Priv: pda...@gmail.com

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Using response variable in interaction as explanatory variable in glm crashes R

2017-10-09 Thread Martin Maechler

> Jan van der Laan 
> on Fri, 6 Oct 2017 12:13:39 +0200 writes:

> It is actually model.matrix that crashes, not glm. Same
> crash occurs with e.g. lm.

> model.matrix(dob_mon ~ dob_day*dob_mon, data = tab)

> also crashes R.

Yes, segmentation fault.

It only happens when these are *logical* variables, not, e.g., when
transformed to integer.

The C code in src/library/stats/src/model.c  tries to eliminate
occurances of the LHS of the formula from the RHS when building
the model matrix and it does work fine in the integer case.

Part of the culprit code may be this (from line 717),
with the  isLogical(.) which in our case, shifts the pointer by
1  in the call to firstfactor() :

int adj = isLogical(var_i)?1:0;
// avoid overflow of jstart * nn PR#15578
firstfactor(&rx[jstart * nn], n, jnext - jstart,
REAL(contrast), nrows(contrast),
ncols(contrast), INTEGER(var_i)+adj);

then in firstfactor(), we see the segfault (when running R with
'-d gdb') :

> model.matrix(dob_mon ~ dob_day*dob_mon, data = tab)

  Program received signal SIGSEGV, Segmentation fault.
  0x7fffeafa76b5 in firstfactor (ncx=0, v=0x5c3b37c, ncc=1, nrc=2, 
c=0x5c90008, 
   nrx=8, x=0x5cbf150) at ../../../../../R/src/library/stats/src/model.c:252
252 else xj[i] = cj[v[i]-1];
Missing separate debuginfos, .
(gdb) list
247 for (int j = 0; j < ncc; j++) {
248 xj = &x[j * (R_xlen_t)nrx];
249 cj = &c[j * (R_xlen_t)nrc];
250 for (int i = 0; i < nrx; i++)
251 if(v[i] == NA_INTEGER) xj[i] = NA_REAL;
252 else xj[i] = cj[v[i]-1];
253 }
254 }
255 

and indeed in the debugger,  i=7  and  v[i] is "outside", v[]
being of length 7, hence indexed 0:6.


> Jan



> On 06-10-17 12:08, Jan van der Laan wrote:
>> 
>> The following code crashes R (I know I shouldn't try to
>> estimate such a model; this was a bug in some code of
>> mine). I also tried with R-devel; same result.
>> 
>> 
>> tab <- structure(list(dob_day = c(FALSE, FALSE, FALSE,
>> FALSE, TRUE, TRUE, TRUE, TRUE), dob_mon = c(FALSE, FALSE,
>> TRUE, TRUE, FALSE, FALSE, TRUE, TRUE), dob_year =
>> c(FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE), n
>> = c(1489634L, 17491L, 134985L, 1639L, 47892L, 611L,
>> 4365L, 750L), pred1 = c(1488301, 18187, 135605, 1657,
>> 48547, 593, 4423, 54)), .Names = c("dob_day", "dob_mon",
>> "dob_year", "n", "pred1"), row.names = c(NA, -8L), class
>> = "data.frame")
>> 
>> m <- glm(dob_mon ~ dob_day*dob_mon, data = tab, family =
>> binomial())
>> 
>> 
>> The crash doesn't when the variables are added just as
>> main effects (dob_day+dob_mon): this results in a warning
>> and the removal of dob_mon from the formula.
>> 
>> -- 
>> 
>> Jan
>> 
>> 
>> 
>> > R.version    _ platform  
>> x86_64-pc-linux-gnu arch   x86_64 os
>> linux-gnu system x86_64, linux-gnu status
>> major  3 minor  4.1 year   2017
>> month  06 day    30 svn rev    72865
>> language   R version.string R version 3.4.1
>> (2017-06-30) nickname   Single Candle
>> 
>> __
>> R-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel

> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Discourage the weights= option of lm with summarized data

Re: [Rd] Using response variable in interaction as explanatory variable in glm crashes R

2 matches

Site Navigation

Mail list logo

Footer information