[Rd] model frames and update()

2015-04-23 Thread Therneau, Terry M., Ph.D.

This issue has arisen within my anova.coxph routine, but is as easily 
illustrated with glm.

testdata <- data.frame(y= 1:5,
   n= c(8,10,6,20,14),
   sex = c(0,1,0,1,1),
   age = c(30,20,35,25,40))

fit <- glm(cbind(y,n) ~ age + sex, binomial, data=testdata, model=TRUE)
saveit <- fit$model

update(fit, .~. - age, data=saveit)
Error in cbind(y, n) : object 'y' not found

One would hope that a saved model frame is precisely the thing that would work best. The 
issue of course is that "cbind(y, n)" is the name of the first variable in saveit, and it 
is not being properly quoted somewhere down the line.  The same issue can occur on the 
right hand side.  "Save the model frame in case you need to refit something next month" is 
does not appear to be a safe approach to reproducable research.


fit2 <- glm(y ~ sex + log(age), poisson, testdata)
save2 <- fit2$model
update(fit2, . ~ . - sex, data=save2)  # fails
glm(y ~ log(age), poisson, save2)  # fails


I can work around this in my anova, but I wanted to not rebuild the frame if it is already 
present.   It looks like model.matrix plus attr(x, 'assign') time -- a bit harder to read, 
but that looks like what anova.glm is doing.  Is there a way to make update work?


The current code, BTW, starts by building its own frame using results of terms.inner, 
which solves the above issue nicely and update() works as expected.  But it isn't robust 
to scoping issues.  (As pointed out yesterday by a user: lapply of a function that 
contained coxph followed by anova gives a variable not found error.)  It also ignores 
saved model frames; thus the rewrite.


Terry T

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] CRAN submit page down

2015-04-23 Thread Dirk Eddelbuettel

Andrie noticed that first, and I can confirm: from our end, it looks as if
the backend to http://cran.r-project.org/submit.html is currently down.

Dirk

-- 
http://dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] model frames and update()

2015-04-23 Thread William Dunlap
> "Save the model frame in case you need to refit something next month"
> does not appear to be a safe approach to reproducible research.

Is this a standard recommendation?  It will not work in many cases.  E.g.,
if
you use lm() to model the sum of some variables the model.frame contains
only the sum, not the addends so you cannot later change an addend and refit
the model.
  > d <- data.frame(y1=1:5,y2=sin(1:5),x1=log(1:5))
  > fit <- lm(y1+y2 ~ x1, data=d, model=TRUE)
  > fit$model
 y1 + y2x1
  1 1.841471 0.000
  2 2.909297 0.6931472
  3 3.141120 1.0986123
  4 3.243198 1.3862944
  5 4.041076 1.6094379
(The same happens if you use a function like abs(x) on
the right side of the formula.)


Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Thu, Apr 23, 2015 at 9:58 AM, Therneau, Terry M., Ph.D. <
thern...@mayo.edu> wrote:

> This issue has arisen within my anova.coxph routine, but is as easily
> illustrated with glm.
>
> testdata <- data.frame(y= 1:5,
>n= c(8,10,6,20,14),
>sex = c(0,1,0,1,1),
>age = c(30,20,35,25,40))
>
> fit <- glm(cbind(y,n) ~ age + sex, binomial, data=testdata, model=TRUE)
> saveit <- fit$model
>
> update(fit, .~. - age, data=saveit)
> Error in cbind(y, n) : object 'y' not found
>
> One would hope that a saved model frame is precisely the thing that would
> work best. The issue of course is that "cbind(y, n)" is the name of the
> first variable in saveit, and it is not being properly quoted somewhere
> down the line.  The same issue can occur on the right hand side.  "Save the
> model frame in case you need to refit something next month" is does not
> appear to be a safe approach to reproducable research.
>
> fit2 <- glm(y ~ sex + log(age), poisson, testdata)
> save2 <- fit2$model
> update(fit2, . ~ . - sex, data=save2)  # fails
> glm(y ~ log(age), poisson, save2)  # fails
>
>
> I can work around this in my anova, but I wanted to not rebuild the frame
> if it is already present.   It looks like model.matrix plus attr(x,
> 'assign') time -- a bit harder to read, but that looks like what anova.glm
> is doing.  Is there a way to make update work?
>
> The current code, BTW, starts by building its own frame using results of
> terms.inner, which solves the above issue nicely and update() works as
> expected.  But it isn't robust to scoping issues.  (As pointed out
> yesterday by a user: lapply of a function that contained coxph followed by
> anova gives a variable not found error.)  It also ignores saved model
> frames; thus the rewrite.
>
> Terry T
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] model frames and update()

2015-04-23 Thread Achim Zeileis

On Thu, 23 Apr 2015, Therneau, Terry M., Ph.D. wrote:

This issue has arisen within my anova.coxph routine, but is as easily 
illustrated with glm.


testdata <- data.frame(y= 1:5,
  n= c(8,10,6,20,14),
  sex = c(0,1,0,1,1),
  age = c(30,20,35,25,40))

fit <- glm(cbind(y,n) ~ age + sex, binomial, data=testdata, model=TRUE)
saveit <- fit$model

update(fit, .~. - age, data=saveit)
Error in cbind(y, n) : object 'y' not found

One would hope that a saved model frame is precisely the thing that 
would work best. The issue of course is that "cbind(y, n)" is the name 
of the first variable in saveit, and it is not being properly quoted 
somewhere down the line.  The same issue can occur on the right hand 
side.  "Save the model frame in case you need to refit something next 
month" is does not appear to be a safe approach to reproducable 
research.


If you want to re-run the same call (which is what the default update 
method does), then you need to save the original data not the 
pre-processed model frame. However, the model frame still has all the 
information you need - but has already evaluated all transformations 
(cbind, log, etc.).



fit2 <- glm(y ~ sex + log(age), poisson, testdata)
save2 <- fit2$model
update(fit2, . ~ . - sex, data=save2)  # fails
glm(y ~ log(age), poisson, save2)  # fails

I can work around this in my anova, but I wanted to not rebuild the 
frame if it is already present.  It looks like model.matrix plus attr(x, 
'assign') time -- a bit harder to read, but that looks like what 
anova.glm is doing. Is there a way to make update work?


If you have the model.frame() plus an update formula you could internally 
extract the response, model matrix and weights, e.g.,


## original fit
fit <- glm(cbind(y,n) ~ log(age) + sex, family = binomial,
  data = testdata)

## original model frame
mf <- fit$model

## update formula
up <- . ~ . - sex

## new terms
mt <- update(terms(mf), up)

## response y, matrix x, weights w (NULL here)
y <- model.response(mf)
x <- model.matrix(mt, mf)
w <- model.weights(mf)
## offset could be added similarly

And then you can call glm.fit() or coxph.fit() if you can get the family 
from the original fit.


One should still check that the update formula does not change the 
response (to something which may or may not exist in the model frame). 
Possibly, one could try to get certain control arguments from the original 
glm fit (but probably not 'start').


Maybe this helps... (But I can understand that the issues of data frame 
vs. model frame can be quite a nuisance when programming model utilities 
:))


The current code, BTW, starts by building its own frame using results of 
terms.inner, which solves the above issue nicely and update() works as 
expected.  But it isn't robust to scoping issues.  (As pointed out 
yesterday by a user: lapply of a function that contained coxph followed 
by anova gives a variable not found error.)  It also ignores saved model 
frames; thus the rewrite.


Terry T

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel



__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] model frames and update()

2015-04-23 Thread Ben Bolker
Therneau, Terry M., Ph.D.  mayo.edu> writes:

>  This issue has arisen within my anova.coxph routine, but is as
> easily illustrated with glm.
 
> testdata <- data.frame(y= 1:5,
> n= c(8,10,6,20,14),
> sex = c(0,1,0,1,1),
> age = c(30,20,35,25,40))
> 
> fit <- glm(cbind(y,n) ~ age + sex, binomial, data=testdata, model=TRUE)
> saveit <- fit$model
> 
> update(fit, .~. - age, data=saveit)
> Error in cbind(y, n) : object 'y' not found
 
> One would hope that a saved model frame is precisely the thing that
> would work best. The issue of course is that "cbind(y, n)" is the
> name of the first variable in saveit, and it is not being properly
> quoted somewhere down the line.  The same issue can occur on the
> right hand side.  "Save the model frame in case you need to refit
> something next month" is does not appear to be a safe approach to
> reproducable research.
 
> fit2 <- glm(y ~ sex + log(age), poisson, testdata)
> save2 <- fit2$model
> update(fit2, . ~ . - sex, data=save2)  # fails
> glm(y ~ log(age), poisson, save2)  # fails
 
> I can work around this in my anova, but I wanted to not rebuild the
> frame if it is already present.  It looks like model.matrix plus
> attr(x, 'assign') time -- a bit harder to read, but that looks like
> what anova.glm is doing.  Is there a way to make update work?
 
> The current code, BTW, starts by building its own frame using
> results of terms.inner, which solves the above issue nicely and
> update() works as expected.  But it isn't robust to scoping issues.
> (As pointed out yesterday by a user: lapply of a function that
> contained coxph followed by anova gives a variable not found error.)
> It also ignores saved model frames; thus the rewrite.  Terry T
 
I started to complain about this sort of thing last month, at

http://article.gmane.org/gmane.comp.lang.r.devel/37805

I was speaking of updating more generally (which might change
anything, not just the formula), but I complained that

   * we could try to use the model frame (which is stored already),
but there are issues with this (the basis of a whole separate rant)
because the model frame stores something in between predictor
variables and input variables. For example

   d <- data.frame(y=1:10,x=runif(10))
   names(model.frame(lm(y~log(x),data=d)))
   ## "y" "log(x)"

So if we wanted to do something like update to "y ~ sqrt(x)",
it wouldn't work ...

  A mini-version of that rant is that I find the model frame structure
quite awkward -- it contains neither what I would call _input_
variables (actual measured things that live in columns in the original
data frame) nor _predictor_ variables (columns associated with
parameters in the model, i.e. columns of the model matrix).  It seems
to me that life would be much easier if the model frame contained just
the intersection of all.vars(formula) with the column names of the
original data set (and with NA values dropped according to
na.action()). I appreciate that this is a potentially difficult design
problem with ramifications that I don't understand, but ... Martin
Maechler tried to explain the rationale for the design to me once, but
I didn't manage to understand his argument (so I have since forgotten
it).

On the other end, it would be nice (in some dream world) to have
the capability to associate a model with a *reference* to a model
frame; consider the situation where one is fitting 10 or 20 different
models to a large data set, ending up with many copies (I'm not
100% sure, but I think that using model.frame() will end up creating
an internal copy of the data even if it's not technically modified)
of the same gigantic data ...

  Ben Bolker

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel