On Wed, 11 Aug 2021, Martin Maechler wrote:

I'm diverting this from R-help to R-devel,

because I'm asking / musing if and if where we should / could
change R here (see below).

Martin Maechler on 11 Aug 2021 11:51:25 +0200

Tim Taylor .. on 08:45:48 +0000 writes:

   >> Manipulating formulas within different models I notice the following:

   >> m1 <- lm(formula = hp ~ cyl, data = mtcars)
   >> m2 <- update(m1, formula. = hp ~ cyl)
   >> all.equal(m1, m2)
   >> #> [1] TRUE
   >> identical(m1, m2)
   >> #> [1] FALSE
   >> waldo::compare(m1, m2)
   >> #> `old$call[[2]]` is a call
   >> #> `new$call[[2]]` is an S3 object of class <formula>, a call

   >> I'm aware formulas are a form of call but what I'm unsure
   >> of is whether there is meaningful difference between the
   >> two versions of the models?

   > A good question.
   > In principle, the promise of an update()  method should be to
   > produce the *same* result as calling the original model-creation
   > (or more generally object-creation) function call.

   > So, already with identical(), you've shown that this is not
   > quite the case for simple lm(),
   > and indeed that is a bit undesirable.

   > To answer your question re "meaningful" difference,
   > given what I say above is:
   > No, there shouldn't be any relevant difference, and if there is,
   > that may considered a bug in the respective update() method,
   > here update.lm.

   > More about this in the following  R code snippet :

Again, a repr.ex.:

---0<-------0<-------0<-------0<-------0<-------0<-------0<----

m1 <- lm(formula = hp ~ cyl, data = mtcars)
m2  <- update(m1, formula. = hp ~ cyl)
m2a <- update(m1)
identical(m1, m2a)#>  TRUE !
## ==> calling update() & explicitly specifying the formula is "the problem"

identical(m1$call, m2$call) #> [1] FALSE
noCall <- function(x) x[setdiff(names(x), "call")]
identical(noCall(m1), noCall(m2))# TRUE!
## look closer:
c1 <- m1$call
c2 <- m2$call
str(as.list(c1))
## List of 3
##  $        : symbol lm
##  $ formula: language hp ~ cyl
##  $ data   : symbol mtcars

str(as.list(c2))
## List of 3
##  $        : symbol lm
##  $ formula:Class 'formula'  language hp ~ cyl
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
##  $ data   : symbol mtcars

identical(c1[-2], c2[-2]) # TRUE ==> so, indeed the difference is *only* in the 
formula ( = [2]) component
f1 <- c1$formula
f2 <- c2$formula
all.equal(f1,f2) # TRUE
identical(f1,f2) # FALSE

## Note that this is typically *not* visible if the user uses
## the accessor functions they should :
identical(formula(m1), formula(m2)) # TRUE !
## and indeed, the formula() method for 'lm'  does set the environment:
stats:::formula.lm

---0<-------0<-------0<-------0<-------0<-------0<-------0<----

We know that it has been important in  R  the formulas have an
environment and that's been the only R-core recommended way to
do non-standard evaluation (!! .. but let's skip that for now !!).

OTOH we have also kept the convention that a formula without
environment implicitly means its environment
is .GlobalEnv aka globalenv().

Currently, I think formula() methods then *should* always return
a formula *with* an environment .. even though that's not
claimed in the reference, i.e., ?formula.

Also, the print() method for formulas by default does *not* show the
environment if it is .GlobalEnv, as you can see on that help
already in the "Usage" section:

    ## S3 method for class 'formula'
    print(x, showEnv = !identical(e, .GlobalEnv), ...)

Now, I've looked at the update() here, which is update.default()
and the source code of that currently is

update.formula <- function (old, new, ...)
{
   tmp <- .Call(C_updateform, as.formula(old), as.formula(new))
   ## FIXME?: terms.formula() with "large" unneeded attributes:
   formula(terms.formula(tmp, simplify = TRUE))
}

where the important part is the "FIXME" comment (seen in the R
sources, but no longer in the R function after installation).

My current "idea" is to formalize what we see working here:
namely allow  update.formula() to *not* set the environment of
its result *if* that environment would be .GlobalEnv ..

--> I'm starting to test my proposal
but would still be *very* glad for comments, also contradicting
ones!

m1$call is the parsed expression for the call to lm(), so
m1$call$formula is the expression that was evaluated to produce the formula.
Typically this will be a call expression with `~` as the function.
It could also be a symbol:

frm <- hp ~ cyl
m3 <- lm(formula = frm, data = mtcars)
m3$call$formula
frm

update.default is creating a new call object by putting in a new
expression for the formula argument. It so happens that putting in a
formula object actually works: The only difference between the AST for
a call of `~` and the formula such a call produces when evaluated is
the class and environment attributes the call adds, and most code that
works with expressions, like eval(), ignores attributes.

It would seem somewhat more consistent if update.default put the
expression that would produce the formula into the call (i.e. stripped
out the two attributes).

But I do not know if there is logic in base R code, never mind package
code, that takes advantage of the attributes on the formula expression
in if they are found. formula() looks in the 'terms' component so would
not be affects, but I don't know if something else might be.

Best,

luke



Martin

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


--
Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa                  Phone:             319-335-3386
Department of Statistics and        Fax:               319-335-3017
   Actuarial Science
241 Schaeffer Hall                  email:   luke-tier...@uiowa.edu
Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to