[Rd] assignment

2021-12-27 Thread Gabor Grothendieck
In a recent SO post this came up (changed example to simplify it
here).  It seems that `test` still has the value sin.

  test <- sin
  environment(test)$test <- cos
  test(0)
  ## [1] 0

It appears to be related to the double use of `test` in `$<-` since if
we break it up it works as expected:

  test <- sin
  e <- environment(test)
  e$test <- cos
  test(0)
  ## [1] 1

`assign` also works:

  test <- sin
  assign("test", cos, environment(test))
  test(0)
  ## [1] 1

Can anyone shed some light on this?


-- 
Statistics & Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Why does lm() with the subset argument give a different answer than subsetting in advance?

2021-12-27 Thread Balise, Raymond R
Hello R folks,
Today I noticed that using the subset argument in lm() with a polynomial gives 
a different result than using the polynomial when the data has already been 
subsetted. This was not at all intuitive for me.You can see an example 
here: 
https://stackoverflow.com/questions/70490599/why-does-lm-with-the-subset-argument-give-a-different-answer-than-subsetting-i

If this is a design feature that you don’t think should be 
fixed, can you please include it in the documentation and explain why it makes 
sense to figure out the orthogonal polynomials on the entire dataset?  This 
feels like a serous leak of information when evaluating train and test datasets 
in a statistical learning framework.

Ray

Raymond R. Balise, PhD
Assistant  Professor
Department of Public Health Sciences, Biostatistics

University of Miami, Miller School of Medicine
1120 N.W. 14th Street
Don Soffer Clinical Research Center - Room 1061
Miami, Florida 33136



[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Why does lm() with the subset argument give a different answer than subsetting in advance?

2021-12-27 Thread Ben Bolker
  I agree that it seems non-intuitive (I can't think of a design reason 
for it to look this way), but I'd like to stress that it's *not* an 
information leak; the predictions of the model are independent of the 
parameterization, which is all this issue affects. In a worst case there 
might be some unfortunate effects on numerical stability if the 
data-dependent bases are computed on a very different set of data than 
the model fitting actually uses.


  I've attached a suggested documentation patch (I hope it makes it 
through to the list, if not I can add it to the body of a message.)




On 12/26/21 8:35 PM, Balise, Raymond R wrote:

Hello R folks,
Today I noticed that using the subset argument in lm() with a polynomial gives 
a different result than using the polynomial when the data has already been 
subsetted. This was not at all intuitive for me.You can see an example 
here: 
https://stackoverflow.com/questions/70490599/why-does-lm-with-the-subset-argument-give-a-different-answer-than-subsetting-i

 If this is a design feature that you don’t think should be 
fixed, can you please include it in the documentation and explain why it makes 
sense to figure out the orthogonal polynomials on the entire dataset?  This 
feels like a serous leak of information when evaluating train and test datasets 
in a statistical learning framework.

Ray

Raymond R. Balise, PhD
Assistant  Professor
Department of Public Health Sciences, Biostatistics

University of Miami, Miller School of Medicine
1120 N.W. 14th Street
Don Soffer Clinical Research Center - Room 1061
Miami, Florida 33136



[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel



--
Dr. Benjamin Bolker
Professor, Mathematics & Statistics and Biology, McMaster University
Director, School of Computational Science and Engineering
Graduate chair, Mathematics & Statistics
Index: lm.Rd
===
--- lm.Rd   (revision 81416)
+++ lm.Rd   (working copy)
@@ -33,7 +33,9 @@
 typically the environment from which \code{lm} is called.}
 
   \item{subset}{an optional vector specifying a subset of observations
-to be used in the fitting process.}
+to be used in the fitting process. (See additional details about how
+this argument interacts with data-dependent bases in the
+\sQuote{Details} section of the \code{\link{model.frame}} documentation.)
 
   \item{weights}{an optional vector of weights to be used in the fitting
 process.  Should be \code{NULL} or a numeric vector.
Index: model.frame.Rd
===
--- model.frame.Rd  (revision 81416)
+++ model.frame.Rd  (working copy)
@@ -38,7 +38,9 @@
   \item{subset}{a specification of the rows to be used: defaults to all
 rows. This can be any valid indexing vector (see
 \code{\link{[.data.frame}}) for the rows of \code{data} or if that is not
-supplied, a data frame made up of the variables used in \code{formula}.}
+supplied, a data frame made up of the variables used in
+\code{formula}. (See additional details about how this argument
+interacts with data-dependent bases under \sQuote{Details} below.)
 
   \item{na.action}{how \code{NA}s are treated.  The default is first,
 any \code{na.action} attribute of \code{data}, second
@@ -103,6 +105,12 @@
   character variable is found, it is converted to a factor (as from \R
   2.10.0).
 
+  Because variables in the formula are evaluated before rows are dropped
+  based on \code{subset}, the characteristics of data-dependent bases
+  such as orthogonal polynomials (i.e. from terms using
+  \code{\link{poly}}) or splines will be computed based on the full data
+  set rather than the subsetted data set.
+
   Unless \code{na.action = NULL}, time-series attributes will be removed
   from the variables found (since they will be wrong if \code{NA}s are
   removed).
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] [External] assignment

2021-12-27 Thread luke-tierney

On Mon, 27 Dec 2021, Gabor Grothendieck wrote:


In a recent SO post this came up (changed example to simplify it
here).  It seems that `test` still has the value sin.

 test <- sin
 environment(test)$test <- cos
 test(0)
 ## [1] 0

It appears to be related to the double use of `test` in `$<-` since if
we break it up it works as expected:

 test <- sin
 e <- environment(test)
 e$test <- cos
 test(0)
 ## [1] 1

`assign` also works:

 test <- sin
 assign("test", cos, environment(test))
 test(0)
 ## [1] 1

Can anyone shed some light on this?


See my response in

https://bugs.r-project.org/show_bug.cgi?id=18269

Best,

luke

--
Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa  Phone: 319-335-3386
Department of Statistics andFax:   319-335-3017
   Actuarial Science
241 Schaeffer Hall  email:   luke-tier...@uiowa.edu
Iowa City, IA 52242 WWW:  http://www.stat.uiowa.edu

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel