Re: [Rd] Why does lm() with the subset argument give a different answer than subsetting in advance?

2022-01-03 Thread Martin Maechler
> Ben Bolker 
> on Mon, 27 Dec 2021 09:43:42 -0500 writes:

>I agree that it seems non-intuitive (I can't think of a
> design reason for it to look this way), but I'd like to
> stress that it's *not* an information leak; the
> predictions of the model are independent of the
> parameterization, which is all this issue affects. In a
> worst case there might be some unfortunate effects on
> numerical stability if the data-dependent bases are
> computed on a very different set of data than the model
> fitting actually uses.

>I've attached a suggested documentation patch (I hope
> it makes it through to the list, if not I can add it to
> the body of a message.)

It did make it through;  thank you, Ben!
( After adding two forgotten '}' ) I've committed the help file
additions to the R sources (R-devel) in svn r81434 .

Thanks again and

   "Happy New Year"

to all readers,

Martin




> On 12/26/21 8:35 PM, Balise, Raymond R wrote:
>> Hello R folks, Today I noticed that using the subset
>> argument in lm() with a polynomial gives a different
>> result than using the polynomial when the data has
>> already been subsetted. This was not at all intuitive for
>> me.  You can see an example here:
>> 
https://stackoverflow.com/questions/70490599/why-does-lm-with-the-subset-argument-give-a-different-answer-than-subsetting-i
>> 
>> If this is a design feature that you don’t think should
>> be fixed, can you please include it in the documentation
>> and explain why it makes sense to figure out the
>> orthogonal polynomials on the entire dataset?  This feels
>> like a serous leak of information when evaluating train
>> and test datasets in a statistical learning framework.
>> 
>> Ray
>> 
>> Raymond R. Balise, PhD Assistant Professor Department of
>> Public Health Sciences, Biostatistics
>> 
>> University of Miami, Miller School of Medicine 1120
>> N.W. 14th Street Don Soffer Clinical Research Center -
>> Room 1061 Miami, Florida 33136
>> 
>> 
>> 
>> [[alternative HTML version deleted]]
>> 
>> __
>> R-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>> 

> -- 
> Dr. Benjamin Bolker Professor, Mathematics & Statistics
> and Biology, McMaster University Director, School of
> Computational Science and Engineering Graduate chair,
> Mathematics & Statistics x[DELETED ATTACHMENT external:
> BenB_lm-subset.patch, plain text]
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] trivial typo in NEWS file

2022-01-03 Thread Ben Bolker



  Index: doc/NEWS.Rd
===
--- doc/NEWS.Rd (revision 81435)
+++ doc/NEWS.Rd (working copy)
@@ -425,7 +425,7 @@
   data frames with default row names (Thanks to Charlie Gao's
   \PR{18179}).

-  \item \code{txtProgresBar()} now enforces a non-zero width for
+  \item \code{txtProgressBar()} now enforces a non-zero width for
   \code{char}, without which no progress can be visible.

   \item \code{dimnames(table(d))} is more consistent in the case where

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] trivial typo in NEWS file

2022-01-03 Thread Martin Maechler
> Ben Bolker 
> on Mon, 3 Jan 2022 11:04:48 -0500 writes:

> Index: doc/NEWS.Rd
> ===
> --- doc/NEWS.Rd   (revision 81435)
> +++ doc/NEWS.Rd   (working copy)
> @@ -425,7 +425,7 @@
> data frames with default row names (Thanks to Charlie Gao's
> \PR{18179}).

> -  \item \code{txtProgresBar()} now enforces a non-zero width for
> +  \item \code{txtProgressBar()} now enforces a non-zero width for
> \code{char}, without which no progress can be visible.

> \item \code{dimnames(table(d))} is more consistent in the case where


Thank you, Ben!

I will take care of this with my next commit (dealing with R's
bugzilla PR#18272).

Martin

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] "getOption(max.print) omitted %d entries" may be negative

2022-01-03 Thread Martin Maechler
> Hugh Parsonage 
> on Wed, 29 Dec 2021 00:36:51 +1100 writes:

> In src/main/printvector.c in the definition of printVector and
> printNamedVector  (and elsewhere):

> Rprintf(" [ reached getOption(\"max.print\") -- omitted %d entries ]\n",
> n - n_pr);

> Though n - n_pr is of type R_xlen_t so may not be representable as
> int. In practice negative values may be observed for long vectors.

> Rprintf(" [ reached getOption(\"max.print\") -- omitted %lld entries ]\n",
> n - n_pr);


Thank you Hugh, for finding and reporting this,
including a proposed remedy. 

At some point in time, I think the   %lld   format specifier was
not portable enough to all versions of C compiler / libraries
that were considered valid for compiling R.

See e.g.,

   https://stackoverflow.com/questions/462345/format-specifier-for-long-long

which says that "it" does not work on Windows.

Maybe this has changed now that we require C99 and also that
since R version 4.0.0 (or 4.0.1) we also use a somewhat more
recent version of gcc also on Windows?

... ah, searching the R sources reveals uses of %lld
*plus*

#ifdef Win32
#include  /* for %lld */
#endif

so it seems we can and should probably change this ...

[Please, C  compiler / library standard experts, chime in !]

Martin Maechler
ETH Zurich  and  R core team

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] "getOption(max.print) omitted %d entries" may be negative

2022-01-03 Thread Tomas Kalibera



On 1/3/22 6:15 PM, Martin Maechler wrote:

Hugh Parsonage
 on Wed, 29 Dec 2021 00:36:51 +1100 writes:

 > In src/main/printvector.c in the definition of printVector and
 > printNamedVector  (and elsewhere):

 > Rprintf(" [ reached getOption(\"max.print\") -- omitted %d entries ]\n",
 > n - n_pr);

 > Though n - n_pr is of type R_xlen_t so may not be representable as
 > int. In practice negative values may be observed for long vectors.

 > Rprintf(" [ reached getOption(\"max.print\") -- omitted %lld entries 
]\n",
 > n - n_pr);


Thank you Hugh, for finding and reporting this,
including a proposed remedy.

At some point in time, I think the   %lld   format specifier was
not portable enough to all versions of C compiler / libraries
that were considered valid for compiling R.

See e.g.,

https://stackoverflow.com/questions/462345/format-specifier-for-long-long

which says that "it" does not work on Windows.

Maybe this has changed now that we require C99 and also that
since R version 4.0.0 (or 4.0.1) we also use a somewhat more
recent version of gcc also on Windows?

... ah, searching the R sources reveals uses of %lld
*plus*

#ifdef Win32
#include  /* for %lld */
#endif

so it seems we can and should probably change this ...


UCRT on Windows supports the C99 format, so %lld works, but there is a 
bug in GCC which causes a compilation warning to appear for %lld.


There is an open GCC bug report with a patch. It has not been adopted, 
yet, but I got reviews from two people and patched the build of GCC in 
Rtools42. So, %lld etc now works without a warning for us on Windows and 
certainly can be used in package code.

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95130

For base R, as we have been using the trio remap to get rid of the 
warning with %lld, it would make sense to keep doing this for 
consistency. Eventually we might be able to remove the dependency on 
trio, after checking that the other problems due to which we use it have 
been resolved in UCRT.


Tomas



[Please, C  compiler / library standard experts, chime in !]

Martin Maechler
ETH Zurich  and  R core team

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] A patchwork indeed

2022-01-03 Thread Avi Gross via R-devel
Let me be clear up front that I do not want to start any major discussions,
merely to share some observations.

 

We discussed at length what it would mean if R was extended to allow a plus
sign to concatenate text when the operands were both of the right types that
made sense for the purpose so that, as in a language like Python:

 

"Hello " + "World!"

 

would result in the obvious concatenation and not as an error. It might be a
way to call perhaps a limited functionality of paste0() as an example. 

 

So, I was studying an R package called patchwork and looking at it from a
perspective in that it slightly extends the way ggplot uses the plus sign
when applied to objects of certain classes. Patchwork does allow something
like some form of (many types) of graphic objects to be displayed left to
right (or in a grid) by just typing 

p1 + p2 + p3

 

BUT it goes a bit nuts and overlays lots of operators so that:

 

(p1 | p2) / p3

 

results in the first two taking up half each of a top row and the third in
the next row and wide. You can of course make all kinds of adjustments but
the point is that those symbols are in a sense overlaid from their default
meanings. there is also a meaning (a tad obscure) for a unary negative sign
as in

- p1 

 

And, without explanation here, the symbols * and & also are used in new
ways. 

 

I note the obvious that the normal precedence rules in R for these
symbols/operators are NOT changed so you often need to use extra levels of
parentheses to guarantee the order of evaluation.

 

Clearly anyone reading your code that has not thoroughly read the manual for
the package will be even more mystified than people are about ggplot and the
plus sign, or the pipe symbols used in the tidyverse and even the new one
now in base R. 

 

But my point is that it looks like doing it is quite possible and small
isolated worlds can benefit from the notational simplicity. Having said
that, this package also allows you to bypass all of this and use more
standard functions that generally get you the same results. Since
manipulating graphs and subgraphs generally does not require combining the
above symbols alongside their other normal usage, this may look harmless and
if you come from languages that routinely allow operators to be overloaded
or polymorphic, looks fine.

 

I am providing this info, not to make a case for doing anything but to ask
if it makes sense to document acceptable methods for others, perhaps using
their own created objects, to do such effects.

 

In case anyone is curious, start here for a sort of tutorial:

 

https://patchwork.data-imaginist.com/

 

Again, not advocating, just providing an example, no doubt among many
others, where R is used in an extended way that can be useful. But of course
moving R to be fully object-oriented in the same way as some other specific
language is not a valid goal.

 


[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel