[Rd] GC: speeding-up the CHARSXP cache maintenance, 2nd try

2021-11-04 Thread Andreas Kersting
Hi,

In https://stat.ethz.ch/pipermail/r-devel/2021-October/081147.html I proposed 
to speed up the CHARSXP cache maintenance during GC using threading. This was 
rejected by Luke in 
https://stat.ethz.ch/pipermail/r-devel/2021-October/081172.html.

Here I want to propose an alternative approach to significantly speed up 
CHARSXP cache maintenance during partial GCs. A patch which passes `make 
check-devel` is attached. Compared to R devel (revision 81110) I get the 
following performance improvements on my system:

Elapsed time for five non-full gc in a session after

x <- as.character(runif(5e7))[]
gc(full = TRUE)

+20sec -> ~1sec.


This patch introduces (theoretical) overheads to mkCharLenCE() and full GCs. 
However, I did not measure dramatic differences:

y <- "old_CHARSXP" 

after

x <- "old_CHARSXP"; gc(); gc()

takes a median 32 nanoseconds with and without the patch.


gc(full = TRUE)

in a new session takes a median 16 milliseconds with and 14 without the patch.


The basic idea is to maintain the CHARSXP cache using subtables in 
R_StringHash, one for each of the (NUM_GC_GENERATIONS := NUM_OLD_GENERATIONS + 
1) GC generations. New CHARSXPs are added by mkCharLenCE() to the subtable of 
the youngest generation. After a partial GC, only the chains anchored at the 
subtables of the youngest (num_old_gens_to_collect + 1) generations need to be 
searched for and cleaned of unmarked nodes. Afterwards, these chains need to be 
merged into those of the respective next generation, if any. This approach 
relies on the fact that an object/CHARSXP can never become younger again. It is 
OK though if an object/CHARSXP "skips" a GC generation.

R_StringHash, which is now of length (NUM_GC_GENERATIONS * char_hash_size), is 
structured such that the chains for the same hashcode but for different 
generations are anchored at slots of R_StringHash which are next to each other 
in memory. This is because we often need to access two or more (i.e. currently 
all three) of them for one operation and this avoids cache misses.

HASHPRI, i.e. the number of occupied primary slots, is computed and stored as 
NUM_GC_GENERATIONS times the number of slots which are occupied in at least one 
of the subtables. This is done because in mkCharLenCE() we need to iterate 
through one or more chains if and only if there is a chain for the particular 
hashcode in at least one subtable.

I tried to keep the patch as minimal as possible. In particular, I did not add 
long vector support to R_StringHash. I rather reduced the max value of 
char_hash_size from 2^30 to 2^29, assuming that NUM_OLD_GENERATIONS is (not 
larger than) 2. I also did not yet adjust do_show_cache() and do_write_cache(), 
but I could do so if the patch is accepted.

Thanks for your consideration and feedback.

Regards,
Andreas


P.S. I had a hard time to get the indentation right in the patch due the mix of 
tabs and spaces. Sorry, if I screwed this up.Index: src/include/Defn.h
===
--- src/include/Defn.h	(revision 81110)
+++ src/include/Defn.h	(working copy)
@@ -94,6 +94,7 @@
  */
 
 #define NAMED_BITS 16
+#define NUM_GC_GENERATIONS 3
 
 /* Flags */
 
Index: src/main/envir.c
===
--- src/main/envir.c	(revision 81110)
+++ src/main/envir.c	(working copy)
@@ -4079,7 +4079,7 @@
 
 void attribute_hidden InitStringHash()
 {
-R_StringHash = R_NewHashTable(char_hash_size);
+R_StringHash = R_NewHashTable(char_hash_size * NUM_GC_GENERATIONS);
 }
 
 /* #define DEBUG_GLOBAL_STRING_HASH 1 */
@@ -4089,7 +4089,7 @@
 {
 SEXP old_table = R_StringHash;
 SEXP new_table, chain, new_chain, val, next;
-unsigned int counter, new_hashcode, newmask;
+unsigned int counter, new_slot, newmask;
 #ifdef DEBUG_GLOBAL_STRING_HASH
 unsigned int oldsize = HASHSIZE(R_StringHash);
 unsigned int oldpri = HASHPRI(R_StringHash);
@@ -4103,27 +4103,38 @@
 /* When using the ATTRIB fields to maintain the chains the chain
moving is destructive and does not involve allocation.  This is
therefore the only point where GC can occur. */
-new_table = R_NewHashTable(newsize);
+new_table = R_NewHashTable(newsize * NUM_GC_GENERATIONS);
 newmask = newsize - 1;
 
-/* transfer chains from old table to new table */
-for (counter = 0; counter < LENGTH(old_table); counter++) {
-	chain = VECTOR_ELT(old_table, counter);
-	while (!ISNULL(chain)) {
-	val = CXHEAD(chain);
-	next = CXTAIL(chain);
-	new_hashcode = char_hash(CHAR(val), LENGTH(val)) & newmask;
-	new_chain = VECTOR_ELT(new_table, new_hashcode);
-	/* If using a primary slot then increase HASHPRI */
-	if (ISNULL(new_chain))
-		SET_HASHPRI(new_table, HASHPRI(new_table) + 1);
-	/* move the current chain link to the new chain */
-	/* this is a destructive modification */
-	new_chain = SET_CXTAIL(val, new_chain);
-	SET_VECTOR_ELT(new_tab

Re: [Rd] Fwd: Using existing envars in Renviron on friendly Windows

2021-11-04 Thread Tomas Kalibera

Thanks for the suggestions, I've updated the documentation.
Tomas

On 11/3/21 11:30 AM, Tomas Kalibera wrote:


On 11/3/21 1:37 AM, Henrik Bengtsson wrote:

Oh, I see, I misunderstood.  Thanks for clarifying.

One more thing, to mix-and-match environment variables and strings
with escaped characters, while mimicking how POSIX shells does it, by
using strings with double and single quotes. For example, with:

$ cat .Renviron
APPDATA='C:\Users\foobar\AppData\Roaming'
R_LIBS_USER="${APPDATA}"'\R-library'

we get:

$ Rscript --no-init-file --quiet -e 'cat(sprintf("R_LIBS_USER=[%s]\n",
Sys.getenv("R_LIBS_USER")))'
R_LIBS_USER=[C:\Users\foobar\AppData\Roaming\R-library]

and

$ source .Renviron
$ echo "R_LIBS_USER=[${R_LIBS_USER}]"
R_LIBS_USER=[C:\Users\foobar\AppData\Roaming\R-library]

Yes, that could be mentioned explicitly as well.

Tomas




/Henrik

On Sun, Oct 31, 2021 at 2:59 AM Tomas Kalibera 
 wrote:


On 10/31/21 2:55 AM, Henrik Bengtsson wrote:

... If one still needed backslashes,
they could then be entered in single quotes, e.g. VAR='c:\users'.

I don't think it matters whether you use single or double quotes -
both will work.  Here's a proof of concept on Linux with R 4.1.1:

$ cat ./.Renviron
A=C:\users
B='C:\users'
C="C:\users"

$ Rscript -e "Sys.getenv(c('A', 'B', 'C'))"
    A   B   C
    "C:users" "C:\\users" "C:\\users"

Yes, but as I wrote "I think the Renviron files should be written in a
way so that they would work the same in a POSIX shell". This is why
single quotes. With double quotes, backslashes are interpreted
differently from a POSIX shell.

Tomas



/Henrik

On Wed, Oct 27, 2021 at 11:45 AM Tomas Kalibera
 wrote:

On 10/21/21 5:18 PM, Martin Maechler wrote:

Michał Bojanowski
   on Wed, 20 Oct 2021 16:31:08 +0200 writes:

   > Hello Tomas,
   > Yes, that's accurate although rather terse, which is 
perhaps the

   > reason why I did not realize it applies to my case.

   > How about adding something in the direction of:

   > 1. Continuing the cited paragraph with:
   > In particular, on Windows it may be necessary to quote 
references to
   > existing environment variables, especially those 
containing file paths

   > (which include backslashes). For example: `"${WINVAR}"`.

   > 2. Add an example (not run):

   > # On Windows do quote references to variables containing 
paths, e.g.:

   > # If APPDATA=C:\Users\foobar\AppData\Roaming
   > # to point to a library tree inside APPDATA in .Renviron 
use

   > R_LIBS_USER="${APPDATA}"/R-library

   > Incidentally the last example is on backslashes too.


   > What do you think?

I agree that adding an example really helps a lot in such cases,
in my experience, notably if it's precise enough to be used +/- 
directly.
Yes, I agree as well. I think the Renviron files should be written 
in a

way so that they would work the same in a POSIX shell, so e.g.
VAR="${VAR0}" or VAR="${VAR0}/subdir" are the recommended ways to
preserve backslashes in VAR0. It is better to use forward slashes in
string literals, e.g. VAR="c:/users". If one still needed 
backslashes,

they could then be entered in single quotes, e.g. VAR='c:\users'.

The currently implemented parsing of Renviron files differs in a 
number

of details from POSIX shells, some are documented and some are not.
Relying only on the documented behavior that is the same as in POSIX
shells is the best choice for future compatibility.

Tomas

   > On Mon, Oct 18, 2021 at 5:02 PM Tomas Kalibera 
 wrote:

   >>
   >>
   >> On 10/15/21 6:44 PM, Michał Bojanowski wrote:
   >> > Perhaps a small update to ?.Renviron would be in 
order to mention that...

   >>
   >> Would you have a more specific suggestion how to update 
the

   >> documentation? Please note that it already says
   >>
   >> "‘value’ is then processed in a similar way to a Unix 
shell: in
   >> particular the outermost level of (single or double) 
quotes is stripped,

   >> and backslashes are removed except inside quotes."
   >>
   >> Thanks,
   >> Tomas
   >>
   >> > On Fri, Oct 15, 2021 at 6:43 PM Michał Bojanowski 
 wrote:
   >> >> Indeed quoting works! Kevin suggested the same, but 
he didnt reply to the list.

   >> >> Thank you all!
   >> >> Michal
   >> >>
   >> >> On Fri, Oct 15, 2021 at 6:40 PM Ivan Krylov 
 wrote:
   >> >>> Sorry for the noise! I wasn't supposed to send my 
previous message.

   >> >>>
   >> >>> On Fri, 15 Oct 2021 16:44:28 +0200
   >> >>> Michał Bojanowski  wrote:
   >> >>>
   >>  AVAR=${APPDATA}/foo/bar
   >> 
   >>  Which is a documented way of referring to existing 
environment

   >>  variables. Now, with that in R I'm getting:
   >> 
   >>  Sys.getenv("APPDATA")    # That works OK
   >>  [1] "C:\\Users\\mbojanowski\\AppData\\Roaming"
   >> 
 

Re: [Rd] [External] GC: speeding-up the CHARSXP cache maintenance, 2nd try

2021-11-04 Thread luke-tierney

Can you please submit this as a wishlist item to bugzilla? it is
easier to keep track of there. You could also submit your threads
based suggestion there, again to keep it easier to keep track of and
possibly get back to in the future.

I will have a look at your approach when I get a chance, but I am
exploring a different approach to avoid scanning old generations that
may be simpler.

Best,

luke

On Wed, 3 Nov 2021, Andreas Kersting wrote:


Hi,

In https://stat.ethz.ch/pipermail/r-devel/2021-October/081147.html I proposed 
to speed up the CHARSXP cache maintenance during GC using threading. This was 
rejected by Luke in 
https://stat.ethz.ch/pipermail/r-devel/2021-October/081172.html.

Here I want to propose an alternative approach to significantly speed up 
CHARSXP cache maintenance during partial GCs. A patch which passes `make 
check-devel` is attached. Compared to R devel (revision 81110) I get the 
following performance improvements on my system:

Elapsed time for five non-full gc in a session after

x <- as.character(runif(5e7))[]
gc(full = TRUE)

+20sec -> ~1sec.


This patch introduces (theoretical) overheads to mkCharLenCE() and full GCs. 
However, I did not measure dramatic differences:

y <- "old_CHARSXP"

after

x <- "old_CHARSXP"; gc(); gc()

takes a median 32 nanoseconds with and without the patch.


gc(full = TRUE)

in a new session takes a median 16 milliseconds with and 14 without the patch.


The basic idea is to maintain the CHARSXP cache using subtables in R_StringHash, one for 
each of the (NUM_GC_GENERATIONS := NUM_OLD_GENERATIONS + 1) GC generations. New CHARSXPs 
are added by mkCharLenCE() to the subtable of the youngest generation. After a partial 
GC, only the chains anchored at the subtables of the youngest (num_old_gens_to_collect + 
1) generations need to be searched for and cleaned of unmarked nodes. Afterwards, these 
chains need to be merged into those of the respective next generation, if any. This 
approach relies on the fact that an object/CHARSXP can never become younger again. It is 
OK though if an object/CHARSXP "skips" a GC generation.

R_StringHash, which is now of length (NUM_GC_GENERATIONS * char_hash_size), is 
structured such that the chains for the same hashcode but for different 
generations are anchored at slots of R_StringHash which are next to each other 
in memory. This is because we often need to access two or more (i.e. currently 
all three) of them for one operation and this avoids cache misses.

HASHPRI, i.e. the number of occupied primary slots, is computed and stored as 
NUM_GC_GENERATIONS times the number of slots which are occupied in at least one 
of the subtables. This is done because in mkCharLenCE() we need to iterate 
through one or more chains if and only if there is a chain for the particular 
hashcode in at least one subtable.

I tried to keep the patch as minimal as possible. In particular, I did not add 
long vector support to R_StringHash. I rather reduced the max value of 
char_hash_size from 2^30 to 2^29, assuming that NUM_OLD_GENERATIONS is (not 
larger than) 2. I also did not yet adjust do_show_cache() and do_write_cache(), 
but I could do so if the patch is accepted.

Thanks for your consideration and feedback.

Regards,
Andreas


P.S. I had a hard time to get the indentation right in the patch due the mix of 
tabs and spaces. Sorry, if I screwed this up.


--
Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa  Phone: 319-335-3386
Department of Statistics andFax:   319-335-3017
   Actuarial Science
241 Schaeffer Hall  email:   luke-tier...@uiowa.edu
Iowa City, IA 52242 WWW:  http://www.stat.uiowa.edu

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] .onLoad, packageStartupMessage, and R CMD check

2021-11-04 Thread Michael Chirico via R-devel
I wrote a linter to stop users from using packageStartupMessage() in
their .onLoad() hook because of the R CMD check warning it triggers:

https://github.com/wch/r-source/blob/8b6625e39cd62424dc23399dade37f20fa8afa91/src/library/tools/R/QC.R#L5167

However, this received some pushback which I ultimately agree with,
and moreover ?.onLoad seems to agree as well:

> Loading a namespace should where possible be silent, with startup
messages given by \code{.onAttach}. These messages (**and any essential
ones from \code{.onLoad}**) should use \code{\link{packageStartupMessage}}
so they can be silenced where they would be a distraction.

**emphasis** mine. That is, if we think some message is _essential_ to
print during loadNamespace(), we are told to use
packageStartupMessage().

Should we remove this R CMD check warning?

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] .onLoad, packageStartupMessage, and R CMD check

2021-11-04 Thread Duncan Murdoch

On 04/11/2021 2:50 p.m., Michael Chirico via R-devel wrote:

I wrote a linter to stop users from using packageStartupMessage() in
their .onLoad() hook because of the R CMD check warning it triggers:

https://github.com/wch/r-source/blob/8b6625e39cd62424dc23399dade37f20fa8afa91/src/library/tools/R/QC.R#L5167

However, this received some pushback which I ultimately agree with,
and moreover ?.onLoad seems to agree as well:


Loading a namespace should where possible be silent, with startup

messages given by \code{.onAttach}. These messages (**and any essential
ones from \code{.onLoad}**) should use \code{\link{packageStartupMessage}}
so they can be silenced where they would be a distraction.

**emphasis** mine. That is, if we think some message is _essential_ to
print during loadNamespace(), we are told to use
packageStartupMessage().

Should we remove this R CMD check warning?


The help page doesn't define what an "essential" message would be, but I 
would assume it's a message about some dire condition, not just "Hi! I 
just got loaded!".  So I think a note or warning would be appropriate, 
but not an error.


Do you have an example of something that should routinely print, but 
that triggers a warning when checked?


Duncan Murdoch

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] .onLoad, packageStartupMessage, and R CMD check

2021-11-04 Thread Gabriel Becker
Hi Michael,

Indeed, just to elaborate further on what I believe Duncan's point is, can
you give any examples, "dire" or not, that are appropriate when the package
is loaded but not attached (ie none of its symbols are visible to the user
without using :::)?

The only things I can think of are a package that changes the behavior of
other, attached package code, such as conflicted. Doing so is very much an
anti-pattern imo generally, with something like conflicted being an
(arguable) exception. And that's assuming conflicted even works/does
anything when loaded but not attached (I have not confirmed whether thats
the case or not). That or a package that is at end-of-life and is or soon
will be unsupported entirely.

The examples don't need to be yours, per se, if you know what those pushing
back against your linter were using messages from .onLoad for...

Best,
~G



On Thu, Nov 4, 2021 at 12:37 PM Duncan Murdoch 
wrote:

> On 04/11/2021 2:50 p.m., Michael Chirico via R-devel wrote:
> > I wrote a linter to stop users from using packageStartupMessage() in
> > their .onLoad() hook because of the R CMD check warning it triggers:
> >
> >
> https://github.com/wch/r-source/blob/8b6625e39cd62424dc23399dade37f20fa8afa91/src/library/tools/R/QC.R#L5167
> >
> > However, this received some pushback which I ultimately agree with,
> > and moreover ?.onLoad seems to agree as well:
> >
> >> Loading a namespace should where possible be silent, with startup
> > messages given by \code{.onAttach}. These messages (**and any essential
> > ones from \code{.onLoad}**) should use
> \code{\link{packageStartupMessage}}
> > so they can be silenced where they would be a distraction.
> >
> > **emphasis** mine. That is, if we think some message is _essential_ to
> > print during loadNamespace(), we are told to use
> > packageStartupMessage().
> >
> > Should we remove this R CMD check warning?
>
> The help page doesn't define what an "essential" message would be, but I
> would assume it's a message about some dire condition, not just "Hi! I
> just got loaded!".  So I think a note or warning would be appropriate,
> but not an error.
>
> Do you have an example of something that should routinely print, but
> that triggers a warning when checked?
>
> Duncan Murdoch
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel