[Rd] GC: speeding-up the CHARSXP cache maintenance, 2nd try
Hi, In https://stat.ethz.ch/pipermail/r-devel/2021-October/081147.html I proposed to speed up the CHARSXP cache maintenance during GC using threading. This was rejected by Luke in https://stat.ethz.ch/pipermail/r-devel/2021-October/081172.html. Here I want to propose an alternative approach to significantly speed up CHARSXP cache maintenance during partial GCs. A patch which passes `make check-devel` is attached. Compared to R devel (revision 81110) I get the following performance improvements on my system: Elapsed time for five non-full gc in a session after x <- as.character(runif(5e7))[] gc(full = TRUE) +20sec -> ~1sec. This patch introduces (theoretical) overheads to mkCharLenCE() and full GCs. However, I did not measure dramatic differences: y <- "old_CHARSXP" after x <- "old_CHARSXP"; gc(); gc() takes a median 32 nanoseconds with and without the patch. gc(full = TRUE) in a new session takes a median 16 milliseconds with and 14 without the patch. The basic idea is to maintain the CHARSXP cache using subtables in R_StringHash, one for each of the (NUM_GC_GENERATIONS := NUM_OLD_GENERATIONS + 1) GC generations. New CHARSXPs are added by mkCharLenCE() to the subtable of the youngest generation. After a partial GC, only the chains anchored at the subtables of the youngest (num_old_gens_to_collect + 1) generations need to be searched for and cleaned of unmarked nodes. Afterwards, these chains need to be merged into those of the respective next generation, if any. This approach relies on the fact that an object/CHARSXP can never become younger again. It is OK though if an object/CHARSXP "skips" a GC generation. R_StringHash, which is now of length (NUM_GC_GENERATIONS * char_hash_size), is structured such that the chains for the same hashcode but for different generations are anchored at slots of R_StringHash which are next to each other in memory. This is because we often need to access two or more (i.e. currently all three) of them for one operation and this avoids cache misses. HASHPRI, i.e. the number of occupied primary slots, is computed and stored as NUM_GC_GENERATIONS times the number of slots which are occupied in at least one of the subtables. This is done because in mkCharLenCE() we need to iterate through one or more chains if and only if there is a chain for the particular hashcode in at least one subtable. I tried to keep the patch as minimal as possible. In particular, I did not add long vector support to R_StringHash. I rather reduced the max value of char_hash_size from 2^30 to 2^29, assuming that NUM_OLD_GENERATIONS is (not larger than) 2. I also did not yet adjust do_show_cache() and do_write_cache(), but I could do so if the patch is accepted. Thanks for your consideration and feedback. Regards, Andreas P.S. I had a hard time to get the indentation right in the patch due the mix of tabs and spaces. Sorry, if I screwed this up.Index: src/include/Defn.h === --- src/include/Defn.h (revision 81110) +++ src/include/Defn.h (working copy) @@ -94,6 +94,7 @@ */ #define NAMED_BITS 16 +#define NUM_GC_GENERATIONS 3 /* Flags */ Index: src/main/envir.c === --- src/main/envir.c (revision 81110) +++ src/main/envir.c (working copy) @@ -4079,7 +4079,7 @@ void attribute_hidden InitStringHash() { -R_StringHash = R_NewHashTable(char_hash_size); +R_StringHash = R_NewHashTable(char_hash_size * NUM_GC_GENERATIONS); } /* #define DEBUG_GLOBAL_STRING_HASH 1 */ @@ -4089,7 +4089,7 @@ { SEXP old_table = R_StringHash; SEXP new_table, chain, new_chain, val, next; -unsigned int counter, new_hashcode, newmask; +unsigned int counter, new_slot, newmask; #ifdef DEBUG_GLOBAL_STRING_HASH unsigned int oldsize = HASHSIZE(R_StringHash); unsigned int oldpri = HASHPRI(R_StringHash); @@ -4103,27 +4103,38 @@ /* When using the ATTRIB fields to maintain the chains the chain moving is destructive and does not involve allocation. This is therefore the only point where GC can occur. */ -new_table = R_NewHashTable(newsize); +new_table = R_NewHashTable(newsize * NUM_GC_GENERATIONS); newmask = newsize - 1; -/* transfer chains from old table to new table */ -for (counter = 0; counter < LENGTH(old_table); counter++) { - chain = VECTOR_ELT(old_table, counter); - while (!ISNULL(chain)) { - val = CXHEAD(chain); - next = CXTAIL(chain); - new_hashcode = char_hash(CHAR(val), LENGTH(val)) & newmask; - new_chain = VECTOR_ELT(new_table, new_hashcode); - /* If using a primary slot then increase HASHPRI */ - if (ISNULL(new_chain)) - SET_HASHPRI(new_table, HASHPRI(new_table) + 1); - /* move the current chain link to the new chain */ - /* this is a destructive modification */ - new_chain = SET_CXTAIL(val, new_chain); - SET_VECTOR_ELT(new_tab
Re: [Rd] Fwd: Using existing envars in Renviron on friendly Windows
Thanks for the suggestions, I've updated the documentation. Tomas On 11/3/21 11:30 AM, Tomas Kalibera wrote: On 11/3/21 1:37 AM, Henrik Bengtsson wrote: Oh, I see, I misunderstood. Thanks for clarifying. One more thing, to mix-and-match environment variables and strings with escaped characters, while mimicking how POSIX shells does it, by using strings with double and single quotes. For example, with: $ cat .Renviron APPDATA='C:\Users\foobar\AppData\Roaming' R_LIBS_USER="${APPDATA}"'\R-library' we get: $ Rscript --no-init-file --quiet -e 'cat(sprintf("R_LIBS_USER=[%s]\n", Sys.getenv("R_LIBS_USER")))' R_LIBS_USER=[C:\Users\foobar\AppData\Roaming\R-library] and $ source .Renviron $ echo "R_LIBS_USER=[${R_LIBS_USER}]" R_LIBS_USER=[C:\Users\foobar\AppData\Roaming\R-library] Yes, that could be mentioned explicitly as well. Tomas /Henrik On Sun, Oct 31, 2021 at 2:59 AM Tomas Kalibera wrote: On 10/31/21 2:55 AM, Henrik Bengtsson wrote: ... If one still needed backslashes, they could then be entered in single quotes, e.g. VAR='c:\users'. I don't think it matters whether you use single or double quotes - both will work. Here's a proof of concept on Linux with R 4.1.1: $ cat ./.Renviron A=C:\users B='C:\users' C="C:\users" $ Rscript -e "Sys.getenv(c('A', 'B', 'C'))" A B C "C:users" "C:\\users" "C:\\users" Yes, but as I wrote "I think the Renviron files should be written in a way so that they would work the same in a POSIX shell". This is why single quotes. With double quotes, backslashes are interpreted differently from a POSIX shell. Tomas /Henrik On Wed, Oct 27, 2021 at 11:45 AM Tomas Kalibera wrote: On 10/21/21 5:18 PM, Martin Maechler wrote: Michał Bojanowski on Wed, 20 Oct 2021 16:31:08 +0200 writes: > Hello Tomas, > Yes, that's accurate although rather terse, which is perhaps the > reason why I did not realize it applies to my case. > How about adding something in the direction of: > 1. Continuing the cited paragraph with: > In particular, on Windows it may be necessary to quote references to > existing environment variables, especially those containing file paths > (which include backslashes). For example: `"${WINVAR}"`. > 2. Add an example (not run): > # On Windows do quote references to variables containing paths, e.g.: > # If APPDATA=C:\Users\foobar\AppData\Roaming > # to point to a library tree inside APPDATA in .Renviron use > R_LIBS_USER="${APPDATA}"/R-library > Incidentally the last example is on backslashes too. > What do you think? I agree that adding an example really helps a lot in such cases, in my experience, notably if it's precise enough to be used +/- directly. Yes, I agree as well. I think the Renviron files should be written in a way so that they would work the same in a POSIX shell, so e.g. VAR="${VAR0}" or VAR="${VAR0}/subdir" are the recommended ways to preserve backslashes in VAR0. It is better to use forward slashes in string literals, e.g. VAR="c:/users". If one still needed backslashes, they could then be entered in single quotes, e.g. VAR='c:\users'. The currently implemented parsing of Renviron files differs in a number of details from POSIX shells, some are documented and some are not. Relying only on the documented behavior that is the same as in POSIX shells is the best choice for future compatibility. Tomas > On Mon, Oct 18, 2021 at 5:02 PM Tomas Kalibera wrote: >> >> >> On 10/15/21 6:44 PM, Michał Bojanowski wrote: >> > Perhaps a small update to ?.Renviron would be in order to mention that... >> >> Would you have a more specific suggestion how to update the >> documentation? Please note that it already says >> >> "‘value’ is then processed in a similar way to a Unix shell: in >> particular the outermost level of (single or double) quotes is stripped, >> and backslashes are removed except inside quotes." >> >> Thanks, >> Tomas >> >> > On Fri, Oct 15, 2021 at 6:43 PM Michał Bojanowski wrote: >> >> Indeed quoting works! Kevin suggested the same, but he didnt reply to the list. >> >> Thank you all! >> >> Michal >> >> >> >> On Fri, Oct 15, 2021 at 6:40 PM Ivan Krylov wrote: >> >>> Sorry for the noise! I wasn't supposed to send my previous message. >> >>> >> >>> On Fri, 15 Oct 2021 16:44:28 +0200 >> >>> Michał Bojanowski wrote: >> >>> >> AVAR=${APPDATA}/foo/bar >> >> Which is a documented way of referring to existing environment >> variables. Now, with that in R I'm getting: >> >> Sys.getenv("APPDATA") # That works OK >> [1] "C:\\Users\\mbojanowski\\AppData\\Roaming" >>
Re: [Rd] [External] GC: speeding-up the CHARSXP cache maintenance, 2nd try
Can you please submit this as a wishlist item to bugzilla? it is easier to keep track of there. You could also submit your threads based suggestion there, again to keep it easier to keep track of and possibly get back to in the future. I will have a look at your approach when I get a chance, but I am exploring a different approach to avoid scanning old generations that may be simpler. Best, luke On Wed, 3 Nov 2021, Andreas Kersting wrote: Hi, In https://stat.ethz.ch/pipermail/r-devel/2021-October/081147.html I proposed to speed up the CHARSXP cache maintenance during GC using threading. This was rejected by Luke in https://stat.ethz.ch/pipermail/r-devel/2021-October/081172.html. Here I want to propose an alternative approach to significantly speed up CHARSXP cache maintenance during partial GCs. A patch which passes `make check-devel` is attached. Compared to R devel (revision 81110) I get the following performance improvements on my system: Elapsed time for five non-full gc in a session after x <- as.character(runif(5e7))[] gc(full = TRUE) +20sec -> ~1sec. This patch introduces (theoretical) overheads to mkCharLenCE() and full GCs. However, I did not measure dramatic differences: y <- "old_CHARSXP" after x <- "old_CHARSXP"; gc(); gc() takes a median 32 nanoseconds with and without the patch. gc(full = TRUE) in a new session takes a median 16 milliseconds with and 14 without the patch. The basic idea is to maintain the CHARSXP cache using subtables in R_StringHash, one for each of the (NUM_GC_GENERATIONS := NUM_OLD_GENERATIONS + 1) GC generations. New CHARSXPs are added by mkCharLenCE() to the subtable of the youngest generation. After a partial GC, only the chains anchored at the subtables of the youngest (num_old_gens_to_collect + 1) generations need to be searched for and cleaned of unmarked nodes. Afterwards, these chains need to be merged into those of the respective next generation, if any. This approach relies on the fact that an object/CHARSXP can never become younger again. It is OK though if an object/CHARSXP "skips" a GC generation. R_StringHash, which is now of length (NUM_GC_GENERATIONS * char_hash_size), is structured such that the chains for the same hashcode but for different generations are anchored at slots of R_StringHash which are next to each other in memory. This is because we often need to access two or more (i.e. currently all three) of them for one operation and this avoids cache misses. HASHPRI, i.e. the number of occupied primary slots, is computed and stored as NUM_GC_GENERATIONS times the number of slots which are occupied in at least one of the subtables. This is done because in mkCharLenCE() we need to iterate through one or more chains if and only if there is a chain for the particular hashcode in at least one subtable. I tried to keep the patch as minimal as possible. In particular, I did not add long vector support to R_StringHash. I rather reduced the max value of char_hash_size from 2^30 to 2^29, assuming that NUM_OLD_GENERATIONS is (not larger than) 2. I also did not yet adjust do_show_cache() and do_write_cache(), but I could do so if the patch is accepted. Thanks for your consideration and feedback. Regards, Andreas P.S. I had a hard time to get the indentation right in the patch due the mix of tabs and spaces. Sorry, if I screwed this up. -- Luke Tierney Ralph E. Wareham Professor of Mathematical Sciences University of Iowa Phone: 319-335-3386 Department of Statistics andFax: 319-335-3017 Actuarial Science 241 Schaeffer Hall email: luke-tier...@uiowa.edu Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] .onLoad, packageStartupMessage, and R CMD check
I wrote a linter to stop users from using packageStartupMessage() in their .onLoad() hook because of the R CMD check warning it triggers: https://github.com/wch/r-source/blob/8b6625e39cd62424dc23399dade37f20fa8afa91/src/library/tools/R/QC.R#L5167 However, this received some pushback which I ultimately agree with, and moreover ?.onLoad seems to agree as well: > Loading a namespace should where possible be silent, with startup messages given by \code{.onAttach}. These messages (**and any essential ones from \code{.onLoad}**) should use \code{\link{packageStartupMessage}} so they can be silenced where they would be a distraction. **emphasis** mine. That is, if we think some message is _essential_ to print during loadNamespace(), we are told to use packageStartupMessage(). Should we remove this R CMD check warning? __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] .onLoad, packageStartupMessage, and R CMD check
On 04/11/2021 2:50 p.m., Michael Chirico via R-devel wrote: I wrote a linter to stop users from using packageStartupMessage() in their .onLoad() hook because of the R CMD check warning it triggers: https://github.com/wch/r-source/blob/8b6625e39cd62424dc23399dade37f20fa8afa91/src/library/tools/R/QC.R#L5167 However, this received some pushback which I ultimately agree with, and moreover ?.onLoad seems to agree as well: Loading a namespace should where possible be silent, with startup messages given by \code{.onAttach}. These messages (**and any essential ones from \code{.onLoad}**) should use \code{\link{packageStartupMessage}} so they can be silenced where they would be a distraction. **emphasis** mine. That is, if we think some message is _essential_ to print during loadNamespace(), we are told to use packageStartupMessage(). Should we remove this R CMD check warning? The help page doesn't define what an "essential" message would be, but I would assume it's a message about some dire condition, not just "Hi! I just got loaded!". So I think a note or warning would be appropriate, but not an error. Do you have an example of something that should routinely print, but that triggers a warning when checked? Duncan Murdoch __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] .onLoad, packageStartupMessage, and R CMD check
Hi Michael, Indeed, just to elaborate further on what I believe Duncan's point is, can you give any examples, "dire" or not, that are appropriate when the package is loaded but not attached (ie none of its symbols are visible to the user without using :::)? The only things I can think of are a package that changes the behavior of other, attached package code, such as conflicted. Doing so is very much an anti-pattern imo generally, with something like conflicted being an (arguable) exception. And that's assuming conflicted even works/does anything when loaded but not attached (I have not confirmed whether thats the case or not). That or a package that is at end-of-life and is or soon will be unsupported entirely. The examples don't need to be yours, per se, if you know what those pushing back against your linter were using messages from .onLoad for... Best, ~G On Thu, Nov 4, 2021 at 12:37 PM Duncan Murdoch wrote: > On 04/11/2021 2:50 p.m., Michael Chirico via R-devel wrote: > > I wrote a linter to stop users from using packageStartupMessage() in > > their .onLoad() hook because of the R CMD check warning it triggers: > > > > > https://github.com/wch/r-source/blob/8b6625e39cd62424dc23399dade37f20fa8afa91/src/library/tools/R/QC.R#L5167 > > > > However, this received some pushback which I ultimately agree with, > > and moreover ?.onLoad seems to agree as well: > > > >> Loading a namespace should where possible be silent, with startup > > messages given by \code{.onAttach}. These messages (**and any essential > > ones from \code{.onLoad}**) should use > \code{\link{packageStartupMessage}} > > so they can be silenced where they would be a distraction. > > > > **emphasis** mine. That is, if we think some message is _essential_ to > > print during loadNamespace(), we are told to use > > packageStartupMessage(). > > > > Should we remove this R CMD check warning? > > The help page doesn't define what an "essential" message would be, but I > would assume it's a message about some dire condition, not just "Hi! I > just got loaded!". So I think a note or warning would be appropriate, > but not an error. > > Do you have an example of something that should routinely print, but > that triggers a warning when checked? > > Duncan Murdoch > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel