[Rd] Objectsize function visiting every element for alt-rep strings
I have a toy alt-rep string package that generates randomly seeded strings. example: library(altstringisode) x <- altrandomStrings(1e8) head(x) [1] "2PN0bdwPY7CA8M06zVKEkhHgZVgtV1" "5PN2qmWqBlQ9wQj99nsQzldVI5ZuGX" ... etc object.size(1e8) Object.size will call the set_altstring_Elt_method for every single element, materializing (slowly) every element of the vector. This is a problem mostly in R-studio since object.size is called automatically, defeating the purpose of alt-rep. Is there a way to avoid the problem of forced materialization in rstudio? PS: Is there a way to tell if a post has been received by the mailing list? How long does it take to show up in the archives? __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Objectsize function visiting every element for alt-rep strings
Thanks for the detailed response, Gabriel! I think that an object_size alt-rep method that package developers need to implement might be hard to get right. One alternative could be an alt-rep method that returns the number of bytes/characters in a given string element since I believe the object size of a CHARSXP depends only on string length? I think two optional alt-string methods would be nice: `alt_string_elt_nchars` -- for the `nchar` function in R `alt_string_elt_nbytes` -- for `object.size` (which might be different than nchars due to encoding) Also since it's an issue that mainly affects R-studio, I started an issue on their github, and it sounds like they'll avoid calling object.size on alt-rep objects automatically. That would fix the main problem I've been having. Thanks, Travers On Fri, Jan 18, 2019 at 2:49 PM Gabriel Becker wrote: > > Travers, > > Great to hear you're trying out the ALTREP stuff, good on you :). > > Did you mean the get_altstring_Elt_method? I see the code in size.c within > utils that grabs each element, but I don't see any setting (and the setters > are noops currently anyway they just do things the old way). > > One thing we have to decide is what object.size means for an altrep. I tend > to think it should mean the size of the alternative representation currently > in use in memory, but I see that a small note in ?object.size indicates that > size of objects with compact internal representations may be overestimated, > so technically this is "as currently documented". The "we" here, of course, > is the R-core team so we'll have to see how they feel on the matter. > > As for what to do about it, one possibility is to add an object.size method > to the ALTREP method table that gets called if object.size is called on an > ALTREP object. In this case, it would be up to the class to define an > appropriate object.size method. That would be relatively easy to do from a > technical standpoint on R's side, but what comes out of object.size would be > a bit "Wild West-y", without the consistency and correctness guarantees one > might expect from a function in utils. > > Another option is to to have object.size recurse to calling object.size on > the two parts (SEXPS which together make up a CONS cell, I believe) that make > up an ALTREP internally. Roughly speaking one of these is usually the > alternative representation while the other is the spot to put an object with > the traditional representation if the payload is ever fully materialized in > an altrep-unsafe way - e.g., C code grabs a writable dataptr via INTEGER, > REAL, DATAPTR, etc. Note there are exceptions to what I said above, > though,such as the wrapper ALTREP classes which always have the parent object > (typically a traditionally laid-out vector), because the "alternative > representation" part is strictly a metadata annotation in that case and > contains no representation of the payload data for those classes. > > In this second case the result of object.size would be consistent across all > ALTREP classes, but in both cases the result of object.size would no longer > give any information about the size of a vector payload. This is consistent > with how object.size deals with external pointers now, but could lead to some > surprise in the case of vectors which the end user may not even know are > ALTREPs. > > Thoughts from anyone else on this list? > > Anyway, thanks for pointing this out. I'll talk with Luke and see what makes > sense to do here. > > Best, > ~G > > On Wed, Jan 16, 2019 at 3:49 AM Travers Ching wrote: >> >> I have a toy alt-rep string package that generates randomly seeded strings. >> >> example: >> library(altstringisode) >> x <- altrandomStrings(1e8) >> head(x) >> [1] "2PN0bdwPY7CA8M06zVKEkhHgZVgtV1" "5PN2qmWqBlQ9wQj99nsQzldVI5ZuGX" ... etc >> object.size(1e8) >> >> Object.size will call the set_altstring_Elt_method for every single >> element, materializing (slowly) every element of the vector. This is >> a problem mostly in R-studio since object.size is called >> automatically, defeating the purpose of alt-rep. >> >> Is there a way to avoid the problem of forced materialization in rstudio? >> >> PS: Is there a way to tell if a post has been received by the mailing >> list? How long does it take to show up in the archives? >> >> __ >> R-devel@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Objectsize function visiting every element for alt-rep strings
It should be possible to calculate object.size in the presence of sharing, at least with respect to all sub-nodes of a SEXP. E.g., during calculation, keep a hash of all SEXP pointers visited. If a pointer has already been visited, add only the size of the pointer to the total object size. Travers On Wed, Jan 23, 2019 at 1:33 AM Tomas Kalibera wrote: > > On 1/22/19 6:17 PM, Kevin Ushey wrote: > > I think that object.size() is most commonly used to answer the question, > > "what R objects are consuming the most memory currently in my R session?" > > and for that reason I think returning the size of the internal > > representations of objects (for e.g. ALTREP objects; unevaluated promises) > > is the right default behavior. > > I don't think one could answer that question at all in the presence of > sharing (of objects with value semantics due to copy on write, string > cache or other caches, sharing of objects with referential semantics > such as environments, etc). Also the mapping from R objects (SEXPs) to > what users might understand as objects would not be clear (which SEXPs > belong to which "object", which SEXPs are too low-level for the user to > be considered, etc). In principle, there could be a memory profiler > working at SEXP level and exposing all the intricacies of the memory > layout, answering reachability questions on a heap dump (so one could > find out about a 1G integer vector and then list all bindings say in > namespace environments from which it is reachable), but of course that > would be a lot of work to implement and to maintain. The problem is not > unique to R (e.g. see Java with the same problems of sharing that > prevent meaningful definition for object size). I am not persuaded it > makes sense to add more options to a function that does not have and > cannot have a well defined user-level semantics, and I would discourage > writing code that is trying to build on that function as I think that it > might lead to confusion and frustration. I think equality for example is > easier to define (just that one could come up with multiple meaningful > definitions, so it makes sense to have multiple options). > > Best > Tomas > > > > I also agree it would be worth considering adding arguments that control > > how object.size() is computed for different kinds of R objects, since users > > might want to use object.size() to answer different types of questions. > > > > All that said, if the ultimate goal here is to avoid having RStudio > > materialize ALTREP objects in the background, then perhaps that change > > should happen in RStudio :-) > > > > Best, > > Kevin > > > > On Tue, Jan 22, 2019 at 8:21 AM Tierney, Luke > > wrote: > > > >> On Mon, 21 Jan 2019, Martin Maechler wrote: > >> > >>>>>>>> Travers Ching > >>>>>>>> on Tue, 15 Jan 2019 12:50:45 -0800 writes: > >>> > I have a toy alt-rep string package that generates > >>> > randomly seeded strings. example: library(altstringisode) > >>> > x <- altrandomStrings(1e8) head(x) [1] > >>> > "2PN0bdwPY7CA8M06zVKEkhHgZVgtV1" > >>> > "5PN2qmWqBlQ9wQj99nsQzldVI5ZuGX" ... etc object.size(1e8) > >>> > >>> > Object.size will call the set_altstring_Elt_method for > >>> > every single element, materializing (slowly) every element > >>> > of the vector. This is a problem mostly in R-studio since > >>> > object.size is called automatically, defeating the purpose > >>> > of alt-rep. > >> There is no sensible way in general to figure out how large the > >> strings would be without computing them. There might be specifically > >> for a deferred sequence conversion but it would require a fair bit of > >> effort to figure out that would be better spent elsewhere. > >> > >> I've never been a big fan of object.size since what it is trying to > >> compute isn't very well defined in the context of sharing and possible > >> internal state changes (even before ALTREP byte code compilation could > >> change the internals of a function [which object.size sees] and > >> assigning into environments or evaluating promises can change > >> environments [which object.size ignores]). The issue is not unlike the > >> one faced by identical(), which has a bunch of options for the > >> different ways objects can be identical, and might need even more. > >> > >> We could in general have object.size for a
[Rd] Object.size() should not visit every element for alt-rep strings, or there should be an altstring_objectsize_method
Below is a toy alt-rep string example, that generates N random strings: https://gist.github.com/traversc/a48a504eb062554f2d6ff8043ca16f9c example: `x <- altrandomStrings(1e8)` `head(x)` [1] "2PN0bdwPY7CA8M06zVKEkhHgZVgtV1" "5PN2qmWqBlQ9wQj99nsQzldVI5ZuGX" ... `object.size(1e8)` Object.size will call the `set_altstring_Elt_method` for every single element, materializing (slowly) every element of the vector. This is a problem mostly in R-studio since object.size is called automatically, defeating the purpose of alt-rep entirely. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Object.size() should not visit every element for alt-rep strings, or there should be an altstring_objectsize_method
Hi Lujke, Thanks for the response. But for some reason, this is a duplicate post I had sent WEEKS ago, but for some reason is only showing up now? I initially thought it was filtered out and detected as spam because of the github link, so I re-wrote the email (several times in fact), and you can see the other thread. Very weird. Also, the good people at rstudio seem to have fixed the issue! Thanks Travers On Thu, Jan 31, 2019 at 5:35 AM Tierney, Luke wrote: > > You should really take this up with RStudio. Calling object.size on > every top level assignment as they appear to do is a bad idea, even > without ALTREP. object.size is only a cheap operation for simple > atomic vectors. For anything with recursive sturcture it needs to walk > the object, so the effort is proprtional to object size: > > > x <- rep("A", 1e8) > > system.time(object.size(x)) > user system elapsed >1.222 0.624 1.850 > > x <- rep(list(1), 1e8) > > system.time(object.size(x)) > user system elapsed >1.247 0.022 1.273 > > The current help for object.size says > > Provides an estimate of the memory that is being used to store an > R object. > > If this is interpreted as the current memory use, which could change > in the ALTREP context (or for environments, though there the changes > are ignored), then we could define object.size for ALTREP objects to > avoid any ALTREP-specific computation. I'm not convinced yet that this > is a good idea, but it even if we do change this at the R level, > RStudio would still be well-advised to have another look at what they > are doing. > > Best, > > luke > > On Tue, 15 Jan 2019, Travers Ching wrote: > > > > > Below is a toy alt-rep string example, that generates N random strings: > > > > https://gist.github.com/traversc/a48a504eb062554f2d6ff8043ca16f9c > > > > example: > > `x <- altrandomStrings(1e8)` > > `head(x)` > > [1] "2PN0bdwPY7CA8M06zVKEkhHgZVgtV1" "5PN2qmWqBlQ9wQj99nsQzldVI5ZuGX" ... > > `object.size(1e8)` > > > > Object.size will call the `set_altstring_Elt_method` for every single > > element, materializing (slowly) every element of the vector. This is > > a problem mostly in R-studio since object.size is called > > automatically, defeating the purpose of alt-rep entirely. > > > > __ > > R-devel@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-devel > > > > -- > Luke Tierney > Ralph E. Wareham Professor of Mathematical Sciences > University of Iowa Phone: 319-335-3386 > Department of Statistics andFax: 319-335-3017 > Actuarial Science > 241 Schaeffer Hall email: luke-tier...@uiowa.edu > Iowa City, IA 52242 WWW: http://www.stat.uiowa.edu __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Intermittent crashes with inset `[<-` command
On an azure centos VM, I can reproduce this bug which reports either: *** caught segfault *** address 0x7006a, cause 'memory not mapped' (crash) Or incompatible types (from builtin to integer) in subassignment type fix (no crash) Like Gabriel, I could not reproduce the bug on a mac laptop. Both R versions 3.5.1. Travers On Wed, Feb 27, 2019 at 9:08 AM William Dunlap via R-devel wrote: > > Valgrind (without gctorture) reports memory misuse: > > % R --debugger=valgrind --debugger-args="--leak-check=full --num-callers=18" > ... > > x <- 1:20 > > y <- rep(letters[1:5], length(x) / 5L) > > for (i in 1:1000) { > + # x[y == 'a'] <- x[y == 'b'] > + x <- `[<-`(x, y == 'a', x[y == 'b']) > + cat(i, '') > + } > 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 > 29 30 31 32 33 34 35 36 37 ==4711== Invalid read of size 1 > ==4711==at 0x501A40F: Rf_xlength (Rinlinedfuns.h:542) > ==4711==by 0x501A40F: VectorAssign (subassign.c:658) > ==4711==by 0x501CDFE: do_subassign_dflt (subassign.c:1641) > ==4711==by 0x5020100: do_subassign (subassign.c:1571) > ==4711==by 0x4F66398: bcEval (eval.c:6795) > ==4711==by 0x4F7D86D: R_compileAndExecute (eval.c:1407) > ==4711==by 0x4F7DA70: do_for (eval.c:2185) > ==4711==by 0x4F7741C: Rf_eval (eval.c:691) > ==4711==by 0x4FA7181: Rf_ReplIteration (main.c:258) > ==4711==by 0x4FA7570: R_ReplConsole (main.c:308) > ==4711==by 0x4FA760E: run_Rmainloop (main.c:1082) > ==4711==by 0x40075A: main (Rmain.c:29) > ==4711== Address 0x19b3ab90 is 0 bytes inside a block of size 160,048 > free'd > ==4711==at 0x4C2ACBD: free (vg_replace_malloc.c:530) > ==4711==by 0x4FAFCB2: ReleaseLargeFreeVectors (memory.c:1055) > ==4711==by 0x4FAFCB2: RunGenCollect (memory.c:1825) > ==4711==by 0x4FAFCB2: R_gc_internal (memory.c:2998) > ==4711==by 0x4FB166F: Rf_allocVector3 (memory.c:2682) > ==4711==by 0x4FB2310: Rf_allocVector (Rinlinedfuns.h:577) > ==4711==by 0x4FB2310: R_alloc (memory.c:2197) > ==4711==by 0x5023F7A: logicalSubscript (subscript.c:575) > ==4711==by 0x5026DA3: Rf_makeSubscript (subscript.c:994) > ==4711==by 0x501A2F3: VectorAssign (subassign.c:656) > ==4711==by 0x501CDFE: do_subassign_dflt (subassign.c:1641) > ==4711==by 0x5020100: do_subassign (subassign.c:1571) > ==4711==by 0x4F66398: bcEval (eval.c:6795) > ==4711==by 0x4F7D86D: R_compileAndExecute (eval.c:1407) > ==4711==by 0x4F7DA70: do_for (eval.c:2185) > ==4711==by 0x4F7741C: Rf_eval (eval.c:691) > ==4711==by 0x4FA7181: Rf_ReplIteration (main.c:258) > ==4711==by 0x4FA7570: R_ReplConsole (main.c:308) > ==4711==by 0x4FA760E: run_Rmainloop (main.c:1082) > ==4711==by 0x40075A: main (Rmain.c:29) > ==4711== Block was alloc'd at > ==4711==at 0x4C29BC3: malloc (vg_replace_malloc.c:299) > ==4711==by 0x4FB1B04: Rf_allocVector3 (memory.c:2712) > ==4711==by 0x5027574: Rf_allocVector (Rinlinedfuns.h:577) > ==4711==by 0x5027574: Rf_ExtractSubset (subset.c:115) > ==4711==by 0x502ADCD: VectorSubset (subset.c:198) > ==4711==by 0x502ADCD: do_subset_dflt (subset.c:823) > ==4711==by 0x502BE90: do_subset (subset.c:661) > ==4711==by 0x4F7741C: Rf_eval (eval.c:691) > ==4711==by 0x4F7BAC3: Rf_evalListKeepMissing (eval.c:2955) > ==4711==by 0x50200CB: R_DispatchOrEvalSP (subassign.c:1535) > ==4711==by 0x50200CB: do_subassign (subassign.c:1567) > ==4711==by 0x4F66398: bcEval (eval.c:6795) > ==4711==by 0x4F7D86D: R_compileAndExecute (eval.c:1407) > ==4711==by 0x4F7DA70: do_for (eval.c:2185) > ==4711==by 0x4F7741C: Rf_eval (eval.c:691) > ==4711==by 0x4FA7181: Rf_ReplIteration (main.c:258) > ==4711==by 0x4FA7570: R_ReplConsole (main.c:308) > ==4711==by 0x4FA760E: run_Rmainloop (main.c:1082) > ==4711==by 0x40075A: main (Rmain.c:29) > ==4711== > ==4711== Invalid read of size 8 > ==4711==at 0x501A856: XLENGTH_EX (Rinlinedfuns.h:189) > ==4711==by 0x501A856: Rf_xlength (Rinlinedfuns.h:554) > ==4711==by 0x501A856: VectorAssign (subassign.c:658) > ==4711==by 0x501CDFE: do_subassign_dflt (subassign.c:1641) > ==4711==by 0x5020100: do_subassign (subassign.c:1571) > ==4711==by 0x4F66398: bcEval (eval.c:6795) > ==4711==by 0x4F7D86D: R_compileAndExecute (eval.c:1407) > ==4711==by 0x4F7DA70: do_for (eval.c:2185) > ==4711==by 0x4F7741C: Rf_eval (eval.c:691) > ==4711==by 0x4FA7181: Rf_ReplIteration (main.c:258) > ==4711==by 0x4FA7570: R_ReplConsole (main.c:308) > ==4711==by 0x4FA760E: run_Rmainloop (main.c:1082) > ==4711==by 0x40075A: main (Rmain.c:29) > ==4711== Address 0x19b3abb0 is 32 bytes inside a block of size 160,048 > free'd > ==4711==at 0x4C2ACBD: free (vg_replace_malloc.c:530) > ==4711==by 0x4FAFCB2: ReleaseLargeFreeVectors (memory.c:1055) > ==4711==by 0x4FAFCB2: RunGenCollect (memory.c:1825) > ==4711==by 0x4FAFCB2: R_gc_internal (memory.c:2998) > ==4711==by 0x4FB16
Re: [Rd] Intermittent crashes with inset `[<-` command
Some testing: Adding `gc()` inside the for loop prevented a crash for 10,000+ iterations, whereas adding `Sys.sleep(.2)` (which takes longer) did not. I couldn't wrap my head around the `vectorAssign` source code, but I suspect it is a matter of an intermediate object not being protected and being gc'ed. Hope that helps someone Travers Travers On Wed, Feb 27, 2019 at 11:48 AM Travers Ching wrote: > > On an azure centos VM, I can reproduce this bug which reports either: > > *** caught segfault *** > address 0x7006a, cause 'memory not mapped' (crash) > > Or > > incompatible types (from builtin to integer) in subassignment type fix > (no crash) > > Like Gabriel, I could not reproduce the bug on a mac laptop. Both R > versions 3.5.1. > > Travers > > On Wed, Feb 27, 2019 at 9:08 AM William Dunlap via R-devel > wrote: > > > > Valgrind (without gctorture) reports memory misuse: > > > > % R --debugger=valgrind --debugger-args="--leak-check=full --num-callers=18" > > ... > > > x <- 1:20 > > > y <- rep(letters[1:5], length(x) / 5L) > > > for (i in 1:1000) { > > + # x[y == 'a'] <- x[y == 'b'] > > + x <- `[<-`(x, y == 'a', x[y == 'b']) > > + cat(i, '') > > + } > > 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 > > 29 30 31 32 33 34 35 36 37 ==4711== Invalid read of size 1 > > ==4711==at 0x501A40F: Rf_xlength (Rinlinedfuns.h:542) > > ==4711==by 0x501A40F: VectorAssign (subassign.c:658) > > ==4711==by 0x501CDFE: do_subassign_dflt (subassign.c:1641) > > ==4711==by 0x5020100: do_subassign (subassign.c:1571) > > ==4711==by 0x4F66398: bcEval (eval.c:6795) > > ==4711==by 0x4F7D86D: R_compileAndExecute (eval.c:1407) > > ==4711==by 0x4F7DA70: do_for (eval.c:2185) > > ==4711==by 0x4F7741C: Rf_eval (eval.c:691) > > ==4711==by 0x4FA7181: Rf_ReplIteration (main.c:258) > > ==4711==by 0x4FA7570: R_ReplConsole (main.c:308) > > ==4711==by 0x4FA760E: run_Rmainloop (main.c:1082) > > ==4711==by 0x40075A: main (Rmain.c:29) > > ==4711== Address 0x19b3ab90 is 0 bytes inside a block of size 160,048 > > free'd > > ==4711==at 0x4C2ACBD: free (vg_replace_malloc.c:530) > > ==4711==by 0x4FAFCB2: ReleaseLargeFreeVectors (memory.c:1055) > > ==4711==by 0x4FAFCB2: RunGenCollect (memory.c:1825) > > ==4711==by 0x4FAFCB2: R_gc_internal (memory.c:2998) > > ==4711==by 0x4FB166F: Rf_allocVector3 (memory.c:2682) > > ==4711==by 0x4FB2310: Rf_allocVector (Rinlinedfuns.h:577) > > ==4711==by 0x4FB2310: R_alloc (memory.c:2197) > > ==4711==by 0x5023F7A: logicalSubscript (subscript.c:575) > > ==4711==by 0x5026DA3: Rf_makeSubscript (subscript.c:994) > > ==4711==by 0x501A2F3: VectorAssign (subassign.c:656) > > ==4711==by 0x501CDFE: do_subassign_dflt (subassign.c:1641) > > ==4711==by 0x5020100: do_subassign (subassign.c:1571) > > ==4711==by 0x4F66398: bcEval (eval.c:6795) > > ==4711==by 0x4F7D86D: R_compileAndExecute (eval.c:1407) > > ==4711==by 0x4F7DA70: do_for (eval.c:2185) > > ==4711==by 0x4F7741C: Rf_eval (eval.c:691) > > ==4711==by 0x4FA7181: Rf_ReplIteration (main.c:258) > > ==4711==by 0x4FA7570: R_ReplConsole (main.c:308) > > ==4711==by 0x4FA760E: run_Rmainloop (main.c:1082) > > ==4711==by 0x40075A: main (Rmain.c:29) > > ==4711== Block was alloc'd at > > ==4711==at 0x4C29BC3: malloc (vg_replace_malloc.c:299) > > ==4711==by 0x4FB1B04: Rf_allocVector3 (memory.c:2712) > > ==4711==by 0x5027574: Rf_allocVector (Rinlinedfuns.h:577) > > ==4711==by 0x5027574: Rf_ExtractSubset (subset.c:115) > > ==4711==by 0x502ADCD: VectorSubset (subset.c:198) > > ==4711==by 0x502ADCD: do_subset_dflt (subset.c:823) > > ==4711==by 0x502BE90: do_subset (subset.c:661) > > ==4711==by 0x4F7741C: Rf_eval (eval.c:691) > > ==4711==by 0x4F7BAC3: Rf_evalListKeepMissing (eval.c:2955) > > ==4711==by 0x50200CB: R_DispatchOrEvalSP (subassign.c:1535) > > ==4711==by 0x50200CB: do_subassign (subassign.c:1567) > > ==4711==by 0x4F66398: bcEval (eval.c:6795) > > ==4711==by 0x4F7D86D: R_compileAndExecute (eval.c:1407) > > ==4711==by 0x4F7DA70: do_for (eval.c:2185) > > ==4711==by 0x4F7741C: Rf_eval (eval.c:691) > > ==4711==by 0x4FA7181: Rf_ReplIteration (main.c:258) > > ==4711==by 0x4FA7570: R_ReplConsole (main.c:308) > > ==4711==by 0x4FA760E: run_Rmainloop (main.c:1082) > > ==4711
Re: [Rd] SUGGESTION: Settings to disable forked processing in R, e.g. parallel::mclapply()
Just throwing my two cents in: I think removing/deprecating fork would be a bad idea for two reasons: 1) There are no performant alternatives 2) Removing fork would break existing workflows Even if replaced with something using the same interface (e.g., a function that automatically detects variables to export as in the amazing `future` package), the lack of copy-on-write functionality would cause scripts everywhere to break. A simple example illustrating these two points: `x <- 5e8; mclapply(1:24, sum, x, 8)` Using fork, `mclapply` takes 5 seconds. Using "psock", `clusterApply` does not complete. Travers On Fri, Apr 12, 2019 at 2:32 AM Iñaki Ucar wrote: > > On Thu, 11 Apr 2019 at 22:07, Henrik Bengtsson > wrote: > > > > ISSUE: > > Using *forks* for parallel processing in R is not always safe. > > [...] > > Comments? > > Using fork() is never safe. The reference provided by Kevin [1] is > pretty compelling (I kindly encourage anyone who ever forked a process > to read it). Therefore, I'd go beyond Henrik's suggestion, and I'd > advocate for deprecating fork clusters and eventually removing them > from parallel. > > [1] > https://www.microsoft.com/en-us/research/uploads/prod/2019/04/fork-hotos19.pdf > > -- > Iñaki Úcar > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] SUGGESTION: Settings to disable forked processing in R, e.g. parallel::mclapply()
Hi Inaki, > "Performant"... in terms of what. If the cost of copying the data > predominates over the computation time, maybe you didn't need > parallelization in the first place. Performant in terms of speed. There's no copying in that example using `mclapply` and so it is significantly faster than other alternatives. It is a very simple and contrived example, but there are lots of applications that depend on processing of large data and benefit from multithreading. For example, if I read in large sequencing data with `Rsamtools` and want to check sequences for a set of motifs. > I don't see why mclapply could not be rewritten using PSOCK clusters. Because it would be much slower. > To implement copy-on-write, Linux overcommits virtual memory, and this > is what causes scripts to break unexpectedly: everything works fine, > until you change a small unimportant bit and... boom, out of memory. > And in general, running forks in any GUI would cause things everywhere > to break. > I'm not sure how did you setup that, but it does complete. Or do you > mean that you ran out of memory? Then try replacing "x" with, e.g., > "x+1" in your mclapply example and see what happens (hint: save your > work first). Yes, I meant that it ran out of memory on my desktop. I understand the limits, and it is not perfect because of the GUI issue you mention, but I don't see a better alternative in terms of speed. Regards, Travers On Fri, Apr 12, 2019 at 3:45 PM Iñaki Ucar wrote: > > On Fri, 12 Apr 2019 at 21:32, Travers Ching wrote: > > > > Just throwing my two cents in: > > > > I think removing/deprecating fork would be a bad idea for two reasons: > > > > 1) There are no performant alternatives > > "Performant"... in terms of what. If the cost of copying the data > predominates over the computation time, maybe you didn't need > parallelization in the first place. > > > 2) Removing fork would break existing workflows > > I don't see why mclapply could not be rewritten using PSOCK clusters. > And as a side effect, this would enable those workflows on Windows, > which doesn't support fork. > > > Even if replaced with something using the same interface (e.g., a > > function that automatically detects variables to export as in the > > amazing `future` package), the lack of copy-on-write functionality > > would cause scripts everywhere to break. > > To implement copy-on-write, Linux overcommits virtual memory, and this > is what causes scripts to break unexpectedly: everything works fine, > until you change a small unimportant bit and... boom, out of memory. > And in general, running forks in any GUI would cause things everywhere > to break. > > > A simple example illustrating these two points: > > `x <- 5e8; mclapply(1:24, sum, x, 8)` > > > > Using fork, `mclapply` takes 5 seconds. Using "psock", `clusterApply` > > does not complete. > > I'm not sure how did you setup that, but it does complete. Or do you > mean that you ran out of memory? Then try replacing "x" with, e.g., > "x+1" in your mclapply example and see what happens (hint: save your > work first). > > -- > Iñaki Úcar __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] [R] Open a file which name contains a tilde
Hi Gabriel, It may be bad practice, but you don't always have control over the file name. E.g. if someone shares a file with a tilde in it -- yes it is simple to rename but it is extra time, and you might not bother to rename a file without foreknowledge of this bug in the first place. Even worse, if someone points you to a read only location on a shared server, you won't even be able to rename the file, and copying might be prohibitive if it's a large file. There are also tilde files created automatically by other programs, notably microsoft office. Travers On Tue, Jun 11, 2019 at 9:49 AM Gabriel Becker wrote: > Hi Frank, > > I'm hesitant to be "that guy", but in case no one else has brought this up > to you, having files with a tilde in their names (generally but especially > on a linux system, where ~ in file names has a very important special > meaning in some cases, as we know) strikes me as an exceptionally bad > practice anyway. In light of that, the solution with the smallest amount of > pain for you is almost surely to just... not do that. Your filenames will > be better for it anyway. > > There is a reason no one has complained about this before, and while I > haven't run a study or anything, I strongly suspect its that "everyone" > else is already on the "no tildes in filenames" bandwagon, so this > behavior, even if technically a bug, has no ability to cause them problems. > > Best, > ~G > > On Tue, Jun 11, 2019 at 8:25 AM Frank Schwidom wrote: > > > Hi, > > > > yes, I have seen this package and it has the same tilde expanding > problem. > > > > Please excuse me I will cc this answer to r-help and r-devel to keep the > > discussion running. > > > > Kind regards, > > Frank Schwidom > > > > On 2019-06-11 09:12:36, Gábor Csárdi wrote: > > > Just in case, have you seen the fs package? > > > https://fs.r-lib.org/ > > > > > > Gabor > > > > > > On Tue, Jun 11, 2019 at 7:51 AM Frank Schwidom > wrote: > > > > > > > > Hi, > > > > > > > > to get rid of any possible filename modification I started a little > > project to cover my usecase: > > > > > > > > https://github.com/schwidom/simplefs > > > > > > > > This is my first R package, suggestions and a review are welcome. > > > > > > > > Thanks in advance > > > > Frank Schwidom > > > > > > > > On 2019-06-07 09:04:06, Richard O'Keefe wrote: > > > > >How can expanding tildes anywhere but the beginning of a file > > name NOT be > > > > >considered a bug? > > > > >On Thu, 6 Jun 2019 at 23:04, Ivan Krylov <[1] > > krylov.r...@gmail.com> wrote: > > > > > > > > > > On Wed, 5 Jun 2019 18:07:15 +0200 > > > > > Frank Schwidom <[2]schwi...@gmx.net> wrote: > > > > > > > > > > > +> path.expand("a ~ b") > > > > > > [1] "a /home/user b" > > > > > > > > > > > How can I switch off any file crippling activity? > > > > > > > > > > It doesn't seem to be possible if readline is enabled and > works > > > > > correctly. > > > > > > > > > > Calls to path.expand [1] end up [2] in R_ExpandFileName [3], > > which > > > > > calls R_ExpandFileName_readline [4], which uses libreadline > > function > > > > > tilde_expand [5]. tilde_expand seems to be designed to expand > > '~' > > > > > anywhere in the string it is handed, i.e. operate on whole > > command > > > > > lines, not file paths. > > > > > > > > > > I am taking the liberty of Cc-ing R-devel in case this can be > > > > > considered a bug. > > > > > > > > > > -- > > > > > Best regards, > > > > > Ivan > > > > > > > > > > [1] > > > > > [3] > > > https://github.com/wch/r-source/blob/12d1d2d232d84aa355e48b81180a0e2c6f2f/src/main/names.c#L807 > > > > > > > > > > [2] > > > > > [4] > > > https://github.com/wch/r-source/blob/12d1d2d232d84aa355e48b81180a0e2c6f2f/src/main/platform.c#L1915 > > > > > > > > > > [3] > > > > > [5] > > > https://github.com/wch/r-source/blob/12d1d2d232d84aa355e48b81180a0e2c6f2f/src/unix/sys-unix.c#L147 > > > > > > > > > > [4] > > > > > [6] > > > https://github.com/wch/r-source/blob/12d1d2d232d84aa355e48b81180a0e2c6f2f/src/unix/sys-std.c#L494 > > > > > > > > > > [5] > > > > > [7] > > https://git.savannah.gnu.org/cgit/readline.git/tree/tilde.c?h=devel#n187 > > > > > > > > > > __ > > > > > [8]r-h...@r-project.org mailing list -- To UNSUBSCRIBE and > > more, see > > > > > [9]https://stat.ethz.ch/mailman/listinfo/r-help > > > > > PLEASE do read the posting guide > > > > > [10]http://www.R-project.org/posting-guide.html > > > > > and provide commented, minimal, self-contained, reproducible > > code. > > > > > > > > > > References > > > > > > > > > >Visible links > > > > >1. mailto:krylov.r...@gmail.com > > > > >2. mailto:schwi...@gmx.net > > > > >3. > > > https://github.com/wch/r-source/blob/12d1d2d232d84aa355e48b81180a0e2c6f2f/src/main/names.c#L807 > > >
[Rd] Use of restricted c++ keywords as variable names in headers
I was trying to use one of the headers in R_ext/, but had trouble. I determined that it was due to using restricted keywords as variable names. So to load in the header, I needed to do this: #define class klass #define private krivate #include #undef class #undef private I know that the altrep.h header previously had the same issue, but was fixed. Could this be changed as well? [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] S4SXP type vs S4 object bit?
I'm trying to understand the R internals a bit better and reading over the documentation. I see that there is a bit related to whether an object is S4 (S4_OBJECT_MASK), and also the object type S4SXP (25). The documentation makes clear that these two things aren't the same. But in practice, will the S4-bit and object type ever disagree for S4 objects? I know that one can set the bit manually in C; are there any practical applications for doing so? Thank you Travers [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] S4SXP type vs S4 object bit?
Thanks you Jiefei and Michael! Travers On Tue, Oct 22, 2019 at 8:14 AM Wang Jiefei wrote: > Hi Travers, > > Just an additional remarks to Michael's answer, if your S4 class inherits > from R's basic types, say integer, the resulting object will be an INTSXP. > If your S4 class does not inherit from any class, it will be an S4SXP. You > can think about this question from the object-oriented framework: If one > class inherits the integer class, what should R do to make all the integer > related functions compatible with the new class at C level? > > Best, > Jiefei > > On Tue, Oct 22, 2019 at 4:28 AM Travers Ching wrote: > >> I'm trying to understand the R internals a bit better and reading over the >> documentation. >> >> I see that there is a bit related to whether an object is S4 >> (S4_OBJECT_MASK), and also the object type S4SXP (25). The documentation >> makes clear that these two things aren't the same. >> >> But in practice, will the S4-bit and object type ever disagree for S4 >> objects? I know that one can set the bit manually in C; are there any >> practical applications for doing so? >> >> Thank you >> Travers >> >> [[alternative HTML version deleted]] >> >> __ >> R-devel@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel >> > [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Apple M1 CRAN checks
I noticed CRAN is now doing checks against Apple M1, and some packages are failing including a dependency I use. Is building on M1 now a requirement, or can the check be ignored? If it's a requirement, how can one test it out? Travers [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Apple M1 CRAN checks
Hi Prof Ripley, Here is the automated message from CRAN which I thought meant needing to fix an M1 issue: "The auto-check found additional issues for the *last* version released on CRAN: M1mac <https://www.stats.ox.ac.uk/pub/bdr/M1mac/stringfish.out> CRAN incoming checks do not test for these additional issues and you will need an appropriately instrumented build of R to reproduce these. Hence please reply-all and explain: Have these been fixed? " However, RcppParallel (a dependency) isn't building on M1: https://www.stats.ox.ac.uk/pub/bdr/M1mac/RcppParallel.out If I understand you correctly, I can ignore the M1 "Additional issues" until official R support? Thank you, Travers On Mon, Feb 22, 2021 at 11:25 PM Prof Brian Ripley wrote: > On 22/02/2021 08:30, Travers Ching wrote: > > I noticed CRAN is now doing checks against Apple M1, and some packages > are > > failing including a dependency I use. > > I don't know what this refers to: M1 Mac CRAN checks are planned but > AFAICS not yet included in the main results tables. > > OTOH, 'Additional issues' on M1 Mac have been reported on the results > pages since early December. > > > Is building on M1 now a requirement, or can the check be ignored? If > it's a > > requirement, how can one test it out? > > 'requirement' for what? > > I am not aware of any CRAN package for which 'R CMD build' does not work > on an M1 Mac. > > *Checking* might need an M1 Mac machine. CRAN has only been notifying > issues which can easily be corrected without access to M1 hardware (such > as using suggested packages unconditionally or using optional > capabilities without checking). > > > Travers > > > > [[alternative HTML version deleted]] > > Please do re-read the posting guide (and 'Writing R Extensions'). > Also, this is not r-package-devel > > -- > Brian D. Ripley, rip...@stats.ox.ac.uk > Emeritus Professor of Applied Statistics, University of Oxford > [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Crash/bug when calling match on latin1 strings
Here's a brief example: # A bunch of words in UTF8; replace *'s words <- readLines("h://pastebin.c**/raw/MFCQfhpY", encoding = "UTF-8") words2 <- iconv(words, "utf-8", "latin1") gctorture(TRUE) y <- match(words2, words2) I searched bugzilla but didn't see anything. Apologies if this is already reported. The bug appears in both R-devel and the release, but doesn't seem to affect R 4.0.5. [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Minor bug with stats::isoreg
Hello, I'd like to file a small bug report. I searched and didn't find a duplicate report. Calling isoreg with an Inf value causes a segmentation fault, tested on R 4.3.1 and R 4.2. A reproducible example is: `isoreg(c(0,Inf))` [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Bug report: parLapply with capture.output(type="message") produces an error
Hello, I have tested this on a fresh ubuntu image with R 4.3.1. Rscript -e 'library(parallel) cl = makeCluster(2) x = parLapply(cl, 1:100, function(i) { capture.output(message("hello"), type = "message") }) print("bye")' This produces the following output: [1] "bye" Error in unserialize(node$con) : error reading from connection Calls: ... doTryCatch -> recvData -> recvData.SOCKnode -> unserialize Execution halted Error in unserialize(node$con) : error reading from connection Calls: ... doTryCatch -> recvData -> recvData.SOCKnode -> unserialize Execution halted The error does not occur interactively or if stopCluster gets called at the end. [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Bug report: parLapply with capture.output(type="message") produces an error
e. If we use the latter, we will see all output from > the parallel worker(s). Let's try that: > > $ Rscript --vanilla -e 'library(parallel); cl <- makeCluster(1, > outfile = ""); x <- clusterEvalQ(cl, { })' > starting worker pid=349252 on localhost:11036 at 17:45:05.125 > Error in unserialize(node$con) : error reading from connection > Calls: ... doTryCatch -> recvData -> recvData.SOCKnode -> > unserialize > Execution halted > > You see. There's a "starting worker ..." output that we now see. But > more importantly, we now also see that "error reading from connection" > message. So, as you see, that error message is there regardless of us > capturing or sinking the "message" output. Instead, what it tells us > is that there is an error taking place at the very end, but we > normally don't see it. > > This error is because when the main R session shuts down, the parallel > workers are still running and trying to listen to the socket > connection that they use to communicate with the main R session. But > that is now broken, so each parallel worker will fail when it tries to > communicate. > > How to fix it? Make sure to close the 'cl' cluster before exiting the > main R session, i.e. > > $ Rscript --vanilla -e 'library(parallel); cl <- makeCluster(1, > outfile = ""); x <- clusterEvalQ(cl, { }); stopCluster(cl)' > starting worker pid=349703 on localhost:11011 at 17:50:20.357 > > The error is no longer there, because the main R session will tell the > parallel workers to shut down *before* terminating itself. This means > there are no stray parallel workers trying to reach a non-existing > main R session. > > In a way, your example revealed that you forgot to call > stopCluster(cl) at the end. > > But, the real message here is: Do not mess with the "message" output in R! > > I'll take the moment to rant about this: I think sink(..., type = > "message") should not be part of the public R API; it's simply > impossible to use safely, because there is no one owner controlling > it. To prevent it being used by mistake, at least it could throw an > error if there's already an active "message" sink. Oh, well ... > > > Almost finally, do what you're probably trying to achieve here, when you > call: > > out <- capture.output({ message("hello"); message("world") }, type = > "message") > > What you really want to do is: > > capture_messages <- function(expr, envir = parent.frame()) { > msgs <- list() > withCallingHandlers({ > eval(expr, envir = envir) > }, message = function(m) { > msgs <<- c(msgs, list(m)) > invokeRestart("muffleMessage") > }) > msgs > } > > msgs <- capture_messages({ message("hello"); message("world") }) > > When you capture 'message' conditions this way, you can decide to > resignal then later, e.g. > > > void <- lapply(msgs, message) > hello > world > > You can capture 'warning' conditions in the same way. > > > > Finally, if you've got to this because you wanted to > capture/see/display/view output that is taking place on parallel > workers, I recommend using the Futureverse (https://futureverse.org) > for parallelization. Disclaimer, I'm the author. The Futureverse > takes care of relaying stdout, messages, warnings, errors, and other > types of conditions automatically. Here's an example that resembles > your original example: > > > cl <- parallel::makeCluster(2) > > future::plan("cluster", workers = cl) > > y <- future.apply::future_lapply(1:3, function(i) message("hello")) > hello > hello > hello > > parallel::stopCluster(cl) > > Note that those "hello" messages are truly relayed versions of the > original 'message' conditions. Warnings works the same way. > > A cleaner and slightly better version of the above example is: > > > library(future.apply) > > plan(multisession, workers = 2) > > y <- future.apply::future_lapply(1:3, function(i) message("hello")) > hello > hello > hello > > plan(sequential) > > Over and out, > > Henrik > > On Thu, Oct 5, 2023 at 4:07 PM Travers Ching wrote: > > > > Hello, I have tested this on a fresh ubuntu image with R 4.3.1. > > > > Rscript -e 'library(parallel) > > cl = makeCluster(2) > > x = parLapply(cl, 1:100, function(i) { > > capture.output(message("hello"), type = "message") > > }) > > print("bye")' > > > > This produces the following output: > > > > [1] "bye" > > Error in unserialize(node$con) : error reading from connection > > Calls: ... doTryCatch -> recvData -> recvData.SOCKnode -> > > unserialize > > Execution halted > > Error in unserialize(node$con) : error reading from connection > > Calls: ... doTryCatch -> recvData -> recvData.SOCKnode -> > > unserialize > > Execution halted > > > > The error does not occur interactively or if stopCluster gets called at > the > > end. > > > > [[alternative HTML version deleted]] > > > > __ > > R-devel@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-devel > [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] R hang/bug with circular references and promises
The following code snippet causes R to hang. This example might be a bit contrived as I was experimenting and trying to understand promises, but uses only base R. It looks like it is looking for "not_a_variable" recursively but since it doesn't exist it goes on indefinitely. x0 <- new.env() x1 <- new.env(parent = x0) parent.env(x0) <- x1 delayedAssign("v", not_a_variable, eval.env=x1) delayedAssign("w", v, assign.env=x1, eval.env=x0) x1$w __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] clarifying and adjusting the C API for R
Hi Luke, thanks for all your work on R! I'd like to ask specifically about R_serialize / R_unserialize (and associated helper functions). These are used by at least a handful of packages and I don't see them in the list from Yutani. Are these API functions considered "stable"? Best, Travers On Sat, Jun 8, 2024 at 9:29 PM Hiroaki Yutani wrote: > > Thanks so much for your wonderful work, Luke! > I didn't expect such a clarification to happen this soon. This is really > great. > > For convenience, I created a quick web page to search the result of > tools:::funAPI(). > > https://yutannihilation.github.io/R-fun-API/ > > Hope this helps those who are too lazy to install R-devel to check. > > Best, > Yutani > > 2024年6月6日(木) 23:47 luke-tierney--- via R-devel : > > > This is an update on some current work on the C API for use in R > > extensions. > > > > The internal R implementation makes use of tens of thousands of C > > entry points. On Linux and Windows, which support visibility > > restrictions, most of these are visible only within the R executble or > > shared library. About 1500 are not hidden and are visible to > > dynamically loaded shared libraries, such as ones in packages, and to > > embedding applications. > > > > There are two main reasons for limiting access to entry points in a > > software framework: > > > > - Some entry points are very easy to use in ways that corrupt internal > >data, leading to segfaults or, worse, incorrect computations without > >segfaults. > > > > - Some entry point expose internal structure and other implementation > >details, which makes it hard to make improvements without breaking > >client code that has come to depend on these details. > > > > The API of C entry points that can be used in R extensions, both for > > packages and embedding, has evolved organically over many years. The > > definition for the current release expressed in the Writing R > > Extensions manual (WRE) is roughly: > > > > An entry point can be used if (1) it is declared in a header file > > in R.home("include"), and (2) if it is documented for use in WRE. > > > > Ideally, (1) would be necessary and sufficient, but for a variety of > > reasons that isn't achievable, at least not in the near term. (2) can > > be challenging to determine; in particular, it is not amenable to a > > computational answer. > > > > An experimental effort is underway to add annotations to the WRE > > Texinfo source to allow (2) to be answered unambiguously. The > > annotations so far mostly reflect my reading or WRE and may be revised > > as they are reviewed by others. The annotated document can be used for > > programmatically identifying what is currently considered part of the C > > API. The result so far is an experimental function tools:::funAPI(): > > > > > head(tools:::funAPI()) > > nameloc apitype > > 1 Rf_AdobeSymbol2utf8 R_ext/GraphicsDevice.heapi > > 2alloc3DArrayWRE api > > 3 allocArrayWRE api > > 4 allocLangWRE api > > 5 allocListWRE api > > 6 allocMatrixWRE api > > > > The 'apitype' field has three possible levels > > > > | api | stable (ideally) API | > > | eapi | experimental API | > > | emb | embedding API| > > > > Entry points in the embedded API would typically only be used in > > applications embedding R or providing new front ends, but might be > > reasonable to use in packages that support embedding. > > > > The 'loc' field indicates how the entry point is identified as part of > > an API: explicit mention in WRE, or declaration in a header file > > identified as fully part of an API. > > > > [tools:::funAPI() may not be completely accurate as it relies on > > regular expressions for examining header files considered part of the > > API rather than proper parsing. But it seems to be pretty close to > > what can be achieved with proper parsing. Proper parsing would add > > dependencies on additional tools, which I would like to avoid for > > now. One dependency already present is that a C compiler has to be on > > the search path and cc -E has to run the C pre-processor.] > > > > Two additional experimental functions are available for analyzing > > package compliance: tools:::checkPkgAPI and tools:::checkAllPkgsAPI. > > These examine installed packages. > > > > [These may produce some false positives on macOS; they may or may not > > work on Windows at this point.] > > > > Using these tools initially showed around 200 non-API entry points > > used across packages on CRAN and BIOC. Ideally this number should be > > reduced to zero. This will require a combination of additions to the > > API and changes in packages. > > > > Some entry points can safely be added to the API. Around 40 have > > already be