Re: [Rd] [parallel] fixes load balancing of parLapplyLB
Dear Henrik, The rationale is just that it is within these extremes and that it is really simple to calculate, without making any assumptions and knowing that it won't be perfect. The extremes A and B you are mentioning are special cases based on assumptions. Case A is based on the assumption that the function has a long runtime or varying runtime, then you are likely to get the best load balancing with really small chunks. Case B is based on the assumption that the function runtime is the same for each list element, i.e. where you don't actually need load balancing, i.e. just use `parLapply` without load balancing. This new default is **not the best one**. It's just a better one than we had before. There is no best one we can use as default because **we don't know the function runtime and how it varies**. The user needs to decide that because he/she knows the function. As mentioned before, I will write a patch that makes the chunk size an optional argument, so the user can decide because only he/she has all the information to choose the best chunk size, just like you did with the `future.scheduling` parameter. Best Regards On February 19, 2018 10:11:04 PM GMT+01:00, Henrik Bengtsson wrote: >Hi, I'm trying to understand the rationale for your proposed amount of >splitting and more precisely why that one is THE one. > >If I put labels on your example numbers in one of your previous post: > > nbrOfElements <- 97 > nbrOfWorkers <- 5 > >With these, there are two extremes in how you can split up the >processing in chunks such that all workers are utilized: > >(A) Each worker, called multiple times, processes one element each >time: > >> nbrOfElements <- 97 >> nbrOfWorkers <- 5 >> nbrOfChunks <- nbrOfElements >> sapply(parallel:::splitList(1:nbrOfElements, nbrOfChunks), length) > [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 >[30] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 >[59] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 >[88] 1 1 1 1 1 1 1 1 1 1 > > >(B) Each worker, called once, processes multiple element: > >> nbrOfElements <- 97 >> nbrOfWorkers <- 5 >> nbrOfChunks <- nbrOfWorkers >> sapply(parallel:::splitList(1:nbrOfElements, nbrOfChunks), length) >[1] 20 19 19 19 20 > >I understand that neither of these two extremes may be the best when >it comes to orchestration overhead and load balancing. Instead, the >best might be somewhere in-between, e.g. > >(C) Each worker, called multiple times, processing multiple elements: > >> nbrOfElements <- 97 >> nbrOfWorkers <- 5 >> nbrOfChunks <- nbrOfElements / nbrOfWorkers >> sapply(parallel:::splitList(1:nbrOfElements, nbrOfChunks), length) > [1] 5 5 5 5 4 5 5 5 5 5 4 5 5 5 5 4 5 5 5 5 > >However, there are multiple alternatives between the two extremes, e.g. > >> nbrOfChunks <- scale * nbrOfElements / nbrOfWorkers > >So, is there a reason why you argue for scale = 1.0 to be the optimal? > >FYI, In future.apply::future_lapply(X, FUN, ...) there is a >'future.scheduling' scale factor(*) argument where default >future.scheduling = 1 corresponds to (B) and future.scheduling = +Inf >to (A). Using future.scheduling = 4 achieves the amount of >load-balancing you propose in (C). (*) Different definition from the >above 'scale'. (Disclaimer: I'm the author) > >/Henrik > >On Mon, Feb 19, 2018 at 10:21 AM, Christian Krause > wrote: >> Dear R-Devel List, >> >> I have installed R 3.4.3 with the patch applied on our cluster and >ran a *real-world* job of one of our users to confirm that the patch >works to my satisfaction. Here are the results. >> >> The original was a series of jobs, all essentially doing the same >stuff using bootstrapped data, so for the original there is more data >and I show the arithmetic mean with standard deviation. The >confirmation with the patched R was only a single instance of that >series of jobs. >> >> ## Job Efficiency >> >> The job efficiency is defined as (this is what the `qacct-efficiency` >tool below does): >> >> ``` >> efficiency = cputime / cores / wallclocktime * 100% >> ``` >> >> In simpler words: how well did the job utilize its CPU cores. It >shows the percentage of time the job was actually doing stuff, as >opposed to the difference: >> >> ``` >> wasted = 100% - efficiency >> ``` >> >> ... which, essentially, tells us how much of the resources were >wasted, i.e. CPU cores just idling, without being used by anyone. We >care a lot about that because, for our scientific computing cluster, >wasted resources is like burning money. >> >> ### original >> >> This is the entire series from our job accounting database, filteres >the successful jobs, calculates efficiency and then shows the average >and standard deviation of the efficiency: >> >> ``` >> $ qacct -j 4433299 | qacct-success | qacct-efficiency | meansd >> n=945 ∅ 61.7276 ± 7.78719 >> ``` >> >> This is the entire series from our job accounting database, filteres >the successful jobs, calculates efficiency and does sort
Re: [Rd] Unnecessary lines in stem.c?
Thanks! Cleaned up in R-devel, Tomas On 02/16/2018 05:03 PM, S Ellison wrote: A discussion on r-help led me to look at stem.c at https://github.com/wch/r-source/blob/trunk/src/library/graphics/src/stem.c Lines 76-77 appear superfluous. They sit inside a condition, and set mu, as follows: if (k*(k-4)*(k-8) == 0) mu = 5; if ((k-1)*(k-5)*(k-6) == 0) mu = 20; But mu is set unconditionally to 10 on line 84, and that is followed by conditional assignments (on line 85-6) identical to lines 76-77. It looks like a couple of lines got left inside a condition that are no longer needed there. If that is correct, is it worth removing the superfluous lines, for future coders' benefit? S Ellison *** This email and any attachments are confidential. Any u...{{dropped:7}} __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] How to modify dots and dispatch NextMethod
Hi all, Not sure if this belongs to R-devel or R-package-devel. Anyways... Suppose we have objects of class c("foo", "bar"), and there are two S3 methods c.foo, c.bar. In c.foo, I'm trying to modify the dots and forward the dispatch using NextMethod without any success. This is what I've tried so far: c.foo <- function(..., recursive=FALSE) { dots <- list(...) # inspect and modify dots # ... do.call( function(..., recursive=FALSE) structure(NextMethod("c"), class="foo"), c(dots, recursive=recursive) ) } foobar <- 1 class(foobar) <- c("foo", "bar") c(foobar, foobar) Error: C stack usage 7970788 is too close to the limit There's recursion (!). But the funniest thing is that if c.foo is exported by one package and c.bar is exported by another one, there's no recursion, but c.bar is never called (!!). Why is the same code behaving in a different way depending on whether these functions are defined in the .GlobalEnv or in two attached packages? (BTW, isS3method() is TRUE, for c.foo and c.bar in both cases). I'm blocked here. Am I missing something? Is there a way of doing this? Thanks in advance. Regards, Iñaki __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Duplicate column names created by base::merge() when by.x has the same name as a column in y
Hi Scott, I think that's a good idea and I tried your patch on my copy of the repository. But it looks to me like the recent patch is identical to the previous one, can you confirm this? Frederick On Mon, Feb 19, 2018 at 07:19:32AM +1100, Scott Ritchie wrote: > Thanks Gabriel, > > I think your suggested approach is 100% backwards compatible > > Currently in the case of duplicate column names only the first can be > indexed by its name. This will always be the column appearing in by.x, > meaning the column in y with the same name cannot be accessed. Appending > ".y" (suffixes[2L]) to this column means it can now be accessed, while > keeping the current behaviour of making the key columns always accessible > by using the names provided to by.x. > > I've attached a new patch that has this behaviour. > > Best, > > Scott > > On 19 February 2018 at 05:08, Gabriel Becker wrote: > > > It seems like there is a way that is backwards compatible-ish in the sense > > mentioned and still has the (arguably, but a good argument I think) better > > behavior: > > > > if by.x is 'name', (AND by.y is not also 'name'), then x's 'name' column > > is called name and y's 'name' column (not used int he merge) is changed to > > name.y. > > > > Now of course this would still change output, but it would change it to > > something I think would be better, while retaining the 'merge columns > > retain their exact names' mechanic as documented. > > > > ~G > > > > On Sat, Feb 17, 2018 at 6:50 PM, Scott Ritchie > > wrote: > > > >> Thanks Duncan and Frederick, > >> > >> I suspected as much - there doesn't appear to be any reason why conflicts > >> between by.x and names(y) shouldn't and cannot be checked, but I can see > >> how this might be more trouble than its worth given it potentially may > >> break downstream packages (i.e. any cases where this occurs but they > >> expect > >> the name of the key column(s) to remain the same). > >> > >> Best, > >> > >> Scott > >> > >> On 18 February 2018 at 11:48, Duncan Murdoch > >> wrote: > >> > >> > On 17/02/2018 6:36 PM, frede...@ofb.net wrote: > >> > > >> >> Hi Scott, > >> >> > >> >> Thanks for the patch. I'm not really involved in R development; it > >> >> will be up to someone in the R core team to apply it. I would hazard > >> >> to say that even if correct (I haven't checked), it will not be > >> >> applied because the change might break existing code. For example it > >> >> seems like reasonable code might easily assume that a column with the > >> >> same name as "by.x" exists in the output of 'merge'. That's just my > >> >> best guess... I don't participate on here often. > >> >> > >> > > >> > > >> > I think you're right. If I were still a member of R Core, I would want > >> to > >> > test this against all packages on CRAN and Bioconductor, and since that > >> > test takes a couple of days to run on my laptop, I'd probably never get > >> > around to it. > >> > > >> > There are lots of cases where "I would have done that differently", but > >> > most of them are far too much trouble to change now that R is more than > >> 20 > >> > years old. And in many cases it will turn out that the way R does it > >> > actually does make more sense than the way I would have done it. > >> > > >> > Duncan Murdoch > >> > > >> > > >> > > >> >> Cheers, > >> >> > >> >> Frederick > >> >> > >> >> On Sat, Feb 17, 2018 at 04:42:21PM +1100, Scott Ritchie wrote: > >> >> > >> >>> The attached patch.diff will make merge.data.frame() append the > >> suffixes > >> >>> to > >> >>> columns with common names between by.x and names(y). > >> >>> > >> >>> Best, > >> >>> > >> >>> Scott Ritchie > >> >>> > >> >>> On 17 February 2018 at 11:15, Scott Ritchie > >> >>> wrote: > >> >>> > >> >>> Hi Frederick, > >> > >> I would expect that any duplicate names in the resulting data.frame > >> would > >> have the suffixes appended to them, regardless of whether or not they > >> are > >> used as the join key. So in my example I would expect "names.x" and > >> "names.y" to indicate their source data.frame. > >> > >> While careful reading of the documentation reveals this is not the > >> case, I > >> would argue the intent of the suffixes functionality should equally > >> be > >> applied to this type of case. > >> > >> If you agree this would be useful, I'm happy to write a patch for > >> merge.data.frame that will add suffixes in this case - I intend to do > >> the > >> same for merge.data.table in the data.table package where I initially > >> encountered the edge case. > >> > >> Best, > >> > >> Scott > >> > >> On 17 February 2018 at 03:53, wrote: > >> > >> Hi Scott, > >> > > >> > It seems like reasonable behavior to me. What result would you > >> expect? > >> > That the second "name" should be called "name.y"? > >> > > >> > The "merge" documentation says: > >> > > >> >>>
Re: [Rd] Duplicate column names created by base::merge() when by.x has the same name as a column in y
Hi Frederick, It looks like I didn't overwrite the patch.diff file after the last edits. Here's the correct patch (attached and copied below): Index: src/library/base/R/merge.R === --- src/library/base/R/merge.R (revision 74280) +++ src/library/base/R/merge.R (working copy) @@ -157,6 +157,14 @@ } if(has.common.nms) names(y) <- nm.y +## If by.x %in% names(y) then duplicate column names still arise, +## apply suffixes to just y - this keeps backwards compatibility +## when referring to by.x in the resulting data.frame +dupe.keyx <- intersect(nm.by, names(y)) +if(length(dupe.keyx)) { + if(nzchar(suffixes[2L])) +names(y)[match(dupe.keyx, names(y), 0L)] <- paste(dupe.keyx, suffixes[2L], sep="") +} nm <- c(names(x), names(y)) if(any(d <- duplicated(nm))) if(sum(d) > 1L) Best, Scott On 21 February 2018 at 08:23, wrote: > Hi Scott, > > I think that's a good idea and I tried your patch on my copy of the > repository. But it looks to me like the recent patch is identical to > the previous one, can you confirm this? > > Frederick > > On Mon, Feb 19, 2018 at 07:19:32AM +1100, Scott Ritchie wrote: > > Thanks Gabriel, > > > > I think your suggested approach is 100% backwards compatible > > > > Currently in the case of duplicate column names only the first can be > > indexed by its name. This will always be the column appearing in by.x, > > meaning the column in y with the same name cannot be accessed. Appending > > ".y" (suffixes[2L]) to this column means it can now be accessed, while > > keeping the current behaviour of making the key columns always accessible > > by using the names provided to by.x. > > > > I've attached a new patch that has this behaviour. > > > > Best, > > > > Scott > > > > On 19 February 2018 at 05:08, Gabriel Becker > wrote: > > > > > It seems like there is a way that is backwards compatible-ish in the > sense > > > mentioned and still has the (arguably, but a good argument I think) > better > > > behavior: > > > > > > if by.x is 'name', (AND by.y is not also 'name'), then x's 'name' > column > > > is called name and y's 'name' column (not used int he merge) is > changed to > > > name.y. > > > > > > Now of course this would still change output, but it would change it to > > > something I think would be better, while retaining the 'merge columns > > > retain their exact names' mechanic as documented. > > > > > > ~G > > > > > > On Sat, Feb 17, 2018 at 6:50 PM, Scott Ritchie > > > wrote: > > > > > >> Thanks Duncan and Frederick, > > >> > > >> I suspected as much - there doesn't appear to be any reason why > conflicts > > >> between by.x and names(y) shouldn't and cannot be checked, but I can > see > > >> how this might be more trouble than its worth given it potentially may > > >> break downstream packages (i.e. any cases where this occurs but they > > >> expect > > >> the name of the key column(s) to remain the same). > > >> > > >> Best, > > >> > > >> Scott > > >> > > >> On 18 February 2018 at 11:48, Duncan Murdoch < > murdoch.dun...@gmail.com> > > >> wrote: > > >> > > >> > On 17/02/2018 6:36 PM, frede...@ofb.net wrote: > > >> > > > >> >> Hi Scott, > > >> >> > > >> >> Thanks for the patch. I'm not really involved in R development; it > > >> >> will be up to someone in the R core team to apply it. I would > hazard > > >> >> to say that even if correct (I haven't checked), it will not be > > >> >> applied because the change might break existing code. For example > it > > >> >> seems like reasonable code might easily assume that a column with > the > > >> >> same name as "by.x" exists in the output of 'merge'. That's just my > > >> >> best guess... I don't participate on here often. > > >> >> > > >> > > > >> > > > >> > I think you're right. If I were still a member of R Core, I would > want > > >> to > > >> > test this against all packages on CRAN and Bioconductor, and since > that > > >> > test takes a couple of days to run on my laptop, I'd probably never > get > > >> > around to it. > > >> > > > >> > There are lots of cases where "I would have done that differently", > but > > >> > most of them are far too much trouble to change now that R is more > than > > >> 20 > > >> > years old. And in many cases it will turn out that the way R does > it > > >> > actually does make more sense than the way I would have done it. > > >> > > > >> > Duncan Murdoch > > >> > > > >> > > > >> > > > >> >> Cheers, > > >> >> > > >> >> Frederick > > >> >> > > >> >> On Sat, Feb 17, 2018 at 04:42:21PM +1100, Scott Ritchie wrote: > > >> >> > > >> >>> The attached patch.diff will make merge.data.frame() append the > > >> suffixes > > >> >>> to > > >> >>> columns with common names between by.x and names(y). > > >> >>> > > >> >>> Best, > > >> >>> > > >> >>> Scott Ritchie > > >> >>> > > >> >>> On 17 February 2018 at 11:15, Scott Ritchie < >
[Rd] Unwanted behaviour of bw.nrd: sometimes, zero is returned as a valid bandwidth
Dear all, Sorry if I am posting to the wrong place, but I could not find the link for registration on the bug tracker, that’s why I am writing here. I think there is inconsistency between two R functions from the stats package, bw.nrd0 and bw.nrd. Consider the following vector: D <- c(0, 1, 1, 1, 1) bw.nrd(D) returns zero bandwidth for this object even without a warning. Considering the fact that in most cases, one divides something by the bandwidth, it is highly undesirable that these function return zero bandwidth without a warning. Contrast bw.nrd0: it has three failsafes. First, if the minimum of SD and IQR/1.34 is 0, it tries three things: set the base multiplier to SD, or to |x[1]|, or to 1. In my opinion, bw.nrd should have either one of these failsafes or a warning that the vanilla formula equals to zero and that the user should try another method of bandwidth selection (just like bw.ucv warns if the minimum occurred at one end). It would be better than suddenly discovering NaNs or NAs without even knowing the possible cause. -- Bien cordialement, | Yours sincerely, Andreï V. Kostyrka. http://kostyrka.ru, http://kostyrka.ru/blog [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Duplicate column names created by base::merge() when by.x has the same name as a column in y
Hi Scott, I tried the new patch and can confirm that it has the advertised behavior on a couple of test cases. I think it makes sense to apply it, because any existing code which refers to a second duplicate data.frame column by name is already broken, while if the reference is by numerical index then changing the column name shouldn't break it. I don't know if you need to update the documentation as part of your patch, or if whoever applies it would be happy to do that. Somebody from R core want to weigh in on this? I attach a file with the test example from your original email as well as a second test case I added with two "by" columns. Thanks, Frederick On Wed, Feb 21, 2018 at 10:06:21AM +1100, Scott Ritchie wrote: > Hi Frederick, > > It looks like I didn't overwrite the patch.diff file after the last edits. > Here's the correct patch (attached and copied below): > > Index: src/library/base/R/merge.R > === > --- src/library/base/R/merge.R (revision 74280) > +++ src/library/base/R/merge.R (working copy) > @@ -157,6 +157,14 @@ > } > > if(has.common.nms) names(y) <- nm.y > +## If by.x %in% names(y) then duplicate column names still arise, > +## apply suffixes to just y - this keeps backwards compatibility > +## when referring to by.x in the resulting data.frame > +dupe.keyx <- intersect(nm.by, names(y)) > +if(length(dupe.keyx)) { > + if(nzchar(suffixes[2L])) > +names(y)[match(dupe.keyx, names(y), 0L)] <- paste(dupe.keyx, > suffixes[2L], sep="") > +} > nm <- c(names(x), names(y)) > if(any(d <- duplicated(nm))) > if(sum(d) > 1L) > > Best, > > Scott > > On 21 February 2018 at 08:23, wrote: > > > Hi Scott, > > > > I think that's a good idea and I tried your patch on my copy of the > > repository. But it looks to me like the recent patch is identical to > > the previous one, can you confirm this? > > > > Frederick > > > > On Mon, Feb 19, 2018 at 07:19:32AM +1100, Scott Ritchie wrote: > > > Thanks Gabriel, > > > > > > I think your suggested approach is 100% backwards compatible > > > > > > Currently in the case of duplicate column names only the first can be > > > indexed by its name. This will always be the column appearing in by.x, > > > meaning the column in y with the same name cannot be accessed. Appending > > > ".y" (suffixes[2L]) to this column means it can now be accessed, while > > > keeping the current behaviour of making the key columns always accessible > > > by using the names provided to by.x. > > > > > > I've attached a new patch that has this behaviour. > > > > > > Best, > > > > > > Scott > > > > > > On 19 February 2018 at 05:08, Gabriel Becker > > wrote: > > > > > > > It seems like there is a way that is backwards compatible-ish in the > > sense > > > > mentioned and still has the (arguably, but a good argument I think) > > better > > > > behavior: > > > > > > > > if by.x is 'name', (AND by.y is not also 'name'), then x's 'name' > > column > > > > is called name and y's 'name' column (not used int he merge) is > > changed to > > > > name.y. > > > > > > > > Now of course this would still change output, but it would change it to > > > > something I think would be better, while retaining the 'merge columns > > > > retain their exact names' mechanic as documented. > > > > > > > > ~G > > > > > > > > On Sat, Feb 17, 2018 at 6:50 PM, Scott Ritchie > > > > wrote: > > > > > > > >> Thanks Duncan and Frederick, > > > >> > > > >> I suspected as much - there doesn't appear to be any reason why > > conflicts > > > >> between by.x and names(y) shouldn't and cannot be checked, but I can > > see > > > >> how this might be more trouble than its worth given it potentially may > > > >> break downstream packages (i.e. any cases where this occurs but they > > > >> expect > > > >> the name of the key column(s) to remain the same). > > > >> > > > >> Best, > > > >> > > > >> Scott > > > >> > > > >> On 18 February 2018 at 11:48, Duncan Murdoch < > > murdoch.dun...@gmail.com> > > > >> wrote: > > > >> > > > >> > On 17/02/2018 6:36 PM, frede...@ofb.net wrote: > > > >> > > > > >> >> Hi Scott, > > > >> >> > > > >> >> Thanks for the patch. I'm not really involved in R development; it > > > >> >> will be up to someone in the R core team to apply it. I would > > hazard > > > >> >> to say that even if correct (I haven't checked), it will not be > > > >> >> applied because the change might break existing code. For example > > it > > > >> >> seems like reasonable code might easily assume that a column with > > the > > > >> >> same name as "by.x" exists in the output of 'merge'. That's just my > > > >> >> best guess... I don't participate on here often. > > > >> >> > > > >> > > > > >> > > > > >> > I think you're right. If I were still a member of R Core, I would > > want > > > >> to > > > >> > test this against all packages on CRAN and Bi