Re: [Rd] Potential improvements of ave?
> Gabriel Becker > on Mon, 15 Mar 2021 15:08:44 -0700 writes: > Abby, > Vectors do have an internal mechanism for knowing that they are sorted via > ALTREP (it was one of 2 core motivating features for 'smart vectors' the > other being knowledge about presence of NAs). > Currently I don't think we expose it at the R level, though it is part of > the official C API. I don't know of any plans for this to change, but I > suppose it could. Plus for functions in R itself, we could even use it > without exposing it more widely. A number of functions, including sort > itself, already do this in fact, but more could. I'd be interested in > hearing which functions you think would particularly benefit from this. Thank you Gabe. > ~G I vaguely remember (from Luke's docs/presentation on ALTREP) that there are some "missing parts" here. One of them the not-existing R level functionality, another may be the C code below R's is.unsorted() ... maybe is.unsorted() could get a new argument and or be re-written, moving the NA handling also to C and have that happen *after* the C code looks if it's an ALTREP object and if that "knows it's sorted". Martin > On Mon, Mar 15, 2021 at 12:01 PM SOEIRO Thomas > wrote: >> Hi Abby, >> >> Thank you for your positive feedback. >> >> I agree for your general comment about sorting. >> >> For ave specifically, ordering may not help because the output must >> maintain the order of the input (as ave returns only x and not the entiere >> data.frame). >> >> Thanks, >> >> Thomas >> >> De : Abby Spurdle >> Envoyé : lundi 15 mars 2021 10:22 >> À : SOEIRO Thomas >> Cc : r-devel@r-project.org >> Objet : Re: [Rd] Potential improvements of ave? >> >> EMAIL EXTERNE - TRAITER AVEC PRÉCAUTION LIENS ET FICHIERS >> >> Hi Thomas, >> >> These are some great suggestions. >> But I can't help but feel there's a much bigger problem here. >> >> Intuitively, the ave function could (or should) sort the data. >> Then the indexing step becomes almost trivial, in terms of both time >> and space complexity. >> And the ave function is not the only example of where a problem >> becomes much simpler, if the data is sorted. >> >> Historically, I've never found base R functions user-friendly for >> aggregation purposes, or for sorting. >> (At least, not by comparison to SQL). >> >> But that's not the main problem. >> It would seem preferable to sort the data, only once. >> (Rather than sorting it repeatedly, or not at all). >> >> Perhaps, objects such as vectors and data.frame(s) could have a >> boolean attribute, to indicate if they're sorted. >> Or functions such as ave could have a sorted argument. >> In either case, if true, the function assumes the data is sorted and >> applies a more efficient algorithm. >> >> >> B. >> >> >> On Sat, Mar 13, 2021 at 1:07 PM SOEIRO Thomas >> wrote: >> > >> > Dear all, >> > >> > I have two questions/suggestions about ave, but I am not sure if it's >> relevant for bug reports. >> > >> > >> > >> > 1) I have performance issues with ave in a case where I didn't expect >> it. The following code runs as expected: >> > >> > set.seed(1) >> > >> > df1 <- data.frame(id1 = sample(1:1e2, 5e2, TRUE), >> > id2 = sample(1:3, 5e2, TRUE), >> > id3 = sample(1:5, 5e2, TRUE), >> > val = sample(1:300, 5e2, TRUE)) >> > >> > df1$diff <- ave(df1$val, >> > df1$id1, >> > df1$id2, >> > df1$id3, >> > FUN = function(i) c(diff(i), 0)) >> > >> > head(df1[order(df1$id1, >> >df1$id2, >> >df1$id3), ]) >> > >> > But when expanding the data.frame (* 1e4), ave fails (Error: cannot >> allocate vector of size 1110.0 Gb): >> > >> > df2 <- data.frame(id1 = sample(1:(1e2 * 1e4), 5e2 * 1e4, TRUE), >> > id2 = sample(1:3, 5e2 * 1e4, TRUE), >> > id3 = sample(1:(5 * 1e4), 5e2 * 1e4, TRUE), >> > val = sample(1:300, 5e2 * 1e4, TRUE)) >> > >> > df2$diff <- ave(df2$val, >> > df2$id1, >> > df2$id2, >> > df2$id3, >> > FUN = function(i) c(diff(i), 0)) >> > >> > This use case does not seem extreme to me (e.g. aggregate et al work >> perfectly on this data.frame). >> > So my question is: Is this expected/intended/reasonable? i.e. Does ave >> need to be optimized? >> > >> > >> > >> > 2) Gabor Grothendieck pointed out in 2011 that drop = TR
Re: [Rd] Potential improvements of ave?
On 16 March 2021 at 10:50, Martin Maechler wrote: | I vaguely remember (from Luke's docs/presentation on ALTREP) | that there are some "missing parts" here. | One of them the not-existing R level functionality, another may be | the C code below R's is.unsorted() ... maybe is.unsorted() | could get a new argument and or be re-written, moving the NA | handling also to C and have that happen *after* the C code looks | if it's an ALTREP object and if that "knows it's sorted". Somewhat related: I recently tried to demonstrate that, say, a 1e2 and a 1e6 vector each with one added NA should have equal "O(1)" time under anyNA(), yet apparently they do not---so the other magic bit (about knowing NAness) may not propagate. (That was with R 4.0.4.) I may also have missed a crucial bit about enabling ALTREPness here, yet the "gist of it" always was (AFAIK) that it is supposedly seamless. Dirk -- https://dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] R.sh and argument escaping
Hello R-devel! The following sequence of commands results in an error message on a POSIX system: tab="`echo -ne "\t"`" LC_ALL=C Rscript -e " $tab 1" # ARGUMENT '~+~1' __ignored__ Tabs can sneak into the -e argument from indented multi-line arguments in shell scripts: Rscript -e ' foo() bar() ... ' R.sh does a good job of escaping spaces and newlines, but since shells are also supposed to split on a tab [*], it's a good idea to escape tabs too: Index: src/scripts/R.sh.in === --- src/scripts/R.sh.in (revision 80090) +++ src/scripts/R.sh.in (working copy) @@ -192,7 +192,7 @@ -e) if test -n "`echo ${2} | ${SED} 's/^-.*//'`"; then a=`(echo "${2}" && echo) | ${SED} -e 's/ /~+~/g' | \ - ${SED} -e :a -e N -e '$!ba' -e 's/\n/~n~/g' -e 's/~n~$//g'` + ${SED} -e :a -e N -e '$!ba' -e 's/\n/~n~/g' -e 's/~n~$//g' -e 's/\t/~t~/g'` shift else error "option '${1}' requires a non-empty argument" Index: src/unix/system.c === --- src/unix/system.c (revision 80090) +++ src/unix/system.c (working copy) @@ -170,6 +170,9 @@ } else if(*q == '~' && *(q+1) == 'n' && *(q+2) == '~') { q += 2; *p++ = '\n'; + } else if(*q == '~' && *(q+1) == 't' && *(q+2) == '~') { + q += 2; + *p++ = '\t'; } else *p++ = *q; } return p; I have verified that with the patch above, Rscript -e " $tab 1" no longer fails. While we're at it, perhaps it could be a good idea to replace the magic number 1 with a the size of the character array above it: Index: src/unix/system.c === --- src/unix/system.c (revision 80090) +++ src/unix/system.c (working copy) @@ -429,7 +432,7 @@ } else if(!strcmp(*av, "-e")) { ac--; av++; Rp->R_Interactive = FALSE; - if(strlen(cmdlines) + strlen(*av) + 2 <= 1) { + if(strlen(cmdlines) + strlen(*av) + 2 <= sizeof(cmdlines)) { char *p = cmdlines+strlen(cmdlines); p = unescape_arg(p, *av); *p++ = '\n'; *p = '\0'; It might also be a good idea to make it possible to represent the escape sequences themselves in the unescaped stream in a fully reversible transformation ('~' <-> '~~~', ' ' <-> '~+~', '\n' <-> '~n~', '\t' <-> '~t~'), making it possible to round-trip character sequences like '~+~' through the escaping and unescaping process (thankfully, '~+~' is not frequently needed in R programs), though expressing that as a sed command is beyond me. Right now, Rscript -e '"~+~"' doesn't print "~+~". Perhaps the bigger question to ask is whether this escaping is unavoidable. Is it documented? Since the args variable is only appended (not prepended), it is likely possible to rewrite the 'while test -n "${1}"; do' loop in terms of 'set -- "$@" ...', which is POSIX-compatible and doesn't require any escaping: set -- "${@}" dummy # append one argument to skip it later for arg in "${@}"; do # it's safe to modify $@ in the for loop [**] # TODO: on first iteration only, empty the $@ and don't check $prev_arg case "${prev_arg}" in # ... -g|--gui) if test -n "`echo "${arg}" | ${SED} 's/^-.*//'`"; then gui="${arg}" set -- "${@}" "${prev_arg}" "${arg}" else error "option '${prev_arg}' requires an argument" fi ;; # ... -e) if ! test -n "`echo "${arg}" | ${SED} 's/^-.*//'`"; then error "option '${prev_arg}' requires a non-empty argument" fi set -- "${@}" -e "${arg}" ;; # ... esac prev_arg="${arg}" # no shift needed done # Later: use "${@}" instead of ${args} Or is it documented behaviour that arguments following an empty argument are not escaped by the shell script but are passed to "${R_HOME}/bin/exec${R_ARCH}/R"? LC_ALL=C R -q -e " $tab 1" # ARGUMENT '~+~1' __ignored__ # # > # > # > LC_ALL=C R '' -q -e " $tab 1" # ARGUMENT '' __ignored__ # # > 1 # [1] 1 # > # > -- Best regards, Ivan [*] https://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html#tag_18_06_05 [**] "First, the list of words following in shall be expanded to generate a list of items..." https://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html#tag_18_09_04_03 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Undefined (so far as I can tell?) behavior of browser when called at top level of sourced script?
Hi all, I was asked a question about why browser() was behaving a specific way, and it turned out that it was being called in a script (rather than in a function). Putting aside the design considerations that lead to that, the behavior is actually a bit puzzling, and so far as I have been able to see, completely undocumented. My suspicion is that this behavior should be considered undefined, but I wanted to make sure I wasn't missing something. (To be perfectly honest I was a bit surprised it wasn't an error). Some experimentation (done in 4.0.1 because that is what I have available, R script attached) has lead me to conclude that if browser is called at the top level, 'n' will just continue to the end, *except* in the case where the next expression is a conditional ** which has a consequent that is evaluated** or a loop, in which case it walks through the consequent/loop body however many number of times and then the 'n' that steps out of that continues on. Should something be added to the documentation that either describes this behavior, declares explicitly that using browser at the top level leads to undefined behavior, or both? I can prepare a patch to that effect if desired. ~G __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Potential improvements of ave?
There are some relatively obvious examples: unique, which.min/which.max/etc, range/min/max, quantile, aggregate/split Also, many timeseries, graphics and spline functions are dependent on the order. In the case of data.frame(s), a boolean flag would probably need to be extended to allow for multiple column sorting, and ascending/descending options. On Tue, Mar 16, 2021 at 11:08 AM Gabriel Becker wrote: > > Abby, > > Vectors do have an internal mechanism for knowing that they are sorted via > ALTREP (it was one of 2 core motivating features for 'smart vectors' the > other being knowledge about presence of NAs). > > Currently I don't think we expose it at the R level, though it is part of the > official C API. I don't know of any plans for this to change, but I suppose > it could. Plus for functions in R itself, we could even use it without > exposing it more widely. A number of functions, including sort itself, > already do this in fact, but more could. I'd be interested in hearing which > functions you think would particularly benefit from this. > > ~G > > On Mon, Mar 15, 2021 at 12:01 PM SOEIRO Thomas wrote: >> >> Hi Abby, >> >> Thank you for your positive feedback. >> >> I agree for your general comment about sorting. >> >> For ave specifically, ordering may not help because the output must maintain >> the order of the input (as ave returns only x and not the entiere >> data.frame). >> >> Thanks, >> >> Thomas >> >> De : Abby Spurdle >> Envoyé : lundi 15 mars 2021 10:22 >> À : SOEIRO Thomas >> Cc : r-devel@r-project.org >> Objet : Re: [Rd] Potential improvements of ave? >> >> EMAIL EXTERNE - TRAITER AVEC PRÉCAUTION LIENS ET FICHIERS >> >> Hi Thomas, >> >> These are some great suggestions. >> But I can't help but feel there's a much bigger problem here. >> >> Intuitively, the ave function could (or should) sort the data. >> Then the indexing step becomes almost trivial, in terms of both time >> and space complexity. >> And the ave function is not the only example of where a problem >> becomes much simpler, if the data is sorted. >> >> Historically, I've never found base R functions user-friendly for >> aggregation purposes, or for sorting. >> (At least, not by comparison to SQL). >> >> But that's not the main problem. >> It would seem preferable to sort the data, only once. >> (Rather than sorting it repeatedly, or not at all). >> >> Perhaps, objects such as vectors and data.frame(s) could have a >> boolean attribute, to indicate if they're sorted. >> Or functions such as ave could have a sorted argument. >> In either case, if true, the function assumes the data is sorted and >> applies a more efficient algorithm. >> >> >> B. >> >> >> On Sat, Mar 13, 2021 at 1:07 PM SOEIRO Thomas wrote: >> > >> > Dear all, >> > >> > I have two questions/suggestions about ave, but I am not sure if it's >> > relevant for bug reports. >> > >> > >> > >> > 1) I have performance issues with ave in a case where I didn't expect it. >> > The following code runs as expected: >> > >> > set.seed(1) >> > >> > df1 <- data.frame(id1 = sample(1:1e2, 5e2, TRUE), >> > id2 = sample(1:3, 5e2, TRUE), >> > id3 = sample(1:5, 5e2, TRUE), >> > val = sample(1:300, 5e2, TRUE)) >> > >> > df1$diff <- ave(df1$val, >> > df1$id1, >> > df1$id2, >> > df1$id3, >> > FUN = function(i) c(diff(i), 0)) >> > >> > head(df1[order(df1$id1, >> >df1$id2, >> >df1$id3), ]) >> > >> > But when expanding the data.frame (* 1e4), ave fails (Error: cannot >> > allocate vector of size 1110.0 Gb): >> > >> > df2 <- data.frame(id1 = sample(1:(1e2 * 1e4), 5e2 * 1e4, TRUE), >> > id2 = sample(1:3, 5e2 * 1e4, TRUE), >> > id3 = sample(1:(5 * 1e4), 5e2 * 1e4, TRUE), >> > val = sample(1:300, 5e2 * 1e4, TRUE)) >> > >> > df2$diff <- ave(df2$val, >> > df2$id1, >> > df2$id2, >> > df2$id3, >> > FUN = function(i) c(diff(i), 0)) >> > >> > This use case does not seem extreme to me (e.g. aggregate et al work >> > perfectly on this data.frame). >> > So my question is: Is this expected/intended/reasonable? i.e. Does ave >> > need to be optimized? >> > >> > >> > >> > 2) Gabor Grothendieck pointed out in 2011 that drop = TRUE is needed to >> > avoid warnings in case of unused levels >> > (https://urldefense.com/v3/__https://stat.ethz.ch/pipermail/r-devel/2011-February/059947.html__;!!JQ5agg!J2AUFbQr31F2c6LUpTnyc5TX2Kh1bJ-VqhMND1c0N5axWO_tQl0pCJhtucPfjU7NXrBO$ >> > ). >> > Is it relevant/possible to expose the drop argument explicitly? >> > >> > >> > >> > Thanks, >> > >> > Thomas >> > __ >> > R-devel@r-project.org mailing list >> > https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-devel__;!!JQ5agg!J2AUFbQr31F2c6L
Re: [Rd] Potential improvements of ave?
Dear all, Thank you for your consideration on this topic. I do not have enough knowledge of R internals to join the discussion about sorting mechanisms. In fact, I did not get how ordering could help for ave as the output must maintain the order of the input (because ave returns only x and not the entiere data.frame). However, while the proposed workaround (i.e. paste0 instead of interaction, cf https://stat.ethz.ch/pipermail/r-devel/2021-March/080509.html) does not solves the "bigger problem" of sorting, it is usable as is and solves the issue. Therefore, what do you think about it? (i.e is it relevant for a patch?) Thanks, Thomas > > De : Abby Spurdle > Envoyé : lundi 15 mars 2021 10:22 > À : SOEIRO Thomas > Cc : r-devel@r-project.org > Objet : Re: [Rd] Potential improvements of ave? > > Hi Thomas, > > These are some great suggestions. > But I can't help but feel there's a much bigger problem here. > > Intuitively, the ave function could (or should) sort the data. > Then the indexing step becomes almost trivial, in terms of both time > and space complexity. > And the ave function is not the only example of where a problem > becomes much simpler, if the data is sorted. > > Historically, I've never found base R functions user-friendly for > aggregation purposes, or for sorting. > (At least, not by comparison to SQL). > > But that's not the main problem. > It would seem preferable to sort the data, only once. > (Rather than sorting it repeatedly, or not at all). > > Perhaps, objects such as vectors and data.frame(s) could have a > boolean attribute, to indicate if they're sorted. > Or functions such as ave could have a sorted argument. > In either case, if true, the function assumes the data is sorted and > applies a more efficient algorithm. > > > B. > > > On Sat, Mar 13, 2021 at 1:07 PM SOEIRO Thomas wrote: >> >> Dear all, >> >> I have two questions/suggestions about ave, but I am not sure if it's >> relevant for bug reports. >> >> >> >> 1) I have performance issues with ave in a case where I didn't expect it. >> The following code runs as expected: >> >> set.seed(1) >> >> df1 <- data.frame(id1 = sample(1:1e2, 5e2, TRUE), >> id2 = sample(1:3, 5e2, TRUE), >> id3 = sample(1:5, 5e2, TRUE), >> val = sample(1:300, 5e2, TRUE)) >> >> df1$diff <- ave(df1$val, >> df1$id1, >> df1$id2, >> df1$id3, >> FUN = function(i) c(diff(i), 0)) >> >> head(df1[order(df1$id1, >>df1$id2, >>df1$id3), ]) >> >> But when expanding the data.frame (* 1e4), ave fails (Error: cannot allocate >> vector of size 1110.0 Gb): >> >> df2 <- data.frame(id1 = sample(1:(1e2 * 1e4), 5e2 * 1e4, TRUE), >> id2 = sample(1:3, 5e2 * 1e4, TRUE), >> id3 = sample(1:(5 * 1e4), 5e2 * 1e4, TRUE), >> val = sample(1:300, 5e2 * 1e4, TRUE)) >> >> df2$diff <- ave(df2$val, >> df2$id1, >> df2$id2, >> df2$id3, >> FUN = function(i) c(diff(i), 0)) >> >> This use case does not seem extreme to me (e.g. aggregate et al work >> perfectly on this data.frame). >> So my question is: Is this expected/intended/reasonable? i.e. Does ave need >> to be optimized? >> >> >> >> 2) Gabor Grothendieck pointed out in 2011 that drop = TRUE is needed to >> avoid warnings in case of unused levels >> (https://stat.ethz.ch/pipermail/r-devel/2011-February/059947.html). >> Is it relevant/possible to expose the drop argument explicitly? >> >> >> >> Thanks, >> >> Thomas __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Potential improvements of ave?
Hi Abby, I actually have a patch submitted that does this for unique/duplicated (only numeric cases I think) but it is, as patches from external contributors go, quite sizable which means it requires a correspondingly large amount of an R-core member's time and energy to vet and consider. It is in the queue, and so, I expect (/hope, provided I didn't make a mistake) it will be incorporated at some point. ( https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17993) You are correct that the speedups are quite significant for calling unique/duplicated on large vectors that know they are sorted: Speedup on my machine for a fairly sizable vector (length 1e7) ranges from about ~10x in the densely duplicated case up to ~60-70x in the sparsely duplicated case for duplicated(). For unique() it seems to range from ~10x in the densely duplicated case to ~15 in the spare case. I had thought that min and max already did this, but looking now, they don't seem to by default, thought ALTREP classes themselves do have an option of setting a min/max method, which would be hit. That does seem like low-hanging fruit, I agree, though in many cases the slow down from a single pass over the data to get a min probably isn't earthshattering. The others do seem like they could benefit as well. Best, ~G On Tue, Mar 16, 2021 at 2:54 PM Abby Spurdle wrote: > There are some relatively obvious examples: > unique, which.min/which.max/etc, range/min/max, quantile, aggregate/split > > Also, many timeseries, graphics and spline functions are dependent on the > order. > > In the case of data.frame(s), a boolean flag would probably need to be > extended to allow for multiple column sorting, and > ascending/descending options. > > On Tue, Mar 16, 2021 at 11:08 AM Gabriel Becker > wrote: > > > > Abby, > > > > Vectors do have an internal mechanism for knowing that they are sorted > via ALTREP (it was one of 2 core motivating features for 'smart vectors' > the other being knowledge about presence of NAs). > > > > Currently I don't think we expose it at the R level, though it is part > of the official C API. I don't know of any plans for this to change, but I > suppose it could. Plus for functions in R itself, we could even use it > without exposing it more widely. A number of functions, including sort > itself, already do this in fact, but more could. I'd be interested in > hearing which functions you think would particularly benefit from this. > > > > ~G > > > > On Mon, Mar 15, 2021 at 12:01 PM SOEIRO Thomas > wrote: > >> > >> Hi Abby, > >> > >> Thank you for your positive feedback. > >> > >> I agree for your general comment about sorting. > >> > >> For ave specifically, ordering may not help because the output must > maintain the order of the input (as ave returns only x and not the entiere > data.frame). > >> > >> Thanks, > >> > >> Thomas > >> > >> De : Abby Spurdle > >> Envoyé : lundi 15 mars 2021 10:22 > >> À : SOEIRO Thomas > >> Cc : r-devel@r-project.org > >> Objet : Re: [Rd] Potential improvements of ave? > >> > >> EMAIL EXTERNE - TRAITER AVEC PRÉCAUTION LIENS ET FICHIERS > >> > >> Hi Thomas, > >> > >> These are some great suggestions. > >> But I can't help but feel there's a much bigger problem here. > >> > >> Intuitively, the ave function could (or should) sort the data. > >> Then the indexing step becomes almost trivial, in terms of both time > >> and space complexity. > >> And the ave function is not the only example of where a problem > >> becomes much simpler, if the data is sorted. > >> > >> Historically, I've never found base R functions user-friendly for > >> aggregation purposes, or for sorting. > >> (At least, not by comparison to SQL). > >> > >> But that's not the main problem. > >> It would seem preferable to sort the data, only once. > >> (Rather than sorting it repeatedly, or not at all). > >> > >> Perhaps, objects such as vectors and data.frame(s) could have a > >> boolean attribute, to indicate if they're sorted. > >> Or functions such as ave could have a sorted argument. > >> In either case, if true, the function assumes the data is sorted and > >> applies a more efficient algorithm. > >> > >> > >> B. > >> > >> > >> On Sat, Mar 13, 2021 at 1:07 PM SOEIRO Thomas > wrote: > >> > > >> > Dear all, > >> > > >> > I have two questions/suggestions about ave, but I am not sure if it's > relevant for bug reports. > >> > > >> > > >> > > >> > 1) I have performance issues with ave in a case where I didn't expect > it. The following code runs as expected: > >> > > >> > set.seed(1) > >> > > >> > df1 <- data.frame(id1 = sample(1:1e2, 5e2, TRUE), > >> > id2 = sample(1:3, 5e2, TRUE), > >> > id3 = sample(1:5, 5e2, TRUE), > >> > val = sample(1:300, 5e2, TRUE)) > >> > > >> > df1$diff <- ave(df1$val, > >> > df1$id1, > >> > df1$id2, > >> > df1$id3, > >> > FUN = function(i) c(
Re: [Rd] Potential improvements of ave?
Your proposed change (roughly, replacing interaction() by unique(paste())) slows down ave() considerably when there are long columns with lots of repeated rows. I think that interaction(drop=TRUE, ...) can be changed to use less memory and be faster by making a separate branch for drop=TRUE that uses the following idiom for finding the unique rows in a data.frame: new.duplicated.data.frame <- function (x, incomparables = FALSE, fromLast = FALSE, ...) { dup <- !logical(nrow(x)) # all entries considered duplicated until proven otherwise for(column in x) { dup <- dup & duplicated(column, incomparables = incomparables, fromLast = fromLast) } dup } ave() could use the above directly or it could call interaction(drop=TRUE,...). On Tue, Mar 16, 2021 at 3:50 PM SOEIRO Thomas wrote: > > Dear all, > > Thank you for your consideration on this topic. > > I do not have enough knowledge of R internals to join the discussion about > sorting mechanisms. In fact, I did not get how ordering could help for ave as > the output must maintain the order of the input (because ave returns only x > and not the entiere data.frame). > > However, while the proposed workaround (i.e. paste0 instead of interaction, > cf https://stat.ethz.ch/pipermail/r-devel/2021-March/080509.html) does not > solves the "bigger problem" of sorting, it is usable as is and solves the > issue. Therefore, what do you think about it? (i.e is it relevant for a > patch?) > > Thanks, > > Thomas > > > > > > De : Abby Spurdle > > Envoyé : lundi 15 mars 2021 10:22 > > À : SOEIRO Thomas > > Cc : r-devel@r-project.org > > Objet : Re: [Rd] Potential improvements of ave? > > > > Hi Thomas, > > > > These are some great suggestions. > > But I can't help but feel there's a much bigger problem here. > > > > Intuitively, the ave function could (or should) sort the data. > > Then the indexing step becomes almost trivial, in terms of both time > > and space complexity. > > And the ave function is not the only example of where a problem > > becomes much simpler, if the data is sorted. > > > > Historically, I've never found base R functions user-friendly for > > aggregation purposes, or for sorting. > > (At least, not by comparison to SQL). > > > > But that's not the main problem. > > It would seem preferable to sort the data, only once. > > (Rather than sorting it repeatedly, or not at all). > > > > Perhaps, objects such as vectors and data.frame(s) could have a > > boolean attribute, to indicate if they're sorted. > > Or functions such as ave could have a sorted argument. > > In either case, if true, the function assumes the data is sorted and > > applies a more efficient algorithm. > > > > > > B. > > > > > > On Sat, Mar 13, 2021 at 1:07 PM SOEIRO Thomas > > wrote: > >> > >> Dear all, > >> > >> I have two questions/suggestions about ave, but I am not sure if it's > >> relevant for bug reports. > >> > >> > >> > >> 1) I have performance issues with ave in a case where I didn't expect it. > >> The following code runs as expected: > >> > >> set.seed(1) > >> > >> df1 <- data.frame(id1 = sample(1:1e2, 5e2, TRUE), > >> id2 = sample(1:3, 5e2, TRUE), > >> id3 = sample(1:5, 5e2, TRUE), > >> val = sample(1:300, 5e2, TRUE)) > >> > >> df1$diff <- ave(df1$val, > >> df1$id1, > >> df1$id2, > >> df1$id3, > >> FUN = function(i) c(diff(i), 0)) > >> > >> head(df1[order(df1$id1, > >>df1$id2, > >>df1$id3), ]) > >> > >> But when expanding the data.frame (* 1e4), ave fails (Error: cannot > >> allocate vector of size 1110.0 Gb): > >> > >> df2 <- data.frame(id1 = sample(1:(1e2 * 1e4), 5e2 * 1e4, TRUE), > >> id2 = sample(1:3, 5e2 * 1e4, TRUE), > >> id3 = sample(1:(5 * 1e4), 5e2 * 1e4, TRUE), > >> val = sample(1:300, 5e2 * 1e4, TRUE)) > >> > >> df2$diff <- ave(df2$val, > >> df2$id1, > >> df2$id2, > >> df2$id3, > >> FUN = function(i) c(diff(i), 0)) > >> > >> This use case does not seem extreme to me (e.g. aggregate et al work > >> perfectly on this data.frame). > >> So my question is: Is this expected/intended/reasonable? i.e. Does ave > >> need to be optimized? > >> > >> > >> > >> 2) Gabor Grothendieck pointed out in 2011 that drop = TRUE is needed to > >> avoid warnings in case of unused levels > >> (https://stat.ethz.ch/pipermail/r-devel/2011-February/059947.html). > >> Is it relevant/possible to expose the drop argument explicitly? > >> > >> > >> > >> Thanks, > >> > >> Thomas > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Faster sorting algorithm...
Those interested in faster sorting may want to look at the merge sort implemented in pqR (see pqR-project.org). It's often used as the default, because it is stable, and does different collations, while being faster than shell sort (except for small vectors). Here are examples, with timings, for pqR-2020-07-23 and R-4.0.2, compiled identically: - pqR-2020-07-23 in C locale: > set.seed(1) > N <- 100 > x <- as.character (sample(N,N,replace=TRUE)) > print(system.time (os <- order(x,method="shell"))) user system elapsed 1.332 0.000 1.334 > print(system.time (or <- order(x,method="radix"))) user system elapsed 0.092 0.004 0.096 > print(system.time (om <- order(x,method="merge"))) user system elapsed 0.363 0.000 0.363 > print(identical(os,or)) [1] TRUE > print(identical(os,om)) [1] TRUE > > x <- c("a","~") > print(order(x,method="shell")) [1] 1 2 > print(order(x,method="radix")) [1] 1 2 > print(order(x,method="merge")) [1] 1 2 - R-4.0.2 in C locale: > set.seed(1) > N <- 100 > x <- as.character (sample(N,N,replace=TRUE)) > print(system.time (os <- order(x,method="shell"))) user system elapsed 2.381 0.004 2.387 > print(system.time (or <- order(x,method="radix"))) user system elapsed 0.138 0.000 0.137 > #print(system.time (om <- order(x,method="merge"))) > print(identical(os,or)) [1] TRUE > #print(identical(os,om)) > > x <- c("a","~") > print(order(x,method="shell")) [1] 1 2 > print(order(x,method="radix")) [1] 1 2 > #print(order(x,method="merge")) pqR-2020-07-23 in fr_CA.utf8 locale: > set.seed(1) > N <- 100 > x <- as.character (sample(N,N,replace=TRUE)) > print(system.time (os <- order(x,method="shell"))) utilisateur système écoulé 2.960 0.000 2.962 > print(system.time (or <- order(x,method="radix"))) utilisateur système écoulé 0.083 0.008 0.092 > print(system.time (om <- order(x,method="merge"))) utilisateur système écoulé 1.143 0.000 1.142 > print(identical(os,or)) [1] TRUE > print(identical(os,om)) [1] TRUE > > x <- c("a","~") > print(order(x,method="shell")) [1] 2 1 > print(order(x,method="radix")) [1] 1 2 > print(order(x,method="merge")) [1] 2 1 R-4.0.2 in fr_CA.utf8 locale: > set.seed(1) > N <- 100 > x <- as.character (sample(N,N,replace=TRUE)) > print(system.time (os <- order(x,method="shell"))) utilisateur système écoulé 4.222 0.016 4.239 > print(system.time (or <- order(x,method="radix"))) utilisateur système écoulé 0.136 0.000 0.137 > #print(system.time (om <- order(x,method="merge"))) > print(identical(os,or)) [1] TRUE > #print(identical(os,om)) > > x <- c("a","~") > print(order(x,method="shell")) [1] 2 1 > print(order(x,method="radix")) [1] 1 2 > #print(order(x,method="merge")) pqR is faster in all the tests, but more relevant to this discussion is that the "merge" method is substantially faster than "shell" for these character vectors with a million elements, while retaining the stability and collation properties of "shell" (whereas "radix" only does C collation). It would probably not be too hard to take the merge sort code from pqR and use it in R core's implementation. The merge sort in pqR doesn't exploit parallelism at the moment, but merge sort is potentially quite parallelizable (though I think the storage allocation strategy I use would have to be modified). Regards, Radford Neal __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel