Re: [Rd] Potential improvements of ave?

2021-03-16 Thread Martin Maechler
> Gabriel Becker 
> on Mon, 15 Mar 2021 15:08:44 -0700 writes:

> Abby,
> Vectors do have an internal mechanism for knowing that they are sorted via
> ALTREP (it was one of 2 core motivating features for 'smart vectors' the
> other being knowledge about presence of NAs).

> Currently I don't think we expose it at the R level, though it is part of
> the official C API. I don't know of any plans for this to change, but I
> suppose it could. Plus for functions in R itself, we could even use it
> without exposing it more widely. A number of functions, including sort
> itself, already do this in fact, but more could. I'd be interested in
> hearing which functions you think would particularly benefit from this.

Thank you Gabe.

> ~G

I vaguely remember (from Luke's docs/presentation on ALTREP)
that there are some "missing parts" here.
One of them the not-existing R level functionality, another may be
the C code below R's  is.unsorted()  ... maybe  is.unsorted()
could get a new argument and or be re-written, moving  the NA
handling also to C and have that happen *after* the C code looks
if it's an ALTREP object and if that "knows it's sorted".

Martin


> On Mon, Mar 15, 2021 at 12:01 PM SOEIRO Thomas 
> wrote:

>> Hi Abby,
>> 
>> Thank you for your positive feedback.
>> 
>> I agree for your general comment about sorting.
>> 
>> For ave specifically, ordering may not help because the output must
>> maintain the order of the input (as ave returns only x and not the 
entiere
>> data.frame).
>> 
>> Thanks,
>> 
>> Thomas
>> 
>> De : Abby Spurdle 
>> Envoyé : lundi 15 mars 2021 10:22
>> À : SOEIRO Thomas
>> Cc : r-devel@r-project.org
>> Objet : Re: [Rd] Potential improvements of ave?
>> 
>> EMAIL EXTERNE - TRAITER AVEC PRÉCAUTION LIENS ET FICHIERS
>> 
>> Hi Thomas,
>> 
>> These are some great suggestions.
>> But I can't help but feel there's a much bigger problem here.
>> 
>> Intuitively, the ave function could (or should) sort the data.
>> Then the indexing step becomes almost trivial, in terms of both time
>> and space complexity.
>> And the ave function is not the only example of where a problem
>> becomes much simpler, if the data is sorted.
>> 
>> Historically, I've never found base R functions user-friendly for
>> aggregation purposes, or for sorting.
>> (At least, not by comparison to SQL).
>> 
>> But that's not the main problem.
>> It would seem preferable to sort the data, only once.
>> (Rather than sorting it repeatedly, or not at all).
>> 
>> Perhaps, objects such as vectors and data.frame(s) could have a
>> boolean attribute, to indicate if they're sorted.
>> Or functions such as ave could have a sorted argument.
>> In either case, if true, the function assumes the data is sorted and
>> applies a more efficient algorithm.
>> 
>> 
>> B.
>> 
>> 
>> On Sat, Mar 13, 2021 at 1:07 PM SOEIRO Thomas 
>> wrote:
>> >
>> > Dear all,
>> >
>> > I have two questions/suggestions about ave, but I am not sure if it's
>> relevant for bug reports.
>> >
>> >
>> >
>> > 1) I have performance issues with ave in a case where I didn't expect
>> it. The following code runs as expected:
>> >
>> > set.seed(1)
>> >
>> > df1 <- data.frame(id1 = sample(1:1e2, 5e2, TRUE),
>> >   id2 = sample(1:3, 5e2, TRUE),
>> >   id3 = sample(1:5, 5e2, TRUE),
>> >   val = sample(1:300, 5e2, TRUE))
>> >
>> > df1$diff <- ave(df1$val,
>> > df1$id1,
>> > df1$id2,
>> > df1$id3,
>> > FUN = function(i) c(diff(i), 0))
>> >
>> > head(df1[order(df1$id1,
>> >df1$id2,
>> >df1$id3), ])
>> >
>> > But when expanding the data.frame (* 1e4), ave fails (Error: cannot
>> allocate vector of size 1110.0 Gb):
>> >
>> > df2 <- data.frame(id1 = sample(1:(1e2 * 1e4), 5e2 * 1e4, TRUE),
>> >   id2 = sample(1:3, 5e2 * 1e4, TRUE),
>> >   id3 = sample(1:(5 * 1e4), 5e2 * 1e4, TRUE),
>> >   val = sample(1:300, 5e2 * 1e4, TRUE))
>> >
>> > df2$diff <- ave(df2$val,
>> > df2$id1,
>> > df2$id2,
>> > df2$id3,
>> > FUN = function(i) c(diff(i), 0))
>> >
>> > This use case does not seem extreme to me (e.g. aggregate et al work
>> perfectly on this data.frame).
>> > So my question is: Is this expected/intended/reasonable? i.e. Does ave
>> need to be optimized?
>> >
>> >
>> >
>> > 2) Gabor Grothendieck pointed out in 2011 that drop = TR

Re: [Rd] Potential improvements of ave?

2021-03-16 Thread Dirk Eddelbuettel


On 16 March 2021 at 10:50, Martin Maechler wrote:
| I vaguely remember (from Luke's docs/presentation on ALTREP)
| that there are some "missing parts" here.
| One of them the not-existing R level functionality, another may be
| the C code below R's  is.unsorted()  ... maybe  is.unsorted()
| could get a new argument and or be re-written, moving  the NA
| handling also to C and have that happen *after* the C code looks
| if it's an ALTREP object and if that "knows it's sorted".

Somewhat related: I recently tried to demonstrate that, say, a 1e2 and a 1e6
vector each with one added NA should have equal "O(1)" time under anyNA(),
yet apparently they do not---so the other magic bit (about knowing NAness)
may not propagate. (That was with R 4.0.4.) I may also have missed a crucial
bit about enabling ALTREPness here, yet the "gist of it" always was (AFAIK)
that it is supposedly seamless.

Dirk

-- 
https://dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] R.sh and argument escaping

2021-03-16 Thread Ivan Krylov
Hello R-devel!

The following sequence of commands results in an error message on a
POSIX system:

tab="`echo -ne "\t"`"
LC_ALL=C Rscript -e " $tab 1"
# ARGUMENT '~+~1' __ignored__

Tabs can sneak into the -e argument from indented multi-line arguments
in shell scripts:

Rscript -e '
foo()
bar()
...
'

R.sh does a good job of escaping spaces and newlines, but since shells
are also supposed to split on a tab [*], it's a good idea to escape
tabs too:

Index: src/scripts/R.sh.in
===
--- src/scripts/R.sh.in (revision 80090)
+++ src/scripts/R.sh.in (working copy)
@@ -192,7 +192,7 @@
 -e)
   if test -n "`echo ${2} | ${SED} 's/^-.*//'`"; then
a=`(echo "${2}" && echo) | ${SED} -e 's/ /~+~/g' | \
-  ${SED} -e :a -e N -e '$!ba' -e 's/\n/~n~/g' -e 's/~n~$//g'`
+  ${SED} -e :a -e N -e '$!ba' -e 's/\n/~n~/g' -e 's/~n~$//g' -e 
's/\t/~t~/g'`
 shift
   else
error "option '${1}' requires a non-empty argument"
Index: src/unix/system.c
===
--- src/unix/system.c   (revision 80090)
+++ src/unix/system.c   (working copy)
@@ -170,6 +170,9 @@
} else if(*q == '~' && *(q+1) == 'n' && *(q+2) == '~') {
q += 2;
*p++ = '\n';
+   } else if(*q == '~' && *(q+1) == 't' && *(q+2) == '~') {
+   q += 2;
+   *p++ = '\t';
} else *p++ = *q;
 }
 return p;

I have verified that with the patch above, Rscript -e " $tab 1" no
longer fails.

While we're at it, perhaps it could be a good idea to replace the magic
number 1 with a the size of the character array above it:

Index: src/unix/system.c
===
--- src/unix/system.c   (revision 80090)
+++ src/unix/system.c   (working copy)
@@ -429,7 +432,7 @@
} else if(!strcmp(*av, "-e")) {
ac--; av++;
Rp->R_Interactive = FALSE;
-   if(strlen(cmdlines) + strlen(*av) + 2 <= 1) {
+   if(strlen(cmdlines) + strlen(*av) + 2 <= sizeof(cmdlines)) {
char *p = cmdlines+strlen(cmdlines);
p = unescape_arg(p, *av);
*p++ = '\n'; *p = '\0';

It might also be a good idea to make it possible to represent the escape
sequences themselves in the unescaped stream in a fully reversible
transformation ('~' <-> '~~~', ' ' <-> '~+~', '\n' <-> '~n~',
'\t' <-> '~t~'), making it possible to round-trip character sequences
like '~+~' through the escaping and unescaping process (thankfully,
'~+~' is not frequently needed in R programs), though expressing that
as a sed command is beyond me. Right now, Rscript -e '"~+~"' doesn't
print "~+~".

Perhaps the bigger question to ask is whether this escaping is
unavoidable. Is it documented? Since the args variable is only appended
(not prepended), it is likely possible to rewrite the
'while test -n "${1}"; do' loop in terms of 'set -- "$@" ...', which is
POSIX-compatible and doesn't require any escaping:

set -- "${@}" dummy # append one argument to skip it later

for arg in "${@}"; do # it's safe to modify $@ in the for loop [**]
  # TODO: on first iteration only, empty the $@ and don't check $prev_arg
  case "${prev_arg}" in
# ...
-g|--gui)
  if test -n "`echo "${arg}" | ${SED} 's/^-.*//'`"; then
gui="${arg}"
set -- "${@}" "${prev_arg}" "${arg}"
  else
error "option '${prev_arg}' requires an argument"
  fi
  ;;
# ...
-e)
  if ! test -n "`echo "${arg}" | ${SED} 's/^-.*//'`"; then
error "option '${prev_arg}' requires a non-empty argument"
  fi
  set -- "${@}" -e "${arg}"
  ;;
# ...
  esac
  prev_arg="${arg}"
  # no shift needed
done

# Later: use "${@}" instead of ${args}

Or is it documented behaviour that arguments following an empty
argument are not escaped by the shell script but are passed to
"${R_HOME}/bin/exec${R_ARCH}/R"?

LC_ALL=C R -q -e " $tab 1"
# ARGUMENT '~+~1' __ignored__
# 
# >
# >
# >
LC_ALL=C R '' -q -e " $tab 1"
# ARGUMENT '' __ignored__
# 
# > 1
# [1] 1
# >
# >


-- 
Best regards,
Ivan

[*]
https://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html#tag_18_06_05

[**]
"First, the list of words following in shall be expanded to generate a
list of items..."
https://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html#tag_18_09_04_03

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Undefined (so far as I can tell?) behavior of browser when called at top level of sourced script?

2021-03-16 Thread Gabriel Becker
Hi all,

I was asked a question about why browser() was behaving a specific way, and
it turned out that it was being called in a script (rather than in a
function).

Putting aside the design considerations that lead to that, the behavior is
actually a bit puzzling, and so far as I have been able to see, completely
undocumented. My suspicion is that this behavior should be considered
undefined, but I wanted to make sure I wasn't missing something. (To be
perfectly honest I was a bit surprised it wasn't an error).

Some experimentation (done in 4.0.1 because that is what I have available,
R script attached) has lead me to conclude that if browser is called at the
top level, 'n' will just continue to the end, *except* in the case where
the next expression is a conditional ** which has a consequent that is
evaluated** or a loop, in which case it walks through the consequent/loop
body however many number of times and then the 'n' that steps out of that
continues on.

Should something be added to the documentation that either describes this
behavior, declares explicitly that using browser at the top level leads to
undefined behavior, or both?

I can prepare a patch to that effect if desired.

~G
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Potential improvements of ave?

2021-03-16 Thread Abby Spurdle
There are some relatively obvious examples:
unique, which.min/which.max/etc, range/min/max, quantile, aggregate/split

Also, many timeseries, graphics and spline functions are dependent on the order.

In the case of data.frame(s), a boolean flag would probably need to be
extended to allow for multiple column sorting, and
ascending/descending options.

On Tue, Mar 16, 2021 at 11:08 AM Gabriel Becker  wrote:
>
> Abby,
>
> Vectors do have an internal mechanism for knowing that they are sorted via 
> ALTREP (it was one of 2 core motivating features for 'smart vectors' the 
> other being knowledge about presence of NAs).
>
> Currently I don't think we expose it at the R level, though it is part of the 
> official C API. I don't know of any plans for this to change, but I suppose 
> it could. Plus for functions in R itself, we could even use it without 
> exposing it more widely. A number of functions, including sort itself, 
> already do this in fact, but more could. I'd be interested in hearing which 
> functions you think would particularly benefit from this.
>
> ~G
>
> On Mon, Mar 15, 2021 at 12:01 PM SOEIRO Thomas  wrote:
>>
>> Hi Abby,
>>
>> Thank you for your positive feedback.
>>
>> I agree for your general comment about sorting.
>>
>> For ave specifically, ordering may not help because the output must maintain 
>> the order of the input (as ave returns only x and not the entiere 
>> data.frame).
>>
>> Thanks,
>>
>> Thomas
>> 
>> De : Abby Spurdle 
>> Envoyé : lundi 15 mars 2021 10:22
>> À : SOEIRO Thomas
>> Cc : r-devel@r-project.org
>> Objet : Re: [Rd] Potential improvements of ave?
>>
>> EMAIL EXTERNE - TRAITER AVEC PRÉCAUTION LIENS ET FICHIERS
>>
>> Hi Thomas,
>>
>> These are some great suggestions.
>> But I can't help but feel there's a much bigger problem here.
>>
>> Intuitively, the ave function could (or should) sort the data.
>> Then the indexing step becomes almost trivial, in terms of both time
>> and space complexity.
>> And the ave function is not the only example of where a problem
>> becomes much simpler, if the data is sorted.
>>
>> Historically, I've never found base R functions user-friendly for
>> aggregation purposes, or for sorting.
>> (At least, not by comparison to SQL).
>>
>> But that's not the main problem.
>> It would seem preferable to sort the data, only once.
>> (Rather than sorting it repeatedly, or not at all).
>>
>> Perhaps, objects such as vectors and data.frame(s) could have a
>> boolean attribute, to indicate if they're sorted.
>> Or functions such as ave could have a sorted argument.
>> In either case, if true, the function assumes the data is sorted and
>> applies a more efficient algorithm.
>>
>>
>> B.
>>
>>
>> On Sat, Mar 13, 2021 at 1:07 PM SOEIRO Thomas  wrote:
>> >
>> > Dear all,
>> >
>> > I have two questions/suggestions about ave, but I am not sure if it's 
>> > relevant for bug reports.
>> >
>> >
>> >
>> > 1) I have performance issues with ave in a case where I didn't expect it. 
>> > The following code runs as expected:
>> >
>> > set.seed(1)
>> >
>> > df1 <- data.frame(id1 = sample(1:1e2, 5e2, TRUE),
>> >   id2 = sample(1:3, 5e2, TRUE),
>> >   id3 = sample(1:5, 5e2, TRUE),
>> >   val = sample(1:300, 5e2, TRUE))
>> >
>> > df1$diff <- ave(df1$val,
>> > df1$id1,
>> > df1$id2,
>> > df1$id3,
>> > FUN = function(i) c(diff(i), 0))
>> >
>> > head(df1[order(df1$id1,
>> >df1$id2,
>> >df1$id3), ])
>> >
>> > But when expanding the data.frame (* 1e4), ave fails (Error: cannot 
>> > allocate vector of size 1110.0 Gb):
>> >
>> > df2 <- data.frame(id1 = sample(1:(1e2 * 1e4), 5e2 * 1e4, TRUE),
>> >   id2 = sample(1:3, 5e2 * 1e4, TRUE),
>> >   id3 = sample(1:(5 * 1e4), 5e2 * 1e4, TRUE),
>> >   val = sample(1:300, 5e2 * 1e4, TRUE))
>> >
>> > df2$diff <- ave(df2$val,
>> > df2$id1,
>> > df2$id2,
>> > df2$id3,
>> > FUN = function(i) c(diff(i), 0))
>> >
>> > This use case does not seem extreme to me (e.g. aggregate et al work 
>> > perfectly on this data.frame).
>> > So my question is: Is this expected/intended/reasonable? i.e. Does ave 
>> > need to be optimized?
>> >
>> >
>> >
>> > 2) Gabor Grothendieck pointed out in 2011 that drop = TRUE is needed to 
>> > avoid warnings in case of unused levels 
>> > (https://urldefense.com/v3/__https://stat.ethz.ch/pipermail/r-devel/2011-February/059947.html__;!!JQ5agg!J2AUFbQr31F2c6LUpTnyc5TX2Kh1bJ-VqhMND1c0N5axWO_tQl0pCJhtucPfjU7NXrBO$
>> >  ).
>> > Is it relevant/possible to expose the drop argument explicitly?
>> >
>> >
>> >
>> > Thanks,
>> >
>> > Thomas
>> > __
>> > R-devel@r-project.org mailing list
>> > https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-devel__;!!JQ5agg!J2AUFbQr31F2c6L

Re: [Rd] Potential improvements of ave?

2021-03-16 Thread SOEIRO Thomas
Dear all,

Thank you for your consideration on this topic.

I do not have enough knowledge of R internals to join the discussion about 
sorting mechanisms. In fact, I did not get how ordering could help for ave as 
the output must maintain the order of the input (because ave returns only x and 
not the entiere data.frame).

However, while the proposed workaround (i.e. paste0 instead of interaction, cf 
https://stat.ethz.ch/pipermail/r-devel/2021-March/080509.html) does not solves 
the "bigger problem" of sorting, it is usable as is and solves the issue. 
Therefore, what do you think about it? (i.e is it relevant for a patch?)

Thanks,

Thomas


> 
> De : Abby Spurdle 
> Envoyé : lundi 15 mars 2021 10:22
> À : SOEIRO Thomas
> Cc : r-devel@r-project.org
> Objet : Re: [Rd] Potential improvements of ave?
>
> Hi Thomas,
>
> These are some great suggestions.
> But I can't help but feel there's a much bigger problem here.
>
> Intuitively, the ave function could (or should) sort the data.
> Then the indexing step becomes almost trivial, in terms of both time
> and space complexity.
> And the ave function is not the only example of where a problem
> becomes much simpler, if the data is sorted.
>
> Historically, I've never found base R functions user-friendly for
> aggregation purposes, or for sorting.
> (At least, not by comparison to SQL).
>
> But that's not the main problem.
> It would seem preferable to sort the data, only once.
> (Rather than sorting it repeatedly, or not at all).
>
> Perhaps, objects such as vectors and data.frame(s) could have a
> boolean attribute, to indicate if they're sorted.
> Or functions such as ave could have a sorted argument.
> In either case, if true, the function assumes the data is sorted and
> applies a more efficient algorithm.
>
>
> B.
>
>
> On Sat, Mar 13, 2021 at 1:07 PM SOEIRO Thomas  wrote:
>>
>> Dear all,
>>
>> I have two questions/suggestions about ave, but I am not sure if it's 
>> relevant for bug reports.
>>
>>
>>
>> 1) I have performance issues with ave in a case where I didn't expect it. 
>> The following code runs as expected:
>>
>> set.seed(1)
>>
>> df1 <- data.frame(id1 = sample(1:1e2, 5e2, TRUE),
>>   id2 = sample(1:3, 5e2, TRUE),
>>   id3 = sample(1:5, 5e2, TRUE),
>>   val = sample(1:300, 5e2, TRUE))
>>
>> df1$diff <- ave(df1$val,
>> df1$id1,
>> df1$id2,
>> df1$id3,
>> FUN = function(i) c(diff(i), 0))
>>
>> head(df1[order(df1$id1,
>>df1$id2,
>>df1$id3), ])
>>
>> But when expanding the data.frame (* 1e4), ave fails (Error: cannot allocate 
>> vector of size 1110.0 Gb):
>>
>> df2 <- data.frame(id1 = sample(1:(1e2 * 1e4), 5e2 * 1e4, TRUE),
>>   id2 = sample(1:3, 5e2 * 1e4, TRUE),
>>   id3 = sample(1:(5 * 1e4), 5e2 * 1e4, TRUE),
>>   val = sample(1:300, 5e2 * 1e4, TRUE))
>>
>> df2$diff <- ave(df2$val,
>> df2$id1,
>> df2$id2,
>> df2$id3,
>> FUN = function(i) c(diff(i), 0))
>>
>> This use case does not seem extreme to me (e.g. aggregate et al work 
>> perfectly on this data.frame).
>> So my question is: Is this expected/intended/reasonable? i.e. Does ave need 
>> to be optimized?
>>
>>
>>
>> 2) Gabor Grothendieck pointed out in 2011 that drop = TRUE is needed to 
>> avoid warnings in case of unused levels 
>> (https://stat.ethz.ch/pipermail/r-devel/2011-February/059947.html).
>> Is it relevant/possible to expose the drop argument explicitly?
>>
>>
>>
>> Thanks,
>>
>> Thomas
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Potential improvements of ave?

2021-03-16 Thread Gabriel Becker
Hi Abby,

I actually have a patch submitted that does this for unique/duplicated
(only numeric cases I think) but it is, as patches from external
contributors go, quite sizable which means it requires a correspondingly
large amount of an R-core member's time and energy to vet and consider. It
is in the queue, and so, I expect (/hope, provided I didn't make a mistake)
it will be incorporated at some point. (
https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17993)

You are correct that the speedups are quite significant for calling
unique/duplicated on large vectors that know they are sorted: Speedup on my
machine for a fairly sizable vector (length 1e7) ranges from about ~10x in
the densely duplicated case up to ~60-70x in the sparsely duplicated case
for duplicated(). For unique() it seems to range from ~10x in the densely
duplicated case to ~15 in the spare case.

I had thought that min and max already did this, but looking now, they
don't seem to by default, thought ALTREP classes themselves do have an
option of setting a min/max method, which would be hit. That does seem like
low-hanging fruit, I agree, though in many cases the slow down from a
single pass over the data to get a min probably isn't earthshattering.

The others do seem like they could benefit as well.

Best,
~G

On Tue, Mar 16, 2021 at 2:54 PM Abby Spurdle  wrote:

> There are some relatively obvious examples:
> unique, which.min/which.max/etc, range/min/max, quantile, aggregate/split
>
> Also, many timeseries, graphics and spline functions are dependent on the
> order.
>
> In the case of data.frame(s), a boolean flag would probably need to be
> extended to allow for multiple column sorting, and
> ascending/descending options.
>
> On Tue, Mar 16, 2021 at 11:08 AM Gabriel Becker 
> wrote:
> >
> > Abby,
> >
> > Vectors do have an internal mechanism for knowing that they are sorted
> via ALTREP (it was one of 2 core motivating features for 'smart vectors'
> the other being knowledge about presence of NAs).
> >
> > Currently I don't think we expose it at the R level, though it is part
> of the official C API. I don't know of any plans for this to change, but I
> suppose it could. Plus for functions in R itself, we could even use it
> without exposing it more widely. A number of functions, including sort
> itself, already do this in fact, but more could. I'd be interested in
> hearing which functions you think would particularly benefit from this.
> >
> > ~G
> >
> > On Mon, Mar 15, 2021 at 12:01 PM SOEIRO Thomas 
> wrote:
> >>
> >> Hi Abby,
> >>
> >> Thank you for your positive feedback.
> >>
> >> I agree for your general comment about sorting.
> >>
> >> For ave specifically, ordering may not help because the output must
> maintain the order of the input (as ave returns only x and not the entiere
> data.frame).
> >>
> >> Thanks,
> >>
> >> Thomas
> >> 
> >> De : Abby Spurdle 
> >> Envoyé : lundi 15 mars 2021 10:22
> >> À : SOEIRO Thomas
> >> Cc : r-devel@r-project.org
> >> Objet : Re: [Rd] Potential improvements of ave?
> >>
> >> EMAIL EXTERNE - TRAITER AVEC PRÉCAUTION LIENS ET FICHIERS
> >>
> >> Hi Thomas,
> >>
> >> These are some great suggestions.
> >> But I can't help but feel there's a much bigger problem here.
> >>
> >> Intuitively, the ave function could (or should) sort the data.
> >> Then the indexing step becomes almost trivial, in terms of both time
> >> and space complexity.
> >> And the ave function is not the only example of where a problem
> >> becomes much simpler, if the data is sorted.
> >>
> >> Historically, I've never found base R functions user-friendly for
> >> aggregation purposes, or for sorting.
> >> (At least, not by comparison to SQL).
> >>
> >> But that's not the main problem.
> >> It would seem preferable to sort the data, only once.
> >> (Rather than sorting it repeatedly, or not at all).
> >>
> >> Perhaps, objects such as vectors and data.frame(s) could have a
> >> boolean attribute, to indicate if they're sorted.
> >> Or functions such as ave could have a sorted argument.
> >> In either case, if true, the function assumes the data is sorted and
> >> applies a more efficient algorithm.
> >>
> >>
> >> B.
> >>
> >>
> >> On Sat, Mar 13, 2021 at 1:07 PM SOEIRO Thomas 
> wrote:
> >> >
> >> > Dear all,
> >> >
> >> > I have two questions/suggestions about ave, but I am not sure if it's
> relevant for bug reports.
> >> >
> >> >
> >> >
> >> > 1) I have performance issues with ave in a case where I didn't expect
> it. The following code runs as expected:
> >> >
> >> > set.seed(1)
> >> >
> >> > df1 <- data.frame(id1 = sample(1:1e2, 5e2, TRUE),
> >> >   id2 = sample(1:3, 5e2, TRUE),
> >> >   id3 = sample(1:5, 5e2, TRUE),
> >> >   val = sample(1:300, 5e2, TRUE))
> >> >
> >> > df1$diff <- ave(df1$val,
> >> > df1$id1,
> >> > df1$id2,
> >> > df1$id3,
> >> > FUN = function(i) c(

Re: [Rd] Potential improvements of ave?

2021-03-16 Thread Bill Dunlap
Your proposed change (roughly, replacing interaction() by
unique(paste())) slows down ave() considerably when there are long
columns with lots of repeated rows.

I think that interaction(drop=TRUE, ...) can be changed to use less
memory and be faster by making a separate branch for drop=TRUE that
uses the following idiom for finding the unique rows in a data.frame:

new.duplicated.data.frame <- function (x, incomparables = FALSE,
fromLast = FALSE, ...)
{
dup <- !logical(nrow(x)) # all entries considered duplicated until
proven otherwise
for(column in x) {
dup <- dup & duplicated(column, incomparables = incomparables,
fromLast = fromLast)
}
dup
}

ave() could use the above directly or it could call interaction(drop=TRUE,...).

On Tue, Mar 16, 2021 at 3:50 PM SOEIRO Thomas  wrote:
>
> Dear all,
>
> Thank you for your consideration on this topic.
>
> I do not have enough knowledge of R internals to join the discussion about 
> sorting mechanisms. In fact, I did not get how ordering could help for ave as 
> the output must maintain the order of the input (because ave returns only x 
> and not the entiere data.frame).
>
> However, while the proposed workaround (i.e. paste0 instead of interaction, 
> cf https://stat.ethz.ch/pipermail/r-devel/2021-March/080509.html) does not 
> solves the "bigger problem" of sorting, it is usable as is and solves the 
> issue. Therefore, what do you think about it? (i.e is it relevant for a 
> patch?)
>
> Thanks,
>
> Thomas
>
>
> > 
> > De : Abby Spurdle 
> > Envoyé : lundi 15 mars 2021 10:22
> > À : SOEIRO Thomas
> > Cc : r-devel@r-project.org
> > Objet : Re: [Rd] Potential improvements of ave?
> >
> > Hi Thomas,
> >
> > These are some great suggestions.
> > But I can't help but feel there's a much bigger problem here.
> >
> > Intuitively, the ave function could (or should) sort the data.
> > Then the indexing step becomes almost trivial, in terms of both time
> > and space complexity.
> > And the ave function is not the only example of where a problem
> > becomes much simpler, if the data is sorted.
> >
> > Historically, I've never found base R functions user-friendly for
> > aggregation purposes, or for sorting.
> > (At least, not by comparison to SQL).
> >
> > But that's not the main problem.
> > It would seem preferable to sort the data, only once.
> > (Rather than sorting it repeatedly, or not at all).
> >
> > Perhaps, objects such as vectors and data.frame(s) could have a
> > boolean attribute, to indicate if they're sorted.
> > Or functions such as ave could have a sorted argument.
> > In either case, if true, the function assumes the data is sorted and
> > applies a more efficient algorithm.
> >
> >
> > B.
> >
> >
> > On Sat, Mar 13, 2021 at 1:07 PM SOEIRO Thomas  
> > wrote:
> >>
> >> Dear all,
> >>
> >> I have two questions/suggestions about ave, but I am not sure if it's 
> >> relevant for bug reports.
> >>
> >>
> >>
> >> 1) I have performance issues with ave in a case where I didn't expect it. 
> >> The following code runs as expected:
> >>
> >> set.seed(1)
> >>
> >> df1 <- data.frame(id1 = sample(1:1e2, 5e2, TRUE),
> >>   id2 = sample(1:3, 5e2, TRUE),
> >>   id3 = sample(1:5, 5e2, TRUE),
> >>   val = sample(1:300, 5e2, TRUE))
> >>
> >> df1$diff <- ave(df1$val,
> >> df1$id1,
> >> df1$id2,
> >> df1$id3,
> >> FUN = function(i) c(diff(i), 0))
> >>
> >> head(df1[order(df1$id1,
> >>df1$id2,
> >>df1$id3), ])
> >>
> >> But when expanding the data.frame (* 1e4), ave fails (Error: cannot 
> >> allocate vector of size 1110.0 Gb):
> >>
> >> df2 <- data.frame(id1 = sample(1:(1e2 * 1e4), 5e2 * 1e4, TRUE),
> >>   id2 = sample(1:3, 5e2 * 1e4, TRUE),
> >>   id3 = sample(1:(5 * 1e4), 5e2 * 1e4, TRUE),
> >>   val = sample(1:300, 5e2 * 1e4, TRUE))
> >>
> >> df2$diff <- ave(df2$val,
> >> df2$id1,
> >> df2$id2,
> >> df2$id3,
> >> FUN = function(i) c(diff(i), 0))
> >>
> >> This use case does not seem extreme to me (e.g. aggregate et al work 
> >> perfectly on this data.frame).
> >> So my question is: Is this expected/intended/reasonable? i.e. Does ave 
> >> need to be optimized?
> >>
> >>
> >>
> >> 2) Gabor Grothendieck pointed out in 2011 that drop = TRUE is needed to 
> >> avoid warnings in case of unused levels 
> >> (https://stat.ethz.ch/pipermail/r-devel/2011-February/059947.html).
> >> Is it relevant/possible to expose the drop argument explicitly?
> >>
> >>
> >>
> >> Thanks,
> >>
> >> Thomas
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Faster sorting algorithm...

2021-03-16 Thread Radford Neal
Those interested in faster sorting may want to look at the merge sort
implemented in pqR (see pqR-project.org).  It's often used as the
default, because it is stable, and does different collations, while
being faster than shell sort (except for small vectors).

Here are examples, with timings, for pqR-2020-07-23 and R-4.0.2, 
compiled identically:

-
pqR-2020-07-23 in C locale:

> set.seed(1)
> N <- 100
> x <- as.character (sample(N,N,replace=TRUE))
> print(system.time (os <- order(x,method="shell")))
   user  system elapsed 
  1.332   0.000   1.334 
> print(system.time (or <- order(x,method="radix")))
   user  system elapsed 
  0.092   0.004   0.096 
> print(system.time (om <- order(x,method="merge")))
   user  system elapsed 
  0.363   0.000   0.363 
> print(identical(os,or))
[1] TRUE
> print(identical(os,om))
[1] TRUE
> 
> x <- c("a","~")
> print(order(x,method="shell"))
[1] 1 2
> print(order(x,method="radix"))
[1] 1 2
> print(order(x,method="merge"))
[1] 1 2

-
R-4.0.2 in C locale:

> set.seed(1)
> N <- 100
> x <- as.character (sample(N,N,replace=TRUE))
> print(system.time (os <- order(x,method="shell")))
   user  system elapsed 
  2.381   0.004   2.387 
> print(system.time (or <- order(x,method="radix")))
   user  system elapsed 
  0.138   0.000   0.137 
> #print(system.time (om <- order(x,method="merge")))
> print(identical(os,or))
[1] TRUE
> #print(identical(os,om))
> 
> x <- c("a","~")
> print(order(x,method="shell"))
[1] 1 2
> print(order(x,method="radix"))
[1] 1 2
> #print(order(x,method="merge"))


pqR-2020-07-23 in fr_CA.utf8 locale:

> set.seed(1)
> N <- 100
> x <- as.character (sample(N,N,replace=TRUE))
> print(system.time (os <- order(x,method="shell")))
utilisateur système  écoulé 
  2.960   0.000   2.962 
> print(system.time (or <- order(x,method="radix")))
utilisateur système  écoulé 
  0.083   0.008   0.092 
> print(system.time (om <- order(x,method="merge")))
utilisateur système  écoulé 
  1.143   0.000   1.142 
> print(identical(os,or))
[1] TRUE
> print(identical(os,om))
[1] TRUE
> 
> x <- c("a","~")
> print(order(x,method="shell"))
[1] 2 1
> print(order(x,method="radix"))
[1] 1 2
> print(order(x,method="merge"))
[1] 2 1


R-4.0.2 in fr_CA.utf8 locale:

> set.seed(1)
> N <- 100
> x <- as.character (sample(N,N,replace=TRUE))
> print(system.time (os <- order(x,method="shell")))
utilisateur système  écoulé 
  4.222   0.016   4.239 
> print(system.time (or <- order(x,method="radix")))
utilisateur système  écoulé 
  0.136   0.000   0.137 
> #print(system.time (om <- order(x,method="merge")))
> print(identical(os,or))
[1] TRUE
> #print(identical(os,om))
> 
> x <- c("a","~")
> print(order(x,method="shell"))
[1] 2 1
> print(order(x,method="radix"))
[1] 1 2
> #print(order(x,method="merge"))


pqR is faster in all the tests, but more relevant to this discussion
is that the "merge" method is substantially faster than "shell" for
these character vectors with a million elements, while retaining the
stability and collation properties of "shell" (whereas "radix" only
does C collation).

It would probably not be too hard to take the merge sort code from pqR
and use it in R core's implementation.

The merge sort in pqR doesn't exploit parallelism at the moment, but
merge sort is potentially quite parallelizable (though I think the
storage allocation strategy I use would have to be modified).

Regards,

   Radford Neal

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel