Re: [Rd] Faster sorting algorithm...

2021-03-21 Thread frederik

I think it is "Professor Neal" :)

I also appreciate the pqR comparisons.

On Wed, Mar 17, 2021 at 09:23:15AM +, Morgan Morgan wrote:

Thank you Neal. This is interesting. I will have a look at pqR.
Indeed radix only does C collation, I believe that is why it is not the
default choice for character ordering and sorting.
Not sure but I believe it can help address the following bugzilla item:
https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17400

On the same topic of collation, there is an experimental sorting function
"psort" in package kit that might help address this issue.


library(kit)

Attaching kit 0.0.7 (OPENMP enabled using 1 thread)

x <- c("b","A","B","a","\xe4")
Encoding(x) <- "latin1"
identical(psort(x, c.locale=FALSE), sort(x))

[1] TRUE

identical(psort(x, c.locale=TRUE), sort(x, method="radix"))

[1] TRUE

Coming back to the topic of fsort, I have just finished the implementation
for double, integer, factor and logical.
The implementation takes into account NA, Inf.. values. Values can be
sorted in a decreasing order or increasing order.
Comparing benchmark with the current implementation in data.table, it is
currently over 30% faster.
There might bugs but I am sure performance can be further improved as I did
not really try hard.
If there is interest in both the implementation and cross community
sharing, please let know

Best regards,
Morgan

On Wed, 17 Mar 2021, 00:37 Radford Neal,  wrote:


Those interested in faster sorting may want to look at the merge sort
implemented in pqR (see pqR-project.org).  It's often used as the
default, because it is stable, and does different collations, while
being faster than shell sort (except for small vectors).

Here are examples, with timings, for pqR-2020-07-23 and R-4.0.2,
compiled identically:

-
pqR-2020-07-23 in C locale:

> set.seed(1)
> N <- 100
> x <- as.character (sample(N,N,replace=TRUE))
> print(system.time (os <- order(x,method="shell")))
   user  system elapsed
  1.332   0.000   1.334
> print(system.time (or <- order(x,method="radix")))
   user  system elapsed
  0.092   0.004   0.096
> print(system.time (om <- order(x,method="merge")))
   user  system elapsed
  0.363   0.000   0.363
> print(identical(os,or))
[1] TRUE
> print(identical(os,om))
[1] TRUE
>
> x <- c("a","~")
> print(order(x,method="shell"))
[1] 1 2
> print(order(x,method="radix"))
[1] 1 2
> print(order(x,method="merge"))
[1] 1 2

-
R-4.0.2 in C locale:

> set.seed(1)
> N <- 100
> x <- as.character (sample(N,N,replace=TRUE))
> print(system.time (os <- order(x,method="shell")))
   user  system elapsed
  2.381   0.004   2.387
> print(system.time (or <- order(x,method="radix")))
   user  system elapsed
  0.138   0.000   0.137
> #print(system.time (om <- order(x,method="merge")))
> print(identical(os,or))
[1] TRUE
> #print(identical(os,om))
>
> x <- c("a","~")
> print(order(x,method="shell"))
[1] 1 2
> print(order(x,method="radix"))
[1] 1 2
> #print(order(x,method="merge"))


pqR-2020-07-23 in fr_CA.utf8 locale:

> set.seed(1)
> N <- 100
> x <- as.character (sample(N,N,replace=TRUE))
> print(system.time (os <- order(x,method="shell")))
utilisateur système  écoulé
  2.960   0.000   2.962
> print(system.time (or <- order(x,method="radix")))
utilisateur système  écoulé
  0.083   0.008   0.092
> print(system.time (om <- order(x,method="merge")))
utilisateur système  écoulé
  1.143   0.000   1.142
> print(identical(os,or))
[1] TRUE
> print(identical(os,om))
[1] TRUE
>
> x <- c("a","~")
> print(order(x,method="shell"))
[1] 2 1
> print(order(x,method="radix"))
[1] 1 2
> print(order(x,method="merge"))
[1] 2 1


R-4.0.2 in fr_CA.utf8 locale:

> set.seed(1)
> N <- 100
> x <- as.character (sample(N,N,replace=TRUE))
> print(system.time (os <- order(x,method="shell")))
utilisateur système  écoulé
  4.222   0.016   4.239
> print(system.time (or <- order(x,method="radix")))
utilisateur système  écoulé
  0.136   0.000   0.137
> #print(system.time (om <- order(x,method="merge")))
> print(identical(os,or))
[1] TRUE
> #print(identical(os,om))
>
> x <- c("a","~")
> print(order(x,method="shell"))
[1] 2 1
> print(order(x,method="radix"))
[1] 1 2
> #print(order(x,method="merge"))


pqR is faster in all the tests, but more relevant to this discussion
is that the "merge" method is substantially faster than "shell" for
these character vectors with a million elements, while retaining the
stability and collation properties of "shell" (whereas "radix" only
does C collation).

It would probably not be too hard to take the merge sort code from pqR
and use it in R core's implementation.

The merge sort in pqR doesn't exploit parallelism at the moment, but
merge sort is potentially quite parallelizable (though I think the
storage allocation strategy I use would

Re: [Rd] Faster sorting algorithm...

2021-03-21 Thread Morgan Morgan
My apologies to Professor Neal.
Thank you for correcting me.
Best regards
Morgan


On Mon, 22 Mar 2021, 05:05 ,  wrote:

> I think it is "Professor Neal" :)
>
> I also appreciate the pqR comparisons.
>
> On Wed, Mar 17, 2021 at 09:23:15AM +, Morgan Morgan wrote:
> >Thank you Neal. This is interesting. I will have a look at pqR.
> >Indeed radix only does C collation, I believe that is why it is not the
> >default choice for character ordering and sorting.
> >Not sure but I believe it can help address the following bugzilla item:
> >https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17400
> >
> >On the same topic of collation, there is an experimental sorting function
> >"psort" in package kit that might help address this issue.
> >
> >> library(kit)
> >Attaching kit 0.0.7 (OPENMP enabled using 1 thread)
> >> x <- c("b","A","B","a","\xe4")
> >> Encoding(x) <- "latin1"
> >> identical(psort(x, c.locale=FALSE), sort(x))
> >[1] TRUE
> >> identical(psort(x, c.locale=TRUE), sort(x, method="radix"))
> >[1] TRUE
> >
> >Coming back to the topic of fsort, I have just finished the implementation
> >for double, integer, factor and logical.
> >The implementation takes into account NA, Inf.. values. Values can be
> >sorted in a decreasing order or increasing order.
> >Comparing benchmark with the current implementation in data.table, it is
> >currently over 30% faster.
> >There might bugs but I am sure performance can be further improved as I
> did
> >not really try hard.
> >If there is interest in both the implementation and cross community
> >sharing, please let know
> >
> >Best regards,
> >Morgan
> >
> >On Wed, 17 Mar 2021, 00:37 Radford Neal,  wrote:
> >
> >> Those interested in faster sorting may want to look at the merge sort
> >> implemented in pqR (see pqR-project.org).  It's often used as the
> >> default, because it is stable, and does different collations, while
> >> being faster than shell sort (except for small vectors).
> >>
> >> Here are examples, with timings, for pqR-2020-07-23 and R-4.0.2,
> >> compiled identically:
> >>
> >> -
> >> pqR-2020-07-23 in C locale:
> >>
> >> > set.seed(1)
> >> > N <- 100
> >> > x <- as.character (sample(N,N,replace=TRUE))
> >> > print(system.time (os <- order(x,method="shell")))
> >>user  system elapsed
> >>   1.332   0.000   1.334
> >> > print(system.time (or <- order(x,method="radix")))
> >>user  system elapsed
> >>   0.092   0.004   0.096
> >> > print(system.time (om <- order(x,method="merge")))
> >>user  system elapsed
> >>   0.363   0.000   0.363
> >> > print(identical(os,or))
> >> [1] TRUE
> >> > print(identical(os,om))
> >> [1] TRUE
> >> >
> >> > x <- c("a","~")
> >> > print(order(x,method="shell"))
> >> [1] 1 2
> >> > print(order(x,method="radix"))
> >> [1] 1 2
> >> > print(order(x,method="merge"))
> >> [1] 1 2
> >>
> >> -
> >> R-4.0.2 in C locale:
> >>
> >> > set.seed(1)
> >> > N <- 100
> >> > x <- as.character (sample(N,N,replace=TRUE))
> >> > print(system.time (os <- order(x,method="shell")))
> >>user  system elapsed
> >>   2.381   0.004   2.387
> >> > print(system.time (or <- order(x,method="radix")))
> >>user  system elapsed
> >>   0.138   0.000   0.137
> >> > #print(system.time (om <- order(x,method="merge")))
> >> > print(identical(os,or))
> >> [1] TRUE
> >> > #print(identical(os,om))
> >> >
> >> > x <- c("a","~")
> >> > print(order(x,method="shell"))
> >> [1] 1 2
> >> > print(order(x,method="radix"))
> >> [1] 1 2
> >> > #print(order(x,method="merge"))
> >>
> >> 
> >> pqR-2020-07-23 in fr_CA.utf8 locale:
> >>
> >> > set.seed(1)
> >> > N <- 100
> >> > x <- as.character (sample(N,N,replace=TRUE))
> >> > print(system.time (os <- order(x,method="shell")))
> >> utilisateur système  écoulé
> >>   2.960   0.000   2.962
> >> > print(system.time (or <- order(x,method="radix")))
> >> utilisateur système  écoulé
> >>   0.083   0.008   0.092
> >> > print(system.time (om <- order(x,method="merge")))
> >> utilisateur système  écoulé
> >>   1.143   0.000   1.142
> >> > print(identical(os,or))
> >> [1] TRUE
> >> > print(identical(os,om))
> >> [1] TRUE
> >> >
> >> > x <- c("a","~")
> >> > print(order(x,method="shell"))
> >> [1] 2 1
> >> > print(order(x,method="radix"))
> >> [1] 1 2
> >> > print(order(x,method="merge"))
> >> [1] 2 1
> >>
> >> 
> >> R-4.0.2 in fr_CA.utf8 locale:
> >>
> >> > set.seed(1)
> >> > N <- 100
> >> > x <- as.character (sample(N,N,replace=TRUE))
> >> > print(system.time (os <- order(x,method="shell")))
> >> utilisateur système  écoulé
> >>   4.222   0.016   4.239
> >> > print(system.time (or <- order(x,method="radix")))
> >> utilisateur système  écoulé
> >>   0.136   0.000   0.137
> >> > #print(system.time (om <- order(x,method="merge")))
> >> > print(identical(os,or))
> >> [1] TRUE
> >>