Re: [Rd] split() - unexpected sorting of results

Rui Barradas Fri, 20 Oct 2017 22:36:06 -0700

Hello,

In order to solve that problem of sorting numerics made characters thereis package stringr, functions str_sort and str_order.


library(stringr)

set.seed(2447)

x <- sample(11L)
sort(as.character(x))
[1] "1"  "10" "11" "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"

str_sort(as.character(x), numeric = TRUE)
[1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11"

str_order(as.character(x), numeric = TRUE)
#[1]  1  4 11  8  6  5  3 10  9  7  2

i <- str_order(as.character(x), numeric = TRUE)
as.character(x)[i]
#[1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11"

Unfortunately this does not solve the OP's question, factor(),as.factor(), split() and others use the base R sorter and this can onlybe changed by changing their sources.


Hope this helps,

Rui Barradas

Em 21-10-2017 00:32, Hervé Pagès escreveu:

Hi,

On 10/20/2017 12:53 PM, Peter Meissner wrote:

Thanks, for the explanation.

Still, I think this is surprising bahaviour which might be handled
better.


Maybe a little surprising, but no more than:

 > x <- sample(11L)

 > sort(x)
  [1]  1  2  3  4  5  6  7  8  9 10 11

 > sort(as.character(x))
  [1] "1"  "10" "11" "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"

The fact that sort(), as.factor(), split() and many other things behave
consistently with respect to the underlying order of character vectors
avoids other even bigger surprises.

Also note that the underlying order of character vectors actually
depends on your locale. One way to guarantee consistent results across
platforms/locales is by explicitly specifying the levels when making
a factor e.g.

   f <- factor(x, levels=unique(x))
   split(1:11, f)

This is particularly sensible when writing unit tests.

Cheers,
H.


Best, Peter

Am 20.10.2017 9:49 nachm. schrieb "Iñaki Úcar" <[email protected]>:

Hi Peter,

2017-10-20 21:33 GMT+02:00 Peter Meissner <[email protected]>:

Hey,

I found this - for me - quite surprising and puzzling behaviour of

split().



split(1:11, as.character(1:11))
split(1:11, 1:11)


When splitting by numerics everything works as expected - sorting of

input

== sorting of output -- but when using a character vector everything
gets
re-sorted alphabetical.


Although, there are some references in the help files to what happens

when

using split, I did not find any note on this - for me - rather
unexpected
behaviour.


As the documentation states,

        f: a ‘factor’ in the sense that ‘as.factor(f)’ defines the
           grouping, or a list of such factors in which case their
           interaction is used for the grouping.

And, in fact,

as.factor(1:11)

  [1] 1  2  3  4  5  6  7  8  9  10 11
Levels: 1 2 3 4 5 6 7 8 9 10 11

as.factor(as.character(1:11))

  [1] 1  2  3  4  5  6  7  8  9  10 11
Levels: 1 10 11 2 3 4 5 6 7 8 9

Regards,
Iñaki

I would like it best when the sorting of split results stays the
same no
matter the input (sorting of input == sorting of output)

If that is not possibly a note of caution in the help pages and
maybe an
example might be valuable.


Best, Peter

         [[alternative HTML version deleted]]

______________________________________________
[email protected] mailing list
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=o5-lZT7zAjFNU8C0Z9D7XaQO_2NGmhKF-IbGZFhSvO0&s=4cZ9rSLJAVnnjULGMCDPAclXHoc9_le3Z1DrZg0nQqg&e=


    [[alternative HTML version deleted]]

______________________________________________
[email protected] mailing list
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=o5-lZT7zAjFNU8C0Z9D7XaQO_2NGmhKF-IbGZFhSvO0&s=4cZ9rSLJAVnnjULGMCDPAclXHoc9_le3Z1DrZg0nQqg&e=


______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] split() - unexpected sorting of results

Reply via email to