Re: [Rd] Another issue with Sys.timezone

2017-10-20 Thread Martin Maechler
> Stephen Berman 
> on Thu, 19 Oct 2017 17:12:50 +0200 writes:

> On Wed, 18 Oct 2017 18:09:41 +0200 Martin Maechler 
 wrote:
>>> Martin Maechler 
>>> on Mon, 16 Oct 2017 19:13:31 +0200 writes:

> (I also included a reply to part of this response of yours below.)

>>> Stephen Berman 
>>> on Sun, 15 Oct 2017 01:53:12 +0200 writes:
>> 
>>> > (I reported the test failure mentioned below to R-help but was advised
>>> > that this list is the right one to address the issue; in the meantime 
I
>>> > investigated the matter somewhat more closely, including searching
>>> > recent R-devel postings, since I haven't been following this list.)
>>> 
>>> > Last May there were two reports here of problems with Sys.timezone, 
one
>>> > where the zoneinfo directory is in a nonstandard location
>>> > (https://stat.ethz.ch/pipermail/r-devel/2017-May/074267.html) and the
>>> > other where the system lacks the file /etc/localtime
>>> > (https://stat.ethz.ch/pipermail/r-devel/2017-May/074275.html).  My
>>> > system exhibits a third case: it lacks /etc/timezone and does not set 
TZ
>>> > systemwide, but it does have /etc/localtime, which is a copy of, 
rather
>>> > than a symlink to, a file under zoneinfo.  On this system 
Sys.timezone()
>>> > returns NA and the Sys.timezone test in reg-tests-1d fails.  However, 
on
>>> > my system I can get the (abbreviated) timezone in R by using 
as.POSIXlt,
>>> > e.g. as.POSIXlt(Sys.time())$zone.  If Sys.timezone took advantage of
>>> > this, e.g. as below, it would be useful on such systems as mine and 
the
>>> > regression test would pass.
>>> 
>>> > my.Sys.timezone <- 
>>> >   function (location = TRUE) 
>>> > {
>>> >   tz <- Sys.getenv("TZ", names = FALSE)
>>> >   if (!location || nzchar(tz)) 
>>> >   return(Sys.getenv("TZ", unset = NA_character_))
>>> >   lt <- normalizePath("/etc/localtime")
>>> >   if (grepl(pat <- "^/usr/share/zoneinfo/", lt) ||
>>> >   grepl(pat <- "^/usr/share/zoneinfo.default/", lt)) 
>>> >   sub(pat, "", lt)
>>> >   else if (lt == "/etc/localtime")
>>> >   if (!file.exists("/etc/timezone"))
>>> >   return(as.POSIXlt(Sys.time())$zone)
>>> >   else if (dir.exists("/usr/share/zoneinfo") && {
>>> >   info <- file.info(normalizePath("/etc/timezone"), 
extra_cols = FALSE)
>>> >   (!info$isdir && info$size <= 200L)
>>> >   } && {
>>> >   tz1 <- tryCatch(readBin("/etc/timezone", "raw", 200L), 
>>> >   error = function(e) raw(0L))
>>> >   length(tz1) > 0L && all(tz1 %in% as.raw(c(9:10, 13L, 
32:126)))
>>> >   } && {
>>> >   tz2 <- gsub("^[[:space:]]+|[[:space:]]+$", "", 
rawToChar(tz1))
>>> >   tzp <- file.path("/usr/share/zoneinfo", tz2)
>>> >   file.exists(tzp) && !dir.exists(tzp) &&
>>> >   identical(file.size(normalizePath(tzp)), 
file.size(lt))
>>> >   }) 
>>> >   tz2
>>> >   else NA_character_
>>> > }
>>> 
>>> > One problem with this is that the zone component of as.POSIXlt only
>>> > holds the abbreviated timezone, not the Olson name.  
>>> 
>>> Yes, indeed.  So, really only for  Sys.timezone(location = FALSE)  this
>>> should be given, for the default  location = TRUE   it should
>>> still give NA (i.e. NA_character_)  in your setup.
>>> 
>>> Interestingly, the Windows versions of Sys.timezone(location =
>>> FALSE) uses something like your proposal,  and I tend to think that
>>> -- again only for location=FALSE -- this should be used on
>>> on-Windows as well, at least instead of returning  NA  then.
>>> 
>>> Also for me on 3 different Linuxen (Fedora 24, F. 26, and ubuntu
>>> 14.04 LTS), I get
>>> 
>>> > Sys.timezone()
>>> [1] "Europe/Zurich"
>>> > Sys.timezone(FALSE)
>>> [1] NA
>>> > 
>>> 
>>> whereas on Windows I get Europe/Berlin for the first (why on
>>> earth - I'm really in Zurich) and get  "CEST" ("Central European Summer 
Time") 
>>> for the 2nd one instead of NA ... simply using a smarter version
>>> of your proposal.   The windows source is
>>> in R's source at  src/library/base/R/windows/system.R :
>>> 
>>> Sys.timezone <- function(location = TRUE)
>>> {
>>> tz <- Sys.getenv("TZ", names = FALSE)
>>> if(nzchar(tz)) return(tz)
>>> if(location) return(.Internal(tzone_name()))
>>> z <- as.POSIXlt(Sys.time())
>>> zz <- attr(z, "tzone")
>>> if(length(zz) == 3L) zz[2L + z$isdst] else zz[1L]
>>> }
>>> 
>>> >From what I read, the last three lines also work in your setup
>>> where it seems zz would be of length 1, right

[Rd] Illegal Logical Values

2017-10-20 Thread brodie gaslam via R-devel
I'm wondering if WRE Section 5.2 should be a little more explicit about misuse 
of integer values other than NA, 0, and 1 in LGLSXPs.  I'm thinking of this 
passage:

> Logical values are sent as 0 (FALSE), 1 (TRUE) or INT_MIN = -2147483648 (NA, 
> but only if NAOK is true), and the compiled code should return one of these 
> three values. (Non-zero values other than INT_MIN are mapped to TRUE.) 

The parenthetical seems to suggest that something like 'LOGICAL(x)[0] = 2;' 
will be treated as TRUE, which it sometimes is, and sometimes isn't:

not.true <- inline::cfunction(body='
  SEXP res = allocVector(LGLSXP, 1);
  LOGICAL(res)[0] = 2;
  return res;'
)()
not.true
## [1] TRUE
not.true == TRUE
## [1] FALSE
not.true[1] == TRUE  # due to scalar subset handling
## [1] TRUE
not.true == 2L
## [1] TRUE


Perhaps a more explicit warning that using anything other than 0, 1, or NA is 
undefined behavior is warranted?  Obviously people should know better than to 
expect correct behavior, but the fact that the behavior is correct in some 
cases (e.g. printing, scalar subsetting) might be confusing.

This is based off of Drew Schmidt's accidental discovery yesterday: 
.
Best,
Brodie.

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Bug: Issues on Windows with SFN disabled

2017-10-20 Thread Tomas Kalibera


This has now been mostly fixed in R-devel. What remains to be resolved 
is that some packages with custom make files cannot be installed from 
source (when R is installed into a directory with space in its name and 
short file names are not available)


Tomas



On 10/17/2017 10:37 AM, Tomas Kalibera wrote:

Hi Zach,

thanks for the report, I can reproduce the problem and confirm it is a 
bug in R and will be fixed.


Hopefully it only impacts few users now. The workaround is to create 
the short name for the directory where R is installed, using "fsutil 
file setshortname" (for all elements of the path that contain space in 
their name). One can revert this by setting the shortname to an empty 
string (""). At least for the latter one may need to boot in safe mode.


Best
Tomas


On 09/17/2017 08:23 PM, Zach Bjornson wrote:

Hello,

R appears to assume that Windows drives have short file names (SFN, 8.3)
enabled; for example, that "C:/Program Files/..." is addressable as
"C:/Progra~1/...". Newer versions of Windows have SFN disabled on non-OS
drives, however.

This means that if you install R on a non-OS drive, you
- can't start R.exe from the command line.
- consequently, anything that attempts to spawn a new R process also 
fails.
This includes a lot of the commands from the popular devtools 
package. More
discussion and background: 
https://github.com/hadley/devtools/issues/1514


I don't have access to bugzilla to file this there.

Thanks and best,
Zach

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel





__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Illegal Logical Values

2017-10-20 Thread Martyn Plummer
On Fri, 2017-10-20 at 14:01 +, brodie gaslam via R-devel wrote:
> I'm wondering if WRE Section 5.2 should be a little more explicit
> about misuse of integer values other than NA, 0, and 1 in LGLSXPs. 
> I'm thinking of this passage:
> 
> > Logical values are sent as 0 (FALSE), 1 (TRUE) or INT_MIN =
> > -2147483648 (NA, but only if NAOK is true), and the compiled code
> > should return one of these three values. (Non-zero values other
> > than INT_MIN are mapped to TRUE.) 
> 
> The parenthetical seems to suggest that something like 'LOGICAL(x)[0]
> = 2;' will be treated as TRUE, which it sometimes is, and sometimes
> isn't:

The title of Section 5.2 is "Interface functions .C and .Fortran" and
the text above refers to those interfaces. It explains how logical
vectors are mapped to C integer arrays on entry and back again on exit.

This does work as advertised. Here is a simple example. File
"nottrue.c" contains the text

void nottrue(int *x)
{
   x[0] = 2;
}

This is compiled with "R CMD SHLIB nottrue.c" to created the shared
object "nottrue.so"

> dyn.load("nottrue.so")
> a <- .C("nottrue", x=integer(1))
> a
$x
[1] 2

> a <- .C("nottrue", x=logical(1))
> a
$x
[1] TRUE

> isTRUE(a$x)
[1] TRUE
> as.integer(a)
[1] 1

So for a logical argument, the integer value 2 is mapped back to a
valid value on return.

> not.true <- inline::cfunction(body='
>   SEXP res = allocVector(LGLSXP, 1);
>   LOGICAL(res)[0] = 2;
>   return res;'
> )()
> not.true
> ## [1] TRUE
> not.true == TRUE
> ## [1] FALSE
> not.true[1] == TRUE  # due to scalar subset handling
> ## [1] TRUE
> not.true == 2L
> ## [1] TRUE

In your last example, not.true is coerced to integer (as explain in the
help for ("==") and its integer value of 2 is recovered.

> Perhaps a more explicit warning that using anything other than 0, 1,
> or NA is undefined behavior is warranted?  Obviously people should
> know better than to expect correct behavior, but the fact that the
> behavior is correct in some cases (e.g. printing, scalar subsetting)
> might be confusing.

Yes if people are tripping up on this then we could clarify that the
.Call interface does not remap logical vectors on exit. Hence 
assignment of any value other than 0, 1 or INT_MIN to the elements of a
logical vector may cause unexpected behaviour when this vector is
returned to R.

Martyn

> >
> Best,
> B.rodie.
> 
>   [[alternative HTML version deleted]]
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Another issue with Sys.timezone

2017-10-20 Thread Stephen Berman
On Fri, 20 Oct 2017 09:15:42 +0200 Martin Maechler  
wrote:

>> Stephen Berman 
>> on Thu, 19 Oct 2017 17:12:50 +0200 writes:
>
> > On Wed, 18 Oct 2017 18:09:41 +0200 Martin Maechler
> >  wrote:
> >>> Martin Maechler 
> >>> on Mon, 16 Oct 2017 19:13:31 +0200 writes:
>
[...]
> >>> whereas on Windows I get Europe/Berlin for the first (why on
> >>> earth - I'm really in Zurich) and get "CEST" ("Central European Summer
> >>> Time")
> >>> for the 2nd one instead of NA ... simply using a smarter version
> >>> of your proposal.   The windows source is
> >>> in R's source at  src/library/base/R/windows/system.R :
> >>> 
> >>> Sys.timezone <- function(location = TRUE)
> >>> {
> >>> tz <- Sys.getenv("TZ", names = FALSE)
> >>> if(nzchar(tz)) return(tz)
> >>> if(location) return(.Internal(tzone_name()))
> >>> z <- as.POSIXlt(Sys.time())
> >>> zz <- attr(z, "tzone")
> >>> if(length(zz) == 3L) zz[2L + z$isdst] else zz[1L]
> >>> }
> >>> 
> >>> >From what I read, the last three lines also work in your setup
> >>> where it seems zz would be of length 1, right ?
>
> > Those line do indeed work here, but zz has three elements:
>
> >> attributes(as.POSIXlt(Sys.time()))$tzone
> > [1] "" "CET"  "CEST"
>
> { "but" ??   yes, three elements is what I see too, but for that
>   reason there's the  if(length(zz) == 3L) ... }

The "but" was in response to "it seems zz would be of length 1", but
perhaps I misunderstood you.

[...]
> >> As you say yourself, the above system("... xargs md5sum ...")
> >> using workaround is really too platform specific  but I'd guess
> >> there should be a less error prone way to get the long timezone
> >> name on your system ...
>
> > If I understand the zic(8) man page, the files in /usr/share/zoneinfo
> > should contain this information, but I don't know how to extract it,
> > since these are compiled files.  And since on my system /etc/localtime
> > is a copy of one of these compiled files, I don't know of any other way
> > to recover the location name without comparing it to those files.
>
> >> If that remains "contained" (i.e. small) and works with files
> >> and R's files tools -- e.g. file.*() ones [but not system()],
> >> I'd consider a patch to the above source file
> >> (sent by you to the R-devel mailing list --- or after having
> >> gotten an account there by asking, via bug report & patch
> >> attachment at https://bugs.r-project.org/ )
>
> > If comparing file size sufficed, that would be easy to do in R;
> > unfortunately, it is not sufficient, since some files designating
> > different time zones in /usr/share/zoneinfo do have the same size.  So
> > the only alternative I can think of is to compare bytes, e.g. with
> > md5sum or with cmp.  Is there some way to do this in R without using
> > system()?
>
> Can't you use
>   tz1 <- readBin("/etc/localtime", "raw", 200L)
> plus later
>   tz2 <- gsub(...,  rawToChar(tz1))
>
> on your  /etc/localtime file 
> almost identically as the current code does for "/etc/timezone" ?

Oh, thanks.  I've looked at this code over and over again in the last
few days and somehow still didn't see its usefulness (maybe because I
haven't had occasion to deal with binary data in R till now).  Anyway,
just substituting "/etc/localtime" for "/etc/timezone" doesn't work,
since my /etc/localtime file seems not to hold the timezone location
name in a form recoverable with rawToChar() (all I see are the
abbreviated timezones CEST, CEMT and CET-1CEST); but I can use the raw
bytes to make the comparison with files in /usr/share/zoneinfo.  With
the attached patch, I get both the timezone location name (with
location=TRUE) and the abbreviated timezone (with location=FALSE).  One
thing I wonder about: is looking at just the first 200 bytes guaranteed
to be sufficient, or would it be better to use n=file.size() to examine
the whole file?

Steve Berman

*** datetime.R.orig	2017-10-20 17:15:05.147093873 +0200
--- datetime.R.new	2017-10-20 18:18:58.598972383 +0200
***
*** 30,54 
  lt <- normalizePath("/etc/localtime") # most Linux, macOS, ...
  if (grepl(pat <- "^/usr/share/zoneinfo/", lt) ||
  grepl(pat <- "^/usr/share/zoneinfo.default/", lt)) sub(pat, "", lt)
! else if (lt == "/etc/localtime" && file.exists("/etc/timezone") &&
!  dir.exists("/usr/share/zoneinfo") &&
!  { # Debian etc.
!  info <- file.info(normalizePath("/etc/timezone"),
!extra_cols = FALSE)
!  (!info$isdir && info$size <= 200L)
!  } && {
!  tz1 <- tryCatch(readBin("/etc/timezone", "raw", 200L),
!  error = functio

[Rd] Rscript Bug Report (improper parsing of [args])

2017-10-20 Thread Trevor Davis
Hi,

A user of my `optparse` package discovered a bug in Rscript's parsing of
[args]. (https://github.com/trevorld/optparse/issues/24)

I've reproduced the bug on my machine including compiling and checking the
development version of R.  I couldn't find a mention of it in the Bug
Tracker or New Features.

Can be minimally reproduced on the UNIX command line with following
commands:

bash$ touch test.R
bash$ Rscript test.R -g 5

WARNING: unknown gui '5', using X11

This is a bug because according to the documentation in ?Rscript besides
`-e` the only [options] Rscript should attempt to parse should

1) Come before the file i.e. `Rscript -g X11 test.R` and not `Rscript
test.R -g X11`
2) Begin with two dashes and not one i.e. `--` and not `-' i.e. `Rscript
--gui=X11 test.R` and not `Rscript -g X11 test.R` (although I'm not sure if
the command-line Rscript even needs to be supporting the gui option).

Thanks,

Trevor

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] split() - unexpected sorting of results

2017-10-20 Thread Peter Meissner
Hey,

I found this - for me - quite surprising and puzzling behaviour of split().


split(1:11, as.character(1:11))
split(1:11, 1:11)


When splitting by numerics everything works as expected - sorting of input
== sorting of output -- but when using a character vector everything gets
re-sorted alphabetical.


Although, there are some references in the help files to what happens when
using split, I did not find any note on this - for me - rather unexpected
behaviour.


I would like it best when the sorting of split results stays the same no
matter the input (sorting of input == sorting of output)

If that is not possibly a note of caution in the help pages and maybe an
example might be valuable.


Best, Peter

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] split() - unexpected sorting of results

2017-10-20 Thread Iñaki Úcar
Hi Peter,

2017-10-20 21:33 GMT+02:00 Peter Meissner :
> Hey,
>
> I found this - for me - quite surprising and puzzling behaviour of split().
>
>
> split(1:11, as.character(1:11))
> split(1:11, 1:11)
>
>
> When splitting by numerics everything works as expected - sorting of input
> == sorting of output -- but when using a character vector everything gets
> re-sorted alphabetical.
>
>
> Although, there are some references in the help files to what happens when
> using split, I did not find any note on this - for me - rather unexpected
> behaviour.

As the documentation states,

   f: a ‘factor’ in the sense that ‘as.factor(f)’ defines the
  grouping, or a list of such factors in which case their
  interaction is used for the grouping.

And, in fact,

> as.factor(1:11)
 [1] 1  2  3  4  5  6  7  8  9  10 11
Levels: 1 2 3 4 5 6 7 8 9 10 11

> as.factor(as.character(1:11))
 [1] 1  2  3  4  5  6  7  8  9  10 11
Levels: 1 10 11 2 3 4 5 6 7 8 9

Regards,
Iñaki

> I would like it best when the sorting of split results stays the same no
> matter the input (sorting of input == sorting of output)
>
> If that is not possibly a note of caution in the help pages and maybe an
> example might be valuable.
>
>
> Best, Peter
>
> [[alternative HTML version deleted]]
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] split() - unexpected sorting of results

2017-10-20 Thread Peter Meissner
Thanks, for the explanation.

Still, I think this is surprising bahaviour which might be handled better.

Best, Peter

Am 20.10.2017 9:49 nachm. schrieb "Iñaki Úcar" :

> Hi Peter,
>
> 2017-10-20 21:33 GMT+02:00 Peter Meissner :
> > Hey,
> >
> > I found this - for me - quite surprising and puzzling behaviour of
> split().
> >
> >
> > split(1:11, as.character(1:11))
> > split(1:11, 1:11)
> >
> >
> > When splitting by numerics everything works as expected - sorting of
> input
> > == sorting of output -- but when using a character vector everything gets
> > re-sorted alphabetical.
> >
> >
> > Although, there are some references in the help files to what happens
> when
> > using split, I did not find any note on this - for me - rather unexpected
> > behaviour.
>
> As the documentation states,
>
>f: a ‘factor’ in the sense that ‘as.factor(f)’ defines the
>   grouping, or a list of such factors in which case their
>   interaction is used for the grouping.
>
> And, in fact,
>
> > as.factor(1:11)
>  [1] 1  2  3  4  5  6  7  8  9  10 11
> Levels: 1 2 3 4 5 6 7 8 9 10 11
>
> > as.factor(as.character(1:11))
>  [1] 1  2  3  4  5  6  7  8  9  10 11
> Levels: 1 10 11 2 3 4 5 6 7 8 9
>
> Regards,
> Iñaki
>
> > I would like it best when the sorting of split results stays the same no
> > matter the input (sorting of input == sorting of output)
> >
> > If that is not possibly a note of caution in the help pages and maybe an
> > example might be valuable.
> >
> >
> > Best, Peter
> >
> > [[alternative HTML version deleted]]
> >
> > __
> > R-devel@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] split() - unexpected sorting of results

2017-10-20 Thread Hervé Pagès

Hi,

On 10/20/2017 12:53 PM, Peter Meissner wrote:

Thanks, for the explanation.

Still, I think this is surprising bahaviour which might be handled better.


Maybe a little surprising, but no more than:

> x <- sample(11L)

> sort(x)
 [1]  1  2  3  4  5  6  7  8  9 10 11

> sort(as.character(x))
 [1] "1"  "10" "11" "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"

The fact that sort(), as.factor(), split() and many other things behave
consistently with respect to the underlying order of character vectors
avoids other even bigger surprises.

Also note that the underlying order of character vectors actually
depends on your locale. One way to guarantee consistent results across
platforms/locales is by explicitly specifying the levels when making
a factor e.g.

  f <- factor(x, levels=unique(x))
  split(1:11, f)

This is particularly sensible when writing unit tests.

Cheers,
H.



Best, Peter

Am 20.10.2017 9:49 nachm. schrieb "Iñaki Úcar" :


Hi Peter,

2017-10-20 21:33 GMT+02:00 Peter Meissner :

Hey,

I found this - for me - quite surprising and puzzling behaviour of

split().



split(1:11, as.character(1:11))
split(1:11, 1:11)


When splitting by numerics everything works as expected - sorting of

input

== sorting of output -- but when using a character vector everything gets
re-sorted alphabetical.


Although, there are some references in the help files to what happens

when

using split, I did not find any note on this - for me - rather unexpected
behaviour.


As the documentation states,

f: a ‘factor’ in the sense that ‘as.factor(f)’ defines the
   grouping, or a list of such factors in which case their
   interaction is used for the grouping.

And, in fact,


as.factor(1:11)

  [1] 1  2  3  4  5  6  7  8  9  10 11
Levels: 1 2 3 4 5 6 7 8 9 10 11


as.factor(as.character(1:11))

  [1] 1  2  3  4  5  6  7  8  9  10 11
Levels: 1 10 11 2 3 4 5 6 7 8 9

Regards,
Iñaki


I would like it best when the sorting of split results stays the same no
matter the input (sorting of input == sorting of output)

If that is not possibly a note of caution in the help pages and maybe an
example might be valuable.


Best, Peter

 [[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=o5-lZT7zAjFNU8C0Z9D7XaQO_2NGmhKF-IbGZFhSvO0&s=4cZ9rSLJAVnnjULGMCDPAclXHoc9_le3Z1DrZg0nQqg&e=




[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=o5-lZT7zAjFNU8C0Z9D7XaQO_2NGmhKF-IbGZFhSvO0&s=4cZ9rSLJAVnnjULGMCDPAclXHoc9_le3Z1DrZg0nQqg&e=



--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:(206) 667-1319

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] split() - unexpected sorting of results

2017-10-20 Thread Rui Barradas

Hello,

In order to solve that problem of sorting numerics made characters there 
is package stringr, functions str_sort and str_order.


library(stringr)

set.seed(2447)

x <- sample(11L)
sort(as.character(x))
[1] "1"  "10" "11" "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"

str_sort(as.character(x), numeric = TRUE)
[1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11"

str_order(as.character(x), numeric = TRUE)
#[1]  1  4 11  8  6  5  3 10  9  7  2

i <- str_order(as.character(x), numeric = TRUE)
as.character(x)[i]
#[1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11"


Unfortunately this does not solve the OP's question, factor(), 
as.factor(), split() and others use the base R sorter and this can only 
be changed by changing their sources.


Hope this helps,

Rui Barradas

Em 21-10-2017 00:32, Hervé Pagès escreveu:

Hi,

On 10/20/2017 12:53 PM, Peter Meissner wrote:

Thanks, for the explanation.

Still, I think this is surprising bahaviour which might be handled
better.


Maybe a little surprising, but no more than:

 > x <- sample(11L)

 > sort(x)
  [1]  1  2  3  4  5  6  7  8  9 10 11

 > sort(as.character(x))
  [1] "1"  "10" "11" "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"

The fact that sort(), as.factor(), split() and many other things behave
consistently with respect to the underlying order of character vectors
avoids other even bigger surprises.

Also note that the underlying order of character vectors actually
depends on your locale. One way to guarantee consistent results across
platforms/locales is by explicitly specifying the levels when making
a factor e.g.

   f <- factor(x, levels=unique(x))
   split(1:11, f)

This is particularly sensible when writing unit tests.

Cheers,
H.



Best, Peter

Am 20.10.2017 9:49 nachm. schrieb "Iñaki Úcar" :


Hi Peter,

2017-10-20 21:33 GMT+02:00 Peter Meissner :

Hey,

I found this - for me - quite surprising and puzzling behaviour of

split().



split(1:11, as.character(1:11))
split(1:11, 1:11)


When splitting by numerics everything works as expected - sorting of

input

== sorting of output -- but when using a character vector everything
gets
re-sorted alphabetical.


Although, there are some references in the help files to what happens

when

using split, I did not find any note on this - for me - rather
unexpected
behaviour.


As the documentation states,

f: a ‘factor’ in the sense that ‘as.factor(f)’ defines the
   grouping, or a list of such factors in which case their
   interaction is used for the grouping.

And, in fact,


as.factor(1:11)

  [1] 1  2  3  4  5  6  7  8  9  10 11
Levels: 1 2 3 4 5 6 7 8 9 10 11


as.factor(as.character(1:11))

  [1] 1  2  3  4  5  6  7  8  9  10 11
Levels: 1 10 11 2 3 4 5 6 7 8 9

Regards,
Iñaki


I would like it best when the sorting of split results stays the
same no
matter the input (sorting of input == sorting of output)

If that is not possibly a note of caution in the help pages and
maybe an
example might be valuable.


Best, Peter

 [[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=o5-lZT7zAjFNU8C0Z9D7XaQO_2NGmhKF-IbGZFhSvO0&s=4cZ9rSLJAVnnjULGMCDPAclXHoc9_le3Z1DrZg0nQqg&e=





[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_r-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=o5-lZT7zAjFNU8C0Z9D7XaQO_2NGmhKF-IbGZFhSvO0&s=4cZ9rSLJAVnnjULGMCDPAclXHoc9_le3Z1DrZg0nQqg&e=






__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel