Re: [Rd] NEWS item for bugfix in normalizePath and file.exists?

2021-04-28 Thread Tomas Kalibera

Hi Toby,

a defensive, portable approach would be to use only file names regarded 
portable by POSIX, so characters including ASCII letters, digits, 
underscore, dot, hyphen (but hyphen should not be the first character). 
That would always work on all systems and this is what I would use.


Individual operating systems and file systems and their configurations 
differ in which additional characters they support and how. On some, 
file names are just sequences of bytes, on some, they have to be valid 
strings in certain encoding (and then with certain exceptions).


On Windows, file names are at the lowest level in UTF-16LE encoding (and 
admitting unpaired surrogates for historical reasons). R stores strings 
in other encodings (UTF-8, native, Latin-1), so file names have to be 
translated to/from UTF-16LE, either directly by R or by Windows.


But, there is no way to convert (non-ASCII) strings in "C" encoding to 
UTF16-LE, so the examples cannot be made to work on Windows.


When the translation is left on Windows, it assumes the non-UTF-16LE 
strings are in the Active Code Page encoding (shown as "system encoding" 
in sessionInfo() in R, Latin-1 in your example) instead of the current C 
library encoding ("C" in your example). So, file names coming from 
Windows will be either the bytes of their UTF-16LE representation or the 
bytes of their Latin-1 representation, but which one is subject to the 
implementation details, so the result is really unusable.


I would say using "C" as encoding in R is not a good idea, and 
particularly not on Windows.


I would say that what happens with such file names in "C" encoding is 
unspecified behavior, which is subject to change at any time without 
notice, and that both the R 4.0.5 and R-devel behavior you are observing 
are acceptable. I don't think it should be mentioned in the NEWS. 
Personally, I would prefer some stricter checks of strings validity and 
perhaps disallowing the "C" encoding in R, so yet another behavior where 
it would be clearer that this cannot really work, but that would require 
more thought and effort.


Best
Tomas


On 4/27/21 9:53 PM, Toby Hocking wrote:


Hi all, Today I noticed bug(s?) in R-4.0.5, which seem to be fixed in
R-devel already. I checked on
https://developer.r-project.org/blosxom.cgi/R-devel/NEWS and there is no
mention of these changes, so I'm wondering if they are intentional? If so,
could someone please add a mention of the bugfix in the NEWS?

The problem involves file.exists, on windows, when a long/strange input
file name Encoding is unknown, in C locale. I expected that FALSE should be
returned (and it is on R-devel), but I got an error in R-4.0.5. Code to
reproduce is:

x <- "\360\237\247\222\n| \360\237\247\222\360\237\217\273\n|
\360\237\247\222\360\237\217\274\n| \360\237\247\222\360\237\217\275\n|
\360\237\247\222\360\237\217\276\n| \360\237\247\222\360\237\217\277\n"
Encoding(x) <- "unknown"
Sys.setlocale(locale="C")
sessionInfo()
file.exists(x)

Output I got from R-4.0.5 was


sessionInfo()

R version 4.0.5 (2021-03-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] C
system code page: 1252

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base

loaded via a namespace (and not attached):
[1] compiler_4.0.5

file.exists(x)

Error in file.exists(x) : file name conversion problem -- name too long?
Execution halted

Output I got from R-devel was


sessionInfo()

R Under development (unstable) (2021-04-26 r80229)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] C

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base

loaded via a namespace (and not attached):
[1] compiler_4.2.0

file.exists(x)

[1] FALSE

I also observed similar results when using normalizePath instead of
file.exists (error in R-4.0.5, no error in R-devel).


normalizePath(x) #R-4.0.5

Error in path.expand(path) : unable to translate 'p'
| p'p;
| p'p<
| p'p=
| p'p>
| p'p
' to UTF-8
Calls: normalizePath -> path.expand
Execution halted


normalizePath(x) #R-devel

[1] "C:\\Users\\th798\\R\\\360\237\247\222\n|
\360\237\247\222\360\237\217\273\n| \360\237\247\222\360\237\217\274\n|
\360\237\247\222\360\237\217\275\n| \360\237\247\222\360\237\217\276\n|
\360\237\247\222\360\237\217\277\n"
Warning message:
In normalizePath(path.expand(path), winslash, mustWork) : path[1]="🧒
| 🧒🏻
| 🧒🏼
| 🧒🏽
| 🧒🏾
| 🧒🏿
": The filename, directory name, or volume label syntax is incorrect

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] New post not readable

2021-04-28 Thread Lluís Revilla
Hi all,

It has come to my attention that there is a new post on The R blog: "R
Can Use Your Help: Testing R Before Release".
However, the link returns an error "Not found":
https://developer.r-project.org/Blog/public/2021/04/28/r-can-use-your-help-testing-r-before-release/index.html
Hope this mailing list is the right place to make it known to the authors.

Maybe these new content could be announced on the R-announcement
mailing list? For others interested I created a Twitter account that
uses The R blog's RSS feed to announce new entries: R_dev_news.

Looking forward to reading the new post.
Cheers,

Lluís

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] NEWS item for bugfix in normalizePath and file.exists?

2021-04-28 Thread Toby Hocking
Hi Tomas, thanks for the thoughtful reply. That makes sense about the
problems with C locale on windows. Actually I did not choose to use C
locale, but instead it was invoked automatically during a package check.
To be clear, I do NOT have a file with that name, but I do want file.exists
to return a reasonable value, FALSE (with no error). If that behavior is
unspecified, then should I use something like tryCatch(file.exists(x),
error=function(e)FALSE) instead of assuming that file.exists will always
return a logical vector without error? For my particular application that
work-around should probably be sufficient, but one may imagine a situation
where you want to do

x <- "\360\237\247\222\n| \360\237\247\222\360\237\217\273\n|
\360\237\247\222\360\237\217\274\n| \360\237\247\222\360\237\217\275\n|
\360\237\247\222\360\237\217\276\n| \360\237\247\222\360\237\217\277\n"
Encoding(x) <- "unknown"
Sys.setlocale(locale="C")
f <- tempfile()
cat("", file = f)
two <- c(x, f)
file.exists(two)

and in that case the correct response from R, in my opinion, would be
c(FALSE, TRUE) -- not an error.
Toby

On Wed, Apr 28, 2021 at 3:10 AM Tomas Kalibera 
wrote:

> Hi Toby,
>
> a defensive, portable approach would be to use only file names regarded
> portable by POSIX, so characters including ASCII letters, digits,
> underscore, dot, hyphen (but hyphen should not be the first character).
> That would always work on all systems and this is what I would use.
>
> Individual operating systems and file systems and their configurations
> differ in which additional characters they support and how. On some,
> file names are just sequences of bytes, on some, they have to be valid
> strings in certain encoding (and then with certain exceptions).
>
> On Windows, file names are at the lowest level in UTF-16LE encoding (and
> admitting unpaired surrogates for historical reasons). R stores strings
> in other encodings (UTF-8, native, Latin-1), so file names have to be
> translated to/from UTF-16LE, either directly by R or by Windows.
>
> But, there is no way to convert (non-ASCII) strings in "C" encoding to
> UTF16-LE, so the examples cannot be made to work on Windows.
>
> When the translation is left on Windows, it assumes the non-UTF-16LE
> strings are in the Active Code Page encoding (shown as "system encoding"
> in sessionInfo() in R, Latin-1 in your example) instead of the current C
> library encoding ("C" in your example). So, file names coming from
> Windows will be either the bytes of their UTF-16LE representation or the
> bytes of their Latin-1 representation, but which one is subject to the
> implementation details, so the result is really unusable.
>
> I would say using "C" as encoding in R is not a good idea, and
> particularly not on Windows.
>
> I would say that what happens with such file names in "C" encoding is
> unspecified behavior, which is subject to change at any time without
> notice, and that both the R 4.0.5 and R-devel behavior you are observing
> are acceptable. I don't think it should be mentioned in the NEWS.
> Personally, I would prefer some stricter checks of strings validity and
> perhaps disallowing the "C" encoding in R, so yet another behavior where
> it would be clearer that this cannot really work, but that would require
> more thought and effort.
>
> Best
> Tomas
>
>
> On 4/27/21 9:53 PM, Toby Hocking wrote:
>
> > Hi all, Today I noticed bug(s?) in R-4.0.5, which seem to be fixed in
> > R-devel already. I checked on
> > https://developer.r-project.org/blosxom.cgi/R-devel/NEWS and there is no
> > mention of these changes, so I'm wondering if they are intentional? If
> so,
> > could someone please add a mention of the bugfix in the NEWS?
> >
> > The problem involves file.exists, on windows, when a long/strange input
> > file name Encoding is unknown, in C locale. I expected that FALSE should
> be
> > returned (and it is on R-devel), but I got an error in R-4.0.5. Code to
> > reproduce is:
> >
> > x <- "\360\237\247\222\n| \360\237\247\222\360\237\217\273\n|
> > \360\237\247\222\360\237\217\274\n| \360\237\247\222\360\237\217\275\n|
> > \360\237\247\222\360\237\217\276\n| \360\237\247\222\360\237\217\277\n"
> > Encoding(x) <- "unknown"
> > Sys.setlocale(locale="C")
> > sessionInfo()
> > file.exists(x)
> >
> > Output I got from R-4.0.5 was
> >
> >> sessionInfo()
> > R version 4.0.5 (2021-03-31)
> > Platform: x86_64-w64-mingw32/x64 (64-bit)
> > Running under: Windows 10 x64 (build 19042)
> >
> > Matrix products: default
> >
> > locale:
> > [1] C
> > system code page: 1252
> >
> > attached base packages:
> > [1] stats graphics  grDevices utils datasets  methods   base
> >
> > loaded via a namespace (and not attached):
> > [1] compiler_4.0.5
> >> file.exists(x)
> > Error in file.exists(x) : file name conversion problem -- name too long?
> > Execution halted
> >
> > Output I got from R-devel was
> >
> >> sessionInfo()
> > R Under development (unstable) (2021-04-26 r80229)
> > Platform: x86_64-w64-mi

Re: [Rd] NEWS item for bugfix in normalizePath and file.exists?

2021-04-28 Thread Martin Maechler
> Toby Hocking 
> on Wed, 28 Apr 2021 07:21:05 -0700 writes:

> Hi Tomas, thanks for the thoughtful reply. That makes sense about the
> problems with C locale on windows. Actually I did not choose to use C
> locale, but instead it was invoked automatically during a package check.
> To be clear, I do NOT have a file with that name, but I do want 
file.exists
> to return a reasonable value, FALSE (with no error). If that behavior is
> unspecified, then should I use something like tryCatch(file.exists(x),
> error=function(e)FALSE) instead of assuming that file.exists will always
> return a logical vector without error? For my particular application that
> work-around should probably be sufficient, but one may imagine a situation
> where you want to do

> x <- "\360\237\247\222\n| \360\237\247\222\360\237\217\273\n|
> \360\237\247\222\360\237\217\274\n| \360\237\247\222\360\237\217\275\n|
> \360\237\247\222\360\237\217\276\n| \360\237\247\222\360\237\217\277\n"
> Encoding(x) <- "unknown"
> Sys.setlocale(locale="C")
> f <- tempfile()
> cat("", file = f)
> two <- c(x, f)
> file.exists(two)

> and in that case the correct response from R, in my opinion, would be
> c(FALSE, TRUE) -- not an error.
> Toby

Indeed, thanks a lot to Tomas!

# A remark 
We *could* -- and according to my taste should -- try to have file.exists()
return a logical vector in almost all cases, namely, e.g., still give an
error for file.exists(pi) :
Notably  if  `c(...)`  {for the  `...`  arguments of file.exists() }
is a character vector, always return a logical vector of the same
length, *and* we could notably make use of the fact that R's
logical type is not binary but ternary, and hence that return
value could contain values from {TRUE, NA, FALSE}  and interpret NA
as "don't know" in all cases where the corresponding string in
the input had an Encoding(.) that was "fishy" in some sense
given the "context" (OS, locale, OS_version, ICU-presence, ...).

In particular, when the underlying code sees encoding-translation issues
for a string,  NA  would be returned instead of an error.

Martin

> On Wed, Apr 28, 2021 at 3:10 AM Tomas Kalibera 
> wrote:

>> Hi Toby,
>> 
>> a defensive, portable approach would be to use only file names regarded
>> portable by POSIX, so characters including ASCII letters, digits,
>> underscore, dot, hyphen (but hyphen should not be the first character).
>> That would always work on all systems and this is what I would use.
>> 
>> Individual operating systems and file systems and their configurations
>> differ in which additional characters they support and how. On some,
>> file names are just sequences of bytes, on some, they have to be valid
>> strings in certain encoding (and then with certain exceptions).
>> 
>> On Windows, file names are at the lowest level in UTF-16LE encoding (and
>> admitting unpaired surrogates for historical reasons). R stores strings
>> in other encodings (UTF-8, native, Latin-1), so file names have to be
>> translated to/from UTF-16LE, either directly by R or by Windows.
>> 
>> But, there is no way to convert (non-ASCII) strings in "C" encoding to
>> UTF16-LE, so the examples cannot be made to work on Windows.
>> 
>> When the translation is left on Windows, it assumes the non-UTF-16LE
>> strings are in the Active Code Page encoding (shown as "system encoding"
>> in sessionInfo() in R, Latin-1 in your example) instead of the current C
>> library encoding ("C" in your example). So, file names coming from
>> Windows will be either the bytes of their UTF-16LE representation or the
>> bytes of their Latin-1 representation, but which one is subject to the
>> implementation details, so the result is really unusable.
>> 
>> I would say using "C" as encoding in R is not a good idea, and
>> particularly not on Windows.
>> 
>> I would say that what happens with such file names in "C" encoding is
>> unspecified behavior, which is subject to change at any time without
>> notice, and that both the R 4.0.5 and R-devel behavior you are observing
>> are acceptable. I don't think it should be mentioned in the NEWS.
>> Personally, I would prefer some stricter checks of strings validity and
>> perhaps disallowing the "C" encoding in R, so yet another behavior where
>> it would be clearer that this cannot really work, but that would require
>> more thought and effort.
>> 
>> Best
>> Tomas
>> 
>> 
>> On 4/27/21 9:53 PM, Toby Hocking wrote:
>> 
>> > Hi all, Today I noticed bug(s?) in R-4.0.5, which seem to be fixed in
>> > R-devel already. I checked on
>> > https://developer.r-project.org/blosxom.cgi/R-devel/NEWS and there is 
no
>> > mention of these changes, so I'm wondering if they are intentional? If
   

Re: [Rd] NEWS item for bugfix in normalizePath and file.exists?

2021-04-28 Thread Tomas Kalibera
Hi Toby,

On 4/28/21 4:21 PM, Toby Hocking wrote:
> Hi Tomas, thanks for the thoughtful reply. That makes sense about the 
> problems with C locale on windows. Actually I did not choose to use C 
> locale, but instead it was invoked automatically during a package check.

I see, as long as the tests only have ASCII strings, the encoding does 
not matter, but once there are also other characters, I think we should 
be running with some real encoding, and one where the characters can be 
represented.

Best,
Tomas

> To be clear, I do NOT have a file with that name, but I do want 
> file.exists to return a reasonable value, FALSE (with no error). If 
> that behavior is unspecified, then should I use something like 
> tryCatch(file.exists(x), error=function(e)FALSE) instead of assuming 
> that file.exists will always return a logical vector without error? 
> For my particular application that work-around should probably be 
> sufficient, but one may imagine a situation where you want to do
>
> x <- "\360\237\247\222\n| \360\237\247\222\360\237\217\273\n| 
> \360\237\247\222\360\237\217\274\n| 
> \360\237\247\222\360\237\217\275\n| 
> \360\237\247\222\360\237\217\276\n| \360\237\247\222\360\237\217\277\n"
> Encoding(x) <- "unknown"

> Sys.setlocale(locale="C")
> f <- tempfile()
> cat("", file = f)
> two <- c(x, f)
> file.exists(two)
>
> and in that case the correct response from R, in my opinion, would be 
> c(FALSE, TRUE) -- not an error.


> Toby
>
> On Wed, Apr 28, 2021 at 3:10 AM Tomas Kalibera 
> mailto:tomas.kalib...@gmail.com>> wrote:
>
> Hi Toby,
>
> a defensive, portable approach would be to use only file names
> regarded
> portable by POSIX, so characters including ASCII letters, digits,
> underscore, dot, hyphen (but hyphen should not be the first
> character).
> That would always work on all systems and this is what I would use.
>
> Individual operating systems and file systems and their
> configurations
> differ in which additional characters they support and how. On some,
> file names are just sequences of bytes, on some, they have to be
> valid
> strings in certain encoding (and then with certain exceptions).
>
> On Windows, file names are at the lowest level in UTF-16LE
> encoding (and
> admitting unpaired surrogates for historical reasons). R stores
> strings
> in other encodings (UTF-8, native, Latin-1), so file names have to be
> translated to/from UTF-16LE, either directly by R or by Windows.
>
> But, there is no way to convert (non-ASCII) strings in "C"
> encoding to
> UTF16-LE, so the examples cannot be made to work on Windows.
>
> When the translation is left on Windows, it assumes the non-UTF-16LE
> strings are in the Active Code Page encoding (shown as "system
> encoding"
> in sessionInfo() in R, Latin-1 in your example) instead of the
> current C
> library encoding ("C" in your example). So, file names coming from
> Windows will be either the bytes of their UTF-16LE representation
> or the
> bytes of their Latin-1 representation, but which one is subject to
> the
> implementation details, so the result is really unusable.
>
> I would say using "C" as encoding in R is not a good idea, and
> particularly not on Windows.
>
> I would say that what happens with such file names in "C" encoding is
> unspecified behavior, which is subject to change at any time without
> notice, and that both the R 4.0.5 and R-devel behavior you are
> observing
> are acceptable. I don't think it should be mentioned in the NEWS.
> Personally, I would prefer some stricter checks of strings
> validity and
> perhaps disallowing the "C" encoding in R, so yet another behavior
> where
> it would be clearer that this cannot really work, but that would
> require
> more thought and effort.
>
> Best
> Tomas
>
>
> On 4/27/21 9:53 PM, Toby Hocking wrote:
>
> > Hi all, Today I noticed bug(s?) in R-4.0.5, which seem to be
> fixed in
> > R-devel already. I checked on
> > https://developer.r-project.org/blosxom.cgi/R-devel/NEWS
>  and
> there is no
> > mention of these changes, so I'm wondering if they are
> intentional? If so,
> > could someone please add a mention of the bugfix in the NEWS?
> >
> > The problem involves file.exists, on windows, when a
> long/strange input
> > file name Encoding is unknown, in C locale. I expected that
> FALSE should be
> > returned (and it is on R-devel), but I got an error in R-4.0.5.
> Code to
> > reproduce is:
> >
> > x <- "\360\237\247\222\n| \360\237\247\222\360\237\217\273\n|
> > \360\237\247\222\360\237\217\274\n|
> \360\237\247\222\360\237\217\275\n|
> > \360\237\247\222\360\237\217\276\n|
> \360\237\247\222\360\237\217\277\n"
> > Encoding

Re: [Rd] New post not readable

2021-04-28 Thread Martin Maechler
> Lluís Revilla 
> on Wed, 28 Apr 2021 15:19:53 +0200 writes:

> Hi all,

> It has come to my attention that there is a new post on The R blog: "R
> Can Use Your Help: Testing R Before Release".
> However, the link returns an error "Not found":
> 
https://developer.r-project.org/Blog/public/2021/04/28/r-can-use-your-help-testing-r-before-release/index.html
> Hope this mailing list is the right place to make it known to the authors.

yes

> Maybe these new content could be announced on the R-announcement
> mailing list?

> For others interested I created a Twitter account that
> uses The R blog's RSS feed to announce new entries: R_dev_news.

Well, there's  @_R_Foundation   the posts of which are
automatically embedded / mirrored on R's  home page https://www.r-project.org/
and where you can see the previous Blog post still being
announced... ...  but the new one not, probably because of the
error (some files not committed I guess) that you mentioned
above.

So, maybe you should remove R_dev_news  or at least mention that
@_R_Foundation  is the official 'R Foundation' twitter account
*and* that it also uses the Blog feed ..

> Looking forward to reading the new post.

Me too,  thank you  Lluís , for the "heads up"!
Martin

> Cheers,
> Lluís

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] NEWS item for bugfix in normalizePath and file.exists?

2021-04-28 Thread Tomas Kalibera



On 4/28/21 5:22 PM, Martin Maechler wrote:

Toby Hocking
 on Wed, 28 Apr 2021 07:21:05 -0700 writes:

 > Hi Tomas, thanks for the thoughtful reply. That makes sense about the
 > problems with C locale on windows. Actually I did not choose to use C
 > locale, but instead it was invoked automatically during a package check.
 > To be clear, I do NOT have a file with that name, but I do want 
file.exists
 > to return a reasonable value, FALSE (with no error). If that behavior is
 > unspecified, then should I use something like tryCatch(file.exists(x),
 > error=function(e)FALSE) instead of assuming that file.exists will always
 > return a logical vector without error? For my particular application that
 > work-around should probably be sufficient, but one may imagine a 
situation
 > where you want to do

 > x <- "\360\237\247\222\n| \360\237\247\222\360\237\217\273\n|
 > \360\237\247\222\360\237\217\274\n| \360\237\247\222\360\237\217\275\n|
 > \360\237\247\222\360\237\217\276\n| \360\237\247\222\360\237\217\277\n"
 > Encoding(x) <- "unknown"
 > Sys.setlocale(locale="C")
 > f <- tempfile()
 > cat("", file = f)
 > two <- c(x, f)
 > file.exists(two)

 > and in that case the correct response from R, in my opinion, would be
 > c(FALSE, TRUE) -- not an error.
 > Toby

Indeed, thanks a lot to Tomas!

# A remark
We *could* -- and according to my taste should -- try to have file.exists()
return a logical vector in almost all cases, namely, e.g., still give an
error for file.exists(pi) :
Notably  if  `c(...)`  {for the  `...`  arguments of file.exists() }
is a character vector, always return a logical vector of the same
length, *and* we could notably make use of the fact that R's
logical type is not binary but ternary, and hence that return
value could contain values from {TRUE, NA, FALSE}  and interpret NA
as "don't know" in all cases where the corresponding string in
the input had an Encoding(.) that was "fishy" in some sense
given the "context" (OS, locale, OS_version, ICU-presence, ...).

In particular, when the underlying code sees encoding-translation issues
for a string,  NA  would be returned instead of an error.


Yes, I agree with Toby and you that there is benefit in allowing 
per-element, vectorized use of file.exists(), and well it is the case 
now, we just fall back to FALSE. NA might be be better in case of error 
that prevents the function from deciding whether the file exists or not 
(e.g. an invalid name in form that make is clear such file cannot exist 
might be a different case...).


But, the only way to get a translation error is by passing a string to 
file.exists() which is invalid in its declared encoding (or which is in 
"C" encoding). I would hope that we could get to the point where such 
situation is prevented (we only allow creation of strings that can be 
translated to Unicode). If we get there, the example would fail with 
error (yet, right, before getting to file.exists()).


My point that I would not write tests of this behavior stands. One 
should not use such file names, and after the change Toby reported from 
ERROR to FALSE, Martin's proposal would change to NA, mine eventually to 
ERROR, etc. So it is best for now to leave it unspecified and not 
trigger it, I think.


Tomas



Martin

 > On Wed, Apr 28, 2021 at 3:10 AM Tomas Kalibera 
 > wrote:

 >> Hi Toby,
 >>
 >> a defensive, portable approach would be to use only file names regarded
 >> portable by POSIX, so characters including ASCII letters, digits,
 >> underscore, dot, hyphen (but hyphen should not be the first character).
 >> That would always work on all systems and this is what I would use.
 >>
 >> Individual operating systems and file systems and their configurations
 >> differ in which additional characters they support and how. On some,
 >> file names are just sequences of bytes, on some, they have to be valid
 >> strings in certain encoding (and then with certain exceptions).
 >>
 >> On Windows, file names are at the lowest level in UTF-16LE encoding (and
 >> admitting unpaired surrogates for historical reasons). R stores strings
 >> in other encodings (UTF-8, native, Latin-1), so file names have to be
 >> translated to/from UTF-16LE, either directly by R or by Windows.
 >>
 >> But, there is no way to convert (non-ASCII) strings in "C" encoding to
 >> UTF16-LE, so the examples cannot be made to work on Windows.
 >>
 >> When the translation is left on Windows, it assumes the non-UTF-16LE
 >> strings are in the Active Code Page encoding (shown as "system encoding"
 >> in sessionInfo() in R, Latin-1 in your example) instead of the current C
 >> library encoding ("C" in your example). So, file names coming from
 >> Windows will be either the bytes of their UTF-16LE representation or the
 >> bytes of their Latin-1 repr

Re: [Rd] NEWS item for bugfix in normalizePath and file.exists?

2021-04-28 Thread Toby Hocking
+1 for Martin's proposal, that makes sense to me too.
About Tomas' idea to immediately stop with an error when the user tries to
create a string which is invalid in its declared encoding, that sounds
great. I'm just wondering if that would break my application. My package is
running an example during a check, in which the unicode/emoji is read into
R using readLines from a file under inst/extdata, so presumably it should
work as long as readLines handles the encoding correctly and/or the locale
during package check is changed to something more reasonable on windows?

On Wed, Apr 28, 2021 at 9:04 AM Tomas Kalibera 
wrote:

>
> On 4/28/21 5:22 PM, Martin Maechler wrote:
> >> Toby Hocking
> >>  on Wed, 28 Apr 2021 07:21:05 -0700 writes:
> >  > Hi Tomas, thanks for the thoughtful reply. That makes sense about
> the
> >  > problems with C locale on windows. Actually I did not choose to
> use C
> >  > locale, but instead it was invoked automatically during a package
> check.
> >  > To be clear, I do NOT have a file with that name, but I do want
> file.exists
> >  > to return a reasonable value, FALSE (with no error). If that
> behavior is
> >  > unspecified, then should I use something like
> tryCatch(file.exists(x),
> >  > error=function(e)FALSE) instead of assuming that file.exists will
> always
> >  > return a logical vector without error? For my particular
> application that
> >  > work-around should probably be sufficient, but one may imagine a
> situation
> >  > where you want to do
> >
> >  > x <- "\360\237\247\222\n| \360\237\247\222\360\237\217\273\n|
> >  > \360\237\247\222\360\237\217\274\n|
> \360\237\247\222\360\237\217\275\n|
> >  > \360\237\247\222\360\237\217\276\n|
> \360\237\247\222\360\237\217\277\n"
> >  > Encoding(x) <- "unknown"
> >  > Sys.setlocale(locale="C")
> >  > f <- tempfile()
> >  > cat("", file = f)
> >  > two <- c(x, f)
> >  > file.exists(two)
> >
> >  > and in that case the correct response from R, in my opinion,
> would be
> >  > c(FALSE, TRUE) -- not an error.
> >  > Toby
> >
> > Indeed, thanks a lot to Tomas!
> >
> > # A remark
> > We *could* -- and according to my taste should -- try to have
> file.exists()
> > return a logical vector in almost all cases, namely, e.g., still give an
> > error for file.exists(pi) :
> > Notably  if  `c(...)`  {for the  `...`  arguments of file.exists() }
> > is a character vector, always return a logical vector of the same
> > length, *and* we could notably make use of the fact that R's
> > logical type is not binary but ternary, and hence that return
> > value could contain values from {TRUE, NA, FALSE}  and interpret NA
> > as "don't know" in all cases where the corresponding string in
> > the input had an Encoding(.) that was "fishy" in some sense
> > given the "context" (OS, locale, OS_version, ICU-presence, ...).
> >
> > In particular, when the underlying code sees encoding-translation issues
> > for a string,  NA  would be returned instead of an error.
>
> Yes, I agree with Toby and you that there is benefit in allowing
> per-element, vectorized use of file.exists(), and well it is the case
> now, we just fall back to FALSE. NA might be be better in case of error
> that prevents the function from deciding whether the file exists or not
> (e.g. an invalid name in form that make is clear such file cannot exist
> might be a different case...).
>
> But, the only way to get a translation error is by passing a string to
> file.exists() which is invalid in its declared encoding (or which is in
> "C" encoding). I would hope that we could get to the point where such
> situation is prevented (we only allow creation of strings that can be
> translated to Unicode). If we get there, the example would fail with
> error (yet, right, before getting to file.exists()).
>
> My point that I would not write tests of this behavior stands. One
> should not use such file names, and after the change Toby reported from
> ERROR to FALSE, Martin's proposal would change to NA, mine eventually to
> ERROR, etc. So it is best for now to leave it unspecified and not
> trigger it, I think.
>
> Tomas
>
> >
> > Martin
> >
> >  > On Wed, Apr 28, 2021 at 3:10 AM Tomas Kalibera <
> tomas.kalib...@gmail.com>
> >  > wrote:
> >
> >  >> Hi Toby,
> >  >>
> >  >> a defensive, portable approach would be to use only file names
> regarded
> >  >> portable by POSIX, so characters including ASCII letters, digits,
> >  >> underscore, dot, hyphen (but hyphen should not be the first
> character).
> >  >> That would always work on all systems and this is what I would
> use.
> >  >>
> >  >> Individual operating systems and file systems and their
> configurations
> >  >> differ in which additional characters they support and how. On
> some,
> >  >> file names are just sequences of bytes, on some, they have to be
> valid
> >  >> strings in certa

Re: [Rd] NEWS item for bugfix in normalizePath and file.exists?

2021-04-28 Thread Tomas Kalibera

On 4/28/21 6:20 PM, Toby Hocking wrote:

+1 for Martin's proposal, that makes sense to me too.
About Tomas' idea to immediately stop with an error when the user tries to
create a string which is invalid in its declared encoding, that sounds
great. I'm just wondering if that would break my application. My package is
running an example during a check, in which the unicode/emoji is read into
R using readLines from a file under inst/extdata, so presumably it should
work as long as readLines handles the encoding correctly and/or the locale
during package check is changed to something more reasonable on windows?


Once we have UTF-8 as native encoding on Windows, things like this 
should work reliably. It should be already the case with the 
experimental UCRT builds.


Even in the MSVCRT/official builds, in some cases things like this could 
work on Windows, depending on whether they trigger translation to native 
encoding or not. E.g. readLines() with encoding="UTF-8" argument would 
produce strings flagged as UTF-8, so indeed ones that could be 
translated to UTF-16LE if they are valid. Some file operations on 
Windows work with UTF-8 pathnames avoiding translation to native 
encoding, but not all, and instead of investing effort into fixing more 
we should I think instead invest into switching to UTF-8 as native encoding.


Actually, by using Emoji's you may also trigger bugs when supplementary 
characters are not supported on Windows. This is something that is still 
relevant after the switch to UTF-8 as native encoding, so something that 
needs to be fixed fully, and there have been some improvements recently.


Tomas



On Wed, Apr 28, 2021 at 9:04 AM Tomas Kalibera 
wrote:


On 4/28/21 5:22 PM, Martin Maechler wrote:

Toby Hocking
  on Wed, 28 Apr 2021 07:21:05 -0700 writes:

  > Hi Tomas, thanks for the thoughtful reply. That makes sense about

the

  > problems with C locale on windows. Actually I did not choose to

use C

  > locale, but instead it was invoked automatically during a package

check.

  > To be clear, I do NOT have a file with that name, but I do want

file.exists

  > to return a reasonable value, FALSE (with no error). If that

behavior is

  > unspecified, then should I use something like

tryCatch(file.exists(x),

  > error=function(e)FALSE) instead of assuming that file.exists will

always

  > return a logical vector without error? For my particular

application that

  > work-around should probably be sufficient, but one may imagine a

situation

  > where you want to do

  > x <- "\360\237\247\222\n| \360\237\247\222\360\237\217\273\n|
  > \360\237\247\222\360\237\217\274\n|

\360\237\247\222\360\237\217\275\n|

  > \360\237\247\222\360\237\217\276\n|

\360\237\247\222\360\237\217\277\n"

  > Encoding(x) <- "unknown"
  > Sys.setlocale(locale="C")
  > f <- tempfile()
  > cat("", file = f)
  > two <- c(x, f)
  > file.exists(two)

  > and in that case the correct response from R, in my opinion,

would be

  > c(FALSE, TRUE) -- not an error.
  > Toby

Indeed, thanks a lot to Tomas!

# A remark
We *could* -- and according to my taste should -- try to have

file.exists()

return a logical vector in almost all cases, namely, e.g., still give an
error for file.exists(pi) :
Notably  if  `c(...)`  {for the  `...`  arguments of file.exists() }
is a character vector, always return a logical vector of the same
length, *and* we could notably make use of the fact that R's
logical type is not binary but ternary, and hence that return
value could contain values from {TRUE, NA, FALSE}  and interpret NA
as "don't know" in all cases where the corresponding string in
the input had an Encoding(.) that was "fishy" in some sense
given the "context" (OS, locale, OS_version, ICU-presence, ...).

In particular, when the underlying code sees encoding-translation issues
for a string,  NA  would be returned instead of an error.

Yes, I agree with Toby and you that there is benefit in allowing
per-element, vectorized use of file.exists(), and well it is the case
now, we just fall back to FALSE. NA might be be better in case of error
that prevents the function from deciding whether the file exists or not
(e.g. an invalid name in form that make is clear such file cannot exist
might be a different case...).

But, the only way to get a translation error is by passing a string to
file.exists() which is invalid in its declared encoding (or which is in
"C" encoding). I would hope that we could get to the point where such
situation is prevented (we only allow creation of strings that can be
translated to Unicode). If we get there, the example would fail with
error (yet, right, before getting to file.exists()).

My point that I would not write tests of this behavior stands. One
should not use such file names, and after the change Toby reported from
ERROR to FALSE, Martin's proposal would change to NA, mine eventually to
ERROR, 

Re: [Rd] reshape documentation

2021-04-28 Thread Deepayan Sarkar
On Sat, Apr 17, 2021 at 7:07 PM SOEIRO Thomas  wrote:
>
> Dear Deepayan,
>
> I do not have further suggestions, but I just wanted to thank you for taking 
> the time to
> improve the documentation so much! (and for adding support for specifying 
> "varying" as
> a vector)
>
> Both "Typical usage" and the details are useful additions. Adding a vignette 
> also seems
> an excellent idea.

Thanks for checking. I have also finally added a vignette, do let me
know if you see anything that can be improved.

Best,
-Deepayan

>
> These changes will probably helps numerous users.
>
> Best,
>
> Thomas
>
>
>
>
> On Wed, Mar 17, 2021 at 7:55 PM Michael Dewey  
> wrote:
> >
> > Comments in line
> >
> > On 13/03/2021 09:50, SOEIRO Thomas wrote:
> > > Dear list,
> > >
> > > I have some questions/suggestions about reshape.
> > >
> > > 1) I think a good amount of the popularity of base::reshape alternative 
> > > is due to the complexity of reshape documentation. It is quite hard (at 
> > > least it is for me) to figure out what argument is needed for 
> > > respectively "long to wide" and "wide to long", because reshapeWide and 
> > > reshapeLong are documented together.
> > > - Do you agree with this?
> > > - Would you consider a proposal to modify the documentation?
> > > - If yes, what approach do you suggest? e.g. split in two pages?
> >
> > The current documentation is much clearer than it was when I first
> > started using R but we should always strive for more.
> >
> > I would suggest leaving the documentation in one place but it might be
> > helpful to add which direction is relevant for each parameter by placing
> > (to wide) or (to long) as appropriate. I think having completely
> > separate lists is not needed
>
> I have just checked in some updates to the documentation (in R-devel)
> which hopefully makes usage clearer. Any further suggestions are
> welcome. We are planning to add a short vignette as well, hopefully in
> time for R 4.1.0.
>
> > > 2) I do not think the documentation indicates that we can use varying 
> > > argument to rename variables in reshapeWide.
> > > - Is this worth documenting?
> > > - Is the construct list(c()) really needed?
> >
> > Yes, because you may have more than one set of variables which need to
> > correspond to a single variable in long format. So in your example if
> > you also had 11 variables for the temperature as well as the
> > concentration each would need specifying as a separate vector in the list.
>
> That's a valid point, but on the other hand, direction="long" already
> supports specifying 'varying' as a vector, and it does simplify the
> single variable case. So we decided to be consistent and allow it for
> direction="wide" too, hopefully with loud enough warnings in the
> documentation about using the feature carelessly.
>
> Best,
> -Deepayan
>
> > Michael
> >
> > >
> > > reshape(Indometh,
> > >  v.names = "conc",
> > >  idvar = "Subject",
> > >  timevar = "time",
> > >  direction = "wide",
> > >  varying = list(c("conc_0.25hr",
> > >   "conc_0.5hr",
> > >   "conc.0.75hr",
> > >   "conc_1hr",
> > >   "conc_1.25hr",
> > >   "conc_2hr",
> > >   "conc_3hr",
> > >   "conc_4hr",
> > >   "conc_5hr",
> > >   "conc_6hr",
> > >   "conc_8hr")))
> > >
> > > Thanks,
> > >
> > > Thomas
> > > __
> > > R-devel using r-project.org mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-devel
> > >
> >
> > --
> > Michael
> > http://www.dewey.myzen.co.uk/home.html
> >
> > __
> > R-devel using r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] as.list fails on functions with S3 classes

2021-04-28 Thread Antoine Fabri
Dear R devel,

as.list() can be used on functions, but not if they have a S3 class that
doesn't include "function".

See below :

```r
add1 <- function(x) x+1

as.list(add1)
#> $x
#>
#>
#> [[2]]
#> x + 1

class(add1) <- c("function", "foo")

as.list(add1)
#> $x
#>
#>
#> [[2]]
#> x + 1

class(add1) <- "foo"

as.list(add1)
#> Error in as.vector(x, "list"): cannot coerce type 'closure' to vector of
type 'list'

as.list.function(add1)
#> $x
#>
#>
#> [[2]]
#> x + 1
```

In failing case the argument is dispatched to as.list.default instead of
as.list.function.

(1) Shouldn't it be dispatched to as.list.function ?

(2) Shouldn't all generics when applied on an object of type closure fall
back to the `fun.function` method  before falling back to the `fun.default`
method ?

Best regards,

Antoine

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] as.list fails on functions with S3 classes

2021-04-28 Thread Gabriel Becker
Hi Antoine,

I would say this is the correct behavior. S3 dispatch is solely (so far as
I know?) concerned with the "actual classes" on the object. This is because
S3 classes act as labels that inform dispatch what, and in what order,
methods should be applied. You took the function class (ie label) off of
your object, which means that in the S3 sense, that object is no longer a
function and dispatching to function methods for it would be incorrect.
This is independent of whether the object is still callable "as a function".

The analogous case for non-closures to what you are describing would be for
S3 to check mode(x) after striking out with class(x) to find relevant
methods. I don't think that would be appropriate.

Also, as an aside, if you want your class to override methods that exist
for function you would want to set the class to c("foo", "function"), not
c("function", "foo"), as you had it in your example.

Best,
~G



On Wed, Apr 28, 2021 at 1:45 PM Antoine Fabri 
wrote:

> Dear R devel,
>
> as.list() can be used on functions, but not if they have a S3 class that
> doesn't include "function".
>
> See below :
>
> ```r
> add1 <- function(x) x+1
>
> as.list(add1)
> #> $x
> #>
> #>
> #> [[2]]
> #> x + 1
>
> class(add1) <- c("function", "foo")
>
> as.list(add1)
> #> $x
> #>
> #>
> #> [[2]]
> #> x + 1
>
> class(add1) <- "foo"
>
> as.list(add1)
> #> Error in as.vector(x, "list"): cannot coerce type 'closure' to vector of
> type 'list'
>
> as.list.function(add1)
> #> $x
> #>
> #>
> #> [[2]]
> #> x + 1
> ```
>
> In failing case the argument is dispatched to as.list.default instead of
> as.list.function.
>
> (1) Shouldn't it be dispatched to as.list.function ?
>
> (2) Shouldn't all generics when applied on an object of type closure fall
> back to the `fun.function` method  before falling back to the `fun.default`
> method ?
>
> Best regards,
>
> Antoine
>
> [[alternative HTML version deleted]]
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] as.list fails on functions with S3 classes

2021-04-28 Thread brodie gaslam via R-devel


> On Wednesday, April 28, 2021, 5:16:20 PM EDT, Gabriel Becker 
>  wrote:
>
> Hi Antoine,
>
> I would say this is the correct behavior. S3 dispatch is solely (so far as
> I know?) concerned with the "actual classes" on the object. This is because
> S3 classes act as labels that inform dispatch what, and in what order,
> methods should be applied. You took the function class (ie label) off of
> your object, which means that in the S3 sense, that object is no longer a
> function and dispatching to function methods for it would be incorrect.
> This is independent of whether the object is still callable "as a function".
>
> The analogous case for non-closures to what you are describing would be for
> S3 to check mode(x) after striking out with class(x) to find relevant
> methods. I don't think that would be appropriate.

I would think of the general case to be to check `class(unclass(x))` on
strike-out.  This would then include things such as "matrix", etc.
Dispatching on the implicit class as fallback seems like a natural thing
to do in a language that dispatches on implicit class when there is none.
After all, once you've struck out of your explicit classes, you have
none left!

This does happen naturally in some places (e.g. interacting with a
data.frame as a list), and is quite delightful (usually).  I won't get
into an argument of what the documentation states or whether any changes
should be made, but to me that dispatch doesn't end with the implicit
class seems feels like a logical wrinkle.  Yes, I can twist my brain to
see how it can be made to make sense, but I don't like it.

A fun past conversation on this very topic:

https://stat.ethz.ch/pipermail/r-devel/2019-March/077457.html

Best,

B.

> Also, as an aside, if you want your class to override methods that exist
> for function you would want to set the class to c("foo", "function"), not
> c("function", "foo"), as you had it in your example.
>
> Best,
> ~G
>
> On Wed, Apr 28, 2021 at 1:45 PM Antoine Fabri 
> wrote:
>
>> Dear R devel,
>>
>> as.list() can be used on functions, but not if they have a S3 class that
>> doesn't include "function".
>>
>> See below :
>>
>> ```r
>> add1 <- function(x) x+1
>>
>> as.list(add1)
>> #> $x
>> #>
>> #>
>> #> [[2]]
>> #> x + 1
>>
>> class(add1) <- c("function", "foo")
>>
>> as.list(add1)
>> #> $x
>> #>
>> #>
>> #> [[2]]
>> #> x + 1
>>
>> class(add1) <- "foo"
>>
>> as.list(add1)
>> #> Error in as.vector(x, "list"): cannot coerce type 'closure' to vector of
>> type 'list'
>>
>> as.list.function(add1)
>> #> $x
>> #>
>> #>
>> #> [[2]]
>> #> x + 1
>> ```
>>
>> In failing case the argument is dispatched to as.list.default instead of
>> as.list.function.
>>
>> (1) Shouldn't it be dispatched to as.list.function ?
>>
>> (2) Shouldn't all generics when applied on an object of type closure fall
>> back to the `fun.function` method  before falling back to the `fun.default`
>> method ?
>>
>> Best regards,
>>
>> Antoine

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] as.list fails on functions with S3 classes

2021-04-28 Thread Gabriel Becker
On Wed, Apr 28, 2021 at 6:04 PM brodie gaslam 
wrote:

>
> > On Wednesday, April 28, 2021, 5:16:20 PM EDT, Gabriel Becker <
> gabembec...@gmail.com> wrote:
> >
>
> > The analogous case for non-closures to what you are describing would be
> for
> > S3 to check mode(x) after striking out with class(x) to find relevant
> > methods. I don't think that would be appropriate.
>
> I would think of the general case to be to check `class(unclass(x))` on
> strike-out.


To me the general case is writing a robust default method that covers
whatever would be class(unclass(x)) would be. When you give an object a new
S3 class, you have the option of extending (c("newclass", "oldclass")) and
"not extending" (just "newclass"), and it certainly doesn't seem to me that
these two should behave the same. Perhaps others disagree.


>   This would then include things such as "matrix", etc.
> Dispatching on the implicit class as fallback seems like a natural thing
> to do in a language that dispatches on implicit class when there is none.
> After all, once you've struck out of your explicit classes, you have
> none left!
>
> This does happen naturally in some places (e.g. interacting with a

data.frame as a list), and is quite delightful (usually).


So I don't know of any places that this happens *in the S3 dispatch sense*.
There are certainly places where the default  method supports lists, and if
data.frame doesn't have a method so it hits the default method, which
handles lists. Am I missing somewhere where the dispatch gives a data.frame
to a list method (in S3 space)?


> I won't get
> into an argument of what the documentation states or whether any changes
> should be made, but to me that dispatch doesn't end with the implicit
> class seems feels like a logical wrinkle.  Yes, I can twist my brain to
> see how it can be made to make sense, but I don't like it.
>

I suppose it depends on how you view S3 dispatch. To me, view it purely as
labeling. S3 dispatch has literally nothing to do with the content of the
object. What you're describing would make that not the case. (Or if I'm
wrong about what is happening, then I'm incorrect about that too).

Best,
~G


>
> A fun past conversation on this very topic:
>
> https://stat.ethz.ch/pipermail/r-devel/2019-March/077457.html
>
> Best,
>
> B.
>
> > Also, as an aside, if you want your class to override methods that exist
> > for function you would want to set the class to c("foo", "function"), not
> > c("function", "foo"), as you had it in your example.
> >
> > Best,
> > ~G
> >
> > On Wed, Apr 28, 2021 at 1:45 PM Antoine Fabri 
> > wrote:
> >
> >> Dear R devel,
> >>
> >> as.list() can be used on functions, but not if they have a S3 class that
> >> doesn't include "function".
> >>
> >> See below :
> >>
> >> ```r
> >> add1 <- function(x) x+1
> >>
> >> as.list(add1)
> >> #> $x
> >> #>
> >> #>
> >> #> [[2]]
> >> #> x + 1
> >>
> >> class(add1) <- c("function", "foo")
> >>
> >> as.list(add1)
> >> #> $x
> >> #>
> >> #>
> >> #> [[2]]
> >> #> x + 1
> >>
> >> class(add1) <- "foo"
> >>
> >> as.list(add1)
> >> #> Error in as.vector(x, "list"): cannot coerce type 'closure' to
> vector of
> >> type 'list'
> >>
> >> as.list.function(add1)
> >> #> $x
> >> #>
> >> #>
> >> #> [[2]]
> >> #> x + 1
> >> ```
> >>
> >> In failing case the argument is dispatched to as.list.default instead of
> >> as.list.function.
> >>
> >> (1) Shouldn't it be dispatched to as.list.function ?
> >>
> >> (2) Shouldn't all generics when applied on an object of type closure
> fall
> >> back to the `fun.function` method  before falling back to the
> `fun.default`
> >> method ?
> >>
> >> Best regards,
> >>
> >> Antoine
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel