Re: [Rd] Bug Report: read.table with UTF-8 encoded file imports infinity symbol as Integer 8

2019-02-07 Thread Daniel Possenriede
There seems to be something odd with "∞" on Windows (and not only with
read.table)
In native encoding (cp-1252 in my case), "∞" gets converted to "8"

x <-  "∞"
Encoding(x)
#> [1] "unknown"
print(x)
#> [1] "8"
charToRaw(x)
#> [1] 38

"∞" is indeed "8"

identical(x, "8")
#> [1] TRUE

Everything seems fine if  "∞" is UTF-8 encoded.

y <- "\u221E"
Encoding(y)
#> [1] "UTF-8"
print(y)
#> [1]  "∞"
charToRaw(y)
#> [1] e2 88 9e

Unless the string is converted back to native encoding.

format(y)
#> [1] "8"

This ought to be "", equivalently to

format("∝")
#> [1] ""

Session Info:

si <- sessionInfo()
si$running
#> [1] "Windows 10 x64 (build 17134)"
si$R.version$version.string
#> [1] "R version 3.5.2 (2018-12-20)"
si$locale
#> [1]
"LC_COLLATE=German_Germany.1252;LC_CTYPE=German_Germany.1252;LC_MONETARY=German_Germany.1252;LC_NUMERIC=C;LC_TIME=German_Germany.1252"



Am Do., 7. Feb. 2019 um 14:33 Uhr schrieb David Byrne <
david.byrne...@gmail.com>:

> I can confirm that it doesn't happen on Ubuntu 18.04.1 so Peter is
> most likely correct; it looks like its Windows specific.
>
> On Thu, 7 Feb 2019 at 12:55, peter dalgaard  wrote:
> >
> > This doesn't seem to be happening on MacOS, neither in Terminal nor
> RStudio, (R 3.5.1, R-devel, R-patched). So probably Windows specific.
> >
> > -pd
> >
> > > On 7 Feb 2019, at 11:17 , David Byrne 
> wrote:
> > >
> > > Bug
> > > Using read.table(file, encoding="UTF-8") to import a UTF-8 encoded
> > > file containing the infinity symbol (' ∞ ') results in the infinity
> > > symbol imported as the number 8. Other Unicode characters seem
> > > unaffected, example, Zhe: ж
> > >
> > > Expected Behavior:
> > > The imported data.frame should represent the infinity symbol as the
> > > expected 'Inf' so that normal mathematical operations can be processed
> > >
> > > Stack Overflow Post:
> > > I created a question on Stack Overflow where one other member was able
> > > to reproduce the same issues I was having. This question can be found
> > > at:
> > >
> https://stackoverflow.com/questions/54522196/r-read-table-with-utf-8-encoded-file-reads-infinity-symbol-as-8-int
> > >
> > > Method to Reproduce - 1:
> > > A simple method to reproduce this issues is to use R-Studio: In the
> > > console, type the following:
> > >> read.table(text=" ∞", encoding="UTF-8")
> > >
> > > The result should be a data.frame with a single value of '8'
> > >
> > > Repeating the same with ж Results in correct expected behavior
> > >
> > > Method to Reproduce - 2:
> > > Create a .csv file containing the infinity and Zhe characters (I have
> > > attached the file for convenience, hopefully it is no rejected by your
> > > email service). Launch an interactive session using
> > >
> > >> r --vanilla
> > >
> > > Enter the following statement taking care to replace the
> > >  with the appropriate one:
> > >
> > >> read.table("/unicode_chars.csv", sep=",",
> encoding="UTF-8")
> > >
> > >
> > > This should result in a two element data.frame; the first being the
> > > incorrect value of 8 with an additional  and the second the
> > > correct value of Zhe.
> > >
> > > Note the additional  prefixed to the front of the '8'. This
> > > appears to be a hidden character for the purposes of letting editors
> > > know the encoding. The following link has some explanation however, it
> > > states this is caused by excel. The file I created was done so using
> > > notepad and not Excel.
> > >
> > >
> https://medium.freecodecamp.org/a-quick-tale-about-feff-the-invisible-character-cd25cd4630e7
> > >
> > > System Details:
> > > OS:
> > >> Windows 10.0.17134 Build 17134
> > >
> > >
> > > R Version:
> > >> platform   x86_64-w64-mingw32
> > >> arch   x86_64
> > >> os mingw32
> > >> system x86_64, mingw32
> > >> status
> > >> major  3
> > >> minor  4.1
> > >> year   2017
> > >> month  06
> > >> day30
> > >> svn rev72865
> > >> language   R
> > >> version.string R version 3.4.1 (2017-06-30)
> > >> nickname   Single Candle
> > > __
> > > R-devel@r-project.org mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
> > --
> > Peter Dalgaard, Professor,
> > Center for Statistics, Copenhagen Business School
> > Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> > Phone: (+45)38153501
> > Office: A 4.23
> > Email: pd@cbs.dk  Priv: pda...@gmail.com
> >
> >
> >
> >
> >
> >
> >
> >
> >
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Bug Report: read.table with UTF-8 encoded file imports infinity symbol as Integer 8

2019-02-08 Thread Daniel Possenriede
Tomas,

> In my scenario, the conversion is invoked by RGui before returning the
input to the main R loop, even before the input gets to the parser. In
principle, we could change this particular conversion in RGui to avoid the
substitution.

Not sure whether I am missing something here, but I used RStudio for my
examples (I should have said) and David's mentioned RStudio as well, so it
does not seem to be a problem with RGui only.

Another example for the "best fit" behaviour seems to be "Σ"
("\u03A3", greek capital letter sigma, not "\u2211", n-ary summation):

print("Σ")
#> [1] "S"

Again with cp1252 on Windows 10, R 3.5.2, RStudio 1.2.1256 preview.

> even though we could rewrite in principle all calls to Windows API to use
Unicode and have all strings in UTF-8 in R, we would still have problems
when interfacing with packages that assume strings are in current native
encoding (without checking), so this problem won't be easy to fix.

Since I regularly encounter the reverse problem, i.e. packages that assume
strings are in UTF-8 encoding without checking (which isn't very
surprising, assuming that most package developers develop on Unix/macOS
systems), I'd say, "rip of the bandaid rather sooner than later". Obviously
I don't know how many bugs would surface in packages if R for Windows'
native encoding were to switch to UTF-8, but these bugs would only be
transitory, I suppose. Whereas there is a steady inflow of
assume-UTF-8-encoding-bugs in new packages and functions with the current
situation.

Best,
Daniel


Am Fr., 8. Feb. 2019 um 13:07 Uhr schrieb Tomas Kalibera <
tomas.kalib...@gmail.com>:

> I can reproduce this behavior on my Windows 10 system in RGui (cp1252):
> when I paste the Unicode infinity symbol into the console, it is treated
> as number 8. This is caused by Windows "best fit" default behavior in
> conversion of unicode characters to characters in the current native
> encoding: at some point in the past, 8 has been chosen as a good fit for
> infinity in Windows. In my scenario, the conversion is invoked by RGui
> before returning the input to the main R loop, even before the input
> gets to the parser. In principle, we could change this particular
> conversion in RGui to avoid the substitution. RGui uses "\u" escapes
> to pass characters that cannot be represented, this is why e.g. the
> Cyrillic Zhe \u0436 worked, so we could tell Windows not to do the
> substitution and pass "\u221e" for Infinity, and then the string after
> being processed by the parser will be represented in UTF-8 inside R and
> could be e.g. printed by the RGui console. That is something that could
> be considered, but it will not solve the main problem and it may
> actually cause trouble to users who are used to such substitutions
> (especially when the substitutions are more intuitive, but, that may be
> a matter of opinion).
>
> The main problem is that in normal use, sooner or later R will get to
> the point when it will need to do the conversion to native encoding, and
> in some context where "\u" escapes will not be possible. One cannot
> reliably work with strings in R that cannot be represented in the
> current native encoding (except when one knows precisely how to avoid
> the conversion in some specific task, but that may be brittle; so the
> best-fit substitution might in principle help here). This problem does
> not exist on Unix/macOS systems where the current native encoding is
> UTF-8 these days, so today it only exists on Windows where UTF-8 cannot
> be the current native encoding. As has been discussed before, even
> though we could rewrite in principle all calls to Windows API to use
> Unicode and have all strings in UTF-8 in R, we would still have problems
> when interfacing with packages that assume strings are in current native
> encoding (without checking), so this problem won't be easy to fix.
>
> Best,
> Tomas
>
> On 2/7/19 3:10 PM, Daniel Possenriede wrote:
> > There seems to be something odd with "∞" on Windows (and not only with
> > read.table)
> > In native encoding (cp-1252 in my case), "∞" gets converted to "8"
> >
> > x <-  "∞"
> > Encoding(x)
> > #> [1] "unknown"
> > print(x)
> > #> [1] "8"
> > charToRaw(x)
> > #> [1] 38
> >
> > "∞" is indeed "8"
> >
> > identical(x, "8")
> > #> [1] TRUE
> >
> > Everything seems fine if  "∞" is UTF-8 encoded.
> >
> > y <- "\u221E"
> > Encoding(y)
> > #> [1] "UTF-8"
> > print(y)
> > #> [1]  

[Rd] special latin1 do not print as glyphs in current devel on windows

2017-07-31 Thread Daniel Possenriede
Sorry, if I am spamming/not using the right list, but I think I might be
onto a regression in current devel.

Namely, special (non-ASCII) characters with latin1 encoding do not get
printed as glyphs with R 3.5.0 devel but were with R 3.4.1.

This output is from

# R version 3.4.1 (2017-06-30) -- "Single Candle"
# Platform: x86_64-w64-mingw32/x64 (64-bit)

> x <- c("€", "–", "‰") # Euro, en-dash, promille
> # v3.4.1 prints latin1 characters fine
> print(x)
[1] "€" "–" "‰"

And this (and all following) output is from

# R Under development (unstable) (2017-07-30 r73000) -- "Unsuffered
Consequences"
# Platform: x86_64-w64-mingw32/x64 (64-bit)

> x <- c("€", "–", "‰") # Euro, en-dash, promille
> # printed as escapes with 3.5.0 devel
> print(x)
[1] "\u0080" "\u0096" "\u0089"

The possible regression ends here, all following output is the same with
v.3.4.1 and 3.5.0 devel.

Possibly a second, but IMHO related issue is that encoding to UTF-8 does
not help and that information is lost when encoding back to latin1.

First, chars are printed as escapes as well, when converted to UTF-8, which
is unexpected, considering that escapes can be printed as glyphs (see
below).

> Encoding(x)
[1] "latin1" "latin1" "latin1"
> x_utf8 <- enc2utf8(x)
> Encoding(x_utf8)
[1] "UTF-8" "UTF-8" "UTF-8"
> print(x_utf8)
[1] "\u0080" "\u0096" "\u0089"

Converting back to native is lossy (which, to me, is also unexpected).

# When converting x_utf8 back to native encoding, chars are not marked as
latin-1 ...
> x_nat <- enc2native(x_utf8)
> Encoding(x_nat)
[1] "unknown" "unknown" "unknown"
> print(x_nat)
[1] "" "" ""

Other unicode chars print fine as glyphs when entered as escapes (cf.
enc2utf8(x) above)

> z <- c("\u215B", "\u2105", "\u03B7") # 1/8, c/o, eta
> Encoding(z)
[1] "UTF-8" "UTF-8" "UTF-8"
> print(z)
[1] "⅛" "℅" "η"

But changing encoding is also not such a good idea here.

> z_nat <- enc2native(z)
> Encoding(z_nat)
[1] "unknown" "unknown" "unknown"
> z_utf8 <- enc2utf8(z_nat)
> Encoding(z_utf8)
[1] "unknown" "unknown" "unknown"
> print(z_utf8)
[1] "" "" ""

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] special latin1 do not print as glyphs in current devel on windows

2017-08-01 Thread Daniel Possenriede
Upon further inspection, I think these are at least two problems.
First the issue with printing latin1/cp1252 characters in the "80" to "9F"
code range.

x <- c("€", "–", "‰")
Encoding(x)
print(x)

I assume that these are Unicode escapes!? (Given that Encoding(x) shows
"latin1" I'd rather expect latin1/cp1252 escapes here, but these would be
e.g. "\x80", right? My locale is LC_COLLATE=German_Germany.1252 btw.)
Now I don't know why print tries to convert to Unicode, but if these indeed
are Unicode escapes, then there is something wrong with the conversion from
cp1252 to Unicode.
In general, most cp1252 char codes translate to Unicode like CP1252: "00"
-> Unicode "", "01" -> "0001", "02" -> "0002", etc. see
http://www.cp1252.com/.
The exception is the cp1252 "80" to "9F" code range. E.g. the Euro sign is
"80" in cp1252 but "20AC" in Unicode, endash "96" in cp1252, "2013" in
Unicode.
The same error seems to happen with

enc2utf8(x)

Now with iconv() the result is as expected.

iconv(x, to = "UTF-8")


The second problem IMO is that encoding markers get lost with the enc2*
functions

x_utf8 <- enc2utf8(x)
Encoding(x_utf8)
x_nat <- enc2native(x_utf8)
Encoding(x_nat)

Again, this is not the case with iconv()

x_iutf8 <- iconv(x, to = "UTF-8")
Encoding(x_iutf8)
x_inat <- iconv(x_iutf8, from = "UTF-8")
Encoding(x_inat)

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] special latin1 do not print as glyphs in current devel on windows

2017-08-01 Thread Daniel Possenriede
Sorry, I should have included my console output, obviously. So here we go:

Wrong UTF-8 escapes with using print in v3.5.0 devel:

# R Under development (unstable) (2017-07-30 r73000) -- "Unsuffered
Consequences"
# Platform: x86_64-w64-mingw32/x64 (64-bit)

> x <- c("€", "–", "‰")
> Encoding(x)
[1] "latin1" "latin1" "latin1"
> print(x)
[1] "\u0080" "\u0096" "\u0089"

Same output with enc2utf8()

> enc2utf8(x)
[1] "\u0080" "\u0096" "\u0089"

With iconv() the result is as expected.

> iconv(x, to = "UTF-8")
[1] "€" "–" "‰"

The second problem IMO is that encoding markers get lost with the enc2*
functions

> x_utf8 <- enc2utf8(x)
> Encoding(x_utf8)
[1] "UTF-8" "UTF-8" "UTF-8"
> x_nat <- enc2native(x_utf8)
> Encoding(x_nat)
[1] "unknown" "unknown" "unknown"

This is not the case with iconv()

> x_iutf8 <- iconv(x, to = "UTF-8")
> Encoding(x_iutf8)
[1] "UTF-8" "UTF-8" "UTF-8"
> x_inat <- iconv(x_iutf8, from = "UTF-8")
> Encoding(x_inat)
[1] "latin1" "latin1" "latin1"

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] special latin1 do not print as glyphs in current devel on windows

2017-08-01 Thread Daniel Possenriede
Thank you!. My apologies again for not including the console output in my
message before. I sent another e-mail with the output in the meantime, so
it should be a bit clearer now, what I am seeing. In case I missed
something, please let me know.

Yes, I am using latin1 and cp1252 interchangebly here, mostly because
Encoding() is reporting the encoding as "latin1". You presumed correctly
that my current/default locale's encoding is CP1252. (I also mentioned that
my locale is LC_COLLATE=German_Germany.1252 before).


As you are changing encodings, you do not want to preserve encoding!
>

I am not interested in preserving encodings. What I am worried about is
that the encoding is not marked anymore, i.e. that Encoding() returns
"unknown".
In cp1252 encoding on Windows (note that I am using the cp1252 escape
"\x80" and not the Unicode "\u20AC")

> x_utf8 <- enc2utf8(c("€", "\x80"))
> Encoding(x_utf8)
[1] "UTF-8" "UTF-8"
> x_nat <- enc2native(x_utf8)
> Encoding(x_nat)
[1] "unknown" "unknown"

See also Kirill's message to this list: "ASCII strings are marked as ASCII
internally, but this information doesn't seem to be available, e.g.,
Encoding() returns "unknown" for such strings "
http://r.789695.n4.nabble.com/source-parse-and-foreign-UTF-8-characters-tp4733523.html

>
> Again, this is not the case with iconv()
>>
>> x_iutf8 <- iconv(x, to = "UTF-8")
>> Encoding(x_iutf8)
>> x_inat <- iconv(x_iutf8, from = "UTF-8")
>> Encoding(x_inat)
>>
>
> iconv is converting from/to the current locale's encoding, presumably
> CP1252, not from the marked encoding (as the help page states explicitly.)
>

I am aware that iconv is not using the marked encoding, but that you either
have to set it explicitly or it uses the current locale's default encoding.
As I said I am worried about the fact that the encoding markers get lost
with the enc2* functions or rather they are not set correctly. I am just
using the iconv example to show that iconv is able to set the encoding
markers correctly. So it seems generally possible.

> x_iutf8 <- iconv(c("€", "\x80"), to = "UTF-8")
> Encoding(x_iutf8)
[1] "UTF-8" "UTF-8"
> x_iutf8
[1] "€" "€"
> x_inat <- iconv(x_iutf8, from = "UTF-8")
> Encoding(x_inat)
[1] "latin1" "latin1"
> x_inat
[1] "\u0080" "\u0080"

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] special latin1 do not print as glyphs in current devel on windows

2017-09-14 Thread Daniel Possenriede
This is a follow-up on my initial posts regarding character encodings on 
Windows (https://stat.ethz.ch/pipermail/r-devel/2017-August/074728.html) 
and Patrick Perry's reply 
(https://stat.ethz.ch/pipermail/r-devel/2017-August/074830.html) in 
particular (thank you for the links and the bug report!). My initial 
posts were quite chaotic (and partly wrong), so I am trying to clear 
things up a bit.


Actually, the title of my original message "special latin1 [characters] 
do not print as glyphs in current devel on windows" is already wrong, 
because the problem exists with characters with CP1252 encoding in the 
80-9F (hex) range. Like Brian Ripley rightfully pointed out, latin1 != 
CP1252. The characters in the 80-9F code point range are not even part 
of ISO/IEC 8859-1 a.k.a. latin1, see for example 
https://en.wikipedia.org/wiki/Windows-1252. R treats them as if they 
were, however, and that is exactly the problem, IMHO.


Let me show you what I mean. (All output from R 3.5 r73238, see 
sessionInfo at the end)


> Sys.getlocale("LC_CTYPE")
[1] "German_Germany.1252"
> x <- c("€", "ž", "š", "ü")
> sapply(x, charToRaw)
\u0080 \u009e \u009a  ü
80 9e 9a fc

"€", "ž", "š" serve as examples in the 80-9F range of CP1252. I also 
show the "ü" just as an example of a non-ASCII character outside that 
range (and because Patrick Perry used it in his bug report which might 
be a (slightly) different problem, but I will get to that later.)


> print(x)
[1] "\u0080" "\u009e" "\u009a" "ü"

"€", "ž", and "š" are printed as (incorrect) unicode escapes. "€" for 
example should be \u20ac not \u0080.
(In R 3.4.1, print(x) shows the glyphs and not the unicode escapes. 
Apparently, as of v3.5, print() calls enc2utf8() (or its equivalent in C 
(translateCharUTF8?))?)


> print("\u20ac")
[1] "€"

The characters in x are marked as "latin1".

> Encoding(x)
[1] "latin1" "latin1" "latin1" "latin1"

Looking at the CP1252 table (e.g. link above), we see that this is 
incorrect for "€", "ž", and "š", which simply do not exist in latin1.


As per the documentation, "enc2utf8 convert[s] elements of character 
vectors to [...] UTF-8 [...], taking any marked encoding into account." 
Since the marked encoding is wrong, so is the output of enc2utf8().


> enc2utf8(x)
[1] "\u0080" "\u009e" "\u009a" "ü"

Now, when we set the encoding to "unknown" everything works fine.

> x_un <- x
> Encoding(x_un) <- "unknown"
> print(x_un)
[1] "€" "ž" "š" "ü"
> (x_un2utf8 <- enc2utf8(x_un))
[1] "€" "ž" "š" "ü"

Long story short: The characters in the 80 to 9F range should not be 
marked as "latin1" on CP1252 locales, IMHO.


As a side-note: the output of localeToCharset() is also problematic, 
since ISO8859-1 != CP1252.


> localeToCharset()
[1] "ISO8859-1"

Finally on to Patrick Perry's bug report 
(https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17329): 'On 
Windows, enc2utf8("ü") yields "|".'


Unfortunately, I cannot reproduce this with the CP1252 locale, as can be 
seen above. Probably, because the bug applies to the C locale (sorry if 
this is somewhere apparent in the bug report and I missed it).


> Sys.setlocale("LC_CTYPE", "C")
[1] "C"
> enc2utf8("ü")
[1] "|"
> charToRaw("ü")
[1] fc
> Encoding("ü")
[1] "unknown"

This does not seem to be related to the marked encoding of the string, 
so it seems to me that this is a different problem than the one above.


Any advice on how to proceed further would be highly appreciated.

Thanks!
Daniel

> sessionInfo()
R Under development (unstable) (2017-09-11 r73238)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 14393)

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.1252  LC_CTYPE=C
[3] LC_MONETARY=German_Germany.1252 LC_NUMERIC=C
[5] LC_TIME=German_Germany.1252

attached base packages:
[1] stats graphics  grDevices utils datasets  methods base

loaded via a namespace (and not attached):
[1] compiler_3.5.0

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] Versioning Rtools ARP entries

2023-03-13 Thread Daniel Possenriede
Hi,

If I am not mistaken, all Rtools 4.2 (and 4.3) revisions have the same
ARP [1] entries, i.e. all report version 4.2.0.1 (or 4.3.0.1). This
makes it difficult to determine the installed version (is it possible
to determine the installed revision?) and impossible for tools like
winget [2] to update Rtools to the latest revision, AFAICT.

Would it be possible to track the version in the installer [3] for
future Rtools releases again, like it used to be in Rtools 4.0 [4]?

Thanks!

Daniel

[1] 
https://github.com/microsoft/winget-pkgs/blob/master/FAQ.md#what-is-an-arp-entry
[2] https://github.com/microsoft/winget-cli
[3] 
https://svn.r-project.org/R-dev-web/trunk/WindowsBuilds/winutf8/ucrt3/rtools/rtools64.iss
[4] 
https://github.com/r-windows/rtools-installer/commit/7f23f0d0442d72922014ec4082c8bdd437364cef

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel