[Rd] Any progress on write.csv fileEncoding for UTF-16 and UTF-32 ?

2017-04-30 Thread Jack Kelley
"R version 3.4.0 (2017-04-21)"  on "x86_64-w64-mingw32" platform

I am using CSVs and other text tables, and text in general (including
regular expressions), on Windows 10.
For me, that means dealing with Windows-1252 and UTF-8 encoding, with UTF-16
and UTF-32 as helpful curiosities.

Something as simple as iconv ("\n", to = "UTF-16") causes an error, due to
an embedded nul.

Then there is write.csv (or write.table) with its fileEncoding parameter:
not working correctly for UTF-16 and UTF-32.

Of course, developers are aware of this, for example …

[Rd] iconv to UTF-16 encoding produces error due to embedded nulls
(write.table with fileEncoding param)
https://stat.ethz.ch/pipermail/r-devel/2016-February/072323.html

iconv to UTF-16 encoding produces error due to embedded nulls (write.table
with fileEncoding param)
http://r.789695.n4.nabble.com/iconv-to-UTF-16-encoding-produces-error-due-to
-embedded-nulls-write-table-with-fileEncoding-param-td4717481.html




Focussing on write.csv and UTF-16LE and UTF-16BE, it seems that a nul
character is omitted in each  pair.

TEST SCRIPT


remove (list = objects())

print (sessionInfo())
cat ("-\n\n")

LE <- data.frame (
  want = c ("0d,00", "0a,00"),
  got  = c ("0d   ", "0a,00")
)

BE <- data.frame (
  want = c ("00,0d", "00,0a"),
  got  = c ("00,0d", "   0a")
)

write.csv (LE, "R_LE.csv", fileEncoding = "UTF-16LE", row.names = FALSE)
write.csv (BE, "R_BE.csv", fileEncoding = "UTF-16BE", row.names = FALSE)

print (readBin ("R_LE.csv", "raw", 1000))
print (LE)
cat ("\n")

print (readBin ("R_BE.csv", "raw", 1000))
print (BE)
cat ("\n")

try (iconv ("\n", to = "UTF-8"))

try (iconv ("\n", to = "UTF-16LE"))
try (iconv ("\n", to = "UTF-16BE"))
try (iconv ("\n", to = "UTF-16"))

try (iconv ("\n", to = "UTF-32LE"))
try (iconv ("\n", to = "UTF-32BE"))
try (iconv ("\n", to = "UTF-32"))



TEST SCRIPT OUTPUT

> source ("bug_encoding.R")
R version 3.4.0 (2017-04-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 14393)

Matrix products: default

locale:
[1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252
[3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C
[5] LC_TIME=English_Australia.1252

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base

loaded via a namespace (and not attached):
[1] compiler_3.4.0
-

 [1] 22 00 77 00 61 00 6e 00 74 00 22 00 2c 00 22 00 67 00 6f 00 74 00 22 00
0d
[26] 0a 00 22 00 30 00 64 00 2c 00 30 00 30 00 22 00 2c 00 22 00 30 00 64 00
20
[51] 00 20 00 20 00 22 00 0d 0a 00 22 00 30 00 61 00 2c 00 30 00 30 00 22 00
2c
[76] 00 22 00 30 00 61 00 2c 00 30 00 30 00 22 00 0d 0a 00
   want   got
1 0d,00 0d
2 0a,00 0a,00

 [1] 00 22 00 77 00 61 00 6e 00 74 00 22 00 2c 00 22 00 67 00 6f 00 74 00 22
00
[26] 0d 0a 00 22 00 30 00 30 00 2c 00 30 00 64 00 22 00 2c 00 22 00 30 00 30
00
[51] 2c 00 30 00 64 00 22 00 0d 0a 00 22 00 30 00 30 00 2c 00 30 00 61 00 22
00
[76] 2c 00 22 00 20 00 20 00 20 00 30 00 61 00 22 00 0d 0a
   want   got
1 00,0d 00,0d
2 00,0a0a

Error in iconv("\n", to = "UTF-16LE") : embedded nul in string: '\n\0'
Error in iconv("\n", to = "UTF-16BE") : embedded nul in string: '\0\n'
Error in iconv("\n", to = "UTF-16") : embedded nul in string: 'þÿ\0\n'
Error in iconv("\n", to = "UTF-32LE") :
  embedded nul in string: '\n\0\0\0'
Error in iconv("\n", to = "UTF-32BE") :
  embedded nul in string: '\0\0\0\n'
Error in iconv("\n", to = "UTF-32") :
  embedded nul in string: '\0\0þÿ\0\0\0\n'
>


Cheers -- Jack Kelley

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Any progress on write.csv fileEncoding for UTF-16 and UTF-32 ?

2017-05-01 Thread Jack Kelley
Thanks for looking into this.

A few notes regarding all the UTF encodings on Windows 10 ...

The default eol for write.csv (via write.table) is "\n" and always gives
as.raw (c (0x0d, 0x0a)), that is,   as adjacent
bytes. This is fine for UTF-8 but wrong for UTF-16 and UTF-32.

EXAMPLE: Using UTF-32 for exaggeration (note also that 3 nul bytes are
missing in the final CR+LF):

df <- data.frame (x = 1:2, y = 3:4)

$`UTF-32LE`$default.eol$raw
 [1] 22 00 00 00 78 00 00 00 22 00 00 00 2c 00 00 00 22 00 00 00 79 00 00 00
22
[26] 00 00 00 0d 0a 00 00 00 31 00 00 00 2c 00 00 00 33 00 00 00 0d 0a 00 00
00
[51] 32 00 00 00 2c 00 00 00 34 00 00 00 0d 0a 00 00 00

$`UTF-32BE`$default.eol$raw
 [1] 00 00 00 22 00 00 00 78 00 00 00 22 00 00 00 2c 00 00 00 22 00 00 00 79
00
[26] 00 00 22 00 00 00 0d 0a 00 00 00 31 00 00 00 2c 00 00 00 33 00 00 00 0d
0a
[51] 00 00 00 32 00 00 00 2c 00 00 00 34 00 00 00 0d 0a

(Nevertheless, Microsoft Excel 2013 tolerates these CSVs!)

One trick/solution is to use eol = "\r" (that is,  only).

Regards -- Jack Kelley




remove (list = objects())
print (sessionInfo())
cat ("##\n\n")

ENCODING <- c (
  "UTF-8",
  "UTF-16LE", "UTF-16BE", "UTF-16",
  "UTF-32LE", "UTF-32BE", "UTF-32"
)

df <- data.frame (x = 1:2, y = 3:4)

csv <- structure (lapply (ENCODING, function (encoding) {
  csv <- sprintf ("df_%s.csv", encoding)
  write.csv (df, csv, fileEncoding = encoding, row.names = FALSE)
  list (default.eol = list (
csv = csv, raw = readBin (csv, "raw", 1000))
  )
}), .Names = ENCODING)

EOL <- c (LF = "\n", CR = "\r", "CR+LF" = "\r\n")

CSV <- structure (lapply (ENCODING, function (encoding) {
  structure (
lapply (names (EOL), function (EOL.name) {
  csv <- sprintf ("df_%s_eol=%s.csv", encoding, EOL.name)
  write.csv (
df, csv, fileEncoding = encoding, row.names = FALSE,
eol = EOL [EOL.name]
  )
  list (csv = csv, raw = readBin (csv, "raw", 1000))
  }), .Names = names (EOL))
}), .Names = ENCODING)

print (csv)
print (CSV)

--------


-Original Message-
From: Duncan Murdoch [mailto:murdoch.dun...@gmail.com] 
Sent: Tuesday, 2 May 2017 04:22
To: Jack Kelley ; r-devel@r-project.org
Subject: Re: [Rd] Any progress on write.csv fileEncoding for UTF-16 and
UTF-32 ?

On 30/04/2017 12:23 PM, Duncan Murdoch wrote:
> No, I don't think anyone is working on this.
>
> There's a fairly simple workaround for the UTF-16 and UTF-32 iconv
> issues:  don't attempt to produce character vectors, produce raw vectors
> instead. (The "toRaw" argument to iconv() asks for this.) Raw vectors
> can contain embedded nulls.  Character vectors can't, because
> internally, R is using 8 bit C strings, and the nulls are string
> terminators.
>
> I don't know how difficult it would be to fix the write.table problems.

I've now taken a look, and it appears as if it's not too hard.  I'll see 
if I can work out a patch that I trust.

Duncan Murdoch

>
> Duncan Murdoch
>
> On 29/04/2017 7:53 PM, Jack Kelley wrote:
>> "R version 3.4.0 (2017-04-21)"  on "x86_64-w64-mingw32" platform
>> ... [rest omitted]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Any progress on write.csv fileEncoding for UTF-16 and UTF-32 ?

2017-05-01 Thread Jack Kelley
Correction to my previous post: Not just the final CR+LF...

Change
EXAMPLE: Using UTF-32 for exaggeration (note also that 3 nul bytes are
missing in the final CR+LF):
to
EXAMPLE: Using UTF-32 for exaggeration (note also that 3 nul bytes are
missing in *each* CR+LF):

-- Jack Kelley

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel