Re: [Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)

Duncan Murdoch Wed, 24 Feb 2016 08:25:54 -0800

On 24/02/2016 9:55 AM, Mikko Korpela wrote:

On 24.02.2016 15:47, Duncan Murdoch wrote:

On 23/02/2016 7:06 AM, Mikko Korpela wrote:

On 23.02.2016 11:37, Martin Maechler wrote:

nospam@altfeld-im de <[email protected]>
      on Mon, 22 Feb 2016 18:45:59 +0100 writes:


      > Dear R developers
      > I think I have found a bug that can be reproduced with two
lines of code
      > and I am very thankful to get your first assessment or
feed-back on my
      > report.

      > If this is the wrong mailing list or I did something wrong
      > (e. g. semi "anonymous" email address to protect my privacy
and defend
      > unwanted spam) please let me know since I am new here.

      > Thank you very much :-)

      > J. Altfeld

Dear J.,
(yes, a bit less anonymity would be very welcomed here!),

You are right, this is a bug, at least in the documentation, but
probably "all real", indeed,

but read on.

      > On Tue, 2016-02-16 at 18:25 +0100, [email protected] wrote:
      >>
      >>
      >> If I execute the code from the "?write.table" examples section
      >>
      >> x <- data.frame(a = I("a \" quote"), b = pi)
      >> # (ommited code)
      >> write.csv(x, file = "foo.csv", fileEncoding = "UTF-16LE")
      >>
      >> the resulting CSV file has a size of 6 bytes which is too short
      >> (truncated):
      >>
      >> """,3

reproducibly, yes.
If you look at what write.csv does
and then simplify, you can get a similar wrong result by

    write.table(x, file = "foo.tab", fileEncoding = "UTF-16LE")

which results in a file with one line

""" 3

and if you debug  write.table() you see that its building blocks
here are
      file <- file(........, encoding = fileEncoding)

a      writeLines(*, file=file)  for the column headers,

and then "deeper down" C code which I did not investigate.


I took a look at connections.c. There is a call to strlen() that gets
confused by null characters. I think the obvious fix is to avoid the
call to strlen() as the size is already known:

Index: src/main/connections.c
===================================================================
--- src/main/connections.c    (revision 70213)
+++ src/main/connections.c    (working copy)
@@ -369,7 +369,7 @@
           /* is this safe? */
           warning(_("invalid char string in output conversion"));
           *ob = '\0';
-        con->write(outbuf, 1, strlen(outbuf), con);
+        con->write(outbuf, 1, ob - outbuf, con);
       } while(again && inb > 0);  /* it seems some iconv signal -1 on
                          zero-length input */
       } else


But just looking a bit at such a file() object with writeLines()
seems slightly revealing, as e.g., 'eol' does not seem to
"work" for this encoding:

      > fn <- tempfile("ffoo"); ff <- file(fn, open="w", encoding =
"UTF-16LE")
      > writeLines(LETTERS[3:1], ff); writeLines("|", ff);
writeLines(">a", ff)
      > close(ff)
      > file.show(fn)
      CBA|>
      > file.size(fn)
      [1] 5
      >


With the patch applied:

      > readLines(fn, encoding="UTF-16LE", skipNul=TRUE)
      [1] "C"  "B"  "A"  "|"  ">a"
      > file.size(fn)
      [1] 22


That may be okay on Unix, but it's not enough on Windows.  There the \n
that writeLines adds at the end of each line isn't translated to
UTF-16LE properly, so things get messed up.  (I think the \n is
translated, but the \r that Windows wants is not, so you get a mix of 8
bit and 16 bit characters.)


That's unfortunate. I tested my tiny patch on Linux. I don't know what
kind of additional changes would be needed to make this work on Windows.


It looks like a big change is needed for a perfect solution:

- Windows does the translation of \n to \r\n. In the R code, Windowsis never told that the output is UTF-16LE, so it does an 8 bit translation.


 - Telling Windows that output is UTF-16LE looks hard:  we'd need to

convert the string to wide chars in R, then write it in wide chars.This seems like a lot of work for a rare case.

- It might be easier to do a hack: if the user asks for "UTF-16LE",then treat it internally as a text file but tell Windows it's a binaryfile. This means no \n to \r\n translation will be done by Windows. Ifthe desired output file needs Windows line endings, the user would haveto specify sep="\r\n" in writeLines.


Duncan Murdoch

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)

Reply via email to