On 24.02.2016 15:47, Duncan Murdoch wrote: > On 23/02/2016 7:06 AM, Mikko Korpela wrote: >> On 23.02.2016 11:37, Martin Maechler wrote: >>>>>>>> nospam@altfeld-im de <nos...@altfeld-im.de> >>>>>>>> on Mon, 22 Feb 2016 18:45:59 +0100 writes: >>> >>> > Dear R developers >>> > I think I have found a bug that can be reproduced with two >>> lines of code >>> > and I am very thankful to get your first assessment or >>> feed-back on my >>> > report. >>> >>> > If this is the wrong mailing list or I did something wrong >>> > (e. g. semi "anonymous" email address to protect my privacy >>> and defend >>> > unwanted spam) please let me know since I am new here. >>> >>> > Thank you very much :-) >>> >>> > J. Altfeld >>> >>> Dear J., >>> (yes, a bit less anonymity would be very welcomed here!), >>> >>> You are right, this is a bug, at least in the documentation, but >>> probably "all real", indeed, >>> >>> but read on. >>> >>> > On Tue, 2016-02-16 at 18:25 +0100, nos...@altfeld-im.de wrote: >>> >> >>> >> >>> >> If I execute the code from the "?write.table" examples section >>> >> >>> >> x <- data.frame(a = I("a \" quote"), b = pi) >>> >> # (ommited code) >>> >> write.csv(x, file = "foo.csv", fileEncoding = "UTF-16LE") >>> >> >>> >> the resulting CSV file has a size of 6 bytes which is too short >>> >> (truncated): >>> >> >>> >> """,3 >>> >>> reproducibly, yes. >>> If you look at what write.csv does >>> and then simplify, you can get a similar wrong result by >>> >>> write.table(x, file = "foo.tab", fileEncoding = "UTF-16LE") >>> >>> which results in a file with one line >>> >>> """ 3 >>> >>> and if you debug write.table() you see that its building blocks >>> here are >>> file <- file(........, encoding = fileEncoding) >>> >>> a writeLines(*, file=file) for the column headers, >>> >>> and then "deeper down" C code which I did not investigate. >> >> I took a look at connections.c. There is a call to strlen() that gets >> confused by null characters. I think the obvious fix is to avoid the >> call to strlen() as the size is already known: >> >> Index: src/main/connections.c >> =================================================================== >> --- src/main/connections.c (revision 70213) >> +++ src/main/connections.c (working copy) >> @@ -369,7 +369,7 @@ >> /* is this safe? */ >> warning(_("invalid char string in output conversion")); >> *ob = '\0'; >> - con->write(outbuf, 1, strlen(outbuf), con); >> + con->write(outbuf, 1, ob - outbuf, con); >> } while(again && inb > 0); /* it seems some iconv signal -1 on >> zero-length input */ >> } else >> >> >>> >>> But just looking a bit at such a file() object with writeLines() >>> seems slightly revealing, as e.g., 'eol' does not seem to >>> "work" for this encoding: >>> >>> > fn <- tempfile("ffoo"); ff <- file(fn, open="w", encoding = >>> "UTF-16LE") >>> > writeLines(LETTERS[3:1], ff); writeLines("|", ff); >>> writeLines(">a", ff) >>> > close(ff) >>> > file.show(fn) >>> CBA|> >>> > file.size(fn) >>> [1] 5 >>> > >> >> With the patch applied: >> >> > readLines(fn, encoding="UTF-16LE", skipNul=TRUE) >> [1] "C" "B" "A" "|" ">a" >> > file.size(fn) >> [1] 22 > > That may be okay on Unix, but it's not enough on Windows. There the \n > that writeLines adds at the end of each line isn't translated to > UTF-16LE properly, so things get messed up. (I think the \n is > translated, but the \r that Windows wants is not, so you get a mix of 8 > bit and 16 bit characters.)
That's unfortunate. I tested my tiny patch on Linux. I don't know what kind of additional changes would be needed to make this work on Windows. -- Mikko Korpela Aalto University School of Science Department of Computer Science ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel