Re: [Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)

2016-02-23 Thread Martin Maechler
> nospam@altfeld-im de 
> on Mon, 22 Feb 2016 18:45:59 +0100 writes:

> Dear R developers
> I think I have found a bug that can be reproduced with two lines of code
> and I am very thankful to get your first assessment or feed-back on my
> report.

> If this is the wrong mailing list or I did something wrong
> (e. g. semi "anonymous" email address to protect my privacy and defend
> unwanted spam) please let me know since I am new here.

> Thank you very much :-)

> J. Altfeld

Dear J.,
(yes, a bit less anonymity would be very welcomed here!),

You are right, this is a bug, at least in the documentation, but
probably "all real", indeed,

but read on.

> On Tue, 2016-02-16 at 18:25 +0100, nos...@altfeld-im.de wrote:
>> 
>> 
>> If I execute the code from the "?write.table" examples section
>> 
>> x <- data.frame(a = I("a \" quote"), b = pi)
>> # (ommited code)
>> write.csv(x, file = "foo.csv", fileEncoding = "UTF-16LE")
>> 
>> the resulting CSV file has a size of 6 bytes which is too short
>> (truncated):
>> 
>> """,3

reproducibly, yes.
If you look at what write.csv does
and then simplify, you can get a similar wrong result by

  write.table(x, file = "foo.tab", fileEncoding = "UTF-16LE")

which results in a file with one line

""" 3

and if you debug  write.table() you see that its building blocks
here are
 file <- file(, encoding = fileEncoding)

awriteLines(*, file=file)  for the column headers,

and then "deeper down" C code which I did not investigate.

But just looking a bit at such a file() object with writeLines()
seems slightly revealing, as e.g., 'eol' does not seem to
"work" for this encoding:

> fn <- tempfile("ffoo"); ff <- file(fn, open="w", encoding = "UTF-16LE")
> writeLines(LETTERS[3:1], ff); writeLines("|", ff); writeLines(">a", ff)
> close(ff)
> file.show(fn)
CBA|>
> file.size(fn)
[1] 5
> 

>> The problem seems to be the iconv function:
>> 
>> iconv("foo", to="UTF-16")
>> 
>> produces
>> 
>> Error in iconv("foo", to = "UTF-16"):
>> embedded nul in string: '\xff\xfef\0o\0o\0'

but this works

> iconv("foo", to="UTF-16", toRaw=TRUE)
[[1]]
[1] ff fe 66 00 6f 00 6f 00

(indeed showing the embedded '\0's)

>> In 2010 a (partial) patch for this problem was submitted:
>> http://tolstoy.newcastle.edu.au/R/e10/devel/10/06/0648.html

the patch only related to the iconv() problem not allowing 'raw'
(instead of character) argument x.

... and it is > 5.5 years old, for an iconv() version that was less
featureful than today.
Rather, current iconv(x) allows x to be a list of raw entries.


>> Are there chances to fix this problem since it prevents writing Windows
>> UTF-16LE text files?

>> 
>> PS: This problem can be reproduced on Windows and Linux.

indeed also on "R devel of today".

I agree it should be fixed... but as I said not by the patch you
mentioned.

Tested patches to fix this are welcome, indeed.

Martin Maechler



>> ---
>> 
>> > sessionInfo()
>> R version 3.2.3 (2015-12-10)
>> Platform: x86_64-pc-linux-gnu (64-bit)
>> Running under: Ubuntu 14.04.3 LTS
>> 
>> locale:
>> [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
>> LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
>> [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
>> LC_PAPER=en_US.UTF-8   LC_NAME=C 
>> [9] LC_ADDRESS=C   LC_TELEPHONE=C
>> LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C   
>> 
>> attached base packages:
>> [1] stats graphics  grDevices utils datasets  methods
>> base 
>> 
>> loaded via a namespace (and not attached):
>> [1] tools_3.2.3
>> >
>> 
>> __
>> R-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel

> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)

2016-02-23 Thread Mikko Korpela
On 23.02.2016 11:37, Martin Maechler wrote:
>> nospam@altfeld-im de 
>> on Mon, 22 Feb 2016 18:45:59 +0100 writes:
> 
> > Dear R developers
> > I think I have found a bug that can be reproduced with two lines of code
> > and I am very thankful to get your first assessment or feed-back on my
> > report.
> 
> > If this is the wrong mailing list or I did something wrong
> > (e. g. semi "anonymous" email address to protect my privacy and defend
> > unwanted spam) please let me know since I am new here.
> 
> > Thank you very much :-)
> 
> > J. Altfeld
> 
> Dear J.,
> (yes, a bit less anonymity would be very welcomed here!),
> 
> You are right, this is a bug, at least in the documentation, but
> probably "all real", indeed,
> 
> but read on.
> 
> > On Tue, 2016-02-16 at 18:25 +0100, nos...@altfeld-im.de wrote:
> >> 
> >> 
> >> If I execute the code from the "?write.table" examples section
> >> 
> >> x <- data.frame(a = I("a \" quote"), b = pi)
> >> # (ommited code)
> >> write.csv(x, file = "foo.csv", fileEncoding = "UTF-16LE")
> >> 
> >> the resulting CSV file has a size of 6 bytes which is too short
> >> (truncated):
> >> 
> >> """,3
> 
> reproducibly, yes.
> If you look at what write.csv does
> and then simplify, you can get a similar wrong result by
> 
>   write.table(x, file = "foo.tab", fileEncoding = "UTF-16LE")
> 
> which results in a file with one line
> 
> """ 3
> 
> and if you debug  write.table() you see that its building blocks
> here are
>file <- file(, encoding = fileEncoding)
> 
> a  writeLines(*, file=file)  for the column headers,
> 
> and then "deeper down" C code which I did not investigate.

I took a look at connections.c. There is a call to strlen() that gets
confused by null characters. I think the obvious fix is to avoid the
call to strlen() as the size is already known:

Index: src/main/connections.c
===
--- src/main/connections.c  (revision 70213)
+++ src/main/connections.c  (working copy)
@@ -369,7 +369,7 @@
/* is this safe? */
warning(_("invalid char string in output conversion"));
*ob = '\0';
-   con->write(outbuf, 1, strlen(outbuf), con);
+   con->write(outbuf, 1, ob - outbuf, con);
} while(again && inb > 0);  /* it seems some iconv signal -1 on
   zero-length input */
 } else


> 
> But just looking a bit at such a file() object with writeLines()
> seems slightly revealing, as e.g., 'eol' does not seem to
> "work" for this encoding:
> 
> > fn <- tempfile("ffoo"); ff <- file(fn, open="w", encoding = "UTF-16LE")
> > writeLines(LETTERS[3:1], ff); writeLines("|", ff); writeLines(">a", ff)
> > close(ff)
> > file.show(fn)
> CBA|>
> > file.size(fn)
> [1] 5
> > 

With the patch applied:

> readLines(fn, encoding="UTF-16LE", skipNul=TRUE)
[1] "C"  "B"  "A"  "|"  ">a"
> file.size(fn)
[1] 22

- Mikko Korpela

> >> The problem seems to be the iconv function:
> >> 
> >> iconv("foo", to="UTF-16")
> >> 
> >> produces
> >> 
> >> Error in iconv("foo", to = "UTF-16"):
> >> embedded nul in string: '\xff\xfef\0o\0o\0'
> 
> but this works
> 
> > iconv("foo", to="UTF-16", toRaw=TRUE)
> [[1]]
> [1] ff fe 66 00 6f 00 6f 00
> 
> (indeed showing the embedded '\0's)
> 
> >> In 2010 a (partial) patch for this problem was submitted:
> >> http://tolstoy.newcastle.edu.au/R/e10/devel/10/06/0648.html
> 
> the patch only related to the iconv() problem not allowing 'raw'
> (instead of character) argument x.
> 
> ... and it is > 5.5 years old, for an iconv() version that was less
> featureful than today.
> Rather, current iconv(x) allows x to be a list of raw entries.
> 
> 
> >> Are there chances to fix this problem since it prevents writing Windows
> >> UTF-16LE text files?
> 
> >> 
> >> PS: This problem can be reproduced on Windows and Linux.
> 
> indeed also on "R devel of today".
> 
> I agree it should be fixed... but as I said not by the patch you
> mentioned.
> 
> Tested patches to fix this are welcome, indeed.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)

2016-02-23 Thread nos...@altfeld-im.de
Excellent analysis, thank you both for the quick reply!

Is there anything I can do to get the bug fixed in the next version of R
(e. g. filing a bug report at https://bugs.r-project.org/bugzilla3/)?


On Tue, 2016-02-23 at 14:06 +0200, Mikko Korpela wrote:
> On 23.02.2016 11:37, Martin Maechler wrote:
> >> nospam@altfeld-im de 
> >> on Mon, 22 Feb 2016 18:45:59 +0100 writes:
> > 
> > > Dear R developers
> > > I think I have found a bug that can be reproduced with two lines of 
> > code
> > > and I am very thankful to get your first assessment or feed-back on my
> > > report.
> > 
> > > If this is the wrong mailing list or I did something wrong
> > > (e. g. semi "anonymous" email address to protect my privacy and defend
> > > unwanted spam) please let me know since I am new here.
> > 
> > > Thank you very much :-)
> > 
> > > J. Altfeld
> > 
> > Dear J.,
> > (yes, a bit less anonymity would be very welcomed here!),
> > 
> > You are right, this is a bug, at least in the documentation, but
> > probably "all real", indeed,
> > 
> > but read on.
> > 
> > > On Tue, 2016-02-16 at 18:25 +0100, nos...@altfeld-im.de wrote:
> > >> 
> > >> 
> > >> If I execute the code from the "?write.table" examples section
> > >> 
> > >> x <- data.frame(a = I("a \" quote"), b = pi)
> > >> # (ommited code)
> > >> write.csv(x, file = "foo.csv", fileEncoding = "UTF-16LE")
> > >> 
> > >> the resulting CSV file has a size of 6 bytes which is too short
> > >> (truncated):
> > >> 
> > >> """,3
> > 
> > reproducibly, yes.
> > If you look at what write.csv does
> > and then simplify, you can get a similar wrong result by
> > 
> >   write.table(x, file = "foo.tab", fileEncoding = "UTF-16LE")
> > 
> > which results in a file with one line
> > 
> > """ 3
> > 
> > and if you debug  write.table() you see that its building blocks
> > here are
> >  file <- file(, encoding = fileEncoding)
> > 
> > awriteLines(*, file=file)  for the column headers,
> > 
> > and then "deeper down" C code which I did not investigate.
> 
> I took a look at connections.c. There is a call to strlen() that gets
> confused by null characters. I think the obvious fix is to avoid the
> call to strlen() as the size is already known:
> 
> Index: src/main/connections.c
> ===
> --- src/main/connections.c(revision 70213)
> +++ src/main/connections.c(working copy)
> @@ -369,7 +369,7 @@
>   /* is this safe? */
>   warning(_("invalid char string in output conversion"));
>   *ob = '\0';
> - con->write(outbuf, 1, strlen(outbuf), con);
> + con->write(outbuf, 1, ob - outbuf, con);
>   } while(again && inb > 0);  /* it seems some iconv signal -1 on
>  zero-length input */
>  } else
> 
> 
> > 
> > But just looking a bit at such a file() object with writeLines()
> > seems slightly revealing, as e.g., 'eol' does not seem to
> > "work" for this encoding:
> > 
> > > fn <- tempfile("ffoo"); ff <- file(fn, open="w", encoding = 
> > "UTF-16LE")
> > > writeLines(LETTERS[3:1], ff); writeLines("|", ff); writeLines(">a", 
> > ff)
> > > close(ff)
> > > file.show(fn)
> > CBA|>
> > > file.size(fn)
> > [1] 5
> > > 
> 
> With the patch applied:
> 
> > readLines(fn, encoding="UTF-16LE", skipNul=TRUE)
> [1] "C"  "B"  "A"  "|"  ">a"
> > file.size(fn)
> [1] 22
> 
> - Mikko Korpela
> 
> > >> The problem seems to be the iconv function:
> > >> 
> > >> iconv("foo", to="UTF-16")
> > >> 
> > >> produces
> > >> 
> > >> Error in iconv("foo", to = "UTF-16"):
> > >> embedded nul in string: '\xff\xfef\0o\0o\0'
> > 
> > but this works
> > 
> > > iconv("foo", to="UTF-16", toRaw=TRUE)
> > [[1]]
> > [1] ff fe 66 00 6f 00 6f 00
> > 
> > (indeed showing the embedded '\0's)
> > 
> > >> In 2010 a (partial) patch for this problem was submitted:
> > >> http://tolstoy.newcastle.edu.au/R/e10/devel/10/06/0648.html
> > 
> > the patch only related to the iconv() problem not allowing 'raw'
> > (instead of character) argument x.
> > 
> > ... and it is > 5.5 years old, for an iconv() version that was less
> > featureful than today.
> > Rather, current iconv(x) allows x to be a list of raw entries.
> > 
> > 
> > >> Are there chances to fix this problem since it prevents writing 
> > Windows
> > >> UTF-16LE text files?
> > 
> > >> 
> > >> PS: This problem can be reproduced on Windows and Linux.
> > 
> > indeed also on "R devel of today".
> > 
> > I agree it should be fixed... but as I said not by the patch you
> > mentioned.
> > 
> > Tested patches to fix this are welcome, indeed.
>

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] iconv to UTF-16 encoding produces error due to embedded nulls (write.table with fileEncoding param)

2016-02-23 Thread Duncan Murdoch

On 23/02/2016 4:53 PM, nos...@altfeld-im.de wrote:

Excellent analysis, thank you both for the quick reply!

Is there anything I can do to get the bug fixed in the next version of R
(e. g. filing a bug report at https://bugs.r-project.org/bugzilla3/)?


Wait a few days, and file a bug report if nothing has happened.

Duncan Murdoch




On Tue, 2016-02-23 at 14:06 +0200, Mikko Korpela wrote:

On 23.02.2016 11:37, Martin Maechler wrote:

nospam@altfeld-im de 
 on Mon, 22 Feb 2016 18:45:59 +0100 writes:


 > Dear R developers
 > I think I have found a bug that can be reproduced with two lines of code
 > and I am very thankful to get your first assessment or feed-back on my
 > report.

 > If this is the wrong mailing list or I did something wrong
 > (e. g. semi "anonymous" email address to protect my privacy and defend
 > unwanted spam) please let me know since I am new here.

 > Thank you very much :-)

 > J. Altfeld

Dear J.,
(yes, a bit less anonymity would be very welcomed here!),

You are right, this is a bug, at least in the documentation, but
probably "all real", indeed,

but read on.

 > On Tue, 2016-02-16 at 18:25 +0100, nos...@altfeld-im.de wrote:
 >>
 >>
 >> If I execute the code from the "?write.table" examples section
 >>
 >> x <- data.frame(a = I("a \" quote"), b = pi)
 >> # (ommited code)
 >> write.csv(x, file = "foo.csv", fileEncoding = "UTF-16LE")
 >>
 >> the resulting CSV file has a size of 6 bytes which is too short
 >> (truncated):
 >>
 >> """,3

reproducibly, yes.
If you look at what write.csv does
and then simplify, you can get a similar wrong result by

   write.table(x, file = "foo.tab", fileEncoding = "UTF-16LE")

which results in a file with one line

""" 3

and if you debug  write.table() you see that its building blocks
here are
 file <- file(, encoding = fileEncoding)

awriteLines(*, file=file)  for the column headers,

and then "deeper down" C code which I did not investigate.


I took a look at connections.c. There is a call to strlen() that gets
confused by null characters. I think the obvious fix is to avoid the
call to strlen() as the size is already known:

Index: src/main/connections.c
===
--- src/main/connections.c  (revision 70213)
+++ src/main/connections.c  (working copy)
@@ -369,7 +369,7 @@
/* is this safe? */
warning(_("invalid char string in output conversion"));
*ob = '\0';
-   con->write(outbuf, 1, strlen(outbuf), con);
+   con->write(outbuf, 1, ob - outbuf, con);
} while(again && inb > 0);  /* it seems some iconv signal -1 on
   zero-length input */
  } else




But just looking a bit at such a file() object with writeLines()
seems slightly revealing, as e.g., 'eol' does not seem to
"work" for this encoding:

 > fn <- tempfile("ffoo"); ff <- file(fn, open="w", encoding = "UTF-16LE")
 > writeLines(LETTERS[3:1], ff); writeLines("|", ff); writeLines(">a", ff)
 > close(ff)
 > file.show(fn)
 CBA|>
 > file.size(fn)
 [1] 5
 >


With the patch applied:

 > readLines(fn, encoding="UTF-16LE", skipNul=TRUE)
 [1] "C"  "B"  "A"  "|"  ">a"
 > file.size(fn)
 [1] 22

- Mikko Korpela


 >> The problem seems to be the iconv function:
 >>
 >> iconv("foo", to="UTF-16")
 >>
 >> produces
 >>
 >> Error in iconv("foo", to = "UTF-16"):
 >> embedded nul in string: '\xff\xfef\0o\0o\0'

but this works

 > iconv("foo", to="UTF-16", toRaw=TRUE)
 [[1]]
 [1] ff fe 66 00 6f 00 6f 00

(indeed showing the embedded '\0's)

 >> In 2010 a (partial) patch for this problem was submitted:
 >> http://tolstoy.newcastle.edu.au/R/e10/devel/10/06/0648.html

the patch only related to the iconv() problem not allowing 'raw'
(instead of character) argument x.

... and it is > 5.5 years old, for an iconv() version that was less
featureful than today.
Rather, current iconv(x) allows x to be a list of raw entries.


 >> Are there chances to fix this problem since it prevents writing Windows
 >> UTF-16LE text files?

 >>
 >> PS: This problem can be reproduced on Windows and Linux.

indeed also on "R devel of today".

I agree it should be fixed... but as I said not by the patch you
mentioned.

Tested patches to fix this are welcome, indeed.




__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel



__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel