On 13-11-09 6:58 PM, Ben Bolker wrote:
Duncan Murdoch <murdoch.duncan <at> gmail.com> writes:


On 13-11-09 12:07 PM, Sverre Stausland wrote:
As recently discussed on Stack Overflow, R for Mac OS and Ubuntu (so
probably all Unix systems) can correctly write files with UTF-8
encoding, but R for Windows cannot:

That's not an accurate description of the problem.  Some functions in R
convert values to the native encoding, but not all do.

http://stackoverflow.com/questions/19877676/write-utf-8-files-from-r

I strongly suggest that R for Windows should support this feature in
upcoming versions.

It's not trivial to do.  When R was written, and perhaps still on some
obscure platforms, there wasn't any way to do that--Windows didn't
support UTF-8 then, just Microsoft's version of UCS-2 and a variety of
other more limited encodings.  Unix platforms didn't support UCS-2.  So
internally R keeps many things in the native encoding.

If you decide to rewrite R from scratch now, I'd suggest that you handle
things differently.  If you'd rather not rewrite it yourself, then I
don't know how you will convince someone else to take on that job.

You might find it easier to convince Microsoft to add a UTF-8 locale, so
then the native encoding would be UTF-8, and the problem would go away.

Duncan Murdoch

   Would it be fairer / more productive to say/ask:

* it would be nice if write.table could write files in UTF-8 encoding

I agree. A couple of months ago I investigated the fact that read.table could not read UTF-8 files if the characters could not be converted to the local encoding. (E.g. reading Russian characters in an English locale seemed to be impossible.) readLines() can read them, but read.table converted them to the native encoding and that killed them.

This is probably fixable, but it requires low level changes to a very commonly used function, so it's likely to break something somewhere.

I haven't looked closely at write.table, but I suspect the problem there is with format(). Connections know their encoding, but format() converts to the native encoding.

* is there any documentation already available about which R functions
_do_ handle UTF-8 output on Windows, and how they do it?

I don't think so. In general, functions that convert to the native encoding break UTF-8 on Windows, because the native encoding is often Latin1 or some other encoding that doesn't cover all the characters in UTF-8. If you look through the source you can work out which ones those are, but it's not easy.

* could they be used as models for adapting write.table to write files
in UTF-8 encoding on Windows?

   i.e., instead of "convert R to output UTF-8 universally on Windows",
"figure out how to make write.table output UTF-8 on Windows, or
suggest a workaround" ?

I imagine if I (or someone else) attempt to get read.table working in this situation then I'd try to get write.table working too.

Duncan Murdoch

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to