Re: [Rd] String encoding problem

Hadley Wickham Thu, 07 Jul 2016 09:19:04 -0700

>>> I'm not sure what should happen here, but that's not a legal string in a
>>> UTF-8 locale, so it's not too surprising that things go wonky.
>>
>> Here's bit more context on how I got that sequence of bytes:
>>
>> x <- "こんにちは"
>> y <- iconv(x, to = "Shift-JIS")
>> Encoding(y)
>> y
>>
>> I did this to create an example to demonstrate how to handle encoding
>> problems, and it's bit frustrating that I have to manually mangle the
>> string in order to be able to re-use it in another session.  Maybe
>> strings with unknown encoding shouldn't use unicode escapes?
>>
>
> The real issue is that the only supported encoding of strings in R are native 
> (=current locale), latin1, and UTF-8. So unless you're running in Shift-JIS 
> locale, that encoding is not supported in your R, so the result of the 
> iconv() above is not a valid R string, just a sequence of bytes that R 
> doesn't know how to deal with. It tries to interpret it in your locale 
> (UTF-8) just as a guess, but that doesn't quite work. To illustrate, doing 
> this in C locale yields a different result:
>
>> x
> [1] "<U+3053><U+3093><U+306B><U+3061><U+306F>"
>> y <- iconv(x, from="UTF-8", to = "Shift-JIS")
>> y
> [1] "\202\261\202\361\202\311\202\277\202\315"
>
> If you want a result that does not depend on your locale and is none of the 
> supported encodings, you have to declare it as bytes (back in UTF-8):
>
>> Encoding(y)="bytes"
>> y
> [1] "\\x82\\xb1\\x82\\xf1\\x82\\xc9\\x82\\xbf\\x82\\xcd"
>> iconv(y, from="Shift-JIS", to="utf-8")
> [1] "こんにちは"
>
> But that has its own perils such as the fact that you cannot dput() 
> byte-encoded strings.


Right - I'm aware of that.  But to me, it doesn't seem correct to
print a string that is not a valid R string. Why is an unknown
encoding printed like UTF-8?

Hadley

-- 
http://hadley.nz

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] String encoding problem

Reply via email to