Re: [Rd] A question about the API mkchar()

Simon Urbanek Tue, 28 Oct 2008 08:29:56 -0700

On Oct 28, 2008, at 6:26 , Fán Lóng wrote:

Hi guys,


Hey guy :)

I've got a question about the API mkchar(). I have met somedifficulty in parsing utf-8 string to mkchar() in R-2.7.0.


There is no mkchar() in R. Did you perhaps mean mkChar()?

I was intending to parse an utf-8 string str_jan (some Japanese
characters such asふ, whose utf-8 code is E381B5

There is no such "UTF-8" code. I'm not sure if you meant Unicode, butthat would be \u3075 (Hiragana hu) for that character. The UTF-8encoding of that character is a three-byte sequence 0xe3 0x81 0xb5 ifthat's what you meant.

) to R API SEXP
mkChar(const char *name) , we only need to create the SEXP using the
string that we parsed.



Unfortunately, I found when parsing the variable str_jan, R will
automatically convert the str_jan according to the current locale
setting,

That is not true - it will be kept as-is regardless of the encoding.Note that mkChar(x) is equivalent to mkCharCE(x, CE_NATIVE); Noconversion takes place when the string is created, but you have told Rthat it is in the native encoding. If that is not true (which is yourcase probably isn't), all bets are off since you're lying to R ;).

so only in the English locale could the function work correctly,under other locale, such as Japanese or Chinese, the string will beconvert incorrectly.

That is clearly a nonsense since the encoding has nothing to do withthe locale language itself (Japanese, Chinese, ..). We are talkingabout the encoding (note that both English and Japanese locales canuse UTF-8 encoding, but don't have to). I think you'll need to get theconcepts right here - for each string you must define the encoding inorder to be able to reproduce the unicode sequence that the stringrepresents. At this point it has nothing to do with the language.

As a matter of fact, those utf-8 code already is Unicode string, anddon't need to be converted at all.
I also tried to use the SEXP Rf_mkCharCE(const char *, cetype_t);,Parsing the CE_UTF8 as the argument of cetype_t, but the result isworse. It returned the result as ucs code, an kind of Unicode underwindows platform.

Well, that's exactly what you want, isn't it? The string is correctlyflagged as UTF-8 so R is finally able to find out what exactly isrepresented by that string. However, your locale apparently doesn'tsupport such characters so it cannot be displayed. If you use a localethat supports it, it works just fine, for example if you use localwith SJIS encoding R will still know how to convert it from UTF-8 toSJIS *for display*. The actual string is not touched.

Here is a small piece of code that shows you the difference betweennative encoding and UTF8-strings:


#include <R.h>
#include <Rinternals.h>

SEXP me() {
  const char c[] = { 0xe3, 0x81, 0xb5, 0 };
  SEXP a = allocVector(STRSXP, 2);
  PROTECT(a);
  SET_STRING_ELT(a, 0, mkCharCE(c, CE_NATIVE));
  SET_STRING_ELT(a, 1, mkCharCE(c, CE_UTF8));
  UNPROTECT(1);
  return a;
}

In a UTF-8 locale it doesn't matter:

ginaz:sandbox$ LANG=ja_JP.UTF-8 R
> .Call("me")
[1] "ふ" "ふ"

But in any other, let's say SJIS, it does:

ginaz:sandbox$ LANG=ja_JP.SJIS R
> .Call("me")
[1] "縺ｵ" "ふ"

Note that the first string is wrong, because we have supplied UTF-8encoding but the current one is SJIS. The second one is correct sincewe told R that it's UTF-8 encoded.


Finally, if the character cannot be displayed in the given encoding:

ginaz:sandbox$ LANG=en_US.US-ASCII R
> .Call("me")
[1] "\343\201\265" "<U+3075>"

The first one is wrong again, since it's not flagged as UTF8, but thesecond one is exactly as expected - unicode 3075 which is the Hiragana"hu". It doesn't exist in US-ASCII so unicode designation is all youcan display.

All I want to get is just a SEXP object containing the originalutf-8 string, no matter what locale is set currently. Normally whatcan I do?


mkChar(X, CE_UTF8);

Cheers,
Simon

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] A question about the API mkchar()

Reply via email to