On May 1, 2013, at 5:33 PM, Simon Urbanek wrote: > > On May 1, 2013, at 10:06 AM, Hadley Wickham wrote: > >> Hi all, >> >> In what encoding does format.POSIXct return its output? It doesn't >> seem to be utf-8: >> >> Sys.setlocale("LC_ALL", "Japanese_Japan.932") >> >> times <- c("1970-01-01 01:00:00 UTC", "1970-02-02 22:00:00 UTC") >> ampm <- format(as.POSIXct(times), format = "%p") >> x <- gsub(">", "*", paste(ampm, collapse = "+>")) >> >> y <- "午前+*午後" >> identical(x, y) >> # [1] TRUE >> >> # But, confusingly, ... >> >> charToRaw(x) >> # [1] e5 8d 88 e5 89 8d 2b 2a e5 8d 88 e5 be 8c >> >> charToRaw(y) >> # [1] 8c df 91 4f 2b 2a 8c df 8c e3 >> > > That's not confusing at all: > >> Encoding(x) > [1] "UTF-8" >> Encoding(y) > [1] "unknown" > > The first string is in UTF-8 the second is in the local locale (here 932). > > >> # So there's at least a small bug with identical >> > > Nope: ?identical > "Character strings are regarded as identical if they are in different marked > encodings but would agree when translated to UTF-8." > > >> # And this causes a problem when you attempt to do >> # stuff with the string >> >> gsub("+", "*", x, fixed = T) >> # Error in gsub("+", "*", x, fixed = T) : >> # invalid multibyte string at '<8c>' >> gsub("+", "*", y, fixed = T) >> # [1] "午前**午後" >> > > This is where the problem lies - and it has nothing to do with format: > >> z=enc2utf8("午前+*午後") >> gsub("+", "*", z, fixed = T) > Error in gsub("+", "*", z, fixed = T) : > invalid multibyte string at '<8c>' > > The cause is that fgrep_one() gives higher precedence to mbcslocale than > use_UTF8 so the grep is actually done in the MBCS locale and not UTF-8. > Consequently, you'll see this only in multi-byte locales other than UTF-8, so > on let's say OS X you can reproduce it with > >> x="午前+*午後" >> gsub("+", "*", x, fixed = T) > Error in gsub("+", "*", x, fixed = T) : > invalid multibyte string at '<8c>' >
This should have been > Sys.getlocale() [1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8" > x="午前+*午後" > Encoding(x) [1] "UTF-8" > Sys.setlocale("LC_ALL", "ja_JP.SJIS") [1] "ja_JP.SJIS/ja_JP.SJIS/ja_JP.SJIS/C/ja_JP.SJIS/en_US.UTF-8" > gsub("+", "*", x, fixed = T) Error in gsub("+", "*", x, fixed = T) : invalid multibyte string at '<8c>' Cheers, S > Inverting the precedence would fix this issue, but I'm not sure if it would > have unwanted side-effects on MBCS locales ... > > Cheers, > Simon > > >> >> My session info is >> >> R version 3.0.0 (2013-04-03) >> Platform: x86_64-w64-mingw32/x64 (64-bit) >> >> locale: >> [1] LC_COLLATE=Japanese_Japan.932 LC_CTYPE=Japanese_Japan.932 >> [3] LC_MONETARY=Japanese_Japan.932 LC_NUMERIC=C >> [5] LC_TIME=Japanese_Japan.932 >> >> attached base packages: >> [1] stats graphics grDevices utils datasets methods base >> >> loaded via a namespace (and not attached): >> [1] tools_3.0.0 >> >> Any ideas? Thanks! >> >> Hadley >> >> -- >> Chief Scientist, RStudio >> http://had.co.nz/ >> >> ______________________________________________ >> R-devel@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel > > ______________________________________________ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel