On Nov 23, 2011, at 6:48 PM, Hadley Wickham wrote: > Hi all, > > I'd like to discuss a infelicity/possible bug with gsub. Take the > following function: > > f <- function(x) { > gsub("\u{A0}", " ", gsub(" ", "\u{A0}", x)) > } > > As you might expect, in utf-8 locales it is idempotent: > > Sys.setlocale("LC_ALL", "UTF-8") > f("x y") > # [1] "x y" > > But in the C locale it is not: > > Sys.setlocale("LC_ALL", "C") > f("x y") > # [1] "x\302\240y" > > This seems weird to me. (And caused a bug in a package because I > didn't realise some windows users have a non-utf8 locale) > > I'm not sure what the correct resolution is. Should the encoding of the > output of gsub be utf-8 if either the input or replacement is utf-8?
It is if the input is UTF-8 but only then - that is what is causing the asymmetry. Part of the problem is that you cannot declare 7-bit string as UTF-8 (even though it is valid) so you can't work around it by forcing the encoding. > In non-utf-8 locales should the encoding of "\u{A0}" be bytes? > No, because the whole point of the encoding is to define the content. "\ua0" defines one unicode character whereas "\302\240" defines two bytes with unknown meaning. The meaning of UTF-8 encoded strings is still valid in non-UTF-8 locales and the reason why your can work with UTF-8 strings in R irrespective of the locale (very useful thing). I would suggest to handle the special case of 7-bit input and UTF-8 replacement such that it results in UTF-8 output (as opposed to bytes output with happens now). The relevant code is somewhat convoluted (and more so in R-devel) so I'm not volunteering to do it, though. Just to make things more clear - the current result (in C locale): > gsub(" ","\ua0", "foo bar") [1] "foo\302\240bar" Possibly desired result: > gsub(" ","\ua0", "foo bar") [1] "foo<U+00A0>bar" Cheers, Simon ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel