Hi all, I'd like to discuss a infelicity/possible bug with gsub. Take the following function:
f <- function(x) { gsub("\u{A0}", " ", gsub(" ", "\u{A0}", x)) } As you might expect, in utf-8 locales it is idempotent: Sys.setlocale("LC_ALL", "UTF-8") f("x y") # [1] "x y" But in the C locale it is not: Sys.setlocale("LC_ALL", "C") f("x y") # [1] "x\302\240y" This seems weird to me. (And caused a bug in a package because I didn't realise some windows users have a non-utf8 locale) I'm not sure what the correct resolution is. Should the encoding of the output of gsub be utf-8 if either the input or replacement is utf-8? In non-utf-8 locales should the encoding of "\u{A0}" be bytes? Hadley -- Assistant Professor / Dobelman Family Junior Chair Department of Statistics / Rice University http://had.co.nz/ ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel