Le lundi 09 septembre 2013 à 13:59 +0100, Prof Brian Ripley a écrit : > On 09/09/2013 09:49, Milan Bouchet-Valat wrote: > > Hi! > > > > I experience an error with an invalid UTF-8 character passed to > > gsub(..., perl=TRUE); the interesting point is that with perl=FALSE (the > > default) no error happens. (The character itself was read from an > > invalid HTML file.) Illustration of the error: > > > > gsub("a", "", "\U3e3965", perl=FALSE) > > # [1] "\U3e3965" > > gsub("a", "", "\U3e3965", perl=TRUE) > > # Error in gsub("a", "", "\U3e3965", perl = TRUE) : > > # input string 1 is invalid UTF-8 > > > > > > The error message in the second command seems to come from > > src/main/grep.c:1640 (in do_gsub): > > if (!utf8Valid(s)) error(("input string %d is invalid UTF-8"), i+1); > > > > utf8Valid() relies on valid_utf8() from PCRE, whose behavior is > > described in src/extra/pcre/pcre_valid_utf8.c. > > > > > > > > Even more problematic/interesting is the fact that iconv() does not > > consider the above character as invalid, as it does not replace it when > > using the sub argument. > >> iconv("a\U3e3965", sub="") > > [1] "a\U003e3965" > > > > On the contrary, an invalid sequence such as \xff is substituted: > > iconv("a\xff", sub="") > > # [1] "a" > > > > This makes it difficult to sanitize the string before passing it to > > gsub(perl=TRUE). Thus, I'm wondering whether something could be done, > > and where. Should iconv() and PCRE be made to agree on the definition of > > an invalid UTF-8 sequence? > > iconv() is using a system service: read its help page. So you know > where to report this .... Yeah, but why is "\U003e3965" considered valid by gsub(perl=TRUE) and printed as a character on Windows 7, and not on Linux? Do you think this is a separate bug on Windows?
Thanks for your help ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel