Le lundi 09 septembre 2013 à 13:41 -0400, Simon Urbanek a écrit : > On Sep 9, 2013, at 12:46 PM, Milan Bouchet-Valat wrote: > > > Le lundi 09 septembre 2013 à 13:59 +0100, Prof Brian Ripley a écrit : > >> On 09/09/2013 09:49, Milan Bouchet-Valat wrote: > >>> Hi! > >>> > >>> I experience an error with an invalid UTF-8 character passed to > >>> gsub(..., perl=TRUE); the interesting point is that with perl=FALSE (the > >>> default) no error happens. (The character itself was read from an > >>> invalid HTML file.) Illustration of the error: > >>> > >>> gsub("a", "", "\U3e3965", perl=FALSE) > >>> # [1] "\U3e3965" > >>> gsub("a", "", "\U3e3965", perl=TRUE) > >>> # Error in gsub("a", "", "\U3e3965", perl = TRUE) : > >>> # input string 1 is invalid UTF-8 > >>> > >>> > >>> The error message in the second command seems to come from > >>> src/main/grep.c:1640 (in do_gsub): > >>> if (!utf8Valid(s)) error(("input string %d is invalid UTF-8"), i+1); > >>> > >>> utf8Valid() relies on valid_utf8() from PCRE, whose behavior is > >>> described in src/extra/pcre/pcre_valid_utf8.c. > >>> > >>> > >>> > >>> Even more problematic/interesting is the fact that iconv() does not > >>> consider the above character as invalid, as it does not replace it when > >>> using the sub argument. > >>>> iconv("a\U3e3965", sub="") > >>> [1] "a\U003e3965" > >>> > >>> On the contrary, an invalid sequence such as \xff is substituted: > >>> iconv("a\xff", sub="") > >>> # [1] "a" > >>> > >>> This makes it difficult to sanitize the string before passing it to > >>> gsub(perl=TRUE). Thus, I'm wondering whether something could be done, > >>> and where. Should iconv() and PCRE be made to agree on the definition of > >>> an invalid UTF-8 sequence? > >> > >> iconv() is using a system service: read its help page. So you know > >> where to report this .... > > Yeah, but why is "\U003e3965" considered valid by gsub(perl=TRUE) and > > printed as a character on Windows 7, and not on Linux? Do you think this > > is a separate bug on Windows? > > > > As pre RFC 3629 UTF-8 does not support characters beyond U+10FFFF so > your U+3E3965 is not encodable in UTF-8 (it is encodable in the older > scheme). > > The trick is that Windows doesn't have a UTF-8 locale and only > supports 16-bit, so it truncates the content to U+3965: > > > charToRaw("\U3e3965") > [1] e3 a5 a5 > > That's why it seemingly works there. Good catch!
So this is just a matter of iconv() being too tolerant, while PCRE is right. As Brian Ripley said, I know where to report the bug. Regards ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel