Hi! I experience an error with an invalid UTF-8 character passed to gsub(..., perl=TRUE); the interesting point is that with perl=FALSE (the default) no error happens. (The character itself was read from an invalid HTML file.) Illustration of the error:
gsub("a", "", "\U3e3965", perl=FALSE) # [1] "\U3e3965" gsub("a", "", "\U3e3965", perl=TRUE) # Error in gsub("a", "", "\U3e3965", perl = TRUE) : # input string 1 is invalid UTF-8 The error message in the second command seems to come from src/main/grep.c:1640 (in do_gsub): if (!utf8Valid(s)) error(("input string %d is invalid UTF-8"), i+1); utf8Valid() relies on valid_utf8() from PCRE, whose behavior is described in src/extra/pcre/pcre_valid_utf8.c. Even more problematic/interesting is the fact that iconv() does not consider the above character as invalid, as it does not replace it when using the sub argument. > iconv("a\U3e3965", sub="") [1] "a\U003e3965" On the contrary, an invalid sequence such as \xff is substituted: iconv("a\xff", sub="") # [1] "a" This makes it difficult to sanitize the string before passing it to gsub(perl=TRUE). Thus, I'm wondering whether something could be done, and where. Should iconv() and PCRE be made to agree on the definition of an invalid UTF-8 sequence? Regards ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel