The help page for `?gsub` says (in the context of performance considerations):

"... just one UTF-8 string will force all the matching to be done in Unicode"


However, this thread on SO: https://stackoverflow.com/q/76749529 gives some indication that this is not true for `perl = TRUE`. Specifically:

> strings <- c("89 562", "John Smith", "Γιάννης Παπαδόπουλος", "Jean-François Dupuis")
> Encoding(strings)
[1] "unknown" "unknown" "UTF-8"   "UTF-8"
> regex <- "\\B\\w+| +"
> gsub(regex, "", strings)
[1] "85"   "JS"   "ΓΠ"   "J-FD"

> gsub(regex, "", strings, perl = TRUE)
[1] "85" "JS" "ΓιάννηςΠαπαδόπουλος" "J-FçoD"

and the website https://regex101.com/r/QDFrOE/1 gives the first answer when the regex option /u ("match with full Unicode) is specified, but the second answer when it is not.

Now I'm not at all sure that that website is authoritative, but this looks like a flag may have been missed in the `perl = TRUE` case.

Duncan Murdoch

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to