[Rd] Bug in perl=TRUE regexp matching?

Duncan Murdoch Sun, 23 Jul 2023 13:29:32 -0700

The help page for `?gsub` says (in the context of performanceconsiderations):

"... just one UTF-8 string will force all the matching to be done inUnicode"

However, this thread on SO: https://stackoverflow.com/q/76749529 givessome indication that this is not true for `perl = TRUE`. Specifically:

> strings <- c("89 562", "John Smith", "Γιάννης Παπαδόπουλος","Jean-François Dupuis")

> Encoding(strings)
[1] "unknown" "unknown" "UTF-8"   "UTF-8"
> regex <- "\\B\\w+| +"
> gsub(regex, "", strings)
[1] "85"   "JS"   "ΓΠ"   "J-FD"

> gsub(regex, "", strings, perl = TRUE)

[1] "85" "JS" "ΓιάννηςΠαπαδόπουλος""J-FçoD"

and the website https://regex101.com/r/QDFrOE/1 gives the first answerwhen the regex option /u ("match with full Unicode) is specified, butthe second answer when it is not.

Now I'm not at all sure that that website is authoritative, but thislooks like a flag may have been missed in the `perl = TRUE` case.


Duncan Murdoch

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] Bug in perl=TRUE regexp matching?

Reply via email to