On 7/23/23 4:29 PM, Duncan Murdoch wrote:
The help page for `?gsub` says (in the context of performance considerations):


"... just one UTF-8 string will force all the matching to be done in Unicode"

It's been a little while since I looked at the code but IIRC this just means that strings are converted to UTF-8 before matching. The problem here seems to be more about the interpretation of the "\\w+" token by PCRE. I think this makes it a little clearer what's going on:

    gsub("\\w", "a", "Γ", perl=TRUE)
    [1] "Γ"

So no match. The PCRE docs https://www.pcre.org/original/doc/html/pcrepattern.html (this might be the old docs, but it works for our purposes here) mention we can turn on unicode property matching with the "(*UCP)" token:

     gsub("(*UCP)\\w", "a", "Γ", perl=TRUE)
     [1] "a"

So there are two layers at play here. The first one is whether R converts strings to UTF-8, which I think is what the documentation is about. The other is whether the PCRE engine is configured to recognize Unicode properties, which at least in both of our configurations for this specific case it appears like it is not.

Best,

B.




However, this thread on SO:  https://stackoverflow.com/q/76749529 gives some indication that this is not true for `perl = TRUE`.  Specifically:

> strings <- c("89 562", "John Smith", "Γιάννης Παπαδόπουλος", "Jean-François Dupuis")
 > Encoding(strings)
[1] "unknown" "unknown" "UTF-8"   "UTF-8"
 > regex <- "\\B\\w+| +"
 > gsub(regex, "", strings)
[1] "85"   "JS"   "ΓΠ"   "J-FD"

 > gsub(regex, "", strings, perl = TRUE)
[1] "85"                  "JS"                  "ΓιάννηςΠαπαδόπουλος" "J-FçoD"

and the website https://regex101.com/r/QDFrOE/1 gives the first answer when the regex option /u ("match with full Unicode) is specified, but the second answer when it is not.

Now I'm not at all sure that that website is authoritative, but this looks like a flag may have been missed in the `perl = TRUE` case.

Duncan Murdoch

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to