Re: [Rd] Bug in perl=TRUE regexp matching?

Duncan Murdoch Mon, 24 Jul 2023 01:10:58 -0700

On 23/07/2023 9:01 p.m., Brodie Gaslam wrote:



On 7/23/23 4:29 PM, Duncan Murdoch wrote:

The help page for `?gsub` says (in the context of performance
considerations):


"... just one UTF-8 string will force all the matching to be done in
Unicode"


It's been a little while since I looked at the code but IIRC this just
means that strings are converted to UTF-8 before matching.  The problem
here seems to be more about the interpretation of the "\\w+" token by
PCRE.  I think this makes it a little clearer what's going on:

      gsub("\\w", "a", "Γ", perl=TRUE)
      [1] "Γ"

So no match.  The PCRE docs
https://www.pcre.org/original/doc/html/pcrepattern.html (this might be
the old docs, but it works for our purposes here) mention we can turn on
unicode property matching with the "(*UCP)" token:

       gsub("(*UCP)\\w", "a", "Γ", perl=TRUE)
       [1] "a"

So there are two layers at play here.  The first one is whether R
converts strings to UTF-8, which I think is what the documentation is
about.  The other is whether the PCRE engine is configured to recognize
Unicode properties, which at least in both of our configurations for
this specific case it appears like it is not.

From the surrounding context, I think the docs are talking about morethan just conversion to UTF-8. The full paragraph reads like this:

"If you are working in a single-byte locale (though not common since R4.2) and have marked UTF-8 strings that are representable in thatlocale, convert them first as just one UTF-8 string will force all thematching to be done in Unicode, which attracts a penalty of around

3× for the default POSIX 1003.2 mode."

i.e. it says the presence of UTF-8 strings slows things down by a factorof 3, so it's faster to convert everything to the local encoding. If itwas just conversion, I don't think that would be true.

But maybe "for the default POSIX 1003.2 mode" applies to the wholeparagraph, not just to the penalty, so this is intentional.


Duncan Murdoch


Best,

B.



However, this thread on SO:  https://stackoverflow.com/q/76749529 gives
some indication that this is not true for `perl = TRUE`.  Specifically:

  > strings <- c("89 562", "John Smith", "Γιάννης Παπαδόπουλος",
"Jean-François Dupuis")
  > Encoding(strings)
[1] "unknown" "unknown" "UTF-8"   "UTF-8"
  > regex <- "\\B\\w+| +"
  > gsub(regex, "", strings)
[1] "85"   "JS"   "ΓΠ"   "J-FD"

  > gsub(regex, "", strings, perl = TRUE)
[1] "85"                  "JS"                  "ΓιάννηςΠαπαδόπουλος"
"J-FçoD"

and the website https://regex101.com/r/QDFrOE/1 gives the first answer
when the regex option /u ("match with full Unicode) is specified, but
the second answer when it is not.

Now I'm not at all sure that that website is authoritative, but this
looks like a flag may have been missed in the `perl = TRUE` case.

Duncan Murdoch

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Bug in perl=TRUE regexp matching?

Reply via email to