Re: [Rd] Bug in perl=TRUE regexp matching?

2023-07-31 Thread Tomas Kalibera



On 7/25/23 03:13, Brodie Gaslam via R-devel wrote:



On 7/24/23 4:10 AM, Duncan Murdoch wrote:

On 23/07/2023 9:01 p.m., Brodie Gaslam wrote:



On 7/23/23 4:29 PM, Duncan Murdoch wrote:

The help page for `?gsub` says (in the context of performance
considerations):


"... just one UTF-8 string will force all the matching to be done in
Unicode"


It's been a little while since I looked at the code but IIRC this just
means that strings are converted to UTF-8 before matching. The problem
here seems to be more about the interpretation of the "\\w+" token by
PCRE.  I think this makes it a little clearer what's going on:

  gsub("\\w", "a", "Γ", perl=TRUE)
  [1] "Γ"

So no match.  The PCRE docs
https://www.pcre.org/original/doc/html/pcrepattern.html (this might be
the old docs, but it works for our purposes here) mention we can 
turn on

unicode property matching with the "(*UCP)" token:

   gsub("(*UCP)\\w", "a", "Γ", perl=TRUE)
   [1] "a"

So there are two layers at play here.  The first one is whether R
converts strings to UTF-8, which I think is what the documentation is
about.  The other is whether the PCRE engine is configured to recognize
Unicode properties, which at least in both of our configurations for
this specific case it appears like it is not.


 From the surrounding context, I think the docs are talking about 
more than just conversion to UTF-8.  The full paragraph reads like this:


"If you are working in a single-byte locale (though not common since 
R 4.2) and have marked UTF-8 strings that are representable in that 
locale, convert them first as just one UTF-8 string will force all 
the matching to be done in Unicode, which attracts a penalty of around

3× for the default POSIX 1003.2 mode."

i.e. it says the presence of UTF-8 strings slows things down by a 
factor of 3, so it's faster to convert everything to the local 
encoding.  If it was just conversion, I don't think that would be true.


But maybe "for the default POSIX 1003.2 mode" applies to the whole 
paragraph, not just to the penalty, so this is intentional.


Agreed, I don't think this whole issue is just about the conversion. 
What I'm trying to highlight is the distinction between what R does 
(converts input to Unicode - UTF-8 for PCRE[1], wchar_t for 
POSIX/TRE[2]), and what the regular expression engines then do (match 
that Unicode per their own semantics).  This for the case of any UTF-8 
in the input.


PCRE is behaving as documented[3]:

> By default, characters whose code points are greater than 127 never 
match \d, \s, or \w, and always match \D, \S, and \W, although this 
may be different for characters in the range 128-255 when 
locale-specific matching is happening. These escape sequences retain 
their original meanings from before Unicode support was available, 
mainly for efficiency reasons. If the PCRE2_UCP option is set, the 
behaviour is changed so that Unicode properties are used to determine 
character types, as follows...


So this doesn't seem like a bug to me.

Does that mean that the following is incorrect?

> one UTF-8 string will force all the matching to be done in Unicode

It depends on how you want to interpret "done in".  Less ambiguous 
could be:


> one UTF-8 string will force all strings to be converted to Unicode 
prior to matching.


I've added a note to ?regexp about enabling Unicode properties in 
patterns using (*UCP). I understand that it may be surprising to users 
these are not fully enabled by default (PCRE2_UCP not set), but then it 
is the default behavior of PCRE2 and most likely chosen for performance 
reasons (see [3]), and ?regexp refers to PCRE documentation.


Re ?gsub, I think it is ok, the matching is in Unicode/UTF-8. Whether 
the Unicode property support is available or how to fully enable it is 
another matter, not discussed in this part of the documentation.


Best
Tomas



Best,

B

[1]: 
https://github.com/r-devel/r-svn/blob/a8a3c4d6902525e4222e0bbf5b512f36e2ceac3d/src/main/grep.c#L1385
[2]: 
https://github.com/r-devel/r-svn/blob/a8a3c4d6902525e4222e0bbf5b512f36e2ceac3d/src/main/grep.c#L1378

[3]: https://pcre.org/current/doc/html/pcre2pattern.html



Duncan Murdoch


Best,

B.





However, this thread on SO: https://stackoverflow.com/q/76749529 gives
some indication that this is not true for `perl = TRUE`. Specifically:

  > strings <- c("89 562", "John Smith", "Γιάννης Παπαδόπουλος",
"Jean-François Dupuis")
  > Encoding(strings)
[1] "unknown" "unknown" "UTF-8"   "UTF-8"
  > regex <- "\\B\\w+| +"
  > gsub(regex, "", strings)
[1] "85"   "JS"   "ΓΠ"   "J-FD"

  > gsub(regex, "", strings, perl = TRUE)
[1] "85"  "JS" "ΓιάννηςΠαπαδόπουλος"
"J-FçoD"

and the website https://regex101.com/r/QDFrOE/1 gives the first answer
when the regex option /u ("match with full Unicode) is specified, but
the second answer when it is not.

Now I'm not at all sure that that website is authoritative, but this
looks like a flag may have been missed in the `per

[Rd] random network disconnects

2023-07-31 Thread Patrick Burns
I'm experiencing a weird issue, and wondering if anyone has seen this, 
and better yet has a solution.


At work we are getting lots of issues with 'permission denied' or 
'network not found' and so forth when reading and writing between our 
machines and a file server.  This happens randomly so the following 
function solves the problem for 'cat' commands:


catSafer <-
function (..., ReTries = 20, ThrowError = TRUE)
{
for (catsi in seq_len(ReTries)) {
res <- try(cat(...))
if (!inherits(res, "try-error"))
break
}
if (inherits(res, "try-error")) {
if (ThrowError) {
stop("file connection failed")
}
else {
warning("file connection failed")
}
}
}

People have done network traces and such, but so far nothing has been seen.

Thanks,
Pat

--
Patrick Burns
pbu...@pburns.seanet.com
http://www.portfolioprobe.com/blog
http://www.burns-stat.com
(home of:
 'Impatient R'
 'The R Inferno'
 'Tao Te Programming')

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel