Re: [Rd] why does [A-Z] include 'T' in an Estonian locale?

Martin Maechler Thu, 01 Jun 2023 01:11:48 -0700

>>>>> Ben Bolker 
>>>>>     on Tue, 30 May 2023 11:45:20 -0400 writes:


    > Inspired by this old Stack Overflow question

    > 
https://stackoverflow.com/questions/19765610/when-does-locale-affect-rs-regular-expressions

    > I was wondering why this is TRUE:

    > Sys.setlocale("LC_ALL", "et_EE")
    > grepl("[A-Z]", "T")

    > TRE's documentation at 
    > <https://laurikari.net/tre/documentation/regex-syntax/> says that a 
    > range "is shorthand for the full range of characters between those two 
    > [endpoints] (inclusive) in the collating sequence".

    > Yet, T is *not* between A and Z in the Estonian collating sequence:

    > sort(LETTERS)
    > [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" 
    > "Q" "R" "S"
    > [20] "Z" "T" "U" "V" "W" "X" "Y"

    > I realize that this may be a question about TRE rather than about R 
    > *per se* (FWIW the grepl() result is also TRUE with `perl = TRUE`, so 
    > the question also applies to PCRE), but I'm wondering if anyone has any 
    > insights ...  (and yes, I know that the correct answer is "use [:alpha:] 
    > and don't worry about it")

    > (In contrast, the ICU engine underlying stringi/stringr says "[t]he 
    > characters to include are determined by Unicode code point ordering" - see

    > 
https://stackoverflow.com/questions/76365426/does-stringrs-regex-engine-translate-a-z-into-abcdefghijklmnopqrstuvwyz/76366163#76366163

    > for links)

Your last (<sentence>)  may point to the solution of the riddle:
Nowadays, typically in R

> capabilities()[["ICU"]]
[1] TRUE

but of course now one has to study if / why  ICU seems to take
precedence over the locale's internal "sort"ing ..


Best regards,
Martin

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] why does [A-Z] include 'T' in an Estonian locale?

Reply via email to