[Rd] why does [A-Z] include 'T' in an Estonian locale?

Ben Bolker Tue, 30 May 2023 08:45:44 -0700

  Inspired by this old Stack Overflow question

https://stackoverflow.com/questions/19765610/when-does-locale-affect-rs-regular-expressions


I was wondering why this is TRUE:

Sys.setlocale("LC_ALL", "et_EE")
grepl("[A-Z]", "T")

TRE's documentation at<https://laurikari.net/tre/documentation/regex-syntax/> says that arange "is shorthand for the full range of characters between those two[endpoints] (inclusive) in the collating sequence".


Yet, T is *not* between A and Z in the Estonian collating sequence:

 sort(LETTERS)

[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P""Q" "R" "S"

[20] "Z" "T" "U" "V" "W" "X" "Y"

I realize that this may be a question about TRE rather than about R*per se* (FWIW the grepl() result is also TRUE with `perl = TRUE`, sothe question also applies to PCRE), but I'm wondering if anyone has anyinsights ... (and yes, I know that the correct answer is "use [:alpha:]and don't worry about it")

(In contrast, the ICU engine underlying stringi/stringr says "[t]hecharacters to include are determined by Unicode code point ordering" - see


https://stackoverflow.com/questions/76365426/does-stringrs-regex-engine-translate-a-z-into-abcdefghijklmnopqrstuvwyz/76366163#76366163

for links)

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] why does [A-Z] include 'T' in an Estonian locale?

Reply via email to