Re: [Rd] why does [A-Z] include 'T' in an Estonian locale?

Ben Bolker Fri, 16 Jun 2023 05:57:25 -0700

  Yes.

FWIW I submitted a request for a documentation fix to TRE (todocument that it actually uses Unicode order, not collation order, todefine ranges, just like most (but not all) other regex engines ...)


https://github.com/laurikari/tre/issues/88

On 2023-06-16 5:16 a.m., peter dalgaard wrote:

Just for amusement: Similar messups occur with Danish and its three extra 
letters:

Sys.setlocale("LC_ALL", "da_DK")

[1] "da_DK/da_DK/da_DK/C/da_DK/en_US.UTF-8"

sort(c(LETTERS,"Æ","Ø","Å"))

  [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" 
"S"
[20] "T" "U" "V" "W" "X" "Y" "Z" "Æ" "Ø" "Å"

grepl("[A-Å]", "Ø")

[1] FALSE

grepl("[A-Å]", "Æ")

[1] FALSE

grepl("[A-Æ]", "Å")

[1] TRUE

grepl("[A-Æ]", "Ø")

[1] FALSE

grepl("[A-Ø]", "Å")

[1] TRUE

grepl("[A-Ø]", "Æ")

[1] TRUE

So for character ranges, the order is Å,Æ,Ø (which is how they'd collate in 
Swedish, except that Swedish uses diacriticals rather than Æ and Ø).

Sys.setlocale("LC_ALL", "sv_SE")

[1] "sv_SE/sv_SE/sv_SE/C/sv_SE/en_US.UTF-8"

sort(c(LETTERS,"Æ","Ø","Å"))

  [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" 
"S"
[20] "T" "U" "V" "W" "X" "Y" "Z" "Å" "Æ" "Ø"

sort(c(LETTERS,"Ä","Ö","Å"))

  [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" 
"S"
[20] "T" "U" "V" "W" "X" "Y" "Z" "Å" "Ä" "Ö"

On 30 May 2023, at 17:45 , Ben Bolker <[email protected]> wrote:

  Inspired by this old Stack Overflow question

https://stackoverflow.com/questions/19765610/when-does-locale-affect-rs-regular-expressions

I was wondering why this is TRUE:

Sys.setlocale("LC_ALL", "et_EE")
grepl("[A-Z]", "T")

TRE's documentation at <https://laurikari.net/tre/documentation/regex-syntax/> says that 
a range "is shorthand for the full range of characters between those two [endpoints] 
(inclusive) in the collating sequence".

Yet, T is *not* between A and Z in the Estonian collating sequence:

sort(LETTERS)
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
[20] "Z" "T" "U" "V" "W" "X" "Y"

  I realize that this may be a question about TRE rather than about R *per se* (FWIW the 
grepl() result is also TRUE with `perl = TRUE`, so the question also applies to PCRE), 
but I'm wondering if anyone has any insights ...  (and yes, I know that the correct 
answer is "use [:alpha:] and don't worry about it")

(In contrast, the ICU engine underlying stringi/stringr says "[t]he characters to 
include are determined by Unicode code point ordering" - see

https://stackoverflow.com/questions/76365426/does-stringrs-regex-engine-translate-a-z-into-abcdefghijklmnopqrstuvwyz/76366163#76366163

for links)

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


--
Dr. Benjamin Bolker
Professor, Mathematics & Statistics and Biology, McMaster University
Director, School of Computational Science and Engineering
(Acting) Graduate chair, Mathematics & Statistics

> E-mail is sent at my convenience; I don't expect replies outside ofworking hours.


______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] why does [A-Z] include 'T' in an Estonian locale?

Reply via email to