Just for amusement: Similar messups occur with Danish and its three extra letters:
> Sys.setlocale("LC_ALL", "da_DK") [1] "da_DK/da_DK/da_DK/C/da_DK/en_US.UTF-8" > sort(c(LETTERS,"Æ","Ø","Å")) [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" [20] "T" "U" "V" "W" "X" "Y" "Z" "Æ" "Ø" "Å" > grepl("[A-Å]", "Ø") [1] FALSE > grepl("[A-Å]", "Æ") [1] FALSE > grepl("[A-Æ]", "Å") [1] TRUE > grepl("[A-Æ]", "Ø") [1] FALSE > grepl("[A-Ø]", "Å") [1] TRUE > grepl("[A-Ø]", "Æ") [1] TRUE So for character ranges, the order is Å,Æ,Ø (which is how they'd collate in Swedish, except that Swedish uses diacriticals rather than Æ and Ø). > Sys.setlocale("LC_ALL", "sv_SE") [1] "sv_SE/sv_SE/sv_SE/C/sv_SE/en_US.UTF-8" > sort(c(LETTERS,"Æ","Ø","Å")) [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" [20] "T" "U" "V" "W" "X" "Y" "Z" "Å" "Æ" "Ø" > sort(c(LETTERS,"Ä","Ö","Å")) [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" [20] "T" "U" "V" "W" "X" "Y" "Z" "Å" "Ä" "Ö" > On 30 May 2023, at 17:45 , Ben Bolker <bbol...@gmail.com> wrote: > > Inspired by this old Stack Overflow question > > https://stackoverflow.com/questions/19765610/when-does-locale-affect-rs-regular-expressions > > I was wondering why this is TRUE: > > Sys.setlocale("LC_ALL", "et_EE") > grepl("[A-Z]", "T") > > TRE's documentation at > <https://laurikari.net/tre/documentation/regex-syntax/> says that a range "is > shorthand for the full range of characters between those two [endpoints] > (inclusive) in the collating sequence". > > Yet, T is *not* between A and Z in the Estonian collating sequence: > > sort(LETTERS) > [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" > "S" > [20] "Z" "T" "U" "V" "W" "X" "Y" > > I realize that this may be a question about TRE rather than about R *per se* > (FWIW the grepl() result is also TRUE with `perl = TRUE`, so the question > also applies to PCRE), but I'm wondering if anyone has any insights ... (and > yes, I know that the correct answer is "use [:alpha:] and don't worry about > it") > > (In contrast, the ICU engine underlying stringi/stringr says "[t]he > characters to include are determined by Unicode code point ordering" - see > > https://stackoverflow.com/questions/76365426/does-stringrs-regex-engine-translate-a-z-into-abcdefghijklmnopqrstuvwyz/76366163#76366163 > > for links) > > ______________________________________________ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel -- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Office: A 4.23 Email: pd....@cbs.dk Priv: pda...@gmail.com ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel