Re: built-in regex matches wrong character
On 9/5/18 2:50 PM, mamatb@mamatb-laptop wrote: > Bash Version: 4.4 > Patch Level: 0 > Release Status: release > > Description: > It seems like bash built-in regex matches some symbols that shouldn't. There are a couple of things to consider here. 1. Bash doesn't have a "built-in" regexp engine. It uses whatever POSIX- compatible regexp API the C library provides. 2. POSIX range expressions are explicitly non-portable and locale- dependent. The characters in a range depend on the locale's collation sequence. Look back at this list for discussions of how upper and lower case letters get into a range like a-z. Chet -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/
Re: built-in regex matches wrong character
On 9/5/18 4:39 PM, Eric Blake wrote: > Or, you can use bash's 'shopt -s globasciiranges' which is > supposed to enable Rational Range Interpretation, where even in non-C > locales, a character range bounded by two ASCII characters takes on the C > locale definition of only the ASCII characters in that range, rather than > the locale's definition of whatever other characters might also be > equivalent (actually, while I know that shopt affects globbing, I don't > know if it also affects regex matching - but if it doesn't, that's probably > a bug that should be fixed). Since bash uses the C library's regexp engine, and most C libraries don't implement RRI, much less expose it as a flags option available via regcomp(), there's no reason to expect that globasciiranges would have any effect on regular expression matching. Chet -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/
Re: built-in regex matches wrong character
On 09/06/2018 09:17 AM, Chet Ramey wrote: On 9/5/18 4:39 PM, Eric Blake wrote: Or, you can use bash's 'shopt -s globasciiranges' which is supposed to enable Rational Range Interpretation, where even in non-C locales, a character range bounded by two ASCII characters takes on the C locale definition of only the ASCII characters in that range, rather than the locale's definition of whatever other characters might also be equivalent (actually, while I know that shopt affects globbing, I don't know if it also affects regex matching - but if it doesn't, that's probably a bug that should be fixed). Since bash uses the C library's regexp engine, and most C libraries don't implement RRI, much less expose it as a flags option available via regcomp(), there's no reason to expect that globasciiranges would have any effect on regular expression matching. But bash could be taught to convert any regex that contains a range with both endpoints ASCII into a different bracket expression before handing things over to regcomp(). That is, if the user is matching against [a-d], bash hands [abcd] to regcomp() instead. You don't need a flag in regcomp() to get RRI, just merely some pre-processing (and often memory allocation, as the expansion of a range into a non-range tends to require more characters). -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Re: built-in regex matches wrong character
On 9/5/18 6:48 PM, Miguel Amat wrote: > Thanks for your response Eric, please find my attached screenshot > testing both solutions. Seems like setting LC_ALL=C in the environment > works fine while 'shopt -s globasciiranges' does not (also I could be > testing this the wrong way, first time using shopt). globasciiranges isn't going to change things here, as explained in my previous message. Chet -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/
Re: built-in regex matches wrong character
On 9/6/18 10:23 AM, Eric Blake wrote: > But bash could be taught to convert any regex that contains a range with > both endpoints ASCII into a different bracket expression before handing > things over to regcomp(). That is, if the user is matching against [a-d], > bash hands [abcd] to regcomp() instead. You don't need a flag in regcomp() > to get RRI, just merely some pre-processing (and often memory allocation, > as the expansion of a range into a non-range tends to require more > characters). Someone would have to write that code. -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/
Re: built-in regex matches wrong character
In article , Eric Blake wrote: >But bash could be taught to convert any regex that contains a range with >both endpoints ASCII into a different bracket expression before handing >things over to regcomp(). That is, if the user is matching against >[a-d], bash hands [abcd] to regcomp() instead. You don't need a flag in >regcomp() to get RRI, just merely some pre-processing (and often memory >allocation, as the expansion of a range into a non-range tends to >require more characters). This is easy and inexpensive for ASCII only. Full RRI does the same thing for wide character sets as well, though, and there the possibility for using very large amounts of memory makes the rewrite-the-range idea less palatable. -- Aharon (Arnold) Robbins arnold AT skeeve DOT com
Re: built-in regex matches wrong character
On 09/06/2018 12:39 PM, Aharon Robbins wrote: In article , Eric Blake wrote: But bash could be taught to convert any regex that contains a range with both endpoints ASCII into a different bracket expression before handing things over to regcomp(). That is, if the user is matching against [a-d], bash hands [abcd] to regcomp() instead. You don't need a flag in regcomp() to get RRI, just merely some pre-processing (and often memory allocation, as the expansion of a range into a non-range tends to require more characters). This is easy and inexpensive for ASCII only. Full RRI does the same thing for wide character sets as well, though, and there the possibility for using very large amounts of memory makes the rewrite-the-range idea less palatable. Indeed. But the bash option is named 'globasciiranges', and I find far more use in having ranges with both endpoints in single-byte ASCII behaving sanely than I do for ranges with one or more ends resulting in a multibyte character (by the time my regex involves multibyte characters, I am already admitting that I am in locale-dependent territory, and RRI may no longer be the best action anyway). That is, RRI makes the most sense when dealing with ASCII characters (< 128) in the first place, and that's a reasonable stopgap for immediate implementation, even if we don't get full RRI across all of Unicode (assuming that such might later become available via a new regcomp() flag). -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org