Regex: A case where the longest match isn't being found

2023-10-26 Thread Dan Bornstein
Configuration Information [Automatically generated, do not change]:
Machine: aarch64
OS: linux-gnu
Compiler: gcc
Compilation CFLAGS: -O2 -ftree-vectorize -flto=auto -ffat-lto-objects -fexcepti\
ons -g -grecord-gcc-switches -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY\
_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -specs=/usr/lib/rpm/redhat/redhat-hardened-\
cc1 -fstack-protector-strong -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1  -ma\
rch=armv8.2-a+crypto -mtune=neoverse-n1 -mbranch-protection=standard -fasynchro\
nous-unwind-tables -fstack-clash-protection
uname output: Linux i-062640626b26bd9ed.us-west-2.compute.internal 6.1.25-37.47\
.amzn2023.aarch64 #1 SMP Mon Apr 24 23:19:51 UTC 2023 aarch64 aarch64 aarch64 G\
NU/Linux
Machine Type: aarch64-amazon-linux-gnu

Bash Version: 5.2
Patch Level: 15
Release Status: release

Description:

I found a case where the regex evaluator doesn't seem to be finding the longest 
possible match for a given expression. The expression works as expected on an 
older version of Bash (3.2.57(1)-release (arm64-apple-darwin22)).

Here's the regex: ^(\$\'([^\']|\\\')*\')(.*)$

(FWIW, this is meant to capture a string that looks like an ANSI-style literal 
string, plus a "rest" for further processing.)

Repeat-By:

For example, run this:

[[ $'$\'foo\\\' x\' bar' =~ ^(\$\'([^\']|\\\')*\')(.*)$ ]] && echo 
"${BASH_REMATCH[1]}"

On v5.2, this prints: $'foo\'
On v3.2.57, this prints: $'foo\' x'


Re: Regex: A case where the longest match isn't being found

2023-10-26 Thread Dan Bornstein
Thanks to the folks who replied.

Indeed, I misunderstood the "longest match" rule to apply to captures and not 
just the whole string. (That is, I thought an earlier capture would get "first 
dibs" on any matching text.) And, as was pointed out by Greg W, the exact 
behavior depends more on the regex library that Bash got linked with than 
anything that Bash inherently does.

Thanks again, and sorry for the noise.

-dan

On Thu, Oct 26, 2023, at 10:50 AM, Dan Bornstein wrote:
> Configuration Information [Automatically generated, do not change]:
> Machine: aarch64
> OS: linux-gnu
> Compiler: gcc
> Compilation CFLAGS: -O2 -ftree-vectorize -flto=auto -ffat-lto-objects 
> -fexcepti\
> ons -g -grecord-gcc-switches -pipe -Wall -Werror=format-security 
> -Wp,-D_FORTIFY\
> _SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS 
> -specs=/usr/lib/rpm/redhat/redhat-hardened-\
> cc1 -fstack-protector-strong -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1  
> -ma\
> rch=armv8.2-a+crypto -mtune=neoverse-n1 -mbranch-protection=standard 
> -fasynchro\
> nous-unwind-tables -fstack-clash-protection
> uname output: Linux i-062640626b26bd9ed.us-west-2.compute.internal 
> 6.1.25-37.47\
> .amzn2023.aarch64 #1 SMP Mon Apr 24 23:19:51 UTC 2023 aarch64 aarch64 aarch64 
> G\
> NU/Linux
> Machine Type: aarch64-amazon-linux-gnu
> 
> Bash Version: 5.2
> Patch Level: 15
> Release Status: release
> 
> Description:
> 
> I found a case where the regex evaluator doesn't seem to be finding the 
> longest possible match for a given expression. The expression works as 
> expected on an older version of Bash (3.2.57(1)-release 
> (arm64-apple-darwin22)).
> 
> Here's the regex: ^(\$\'([^\']|\\\')*\')(.*)$
> 
> (FWIW, this is meant to capture a string that looks like an ANSI-style 
> literal string, plus a "rest" for further processing.)
> 
> Repeat-By:
> 
> For example, run this:
> 
> [[ $'$\'foo\\\' x\' bar' =~ ^(\$\'([^\']|\\\')*\')(.*)$ ]] && echo 
> "${BASH_REMATCH[1]}"
> 
> On v5.2, this prints: $'foo\'
> On v3.2.57, this prints: $'foo\' x'
> 
>