2022年11月1日(火) 0:59 Chet Ramey <chet.ra...@case.edu>: > If someone wanted to do this, I would take a look at incorporating the > results, as long as it didn't add dependencies on, say, pcre or gnulib > regex.
Instead of translating the pattern to a regular expression and compiling it by regcomp (<regex.h>), I have experimentally implemented a new extglob engine based on a naive DFA and was comparing the behavior and the performance with the current implementations of devel. [Note: In this report hereafter, ``the current implementation/engine'' always means the implementation of strmatch (lib/glob/{strmatch,smatch,sm_loop.c}) in the current devel 31f4d468.] However, I noticed two strange behaviors of the current engine. Before adjusting the behavior of the new engine and submitting it for review, I would like to confirm the (expected) behavior of the current engine in the current devel. These two behaviors finally turned out to be both related to the matching of bracket expression by the function `BRACKMATCH' (lib/glob/sm_loop.c). ---------------------------------------------------------------------- 1. pattern [[=B=]][c] matches with c $ bash-devel --norc $ [[ Bc == [[=B=]][c] ]]; echo $? 0 <-- OK. This is expected. $ [[ c == [[=B=]][c] ]]; echo $? 0 <-- This is unexpected. See the attached [r0037.brackmatch1.equivalence-class.patch] for the fix. The problem is caused because [[=B=]][c] is treated as a single bracket expression [ « [=B=]][c » ] when the equivalence class [=B=] does not match. This is because `]' after `[[=B=]' is treated as if it is the first `]' in the bracket expression (such as `]' after `[' in the pattern « []aaa] »). In the patch, when the next character after a failing equivalence class [=B=] is `]', the bracket expression has been changed to fail just the same as the case for a failing character class [:alpha:] (lib/glob/sm_loop.c:530; line number is that in the current devel 31f4d468). ---------------------------------------------------------------------- 2. bracket expression sometimes match or unmatch the slash with FNM_PATHNAME. FNM_PATHNAME is only used in two places in the Bash codebase. 1) One is for the glob matching for the filenames in the directory (lib/glob/glob.c). However, this actually does not seem to have an actual effect because FNM_PATHNAME only causes differences in the matching when the target string contains a slash but the filenames do not contain any slashes. 2) The other is the filtering of the pathname-expansion results with GLOBIGNORE (pathexp.c). So the only way to test the behavior of Bash's strmatch with FNM_PATHNAME (without writing a C program to directly use the function `strmatch') is to use GLOBIGNORE. To demonstrate the behavior of the current implementation, I attach a script [fnmpath.sh], which utilizes GLOBIGNORE for the Bash implementation. It compares the result with fnmatch(3). The result in my environment (x86_64-redhat-linux-gnu [Fedora Linux 36 (Server Edition)]) is this: $ bash-devel fnmpath.sh #1: pat=ab/cd/efg yes/yes #2: pat=ab[/]cd/efg yes/no #3: pat=ab[/a]cd/efg yes/no #4: pat=ab[a/]cd/efg no/no #5: pat=ab[!a]cd/efg yes/no #6: pat=ab[.-0]cd/efg yes/no #7: pat=*/*/efg yes/yes #8: pat=*[/]*/efg no/no #9: pat=*[/a]*/efg no/no #10: pat=*[a/]*/efg no/no #11: pat=*[!a]*/efg no/no #12: pat=*[.-0]*/efg no/no #13: pat=ab@(/)cd/efg yes/yes #14: pat=*@(/)cd/efg no/no #15: pat=*/cd/efg yes/yes This tests whether each pattern matches the string "ab/cd/efg". Two results by Bash's strmatch and fnmatch(3) are connected with `/'. "yes" means the pattern matches the string "ab/cd/efg" and "no" means it does not match. Some observations are * In fnmatch(3), a bracket expression never matches a / with FNM_PATHNAME. * In Bash's strmatch, a bracket expression sometimes matches `/' and sometimes does not. In the codebase, `/' is excluded only when it explicitly appears after another character in the bracket expression (lib/glob/sm_loop.c:574) even though the comment mentions [/]. This is the reason that only [a/] fails with Bash's implementation in cases #2..#6 in the above result. * What is happening with Bash's implementation in cases #7..#12 is related the assumption of the backtracking trick for `*': The trick for `*' backtracking explained in the code comment lib/glob/sm_loop.c:320---"This is a clever idea from glibc"---relies on the behavior that the bracket expression never matches a slash that `*' cannot match. [Note: The exact requirements for this trick is actually slightly weaker: each bracket expression needs to match a fixed number of `/', 0 or 1, when FNM_PATHNAME is specified; it should never match a slash if it can match other characters, and it should never match other characters if it can match a slash.] Otherwise, backtracking for a different number of slashes would unexpectedly fail. It is hard to modify the current implementation so that it does not use the "clever idea (lib/glob/sm_loop.c:320)" which requires the assumption on the bracket expressions; it would be another re-implementation of the engine. In addition, the time complexity of the current implementation is linear with respect to the string length O(len) as far as extglob is unused, but, if we allow bracket expressions to consistently match `/', the time complexity would become O(len^n) at the worst where n is the number of `*' as far as the backtracking algorithm is used. For this practical reason in addition to the compatibility with fnmatch(3), I think we should just follow fnmatch(3) to reject `/' for any bracket expressions with FNM_PATHNAME. * There is a similar inconsistency caused by the same trick with the extglob as observed in cases #13..#15. For these cases, even fnmatch(3) behaves in a somewhat unpredictable way, so I would not try to fix this behavior in this report. The attached patch [r0037.brackmatch2.slash.patch] fixes this. I move the check for the slash outside the loop of the bracket expression. In particular, I moved the check outside the function BRACKMATCH because it is more consistent with the other similar checks for `?' (lib/glob/sm_loop.c:108) and `*' (lib/glob/sm_loop.c:179). ---------------------------------------------------------------------- By the way, the third patch [r0037.brackmatch3.unused-assign.patch] is just a cosmetic fix that removes the assignments of unused values to the variable `cend'. The values are unused because they will be overwritten by a later line (lib/glob/sm_loop.557) without being referenced. If you would like to keep the current assignments because it is harmless, it also works for me, but in that case I think we should also assign the value to `cend' in lib/glob/sm_loop.c:546 as `cstart = cend = ...' for consistency with the other lines lib/glob/sm_loop.c:442 and lib/glob/sm_loop.c:554. -- Koichi
From d9ea2c38776f1624406ec3f707b886771b431543 Mon Sep 17 00:00:00 2001 From: Koichi Murase <myoga.mur...@gmail.com> Date: Wed, 9 Nov 2022 18:41:09 +0900 Subject: [PATCH 1/4] fix(glob/sm_loop): fix [=c=] in BRACKMATCH --- lib/glob/sm_loop.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/lib/glob/sm_loop.c b/lib/glob/sm_loop.c index 592a78db..feb06497 100644 --- a/lib/glob/sm_loop.c +++ b/lib/glob/sm_loop.c @@ -463,6 +463,8 @@ BRACKMATCH (p, test, flags) c = *p++; if (c == L('\0')) return ((test == L('[')) ? savep : (CHAR *)0); /*]*/ + else if (c == L(']')) + break; c = FOLD (c); continue; } -- 2.37.2
#!/usr/bin/env bash shopt -s extglob LC_COLLATE=C gcc -O2 -xc -o ./fnmatch - <<EOF #include <fnmatch.h> #include <stdlib.h> #include <stdio.h> int main(int argc, char **argv) { if (2 >= argc) { fprintf(stderr, "usage: fnmatch string pattern\n"); exit(2); } int flags = FNM_PATHNAME | FNM_PERIOD | FNM_EXTMATCH; if (fnmatch(argv[2], argv[1], flags) == 0) return 0; return 1; } EOF mkdir -p ab/cd/efg check_count=1 function check { local GLOBIGNORE=$1 # bash impl local -a f=(*/*/efg*) if [[ $f == '*/*/efg*' ]]; then local strmatch=$'\e[32myes\e[m' else local strmatch=$'\e[31mno\e[m' fi # Linux fnmatch if ./fnmatch ab/cd/efg "$1"; then local fnmatch=$'\e[32myes\e[m' else local fnmatch=$'\e[31mno\e[m' fi printf '#%d: pat=%-16s %s/%s\n' "$((check_count++))" "$1" "$strmatch" "$fnmatch" } check 'ab/cd/efg' check 'ab[/]cd/efg' check 'ab[/a]cd/efg' check 'ab[a/]cd/efg' check 'ab[!a]cd/efg' check 'ab[.-0]cd/efg' check '*/*/efg' check '*[/]*/efg' check '*[/a]*/efg' check '*[a/]*/efg' check '*[!a]*/efg' check '*[.-0]*/efg' # check '*/*/efg' # check '*[b]/*/efg' # check '*[ab]/*/efg' # check '*[ba]/*/efg' # check '*[!a]/*/efg' # check '*[a-c]/*/efg' check 'ab@(/)cd/efg' check '*@(/)cd/efg' check '*/cd/efg'
From a40957b761c74462f078f970a064d116992516d1 Mon Sep 17 00:00:00 2001 From: Koichi Murase <myoga.mur...@gmail.com> Date: Mon, 14 Nov 2022 16:55:04 +0900 Subject: [PATCH 2/4] fix(glob/sm_loop): make bracket expression unmatching "/" under FNM_PATHNAME --- lib/glob/sm_loop.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/lib/glob/sm_loop.c b/lib/glob/sm_loop.c index feb06497..ddce80f5 100644 --- a/lib/glob/sm_loop.c +++ b/lib/glob/sm_loop.c @@ -336,6 +336,12 @@ fprintf(stderr, "gmatch: pattern = %s; pe = %s\n", pattern, pe); if (sc == L('\0') || n == se) return FNM_NOMATCH; + /* A bracket expressions can never match `/' under FNM_PATHNAME. + This also applies to the case when `/' is explicitly specified + as [/], etc. */ + if ((flags & FNM_PATHNAME) && sc == L('/')) + return FNM_NOMATCH; + /* A character class cannot match a `.' if it is the first character of the string or if it is the first character following a slash and we are matching a pathname. */ @@ -573,10 +579,6 @@ BRACKMATCH (p, test, flags) if (c == L('\0')) return ((test == L('[')) ? savep : (CHAR *)0); - if ((flags & FNM_PATHNAME) && c == L('/')) - /* [/] can never match when matching a pathname. */ - return (CHAR *)0; - /* This introduces a range, unless the `-' is the last character of the class. Find the end of the range and move past it. */ -- 2.37.2
From 263d4bd71787ee237c14230c9d019e9bab749935 Mon Sep 17 00:00:00 2001 From: Koichi Murase <myoga.mur...@gmail.com> Date: Mon, 14 Nov 2022 17:01:25 +0900 Subject: [PATCH 3/4] fix(glob/sm_loop): remove unused assignments --- lib/glob/sm_loop.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/lib/glob/sm_loop.c b/lib/glob/sm_loop.c index ddce80f5..970cb559 100644 --- a/lib/glob/sm_loop.c +++ b/lib/glob/sm_loop.c @@ -443,9 +443,7 @@ BRACKMATCH (p, test, flags) c = *p++; for (;;) { - /* Initialize cstart and cend in case `-' is the last - character of the pattern. */ - cstart = cend = c; + cstart = c; forcecoll = 0; /* POSIX.2 equivalence class: [=c=]. See POSIX.2 2.8.3.2. Find @@ -559,9 +557,11 @@ BRACKMATCH (p, test, flags) { if (*p == '\0') return (CHAR *)0; - cstart = cend = *p++; + cstart = *p++; } + /* Initialize cstart and cend in case `-' is the last + character of the pattern. */ cstart = cend = FOLD (cstart); isrange = 0; -- 2.37.2