2022年11月1日(火) 0:59 Chet Ramey <chet.ra...@case.edu>:
> If someone wanted to do this, I would take a look at incorporating the
> results, as long as it didn't add dependencies on, say, pcre or gnulib
> regex.

Instead of translating the pattern to a regular expression and
compiling it by regcomp (<regex.h>), I have experimentally implemented
a new extglob engine based on a naive DFA and was comparing the
behavior and the performance with the current implementations of
devel.  [Note: In this report hereafter, ``the current
implementation/engine'' always means the implementation of strmatch
(lib/glob/{strmatch,smatch,sm_loop.c}) in the current devel 31f4d468.]

However, I noticed two strange behaviors of the current engine.
Before adjusting the behavior of the new engine and submitting it for
review, I would like to confirm the (expected) behavior of the current
engine in the current devel.

These two behaviors finally turned out to be both related to the
matching of bracket expression by the function `BRACKMATCH'
(lib/glob/sm_loop.c).

----------------------------------------------------------------------
1. pattern [[=B=]][c] matches with c

  $ bash-devel --norc
  $ [[ Bc == [[=B=]][c] ]]; echo $?
  0      <-- OK. This is expected.
  $ [[ c == [[=B=]][c] ]]; echo $?
  0      <-- This is unexpected.

  See the attached [r0037.brackmatch1.equivalence-class.patch] for the
  fix.  The problem is caused because [[=B=]][c] is treated as a
  single bracket expression [ « [=B=]][c » ] when the equivalence
  class [=B=] does not match.  This is because `]' after `[[=B=]' is
  treated as if it is the first `]' in the bracket expression (such as
  `]' after `[' in the pattern « []aaa] »).  In the patch, when the
  next character after a failing equivalence class [=B=] is `]', the
  bracket expression has been changed to fail just the same as the
  case for a failing character class [:alpha:]
  (lib/glob/sm_loop.c:530; line number is that in the current devel
  31f4d468).

----------------------------------------------------------------------
2. bracket expression sometimes match or unmatch the slash with
  FNM_PATHNAME.

  FNM_PATHNAME is only used in two places in the Bash codebase.  1)
  One is for the glob matching for the filenames in the directory
  (lib/glob/glob.c).  However, this actually does not seem to have an
  actual effect because FNM_PATHNAME only causes differences in the
  matching when the target string contains a slash but the filenames
  do not contain any slashes.  2) The other is the filtering of the
  pathname-expansion results with GLOBIGNORE (pathexp.c).  So the only
  way to test the behavior of Bash's strmatch with FNM_PATHNAME
  (without writing a C program to directly use the function
  `strmatch') is to use GLOBIGNORE.

  To demonstrate the behavior of the current implementation, I attach
  a script [fnmpath.sh], which utilizes GLOBIGNORE for the Bash
  implementation.  It compares the result with fnmatch(3).  The result
  in my environment (x86_64-redhat-linux-gnu [Fedora Linux 36 (Server
  Edition)]) is this:

  $ bash-devel fnmpath.sh
  #1: pat=ab/cd/efg        yes/yes
  #2: pat=ab[/]cd/efg      yes/no
  #3: pat=ab[/a]cd/efg     yes/no
  #4: pat=ab[a/]cd/efg     no/no
  #5: pat=ab[!a]cd/efg     yes/no
  #6: pat=ab[.-0]cd/efg    yes/no
  #7: pat=*/*/efg          yes/yes
  #8: pat=*[/]*/efg        no/no
  #9: pat=*[/a]*/efg       no/no
  #10: pat=*[a/]*/efg       no/no
  #11: pat=*[!a]*/efg       no/no
  #12: pat=*[.-0]*/efg      no/no
  #13: pat=ab@(/)cd/efg     yes/yes
  #14: pat=*@(/)cd/efg      no/no
  #15: pat=*/cd/efg         yes/yes

  This tests whether each pattern matches the string "ab/cd/efg".  Two
  results by Bash's strmatch and fnmatch(3) are connected with
  `/'. "yes" means the pattern matches the string "ab/cd/efg" and "no"
  means it does not match.  Some observations are

  * In fnmatch(3), a bracket expression never matches a / with
    FNM_PATHNAME.

  * In Bash's strmatch, a bracket expression sometimes matches `/' and
    sometimes does not.  In the codebase, `/' is excluded only when it
    explicitly appears after another character in the bracket
    expression (lib/glob/sm_loop.c:574) even though the comment
    mentions [/].  This is the reason that only [a/] fails with Bash's
    implementation in cases #2..#6 in the above result.

  * What is happening with Bash's implementation in cases #7..#12 is
    related the assumption of the backtracking trick for `*': The
    trick for `*' backtracking explained in the code comment
    lib/glob/sm_loop.c:320---"This is a clever idea from
    glibc"---relies on the behavior that the bracket expression never
    matches a slash that `*' cannot match.  [Note: The exact
    requirements for this trick is actually slightly weaker: each
    bracket expression needs to match a fixed number of `/', 0 or 1,
    when FNM_PATHNAME is specified; it should never match a slash if
    it can match other characters, and it should never match other
    characters if it can match a slash.] Otherwise, backtracking for a
    different number of slashes would unexpectedly fail.

    It is hard to modify the current implementation so that it does
    not use the "clever idea (lib/glob/sm_loop.c:320)" which requires
    the assumption on the bracket expressions; it would be another
    re-implementation of the engine.  In addition, the time complexity
    of the current implementation is linear with respect to the string
    length O(len) as far as extglob is unused, but, if we allow
    bracket expressions to consistently match `/', the time complexity
    would become O(len^n) at the worst where n is the number of `*' as
    far as the backtracking algorithm is used.

    For this practical reason in addition to the compatibility with
    fnmatch(3), I think we should just follow fnmatch(3) to reject `/'
    for any bracket expressions with FNM_PATHNAME.

  * There is a similar inconsistency caused by the same trick with the
    extglob as observed in cases #13..#15.  For these cases, even
    fnmatch(3) behaves in a somewhat unpredictable way, so I would not
    try to fix this behavior in this report.

  The attached patch [r0037.brackmatch2.slash.patch] fixes this.  I
  move the check for the slash outside the loop of the bracket
  expression.  In particular, I moved the check outside the function
  BRACKMATCH because it is more consistent with the other similar
  checks for `?' (lib/glob/sm_loop.c:108) and `*'
  (lib/glob/sm_loop.c:179).

----------------------------------------------------------------------

By the way, the third patch [r0037.brackmatch3.unused-assign.patch] is
just a cosmetic fix that removes the assignments of unused values to
the variable `cend'.  The values are unused because they will be
overwritten by a later line (lib/glob/sm_loop.557) without being
referenced.

If you would like to keep the current assignments because it is
harmless, it also works for me, but in that case I think we should
also assign the value to `cend' in lib/glob/sm_loop.c:546 as `cstart =
cend = ...' for consistency with the other lines
lib/glob/sm_loop.c:442 and lib/glob/sm_loop.c:554.

--
Koichi
From d9ea2c38776f1624406ec3f707b886771b431543 Mon Sep 17 00:00:00 2001
From: Koichi Murase <myoga.mur...@gmail.com>
Date: Wed, 9 Nov 2022 18:41:09 +0900
Subject: [PATCH 1/4] fix(glob/sm_loop): fix [=c=] in BRACKMATCH

---
 lib/glob/sm_loop.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/lib/glob/sm_loop.c b/lib/glob/sm_loop.c
index 592a78db..feb06497 100644
--- a/lib/glob/sm_loop.c
+++ b/lib/glob/sm_loop.c
@@ -463,6 +463,8 @@ BRACKMATCH (p, test, flags)
              c = *p++;
              if (c == L('\0'))
                return ((test == L('[')) ? savep : (CHAR *)0); /*]*/
+             else if (c == L(']'))
+               break;
              c = FOLD (c);
              continue;
            }
-- 
2.37.2

#!/usr/bin/env bash

shopt -s extglob
LC_COLLATE=C

gcc -O2 -xc -o ./fnmatch - <<EOF
#include <fnmatch.h>
#include <stdlib.h>
#include <stdio.h>

int main(int argc, char **argv) {
  if (2 >= argc) {
    fprintf(stderr, "usage: fnmatch string pattern\n");
    exit(2);
  }

  int flags = FNM_PATHNAME | FNM_PERIOD | FNM_EXTMATCH;
  if (fnmatch(argv[2], argv[1], flags) == 0)
    return 0;
  return 1;
}
EOF

mkdir -p ab/cd/efg
check_count=1
function check {
  local GLOBIGNORE=$1

  # bash impl
  local -a f=(*/*/efg*)
  if [[ $f == '*/*/efg*' ]]; then
    local strmatch=$'\e[32myes\e[m'
  else
    local strmatch=$'\e[31mno\e[m'
  fi

  # Linux fnmatch
  if ./fnmatch ab/cd/efg "$1"; then
    local fnmatch=$'\e[32myes\e[m'
  else
    local fnmatch=$'\e[31mno\e[m'
  fi

  printf '#%d: pat=%-16s %s/%s\n' "$((check_count++))" "$1" "$strmatch" "$fnmatch"
}

check 'ab/cd/efg'
check 'ab[/]cd/efg'
check 'ab[/a]cd/efg'
check 'ab[a/]cd/efg'
check 'ab[!a]cd/efg'
check 'ab[.-0]cd/efg'
check '*/*/efg'
check '*[/]*/efg'
check '*[/a]*/efg'
check '*[a/]*/efg'
check '*[!a]*/efg'
check '*[.-0]*/efg'

# check '*/*/efg'
# check '*[b]/*/efg'
# check '*[ab]/*/efg'
# check '*[ba]/*/efg'
# check '*[!a]/*/efg'
# check '*[a-c]/*/efg'

check 'ab@(/)cd/efg'
check '*@(/)cd/efg'
check '*/cd/efg'
From a40957b761c74462f078f970a064d116992516d1 Mon Sep 17 00:00:00 2001
From: Koichi Murase <myoga.mur...@gmail.com>
Date: Mon, 14 Nov 2022 16:55:04 +0900
Subject: [PATCH 2/4] fix(glob/sm_loop): make bracket expression unmatching "/"
 under FNM_PATHNAME

---
 lib/glob/sm_loop.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/lib/glob/sm_loop.c b/lib/glob/sm_loop.c
index feb06497..ddce80f5 100644
--- a/lib/glob/sm_loop.c
+++ b/lib/glob/sm_loop.c
@@ -336,6 +336,12 @@ fprintf(stderr, "gmatch: pattern = %s; pe = %s\n", 
pattern, pe);
            if (sc == L('\0') || n == se)
              return FNM_NOMATCH;
 
+           /* A bracket expressions can never match `/' under FNM_PATHNAME.
+              This also applies to the case when `/' is explicitly specified
+              as [/], etc.  */
+           if ((flags & FNM_PATHNAME) && sc == L('/'))
+             return FNM_NOMATCH;
+
            /* A character class cannot match a `.' if it is the first
               character of the string or if it is the first character
               following a slash and we are matching a pathname. */
@@ -573,10 +579,6 @@ BRACKMATCH (p, test, flags)
       if (c == L('\0'))
        return ((test == L('[')) ? savep : (CHAR *)0);
 
-      if ((flags & FNM_PATHNAME) && c == L('/'))
-       /* [/] can never match when matching a pathname.  */
-       return (CHAR *)0;
-
       /* This introduces a range, unless the `-' is the last
         character of the class.  Find the end of the range
         and move past it. */
-- 
2.37.2

From 263d4bd71787ee237c14230c9d019e9bab749935 Mon Sep 17 00:00:00 2001
From: Koichi Murase <myoga.mur...@gmail.com>
Date: Mon, 14 Nov 2022 17:01:25 +0900
Subject: [PATCH 3/4] fix(glob/sm_loop): remove unused assignments

---
 lib/glob/sm_loop.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/lib/glob/sm_loop.c b/lib/glob/sm_loop.c
index ddce80f5..970cb559 100644
--- a/lib/glob/sm_loop.c
+++ b/lib/glob/sm_loop.c
@@ -443,9 +443,7 @@ BRACKMATCH (p, test, flags)
   c = *p++;
   for (;;)
     {
-      /* Initialize cstart and cend in case `-' is the last
-        character of the pattern. */
-      cstart = cend = c;
+      cstart = c;
       forcecoll = 0;
 
       /* POSIX.2 equivalence class:  [=c=].  See POSIX.2 2.8.3.2.  Find
@@ -559,9 +557,11 @@ BRACKMATCH (p, test, flags)
        {
          if (*p == '\0')
            return (CHAR *)0;
-         cstart = cend = *p++;
+         cstart = *p++;
        }
 
+      /* Initialize cstart and cend in case `-' is the last
+        character of the pattern. */
       cstart = cend = FOLD (cstart);
       isrange = 0;
 
-- 
2.37.2

Reply via email to