Anthony Ramine <[email protected]> writes:
> ase folding is not done correctly when matching against the [:upper:]
> character class and uppercased character ranges (e.g. A-Z).
> Specifically, an uppercase letter fails to match against any of them
> when case folding is requested because plain characters in the pattern
> and the whole string are preemptively lowercased to handle the base case
> fast.
>
> That optimization is kept and ISLOWER() is used in the [:upper:] case
> when case folding is requested, while matching against a character range
> is retried with toupper() if the character was lowercase, as the bounds
> of the range itself cannot be modified (in a case-insensitive context,
> [A-_] is not equivalent to [a-_]).
>
> Signed-off-by: Anthony Ramine <[email protected]>
> Reviewed-by: Duy Nguyen <[email protected]>
> ---
Thanks.
> t/t3070-wildmatch.sh | 55
> ++++++++++++++++++++++++++++++++++++++++++++++------
> wildmatch.c | 7 +++++++
> 2 files changed, 56 insertions(+), 6 deletions(-)
>
> I added Duy as reviewer and fixed a typo in the commit message reported by
> Eric Sunshine.
>
> diff --git a/t/t3070-wildmatch.sh b/t/t3070-wildmatch.sh
> index 4c37057..38446a0 100755
> --- a/t/t3070-wildmatch.sh
> +++ b/t/t3070-wildmatch.sh
> @@ -6,20 +6,20 @@ test_description='wildmatch tests'
>
> match() {
> if [ $1 = 1 ]; then
> - test_expect_success "wildmatch: match '$3' '$4'" "
> + test_expect_success "wildmatch: match '$3' '$4'" "
> test-wildmatch wildmatch '$3' '$4'
> "
> else
> - test_expect_success "wildmatch: no match '$3' '$4'" "
> + test_expect_success "wildmatch: no match '$3' '$4'" "
> ! test-wildmatch wildmatch '$3' '$4'
> "
> fi
> if [ $2 = 1 ]; then
> - test_expect_success "fnmatch: match '$3' '$4'" "
> + test_expect_success "fnmatch: match '$3' '$4'" "
> test-wildmatch fnmatch '$3' '$4'
> "
> elif [ $2 = 0 ]; then
> - test_expect_success "fnmatch: no match '$3' '$4'" "
> + test_expect_success "fnmatch: no match '$3' '$4'" "
> ! test-wildmatch fnmatch '$3' '$4'
> "
> # else
Is the above about aligning $3/$4 across checks of different types
(i.e. purely cosmetic)? I am not complaining; just making sure if
there is nothing deeper going on.
It is outside the scope of this change, but the shell style of this
script (most notably use of [] instead of test) needs to be fixed
someday, preferrably soon, including the commented out else clause
at the end of the hunk.
> @@ -235,4 +247,35 @@ pathmatch 1 abcXdefXghi '*X*i'
> pathmatch 1 ab/cXd/efXg/hi '*/*X*/*/*i'
> pathmatch 1 ab/cXd/efXg/hi '*Xg*i'
>
> +# Case-sensitivy features
> +match 0 x 'a' '[A-Z]'
> +match 1 x 'A' '[A-Z]'
> +match 0 x 'A' '[a-z]'
> +match 1 x 'a' '[a-z]'
> +match 0 x 'a' '[[:upper:]]'
> +match 1 x 'A' '[[:upper:]]'
> +match 0 x 'A' '[[:lower:]]'
> +match 1 x 'a' '[[:lower:]]'
> +match 0 x 'A' '[B-Za]'
> +match 1 x 'a' '[B-Za]'
> +match 0 x 'A' '[B-a]'
> +match 1 x 'a' '[B-a]'
> +match 0 x 'z' '[Z-y]'
> +match 1 x 'Z' '[Z-y]'
> +
> +imatch 1 'a' '[A-Z]'
Do we want "# Case-insensitivity features" commment here as well?
> +imatch 1 'A' '[A-Z]'
> +imatch 1 'A' '[a-z]'
> +imatch 1 'a' '[a-z]'
> +imatch 1 'a' '[[:upper:]]'
> +imatch 1 'A' '[[:upper:]]'
> +imatch 1 'A' '[[:lower:]]'
> +imatch 1 'a' '[[:lower:]]'
> +imatch 1 'A' '[B-Za]'
> +imatch 1 'a' '[B-Za]'
> +imatch 1 'A' '[B-a]'
> +imatch 1 'a' '[B-a]'
> +imatch 1 'z' '[Z-y]'
> +imatch 1 'Z' '[Z-y]'
> +
> test_done
> diff --git a/wildmatch.c b/wildmatch.c
> index 7192bdc..f91ba99 100644
> --- a/wildmatch.c
> +++ b/wildmatch.c
> @@ -196,6 +196,11 @@ static int dowild(const uchar *p, const uchar *text,
> unsigned int flags)
> }
> if (t_ch <= p_ch && t_ch >= prev_ch)
> matched = 1;
> + else if ((flags & WM_CASEFOLD) &&
> ISLOWER(t_ch)) {
> + uchar t_ch_upper =
> toupper(t_ch);
> + if (t_ch_upper <= p_ch &&
> t_ch_upper >= prev_ch)
> + matched = 1;
> + }
> p_ch = 0; /* This makes "prev_ch" get
> set to 0. */
Hmm, this looks somewhat strange.
* At the beginning of the outermost "per characters in the text"
loop, we seem to downcase t_ch when WM_CASEFOLD is set.
* Also at the same place, we also seem to downcase p_ch under the
same condition.
which makes me wonder why the fix is not like this:
+ if (flags & WM_CASEFOLD) {
+ if (ISUPPER(p_ch))
+ p_ch = tolower(p_ch);
+ if (prev_ch && ISUPPER(prev_ch))
+ prev_ch = tolower(prev_ch);
+ }
if (t_ch <= p_ch && t_ch >= prev_ch)
matched = 1;
p_ch = 0; /* This sets "prev_ch" to 0 */
Ahh, OK, the "seemingly strange" construct is about handling a range
like "[Z-y]"; we do not want to upcase or downcase the p_ch/prev_ch
make the range "[z-y]" (empty) which would exclude bytes like "^",
"_" or even "Z".
And it is also OK to downcase p_ch in a single-letter case, not the
characters in a range, at the beginning of the outermost loop,
because we always compare for equality against t_ch (which is
downcased) in that case.
> @@ -245,6 +250,8 @@ static int dowild(const uchar *p, const uchar *text,
> unsigned int flags)
> } else if (CC_EQ(s,i, "upper")) {
> if (ISUPPER(t_ch))
> matched = 1;
> + else if ((flags & WM_CASEFOLD)
> && ISLOWER(t_ch))
> + matched = 1;
This also looks somewhat strange but correct in that t_ch is already
downcased so we do not need a corresponding change for CC_EQ("lower")
codepath.
Interesting. Will apply.
Thanks.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html