On Fri, 2025-09-12 at 21:53 -0400, Tom Lane wrote:
> 
> Ah, got it.  But this logic definitely deserves more comments.
> What do you think of something like
> 
>             if (pchar == ']' && charclass_start > 2)
>             {
>                 /* found the real end of a bracket pair */
>                 charclass_depth--;
>                 /* past start of outer brackets, so keep charclass_start > 2 
> */
>             }
>             else if (pchar == '[')
>             {
>                 /* start of a nested bracket pair */
>                 charclass_depth++;
>                 /* leading ^ or ] in this context is not special */
>                 charclass_start = 3;
>             }
>             else if (pchar == '^')
>             {
>                 /* okay to increment charclass_start even if already > 1 */
>                 charclass_start++;
>             }
>             else
>             {
>                 /* otherwise (including case of leading ']') */
>                 charclass_start = 3;    /* definitely past the start */
>             }
> 
> > Perhaps s/charclass_depth/bracket_depth/ would be a good idea.
> 
> Wouldn't object to that.  Maybe we can think of a better name for
> "charclass_start" too --- that sounds like a boolean condition.
> The best I'm coming up with right now is "charclass_count", but
> that's not that helpful.

I came up with the attached patch set.

0001 is the actual bug fix: in addition to my previous patch, I fixed two
more cases in which a closing bracket might not be recognized as ending
the character class (one is from your patch above).  I think that these
could not lead to bad query results, but I am not certain.

0002 is the cosmetic improvement: I renamed "charclass_depth" to "bracket_depth"
and "bracket_depth" to "bracket_pos", rewrote the "else if" cascade as you
suggested above and put some love into additional comments.

I used two separate patches for clarity and ease of review, but both
should get backpatched.

Yours,
Laurenz Albe
From dd37acb30b637c306f2ae7f041c164e892e92a22 Mon Sep 17 00:00:00 2001
From: Laurenz Albe <[email protected]>
Date: Sat, 13 Sep 2025 09:59:47 +0200
Subject: [PATCH v2 1/2] Amend recent fix for SIMILAR TO regex conversion

Commit e3ffc3e91d fixed the translation of character classes in
SIMILAR TO regular expressions.  Unfortunately the fix broke a corner
case: if there is an escape character right after the opening bracket,
(for example in "[\q]") a closing bracket right after the escape
sequence would not be seen as closing the character class.

There were two more oversights: a backslash or a nested opening bracket
right at the beginning of a character class should remove the special
meaning from any following caret or closing bracket.

Author: Laurenz Albe <[email protected]>
Reported-By: Dominique Devienne <[email protected]>
Reported-By: Stephan Springl <[email protected]>
Discussion: https://postgr.es/m/41a37137-f8bb-8fc5-2948-31b528f166dc%40bfw-online.de
Discussion: https://postgr.es/m/CAFCRh-8NwJd0jq6P%3DR3qhHyqU7hw0BTor3W0SvUcii24et%2BzAw%40mail.gmail.com
Backpatch-through: 13
---
 src/backend/utils/adt/regexp.c        | 14 ++++++++++++++
 src/test/regress/expected/strings.out |  9 +++++++++
 src/test/regress/sql/strings.sql      |  3 +++
 3 files changed, 26 insertions(+)

diff --git a/src/backend/utils/adt/regexp.c b/src/backend/utils/adt/regexp.c
index 6e2864cbbda..b62d67a5a98 100644
--- a/src/backend/utils/adt/regexp.c
+++ b/src/backend/utils/adt/regexp.c
@@ -948,6 +948,12 @@ similar_escape_internal(text *pat_text, text *esc_text)
 				 */
 				*r++ = '\\';
 				*r++ = pchar;
+				/*
+				 * If we encounter an escaped character in a character class,
+				 * we are no longer at the beginning.
+				 */
+				if (charclass_depth > 0)
+					charclass_start = 3;
 			}
 			afterescape = false;
 		}
@@ -959,7 +965,11 @@ similar_escape_internal(text *pat_text, text *esc_text)
 		else if (charclass_depth > 0)
 		{
 			if (pchar == '\\')
+			{
 				*r++ = '\\';
+				/* we are no longer at the beginning of a character class */
+				charclass_start = 3;
+			}
 			*r++ = pchar;
 
 			/*
@@ -971,7 +981,11 @@ similar_escape_internal(text *pat_text, text *esc_text)
 			if (pchar == ']' && charclass_start > 2)
 				charclass_depth--;
 			else if (pchar == '[')
+			{
 				charclass_depth++;
+				/* we are no longer at the beginning of a character class */
+				charclass_start = 3;
+			}
 
 			/*
 			 * If there is a caret right after the opening bracket, it negates
diff --git a/src/test/regress/expected/strings.out b/src/test/regress/expected/strings.out
index ba302da51e7..2d6cb02ad60 100644
--- a/src/test/regress/expected/strings.out
+++ b/src/test/regress/expected/strings.out
@@ -693,6 +693,15 @@ EXPLAIN (COSTS OFF) SELECT * FROM TEXT_TBL WHERE f1 SIMILAR TO '[^^]^';
    Filter: (f1 ~ '^(?:[^^]\^)$'::text)
 (2 rows)
 
+-- Closing square bracket after an escape sequence at the beginning of
+-- a character closes the character class
+EXPLAIN (COSTS OFF) SELECT * FROM TEXT_TBL WHERE f1 SIMILAR TO '[|a]%' ESCAPE '|';
+              QUERY PLAN               
+---------------------------------------
+ Seq Scan on text_tbl
+   Filter: (f1 ~ '^(?:[\a].*)$'::text)
+(2 rows)
+
 -- Test backslash escapes in regexp_replace's replacement string
 SELECT regexp_replace('1112223333', E'(\\d{3})(\\d{3})(\\d{4})', E'(\\1) \\2-\\3');
  regexp_replace 
diff --git a/src/test/regress/sql/strings.sql b/src/test/regress/sql/strings.sql
index b94004cc08c..5ed421d6205 100644
--- a/src/test/regress/sql/strings.sql
+++ b/src/test/regress/sql/strings.sql
@@ -218,6 +218,9 @@ EXPLAIN (COSTS OFF) SELECT * FROM TEXT_TBL WHERE f1 SIMILAR TO '[]%][^]%][^%]%';
 -- Closing square bracket effective after two carets at the beginning
 -- of character class.
 EXPLAIN (COSTS OFF) SELECT * FROM TEXT_TBL WHERE f1 SIMILAR TO '[^^]^';
+-- Closing square bracket after an escape sequence at the beginning of
+-- a character closes the character class
+EXPLAIN (COSTS OFF) SELECT * FROM TEXT_TBL WHERE f1 SIMILAR TO '[|a]%' ESCAPE '|';
 
 -- Test backslash escapes in regexp_replace's replacement string
 SELECT regexp_replace('1112223333', E'(\\d{3})(\\d{3})(\\d{4})', E'(\\1) \\2-\\3');
-- 
2.51.0

From 3c18418ccc3a8fe2e0276d229ff51d3f29e766a4 Mon Sep 17 00:00:00 2001
From: Laurenz Albe <[email protected]>
Date: Sat, 13 Sep 2025 10:51:35 +0200
Subject: [PATCH v2 2/2] Cosmetic improvements for SIMILAR TO regex conversion

The bug introduced in e3ffc3e91d indicates that the code in that area
should be more readable.  This patch renames the variables
"charclass_depth" and "charclass_start" to something more meaningful,
adds and improves comments and rewrites an "if" cascade to be more
consistent.

This patch is purely cosmetic and should not change the logic.

Author: Laurenz Albe <[email protected]>
Discussion: https://postgr.es/m/flat/2673230.1757722323%40sss.pgh.pa.us#831fd545c0faeebbfb08714eb4cc307d
Backpatch-through: 13
---
 src/backend/utils/adt/regexp.c | 76 ++++++++++++++++++++--------------
 1 file changed, 45 insertions(+), 31 deletions(-)

diff --git a/src/backend/utils/adt/regexp.c b/src/backend/utils/adt/regexp.c
index b62d67a5a98..dcdb2e53f67 100644
--- a/src/backend/utils/adt/regexp.c
+++ b/src/backend/utils/adt/regexp.c
@@ -774,10 +774,8 @@ similar_escape_internal(text *pat_text, text *esc_text)
 				elen;
 	bool		afterescape = false;
 	int			nquotes = 0;
-	int			charclass_depth = 0;	/* Nesting level of character classes,
-										 * encompassed by square brackets */
-	int			charclass_start = 0;	/* State of the character class start,
-										 * for carets */
+	int			bracket_depth = 0;	/* square bracket nesting level */
+	int			charclass_pos = 0;	/* position inside a character class */
 
 	p = VARDATA_ANY(pat_text);
 	plen = VARSIZE_ANY_EXHDR(pat_text);
@@ -907,7 +905,7 @@ similar_escape_internal(text *pat_text, text *esc_text)
 		/* fast path */
 		if (afterescape)
 		{
-			if (pchar == '"' && charclass_depth < 1)	/* escape-double-quote? */
+			if (pchar == '"' && bracket_depth < 1)	/* escape-double-quote? */
 			{
 				/* emit appropriate part separator, per notes above */
 				if (nquotes == 0)
@@ -948,12 +946,13 @@ similar_escape_internal(text *pat_text, text *esc_text)
 				 */
 				*r++ = '\\';
 				*r++ = pchar;
+
 				/*
 				 * If we encounter an escaped character in a character class,
 				 * we are no longer at the beginning.
 				 */
-				if (charclass_depth > 0)
-					charclass_start = 3;
+				if (bracket_depth > 0)
+					charclass_pos = 3;
 			}
 			afterescape = false;
 		}
@@ -962,49 +961,64 @@ similar_escape_internal(text *pat_text, text *esc_text)
 			/* SQL escape character; do not send to output */
 			afterescape = true;
 		}
-		else if (charclass_depth > 0)
+		else if (bracket_depth > 0)
 		{
 			if (pchar == '\\')
 			{
+				/* if backslash is no SIMILAR TO escape character, double it */
 				*r++ = '\\';
 				/* we are no longer at the beginning of a character class */
-				charclass_start = 3;
+				charclass_pos = 3;
 			}
 			*r++ = pchar;
 
-			/*
-			 * Ignore a closing bracket at the start of a character class.
-			 * Such a bracket is taken literally rather than closing the
-			 * class.  "charclass_start" is 1 right at the beginning of a
-			 * class and 2 after an initial caret.
+			/*----------
+			 * "charclass_pos" describes where in a character class we are:
+			 *   0: not in a character class
+			 *   1: right after the opening '[' (a following '^' will negate
+			 *      the class, and ']' is a normal character)
+			 *   2: right after a '^' after the opening '[' (']' is a normal
+			 *      character)
+			 *   3 or more: further inside the character class
 			 */
-			if (pchar == ']' && charclass_start > 2)
-				charclass_depth--;
+			if (pchar == ']' && charclass_pos > 2)
+			{
+				/* found the real end of a bracket pair */
+				bracket_depth--;
+				/* don't reset charclass_pos, this may be an inner bracket */
+			}
 			else if (pchar == '[')
 			{
-				charclass_depth++;
+				/* start of a nested bracket pair */
+				bracket_depth++;
 				/* we are no longer at the beginning of a character class */
-				charclass_start = 3;
+				charclass_pos = 3;
+			}
+			else if (pchar == '^')
+			{
+				/*
+				 * A caret right after the opening bracket negates the
+				 * character class.  In that case, the following will
+				 * increment charclass_pos from 1 to 2, so that a following
+				 * ']' is still a normal character and does not close the
+				 * character class. If we are further inside a character
+				 * class, charclass_pos might get incremented past 3, which is
+				 * fine.
+				 */
+				charclass_pos++;
 			}
-
-			/*
-			 * If there is a caret right after the opening bracket, it negates
-			 * the character class, but a following closing bracket should
-			 * still be treated as a normal character.  That holds only for
-			 * the first caret, so only the values 1 and 2 mean that closing
-			 * brackets should be taken literally.
-			 */
-			if (pchar == '^')
-				charclass_start++;
 			else
-				charclass_start = 3;	/* definitely past the start */
+			{
+				/* a leading ']' or anything else take us past the beginning */
+				charclass_pos = 3;
+			}
 		}
 		else if (pchar == '[')
 		{
 			/* start of a character class */
 			*r++ = pchar;
-			charclass_depth++;
-			charclass_start = 1;
+			bracket_depth = 1;
+			charclass_pos = 1;
 		}
 		else if (pchar == '%')
 		{
-- 
2.51.0

Reply via email to