On Wed, Mar 25, 2026 at 6:09 PM Masahiko Sawada <[email protected]> wrote:
>
> On Wed, Mar 25, 2026 at 5:35 PM Tom Lane <[email protected]> wrote:
> >
> > Tomas Vondra <[email protected]> writes:
> > > On 3/26/26 00:40, Tom Lane wrote:
> > >> I believe what's happening there is that in cs_CZ locale,
> > >> "V" doesn't follow simple ASCII sort ordering.
> >
> > > With cs_CZ all letters sort *before* numbers, while in en_US it's the
> > > other way around. V is not special in any way.
> >
> > Ah, sorry, I should have researched a bit instead of relying on
> > fading memory.  The quirk I was thinking of is that in cs_CZ,
> > "ch" sorts after "h":
> >
> > u8=# select 'h' < 'ch'::text collate "en_US";
> >  ?column?
> > ----------
> >  f
> > (1 row)
> >
> > u8=# select 'h' < 'ch'::text collate "cs_CZ";
> >  ?column?
> > ----------
> >  t
> > (1 row)
> >
> > Regular hex encoding isn't bitten by that because it doesn't
> > use 'h' in the text form ... but this base32hex thingie does.
> >
> > However, your point is also correct:
> >
> > u8=# select '0' < 'C'::text ;
> >  ?column?
> > ----------
> >  t
> > (1 row)
> >
> > u8=# select '0' < 'C'::text collate "cs_CZ";
> >  ?column?
> > ----------
> >  f
> > (1 row)
> >
> > and that breaks "text ordering matches numeric ordering"
> > for both traditional hex and base32hex.  So maybe this
> > is not as big a deal as I first thought.  We need a fix
> > for the new test though.  Probably adding COLLATE "C"
> > would be enough.
>
> Thank you for the report and the analysis.
>
> I've reproduced the issue with "cs_CZ" collation and adding COLLATE
> "C" to the query resolves it. It seems also a good idea to add a note
> in the documentation too as users might face the same issue. For
> example,
>
> To maintain the lexicographical sort order of the encoded data, ensure
> that the text is sorted using the C collation (e.g., using COLLATE
> "C"). Natural language collations may sort characters differently and
> break the ordering.
>

Attached the patch doing the above idea.

Regards,

-- 
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
From a64f3f64a9f04c1f5da9a51fe760c40480585fd4 Mon Sep 17 00:00:00 2001
From: Masahiko Sawada <[email protected]>
Date: Wed, 25 Mar 2026 18:07:11 -0700
Subject: [PATCH] Fix UUID sortability tests in base32hex encoding.

The recently added test for base32hex encoding of UUIDs failed on
buildfarm member hippopotamus using natural language locales (such as
cs_CZ). This happened because those collations may sort characters
differently, which breaks the strict byte-wise lexicographical
ordering expected by base32hex encoding.

This commit fixes the regression tests by explicitly using the C
collation. Additionally, add a note to the documentation to warm users
that they must use the C collation if they want to maintain the
lexicographical sort order of the encoded data.

Per buildfarm member hippopotamus.

Analyzed-by: Tom Lane <[email protected]>
Discussion: https://postgr.es/m/[email protected]
---
 doc/src/sgml/func/func-binarystring.sgml | 9 +++++++++
 src/test/regress/expected/uuid.out       | 7 +++++--
 src/test/regress/sql/uuid.sql            | 7 +++++--
 3 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/doc/src/sgml/func/func-binarystring.sgml b/doc/src/sgml/func/func-binarystring.sgml
index 0aaf9bc68f1..2ad2cdbea82 100644
--- a/doc/src/sgml/func/func-binarystring.sgml
+++ b/doc/src/sgml/func/func-binarystring.sgml
@@ -790,6 +790,15 @@
        produces a 26-character string compared to the standard 36-character
        UUID representation.
       </para>
+
+      <note>
+       <para>
+        To maintain the lexicographical sort order of the encoded data,
+        ensure that the text is sorted using the C collation
+        (e.g., using <literal>COLLATE "C"</literal>). Natural language
+        collations may sort characters differently and break the ordering.
+       </para>
+      </note>
      </listitem>
     </varlistentry>
 
diff --git a/src/test/regress/expected/uuid.out b/src/test/regress/expected/uuid.out
index 142c529e693..9c5dda9e9ab 100644
--- a/src/test/regress/expected/uuid.out
+++ b/src/test/regress/expected/uuid.out
@@ -236,8 +236,11 @@ SELECT array_agg(id ORDER BY guid_field) FROM guid3;
  {1,2,3,4,5,6,7,8,9,10,11,12}
 (1 row)
 
--- make sure base32hex encoding works with UUIDs and preserves ordering
-SELECT array_agg(id ORDER BY guid_encoded) FROM guid3;
+-- Test base32hex encoding of UUIDs and its lexicographical sorting property.
+-- COLLATE "C" is required to prevent buildfarm failures in non-C locales
+-- where natural language collations (such as cs_CZ) would break strict
+-- byte-wise ordering.
+SELECT array_agg(id ORDER BY guid_encoded COLLATE "C") FROM guid3;
           array_agg           
 ------------------------------
  {1,2,3,4,5,6,7,8,9,10,11,12}
diff --git a/src/test/regress/sql/uuid.sql b/src/test/regress/sql/uuid.sql
index f2ff00f5ddd..8cc2ad40614 100644
--- a/src/test/regress/sql/uuid.sql
+++ b/src/test/regress/sql/uuid.sql
@@ -122,8 +122,11 @@ INSERT INTO guid3 (guid_field) SELECT uuidv7() FROM generate_series(1, 10);
 INSERT INTO guid3 (guid_field) VALUES ('ffffffff-ffff-ffff-ffff-ffffffffffff'::uuid);
 SELECT array_agg(id ORDER BY guid_field) FROM guid3;
 
--- make sure base32hex encoding works with UUIDs and preserves ordering
-SELECT array_agg(id ORDER BY guid_encoded) FROM guid3;
+-- Test base32hex encoding of UUIDs and its lexicographical sorting property.
+-- COLLATE "C" is required to prevent buildfarm failures in non-C locales
+-- where natural language collations (such as cs_CZ) would break strict
+-- byte-wise ordering.
+SELECT array_agg(id ORDER BY guid_encoded COLLATE "C") FROM guid3;
 
 -- Check the timestamp offsets for v7.
 --
-- 
2.53.0

Reply via email to