Re: Use correct collation in pg_trgm

David Geier Fri, 27 Mar 2026 03:57:02 -0700

On 26.03.2026 19:26, Tom Lane wrote:
> Jeff Davis <[email protected]> writes:
>> On Thu, 2026-03-26 at 09:50 +0100, David Geier wrote:
>>> I agree. That is inconsistent. But if anything, shouldn't we change
>>> tsvector/tsquery to as well adhere to the inferred collation?
> 
>> I am not sure either way.
>> It's easy to specify a COLLATE clause to affect the interpretation of
>> the input. But once you parse the inputs into a stored value, you can't
>> later reinterpret those values by specifying a COLLATE clause. The
>> parsing already happened and the original input string was lost.
>> You can end up with a table full of values, some of which were parsed
>> with one set of semantics, and others parsed with a different set of
>> semantics. That may make sense or it may just cause confusion. It's
>> tough for me to say.
> 
> The rule that text search goes by is that it's okay to be a bit
> fuzzy about this because people are usually looking for approximate
> matches, so that even if you have sets of lexemes that were extracted
> under slightly different parsing rules you can probably still find


tsquery allows to do starts-with queries equivalent to LIKE 'foo%' via
to_tsquery('foo:*'). These two would then also behave differently.

Can you see any good reason that speaks against using the inferred
collation in tsquery / tsvector?

> what you want.  While that argument still works for pg_trgm's original
> "similarity" functions, it falls flat for the LIKE/ILIKE/regex index
> support functionality: people will be justifiably unhappy if the index
> doesn't find the exact same matches that a seqscan-and-filter would.

Agreed. That was also one of the motivations to change it.

> I've not experimented, but I rather imagine that things are already
> buggy as heck, in that optimizing a LIKE or regex expression that's
> got collation A applied to it into an indexscan on a pg_trgm index
> made with collation B will not work if different trigrams get
> extracted.  I think we have to insist that the index collation match
> the query.  Once we've done that, the concern about making a change
> like this seems less: you will not get wrong answers, rather the
> planner will refuse to use an incompatible index.

I thought that happens already. In the following example no index scan
is used, even though sequential scan is disabled. FWICS,
IndexCollMatchesExprColl() takes care of that.

CREATE EXTENSION pg_trgm;
CREATE TABLE test(col TEXT COLLATE "tr-x-icu");
CREATE INDEX ON test USING GIN(col gin_trgm_ops);
SET enable_seqscan = FALSE;

EXPLAIN SELECT * FROM test WHERE col LIKE '%test%' COLLATE "C";
                               QUERY PLAN
-------------------------------------------------------------------------
 Seq Scan on test  (cost=10000000000.00..10000000001.01 rows=1 width=32)
   Filter: (col = 'test'::text COLLATE "C")

If you have other cases in mind, pointers are appreciated.

--
David Geier

Re: Use correct collation in pg_trgm

Reply via email to