trgm and index-usage when using similarity function

2025-11-05 Thread Andreas Joseph Krogh


Using PG-18.0, I'm having problems with trgm and index-usage when using 
similarity-function.



Given the following schema:

CREATE TABLE person
(
entity_id BIGSERIAL PRIMARY KEY,
firstname text,
lastname text,
fullname_search text,
birthdate date
);

INSERT INTO person
SELECT pers.entity_id, pers.firstname, pers.lastname, pers.fullname_search, 
pers.birthdate
FROM onp_crm_person pers;

CREATE INDEX test_idx_onp_crm_person_duplicates
ON person USING gin (fullname_search gin_trgm_ops)
;


This query uses index and is fast, but I have to set 
pg_trgm.similarity_threshold manually first:

SET pg_trgm.similarity_threshold = 0.9;

EXPLAIN (ANALYZE, BUFFERS)
SELECT p.entity_id, p.firstname, p.lastname, p.fullname_search, p.birthdate
FROM person p
WHERE EXISTS (
SELECT *
FROM person d
 WHERE d.fullname_search % p.fullname_search
AND d.entity_id <> p.entity_id
LIMIT 1
)
;

┌──┐
│ QUERY PLAN │

├──┤
│ Nested Loop Semi Join (cost=2.19..319906.05 rows=256 width=45) (actual 
time=29.527..2626.465 rows=7497.00 loops=1) │
│ Buffers: shared hit=972919 │
│ -> Seq Scan on person p (cost=0.00..488.81 rows=25581 width=45) (actual 
time=0.024..1.127 rows=25581.00 loops=1) │
│ Buffers: shared hit=233 │
│ -> Bitmap Heap Scan on person d (cost=2.19..10.03 rows=256 width=25) (actual 
time=0.102..0.102 rows=0.29 loops=25581) │
│ Recheck Cond: (fullname_search % p.fullname_search) │
│ Rows Removed by Index Recheck: 0 │
│ Filter: (entity_id <> p.entity_id) │
│ Rows Removed by Filter: 1 │
│ Heap Blocks: exact=34252 │
│ Buffers: shared hit=972686 │
│ -> Bitmap Index Scan on test_idx_onp_crm_person_duplicates (cost=0.00..2.13 
rows=256 width=0) (actual time=0.097..0.097 rows=1.99 loops=25581) │
│ Index Cond: (fullname_search % p.fullname_search) │
│ Index Searches: 25581 │
│ Buffers: shared hit=933853 │
│ Planning: │
│ Buffers: shared hit=2 │
│ Planning Time: 3.620 ms │
│ JIT: │
│ Functions: 7 │
│ Options: Inlining false, Optimization false, Expressions true, Deforming 
true │
│ Timing: Generation 0.772 ms (Deform 0.230 ms), Inlining 0.000 ms, 
Optimization 5.522 ms, Emission 23.219 ms, Total 29.513 ms │
│ Execution Time: 2628.643 ms │

└──┘
(23 rows)


But using similarity-function:

EXPLAIN (ANALYZE, BUFFERS)
SELECT p.entity_id, p.firstname, p.lastname, p.fullname_search, p.birthdate
FROM person p
WHERE EXISTS (
SELECT *
FROM person d
 WHERE similarity(d.fullname_search, p.fullname_search) >= 0.9
AND d.entity_id <> p.entity_id
LIMIT 1
)
;
… this has been running for 30 minutes now and I've cancelled it.



Any suggestions on how to fix this query so I can:

 * 
specify the similarity-factor as part of the query

 * 
Make it fast

Thanks.






--
Andreas Joseph Krogh

Problem getting query to use index inside a function

2025-11-05 Thread Dirschel, Steve
I'm not sure if I should send this to pgsql-performance or pqsql-general so 
hopefully I'm sending to the correct one.

Table definition:

shared=> \d request
 Table "public.request"
  Column  |Type | Collation | Nullable |
Default
--+-+---+--+
objectid | character(36)   |   | not null |
data | jsonb   |   | not null | 
'{}'::jsonb
clientid | character(3)|   | not null |
active   | integer |   | not null |
productid| integer |   | not null |
checkoutbyuid| character(100)  |   |  |
checkoutdatetime | timestamp without time zone |   |  |
metadata | jsonb   |   | not null | 
'{}'::jsonb
search_vector| tsvector|   |  |
requeststate | text|   | not null | 
'Active'::text
Indexes:
"requestkey" PRIMARY KEY, btree (objectid, productid)


Here is the function I'm having difficulties with:

CREATE OR REPLACE FUNCTION public.steve1(param_requestid text[], 
param_productid integer DEFAULT 1)
RETURNS TABLE(objectid text, n text, v text, vt integer)
LANGUAGE sql
AS $function$
SELECT   objectid::text
 , i->>'n'::text
 , i->>'v'::text
 , (i->>'vt') :: INT as vt
FROM   request r
 , jsonb_array_elements(data -> 'i') i
WHERE objectid = ANY($1)
 AND productid=$2

$function$
;

Query:

shared=> explain (analyze, buffers)
shared-> SELECT objectid::text, n::text, v::text, vt::int FROM 
steve1(ARRAY['5ab8e0ca-abb5-48c9-a95e-2bb2a375d903','adcbe251-6723-48a8-8385-55133fab704a'],
 1);
   QUERY PLAN

Function Scan on steve1  (cost=0.25..10.25 rows=1000 width=100) (actual 
time=42.694..42.942 rows=3368 loops=1)
   Buffers: shared hit=16612  <---
Planning Time: 0.034 ms
Execution Time: 43.279 ms
(4 rows)

The query is doing 16612 logical reads which implies it full scanning the 
table.  I can also run the query providing the input values where you can 
clearly see it's full scanning the table:

shared=> explain (analyze, buffers)
shared-> SELECT objectid::text
shared->, i->>'n'::text
shared->, i->>'v'::text
shared->, (i->>'vt') :: INT as vt
shared-> FROM   request r
shared->, jsonb_array_elements(data -> 'i') i
shared-> WHERE  objectid = ANY( 
ARRAY['5ab8e0ca-abb5-48c9-a95e-2bb2a375d903','adcbe251-6723-48a8-8385-55133fab704a'])
shared->AND productid=1;
   QUERY 
PLAN

Nested Loop  (cost=0.01..19561.37 rows=57000 width=100) (actual 
time=0.351..41.354 rows=3368 loops=1)
   Buffers: shared hit=16586  <-
   ->  Seq Scan on request r  (cost=0.00..17566.36 rows=570 width=67) (actual 
time=0.010..38.341 rows=2 loops=1)
 Filter: ((productid = 1) AND ((objectid)::text = ANY 
('{5ab8e0ca-abb5-48c9-a95e-2bb2a375d903,adcbe251-6723-48a8-8385-55133fab704a}'::text[])))
 Rows Removed by Filter: 57077
 Buffers: shared hit=16569
   ->  Function Scan on jsonb_array_elements i  (cost=0.01..1.00 rows=100 
width=32) (actual time=0.291..0.406 rows=1684 loops=2)
 Buffers: shared hit=17
Planning Time: 0.093 ms
Execution Time: 41.607 ms
(10 rows)


I can get the query to use the PK index if I change the query from ARRAY [] to 
{}:

shared=> explain (analyze, buffers)
shared-> SELECT objectid::text
shared->, i->>'n'::text
shared->, i->>'v'::text
shared->, (i->>'vt') :: INT as vt
shared-> FROM   request r
shared->, jsonb_array_elements(data -> 'i') i
shared-> WHERE  objectid = 
ANY('{5ab8e0ca-abb5-48c9-a95e-2bb2a375d903,adcbe251-6723-48a8-8385-55133fab704a}')
shared->AND productid=1;
  QUERY PLAN
--
Nested Loop  (cost=0.42..23.87 rows=200 width=100) (actual time=0.346..3.007 
rows=3368 loops=1)
   Buffers: shared hit=27  <
   ->  Index Scan using requestkey on request r  (cost=0.41..16.87 rows=2 
width=67) (actual time=0.036..0.049 rows=2 loops=1)
 Index Cond: ((objectid = ANY 
('{5ab8e0c

Re: [EXT] Re: Problem getting query to use index inside a function

2025-11-05 Thread Tom Lane
"Dirschel, Steve"  writes:
>> I think you would have better luck if the planner were "inlining"
>> this function, which we can see it's not since you get a Function Scan on 
>> steve1 rather than the contained query.
>> I think the only thing stopping that from happening is that the function is 
>> (by default) VOLATILE.  Try marking it STABLE so that it can share the 
>> calling query's snapshot.

> Thanks for the reply, but that did not seem to help.

I tried to replicate this as follows:

--- CUT ---
create table request(objectid text, productid int, data jsonb);
create index on request(objectid, productid);

CREATE OR REPLACE FUNCTION public.steve1(param_requestid text[], 
param_productid integer DEFAULT 1) RETURNS TABLE(objectid text, n 
text, v text, vt integer) LANGUAGE sql AS $function$
SELECT   objectid::text
 , i->>'n'::text
 , i->>'v'::text
 , (i->>'vt') :: INT as vt
FROM   request r
 , jsonb_array_elements(data -> 'i') i WHERE objectid = 
ANY($1)
 AND productid=$2

$function$
stable ;

explain
SELECT objectid::text, n::text, v::text, vt::int FROM 
steve1(ARRAY['5ab8e0ca-abb5-48c9-a95e-2bb2a375d903','adcbe251-6723-48a8-8385-55133fab704a'],
 1);
--- CUT ---

and I got:

 QUERY PLAN 


 Nested Loop  (cost=0.15..11.42 rows=100 width=100)
   ->  Index Scan using request_objectid_productid_idx on request r  
(cost=0.15..8.17 rows=1 width=64)
 Index Cond: ((objectid = ANY 
('{5ab8e0ca-abb5-48c9-a95e-2bb2a375d903,adcbe251-6723-48a8-8385-55133fab704a}'::text[]))
 AND (productid = 1))
   ->  Function Scan on jsonb_array_elements i  (cost=0.01..1.00 rows=100 
width=32)
(4 rows)

which is what I expected from successful inlining of the function.
So there are some moving parts in your situation that you've not
told us about.

regards, tom lane




RE: [EXT] Re: Problem getting query to use index inside a function

2025-11-05 Thread Dirschel, Steve


>> Thanks for the reply, but that did not seem to help.

> I tried to replicate this as follows:

> --- CUT ---
> create table request(objectid text, productid int, data jsonb); create index 
> on request(objectid, productid);

> CREATE OR REPLACE FUNCTION public.steve1(param_requestid text[], 
> param_productid integer DEFAULT 1) RETURNS TABLE(objectid text, n text, v 
> text, vt integer) LANGUAGE sql AS $function$
> SELECT   objectid::text
>  , i->>'n'::text
>  , i->>'v'::text
>  , (i->>'vt') :: INT as vt
> FROM   request r
>  , jsonb_array_elements(data -> 'i') i WHERE objectid =
> ANY($1)
>  AND productid=$2

> $function$
> stable ;

> explain
> SELECT objectid::text, n::text, v::text, vt::int FROM 
> steve1(ARRAY['5ab8e0ca-abb5-48c9-a95e-2bb2a375d903','adcbe251-6723-48a8-8385-55133fab704a'],
>  1);
> --- CUT ---

> and I got:

>  QUERY 
> PLAN 
> 
>  Nested Loop  (cost=0.15..11.42 rows=100 width=100)
>->  Index Scan using request_objectid_productid_idx on request r  
> (cost=0.15..8.17 rows=1 width=64)
>  Index Cond: ((objectid = ANY 
> ('{5ab8e0ca-abb5-48c9-a95e-2bb2a375d903,adcbe251-6723-48a8-8385-55133fab704a}'::text[]))
>  AND (productid = 1))
>->  Function Scan on jsonb_array_elements i  (cost=0.01..1.00 rows=100 
> width=32)
> (4 rows)

> which is what I expected from successful inlining of the function.
> So there are some moving parts in your situation that you've not told us 
> about.

>   regards, tom lane

Hi Tom,

I ran your code and got similar results so I agree there is more for me to dig 
into.  I see STABLE has a limit of not allowing DML in the function.  I am 
testing against a very simple function here but we have many other functions 
with the same problem but those also contain DML so even if I got the STABLE to 
work in this one test case it does not appear I could use in all of my 
functions with this problem.  Are there other options here besides the STABLE 
option that would work for functions that also contain DML?

Thanks
Steve




Re: Problem getting query to use index inside a function

2025-11-05 Thread Tom Lane
"Dirschel, Steve"  writes:
> Here is the function I'm having difficulties with:

> CREATE OR REPLACE FUNCTION public.steve1(param_requestid text[], 
> param_productid integer DEFAULT 1)
> RETURNS TABLE(objectid text, n text, v text, vt integer)
> LANGUAGE sql
> AS $function$
> SELECT   objectid::text
>  , i->>'n'::text
>  , i->>'v'::text
>  , (i->>'vt') :: INT as vt
> FROM   request r
>  , jsonb_array_elements(data -> 'i') i
> WHERE objectid = ANY($1)
>  AND productid=$2

> $function$
> ;

> Query:

> shared=> explain (analyze, buffers)
> shared-> SELECT objectid::text, n::text, v::text, vt::int FROM 
> steve1(ARRAY['5ab8e0ca-abb5-48c9-a95e-2bb2a375d903','adcbe251-6723-48a8-8385-55133fab704a'],
>  1);
>QUERY PLAN
> 
> Function Scan on steve1  (cost=0.25..10.25 rows=1000 width=100) (actual 
> time=42.694..42.942 rows=3368 loops=1)

I think you would have better luck if the planner were "inlining"
this function, which we can see it's not since you get a Function Scan
on steve1 rather than the contained query.

I think the only thing stopping that from happening is that the
function is (by default) VOLATILE.  Try marking it STABLE so that
it can share the calling query's snapshot.

(v18 should handle such cases better than previous versions, BTW.
But you'd still be better off marking the function STABLE.)

regards, tom lane




RE: [EXT] Re: Problem getting query to use index inside a function

2025-11-05 Thread Dirschel, Steve
> > Here is the function I'm having difficulties with:

> > CREATE OR REPLACE FUNCTION public.steve1(param_requestid text[], 
> > param_productid integer DEFAULT 1) RETURNS TABLE(objectid text, n 
> > text, v text, vt integer) LANGUAGE sql AS $function$
> > SELECT   objectid::text
> >  , i->>'n'::text
> >  , i->>'v'::text
> >  , (i->>'vt') :: INT as vt
> > FROM   request r
> >  , jsonb_array_elements(data -> 'i') i WHERE objectid = 
> > ANY($1)
> >  AND productid=$2

> > $function$
> > ;

> > Query:

> > shared=> explain (analyze, buffers)
> > shared-> SELECT objectid::text, n::text, v::text, vt::int FROM 
> > shared-> steve1(ARRAY['5ab8e0ca-abb5-48c9-a95e-2bb2a375d903','adcbe251
> > shared-> -6723-48a8-8385-55133fab704a'], 1);
> >QUERY PLAN
> > --
> > --
> > Function Scan on steve1  (cost=0.25..10.25 rows=1000 width=100) 
> > (actual time=42.694..42.942 rows=3368 loops=1)

> I think you would have better luck if the planner were "inlining"
> this function, which we can see it's not since you get a Function Scan on 
> steve1 rather than the contained query.

> I think the only thing stopping that from happening is that the function is 
> (by default) VOLATILE.  Try marking it STABLE so that it can share the 
> calling query's snapshot.

> (v18 should handle such cases better than previous versions, BTW.
> But you'd still be better off marking the function STABLE.)

>   regards, tom lane


Thanks for the reply, but that did not seem to help.  I tried using both the {} 
and ARRAY[] syntax but both are still full scanning based on the shared 
buffers.  Any other ideas?

shared=> ALTER FUNCTION public.steve1 (param_requestid text[], param_productid 
integer) STABLE;
ALTER FUNCTION

shared=> explain (analyze, buffers)
shared->SELECT objectid::text, n::text, v::text, vt::int FROM 
steve1('{83e3326a-62fe-45bc-81e5-1e9fb9a84d31}', 'PJJ', 1, 1);
 QUERY PLAN
-
 Function Scan on steve1  (cost=0.25..10.25 rows=1000 width=100) (actual 
time=64.465..64.465 rows=0 loops=1)
   Buffers: shared hit=16572
 Planning Time: 0.033 ms
 Execution Time: 64.485 ms
(4 rows)


shared=>
shared=> explain (analyze, buffers)
shared-> SELECT objectid::text, n::text, v::text, vt::int FROM 
steve1(ARRAY['83e3326a-62fe-45bc-81e5-1e9fb9a84d31'], 'PJJ', 1, 1);
 QUERY PLAN
-
 Function Scan on steve1  (cost=0.25..10.25 rows=1000 width=100) (actual 
time=63.749..63.749 rows=0 loops=1)
   Buffers: shared hit=16569
 Planning Time: 0.043 ms
 Execution Time: 63.766 ms
(4 rows)