trgm and index-usage when using similarity function
Using PG-18.0, I'm having problems with trgm and index-usage when using similarity-function. Given the following schema: CREATE TABLE person ( entity_id BIGSERIAL PRIMARY KEY, firstname text, lastname text, fullname_search text, birthdate date ); INSERT INTO person SELECT pers.entity_id, pers.firstname, pers.lastname, pers.fullname_search, pers.birthdate FROM onp_crm_person pers; CREATE INDEX test_idx_onp_crm_person_duplicates ON person USING gin (fullname_search gin_trgm_ops) ; This query uses index and is fast, but I have to set pg_trgm.similarity_threshold manually first: SET pg_trgm.similarity_threshold = 0.9; EXPLAIN (ANALYZE, BUFFERS) SELECT p.entity_id, p.firstname, p.lastname, p.fullname_search, p.birthdate FROM person p WHERE EXISTS ( SELECT * FROM person d WHERE d.fullname_search % p.fullname_search AND d.entity_id <> p.entity_id LIMIT 1 ) ; ┌──┐ │ QUERY PLAN │ ├──┤ │ Nested Loop Semi Join (cost=2.19..319906.05 rows=256 width=45) (actual time=29.527..2626.465 rows=7497.00 loops=1) │ │ Buffers: shared hit=972919 │ │ -> Seq Scan on person p (cost=0.00..488.81 rows=25581 width=45) (actual time=0.024..1.127 rows=25581.00 loops=1) │ │ Buffers: shared hit=233 │ │ -> Bitmap Heap Scan on person d (cost=2.19..10.03 rows=256 width=25) (actual time=0.102..0.102 rows=0.29 loops=25581) │ │ Recheck Cond: (fullname_search % p.fullname_search) │ │ Rows Removed by Index Recheck: 0 │ │ Filter: (entity_id <> p.entity_id) │ │ Rows Removed by Filter: 1 │ │ Heap Blocks: exact=34252 │ │ Buffers: shared hit=972686 │ │ -> Bitmap Index Scan on test_idx_onp_crm_person_duplicates (cost=0.00..2.13 rows=256 width=0) (actual time=0.097..0.097 rows=1.99 loops=25581) │ │ Index Cond: (fullname_search % p.fullname_search) │ │ Index Searches: 25581 │ │ Buffers: shared hit=933853 │ │ Planning: │ │ Buffers: shared hit=2 │ │ Planning Time: 3.620 ms │ │ JIT: │ │ Functions: 7 │ │ Options: Inlining false, Optimization false, Expressions true, Deforming true │ │ Timing: Generation 0.772 ms (Deform 0.230 ms), Inlining 0.000 ms, Optimization 5.522 ms, Emission 23.219 ms, Total 29.513 ms │ │ Execution Time: 2628.643 ms │ └──┘ (23 rows) But using similarity-function: EXPLAIN (ANALYZE, BUFFERS) SELECT p.entity_id, p.firstname, p.lastname, p.fullname_search, p.birthdate FROM person p WHERE EXISTS ( SELECT * FROM person d WHERE similarity(d.fullname_search, p.fullname_search) >= 0.9 AND d.entity_id <> p.entity_id LIMIT 1 ) ; … this has been running for 30 minutes now and I've cancelled it. Any suggestions on how to fix this query so I can: * specify the similarity-factor as part of the query * Make it fast Thanks. -- Andreas Joseph Krogh
Problem getting query to use index inside a function
I'm not sure if I should send this to pgsql-performance or pqsql-general so
hopefully I'm sending to the correct one.
Table definition:
shared=> \d request
Table "public.request"
Column |Type | Collation | Nullable |
Default
--+-+---+--+
objectid | character(36) | | not null |
data | jsonb | | not null |
'{}'::jsonb
clientid | character(3)| | not null |
active | integer | | not null |
productid| integer | | not null |
checkoutbyuid| character(100) | | |
checkoutdatetime | timestamp without time zone | | |
metadata | jsonb | | not null |
'{}'::jsonb
search_vector| tsvector| | |
requeststate | text| | not null |
'Active'::text
Indexes:
"requestkey" PRIMARY KEY, btree (objectid, productid)
Here is the function I'm having difficulties with:
CREATE OR REPLACE FUNCTION public.steve1(param_requestid text[],
param_productid integer DEFAULT 1)
RETURNS TABLE(objectid text, n text, v text, vt integer)
LANGUAGE sql
AS $function$
SELECT objectid::text
, i->>'n'::text
, i->>'v'::text
, (i->>'vt') :: INT as vt
FROM request r
, jsonb_array_elements(data -> 'i') i
WHERE objectid = ANY($1)
AND productid=$2
$function$
;
Query:
shared=> explain (analyze, buffers)
shared-> SELECT objectid::text, n::text, v::text, vt::int FROM
steve1(ARRAY['5ab8e0ca-abb5-48c9-a95e-2bb2a375d903','adcbe251-6723-48a8-8385-55133fab704a'],
1);
QUERY PLAN
Function Scan on steve1 (cost=0.25..10.25 rows=1000 width=100) (actual
time=42.694..42.942 rows=3368 loops=1)
Buffers: shared hit=16612 <---
Planning Time: 0.034 ms
Execution Time: 43.279 ms
(4 rows)
The query is doing 16612 logical reads which implies it full scanning the
table. I can also run the query providing the input values where you can
clearly see it's full scanning the table:
shared=> explain (analyze, buffers)
shared-> SELECT objectid::text
shared->, i->>'n'::text
shared->, i->>'v'::text
shared->, (i->>'vt') :: INT as vt
shared-> FROM request r
shared->, jsonb_array_elements(data -> 'i') i
shared-> WHERE objectid = ANY(
ARRAY['5ab8e0ca-abb5-48c9-a95e-2bb2a375d903','adcbe251-6723-48a8-8385-55133fab704a'])
shared->AND productid=1;
QUERY
PLAN
Nested Loop (cost=0.01..19561.37 rows=57000 width=100) (actual
time=0.351..41.354 rows=3368 loops=1)
Buffers: shared hit=16586 <-
-> Seq Scan on request r (cost=0.00..17566.36 rows=570 width=67) (actual
time=0.010..38.341 rows=2 loops=1)
Filter: ((productid = 1) AND ((objectid)::text = ANY
('{5ab8e0ca-abb5-48c9-a95e-2bb2a375d903,adcbe251-6723-48a8-8385-55133fab704a}'::text[])))
Rows Removed by Filter: 57077
Buffers: shared hit=16569
-> Function Scan on jsonb_array_elements i (cost=0.01..1.00 rows=100
width=32) (actual time=0.291..0.406 rows=1684 loops=2)
Buffers: shared hit=17
Planning Time: 0.093 ms
Execution Time: 41.607 ms
(10 rows)
I can get the query to use the PK index if I change the query from ARRAY [] to
{}:
shared=> explain (analyze, buffers)
shared-> SELECT objectid::text
shared->, i->>'n'::text
shared->, i->>'v'::text
shared->, (i->>'vt') :: INT as vt
shared-> FROM request r
shared->, jsonb_array_elements(data -> 'i') i
shared-> WHERE objectid =
ANY('{5ab8e0ca-abb5-48c9-a95e-2bb2a375d903,adcbe251-6723-48a8-8385-55133fab704a}')
shared->AND productid=1;
QUERY PLAN
--
Nested Loop (cost=0.42..23.87 rows=200 width=100) (actual time=0.346..3.007
rows=3368 loops=1)
Buffers: shared hit=27 <
-> Index Scan using requestkey on request r (cost=0.41..16.87 rows=2
width=67) (actual time=0.036..0.049 rows=2 loops=1)
Index Cond: ((objectid = ANY
('{5ab8e0c
Re: [EXT] Re: Problem getting query to use index inside a function
"Dirschel, Steve" writes:
>> I think you would have better luck if the planner were "inlining"
>> this function, which we can see it's not since you get a Function Scan on
>> steve1 rather than the contained query.
>> I think the only thing stopping that from happening is that the function is
>> (by default) VOLATILE. Try marking it STABLE so that it can share the
>> calling query's snapshot.
> Thanks for the reply, but that did not seem to help.
I tried to replicate this as follows:
--- CUT ---
create table request(objectid text, productid int, data jsonb);
create index on request(objectid, productid);
CREATE OR REPLACE FUNCTION public.steve1(param_requestid text[],
param_productid integer DEFAULT 1) RETURNS TABLE(objectid text, n
text, v text, vt integer) LANGUAGE sql AS $function$
SELECT objectid::text
, i->>'n'::text
, i->>'v'::text
, (i->>'vt') :: INT as vt
FROM request r
, jsonb_array_elements(data -> 'i') i WHERE objectid =
ANY($1)
AND productid=$2
$function$
stable ;
explain
SELECT objectid::text, n::text, v::text, vt::int FROM
steve1(ARRAY['5ab8e0ca-abb5-48c9-a95e-2bb2a375d903','adcbe251-6723-48a8-8385-55133fab704a'],
1);
--- CUT ---
and I got:
QUERY PLAN
Nested Loop (cost=0.15..11.42 rows=100 width=100)
-> Index Scan using request_objectid_productid_idx on request r
(cost=0.15..8.17 rows=1 width=64)
Index Cond: ((objectid = ANY
('{5ab8e0ca-abb5-48c9-a95e-2bb2a375d903,adcbe251-6723-48a8-8385-55133fab704a}'::text[]))
AND (productid = 1))
-> Function Scan on jsonb_array_elements i (cost=0.01..1.00 rows=100
width=32)
(4 rows)
which is what I expected from successful inlining of the function.
So there are some moving parts in your situation that you've not
told us about.
regards, tom lane
RE: [EXT] Re: Problem getting query to use index inside a function
>> Thanks for the reply, but that did not seem to help.
> I tried to replicate this as follows:
> --- CUT ---
> create table request(objectid text, productid int, data jsonb); create index
> on request(objectid, productid);
> CREATE OR REPLACE FUNCTION public.steve1(param_requestid text[],
> param_productid integer DEFAULT 1) RETURNS TABLE(objectid text, n text, v
> text, vt integer) LANGUAGE sql AS $function$
> SELECT objectid::text
> , i->>'n'::text
> , i->>'v'::text
> , (i->>'vt') :: INT as vt
> FROM request r
> , jsonb_array_elements(data -> 'i') i WHERE objectid =
> ANY($1)
> AND productid=$2
> $function$
> stable ;
> explain
> SELECT objectid::text, n::text, v::text, vt::int FROM
> steve1(ARRAY['5ab8e0ca-abb5-48c9-a95e-2bb2a375d903','adcbe251-6723-48a8-8385-55133fab704a'],
> 1);
> --- CUT ---
> and I got:
> QUERY
> PLAN
>
> Nested Loop (cost=0.15..11.42 rows=100 width=100)
>-> Index Scan using request_objectid_productid_idx on request r
> (cost=0.15..8.17 rows=1 width=64)
> Index Cond: ((objectid = ANY
> ('{5ab8e0ca-abb5-48c9-a95e-2bb2a375d903,adcbe251-6723-48a8-8385-55133fab704a}'::text[]))
> AND (productid = 1))
>-> Function Scan on jsonb_array_elements i (cost=0.01..1.00 rows=100
> width=32)
> (4 rows)
> which is what I expected from successful inlining of the function.
> So there are some moving parts in your situation that you've not told us
> about.
> regards, tom lane
Hi Tom,
I ran your code and got similar results so I agree there is more for me to dig
into. I see STABLE has a limit of not allowing DML in the function. I am
testing against a very simple function here but we have many other functions
with the same problem but those also contain DML so even if I got the STABLE to
work in this one test case it does not appear I could use in all of my
functions with this problem. Are there other options here besides the STABLE
option that would work for functions that also contain DML?
Thanks
Steve
Re: Problem getting query to use index inside a function
"Dirschel, Steve" writes: > Here is the function I'm having difficulties with: > CREATE OR REPLACE FUNCTION public.steve1(param_requestid text[], > param_productid integer DEFAULT 1) > RETURNS TABLE(objectid text, n text, v text, vt integer) > LANGUAGE sql > AS $function$ > SELECT objectid::text > , i->>'n'::text > , i->>'v'::text > , (i->>'vt') :: INT as vt > FROM request r > , jsonb_array_elements(data -> 'i') i > WHERE objectid = ANY($1) > AND productid=$2 > $function$ > ; > Query: > shared=> explain (analyze, buffers) > shared-> SELECT objectid::text, n::text, v::text, vt::int FROM > steve1(ARRAY['5ab8e0ca-abb5-48c9-a95e-2bb2a375d903','adcbe251-6723-48a8-8385-55133fab704a'], > 1); >QUERY PLAN > > Function Scan on steve1 (cost=0.25..10.25 rows=1000 width=100) (actual > time=42.694..42.942 rows=3368 loops=1) I think you would have better luck if the planner were "inlining" this function, which we can see it's not since you get a Function Scan on steve1 rather than the contained query. I think the only thing stopping that from happening is that the function is (by default) VOLATILE. Try marking it STABLE so that it can share the calling query's snapshot. (v18 should handle such cases better than previous versions, BTW. But you'd still be better off marking the function STABLE.) regards, tom lane
RE: [EXT] Re: Problem getting query to use index inside a function
> > Here is the function I'm having difficulties with:
> > CREATE OR REPLACE FUNCTION public.steve1(param_requestid text[],
> > param_productid integer DEFAULT 1) RETURNS TABLE(objectid text, n
> > text, v text, vt integer) LANGUAGE sql AS $function$
> > SELECT objectid::text
> > , i->>'n'::text
> > , i->>'v'::text
> > , (i->>'vt') :: INT as vt
> > FROM request r
> > , jsonb_array_elements(data -> 'i') i WHERE objectid =
> > ANY($1)
> > AND productid=$2
> > $function$
> > ;
> > Query:
> > shared=> explain (analyze, buffers)
> > shared-> SELECT objectid::text, n::text, v::text, vt::int FROM
> > shared-> steve1(ARRAY['5ab8e0ca-abb5-48c9-a95e-2bb2a375d903','adcbe251
> > shared-> -6723-48a8-8385-55133fab704a'], 1);
> >QUERY PLAN
> > --
> > --
> > Function Scan on steve1 (cost=0.25..10.25 rows=1000 width=100)
> > (actual time=42.694..42.942 rows=3368 loops=1)
> I think you would have better luck if the planner were "inlining"
> this function, which we can see it's not since you get a Function Scan on
> steve1 rather than the contained query.
> I think the only thing stopping that from happening is that the function is
> (by default) VOLATILE. Try marking it STABLE so that it can share the
> calling query's snapshot.
> (v18 should handle such cases better than previous versions, BTW.
> But you'd still be better off marking the function STABLE.)
> regards, tom lane
Thanks for the reply, but that did not seem to help. I tried using both the {}
and ARRAY[] syntax but both are still full scanning based on the shared
buffers. Any other ideas?
shared=> ALTER FUNCTION public.steve1 (param_requestid text[], param_productid
integer) STABLE;
ALTER FUNCTION
shared=> explain (analyze, buffers)
shared->SELECT objectid::text, n::text, v::text, vt::int FROM
steve1('{83e3326a-62fe-45bc-81e5-1e9fb9a84d31}', 'PJJ', 1, 1);
QUERY PLAN
-
Function Scan on steve1 (cost=0.25..10.25 rows=1000 width=100) (actual
time=64.465..64.465 rows=0 loops=1)
Buffers: shared hit=16572
Planning Time: 0.033 ms
Execution Time: 64.485 ms
(4 rows)
shared=>
shared=> explain (analyze, buffers)
shared-> SELECT objectid::text, n::text, v::text, vt::int FROM
steve1(ARRAY['83e3326a-62fe-45bc-81e5-1e9fb9a84d31'], 'PJJ', 1, 1);
QUERY PLAN
-
Function Scan on steve1 (cost=0.25..10.25 rows=1000 width=100) (actual
time=63.749..63.749 rows=0 loops=1)
Buffers: shared hit=16569
Planning Time: 0.043 ms
Execution Time: 63.766 ms
(4 rows)
