from:"Shanti\-Dominique"

Efficient rows filter for array inclusion with gin index

2024-02-28 Thread Shanti-Dominique


Hello,

I have a problem writing performant queries that requires to test value 
inclusion within an array.


First: the schema consists of an "items" table with a "ref_id" primary 
key (uuid, primary key). The items are hierarchical and can have a 
"parent_ref_id" (uuid) that references their parent. In order to avoid 
recursive queries,there is a secondary table "item_paths" populated via 
triggers that have two columns "ref_id" (uuid that references a row in 
"items") and "item_path" (uuid[] which contains the path of items from 
the root to the item included, a gin index).


On order to test if an item i2 isa descendant of another item i1, I see 
two ways to query the database:


1)
    SELECT  *
    FROM    items i1
    JOIN item_paths p1 ON i1.ref_id = p1.ref_id
    JOIN items i2 ON i2.ref_id = ANY (p1.item_path)
    WHERE   ...

2)
    SELECT  *
    FROM    items i1
    JOIN item_paths p1 ON i1.ref_id = p1.ref_id
    JOIN items i2 ON ARRAY[i2.ref_id] <@ p1.item_path
    WHERE   ...

The is that neither of these two solutions seems good for the general case:

1) does not make use of the gin index as the "= ANY(...)" construct is 
not part of the supported operators. What it seems to be doing is that 
it runs a sequential scan against p1 while it uses the index to find the 
item i2.


2) uses the operator <@ which is supported by the gin index, the test 
for inclusion is fast and the query does not run a sequential scan over 
the whole "item_paths" table. However, because of the ARRAY[i2.ref_id] 
construct, it performs a sequential scan on i2.


I use this kind of construct in many places in my code and I'd welcome a 
solution that would use the index in all cases. Is it possible?


Thank you.

Here below a SQL file that demonstrate the problem using EXPLAIN:

-- Database Schema

SET enable_seqscan  = off;

CREATE TABLE items (
    ref_id uuid DEFAULT public.gen_random_uuid() NOT NULL,
    name character varying,
    parent_ref_id uuid
);

CREATE TABLE item_paths (
    ref_id uuid NOT NULL,
    item_path uuid[]
);

CREATE INDEX items_ref_id_idx ON items USING btree (ref_id);
CREATE INDEX items_name_idx ON items USING btree (name);
CREATE INDEX items_parent_ref_id_idx ON items USING btree (parent_ref_id);
CREATE INDEX item_paths_ref_id_idx ON item_paths USING btree (ref_id);
CREATE INDEX item_paths_item_path_idx ON item_paths USING gin (item_path);



-- Fast: select the small number of items i1, join with their paths and 
from the

-- path values take all the values from i2 using the index on i2

EXPLAIN
SELECT  i2.*
FROM    items i1
    JOIN item_paths p1 ON i1.ref_id = p1.ref_id
    JOIN items i2 ON i2.ref_id = ANY (p1.item_path)
WHERE   i1.name = 'a';

-- Nested Loop  (cost=5.52..102.86 rows=903 width=64)
--   ->  Nested Loop  (cost=5.37..46.14 rows=21 width=32)
-- ->  Bitmap Heap Scan on items i1 (cost=4.18..12.64 rows=4 
width=16)

--   Recheck Cond: ((name)::text = 'a'::text)
--   ->  Bitmap Index Scan on items_name_idx 
(cost=0.00..4.18 rows=4 width=0)

-- Index Cond: ((name)::text = 'a'::text)
-- ->  Bitmap Heap Scan on item_paths p1 (cost=1.19..8.32 rows=5 
width=48)

--   Recheck Cond: (ref_id = i1.ref_id)
--   ->  Bitmap Index Scan on item_paths_ref_id_idx  
(cost=0.00..1.19 rows=5 width=0)

-- Index Cond: (ref_id = i1.ref_id)
--   ->  Index Scan using items_ref_id_idx on items i2 (cost=0.15..2.27 
rows=43 width=64)

-- Index Cond: (ref_id = ANY (p1.item_path))



-- Slow: select the small number of items then performs a sequantial 
scan on the

-- item_paths to find the paths that have this item ref_id in their path.

EXPLAIN
SELECT  item_paths.*
FROM    items
    JOIN item_paths ON items.ref_id = ANY (item_paths.item_path)
WHERE   items.name = 'a';

-- Nested Loop  (cost=104.18..1000140.35 rows=209 width=48)
--   Join Filter: (items.ref_id = ANY (item_paths.item_path))
--   ->  Seq Scan on item_paths (cost=100.00..120.70 
rows=1070 width=48)

--   ->  Materialize  (cost=4.18..12.66 rows=4 width=16)
-- ->  Bitmap Heap Scan on items  (cost=4.18..12.64 rows=4 width=16)
--   Recheck Cond: ((name)::text = 'a'::text)
--   ->  Bitmap Index Scan on items_name_idx 
(cost=0.00..4.18 rows=4 width=0)

-- Index Cond: ((name)::text = 'a'::text)



-- Slow: Find by index i1 and p1, then perform a sequantial scan on i2 
(probably

-- caused by the construct ARRAY[i1.ref_id]) to find the matching item

EXPLAIN
SELECT  i2.*
FROM    items i1
    JOIN item_paths p1 ON i1.ref_id = p1.ref_id
    JOIN items i2 ON ARRAY[i2.ref_id] <@ p1

Re: Efficient rows filter for array inclusion with gin index

2024-02-28 Thread Shanti-Dominique


Replying to myself after more investigation.

On 28/02/2024 12:05, Shanti-Dominique wrote:




2)
    SELECT  *
    FROM    items i1
    JOIN item_paths p1 ON i1.ref_id = p1.ref_id
    JOIN items i2 ON ARRAY[i2.ref_id] <@ p1.item_path
    WHERE   ...

2) uses the operator <@ which is supported by the gin index, the test 
for inclusion is fast and the query does not run a sequential scan 
over the whole "item_paths" table. However, because of the 
ARRAY[i2.ref_id] construct, it performs a sequential scan on i2.
I was under the assumption that the ARRAY[] construct prevented 
postgresql from efficiently using the index on the other side of the 
operator, but I think I was mistaken. On a database full of data, I 
tried getting around this but did not see any improvement of performance.


First I tried to add an index on the single element array:

CREATE FUNCTION uuidarr(ref_id uuid) RETURNS uuid[]
  LANGUAGE SQL
  IMMUTABLE
  RETURNS NULL ON NULL INPUT
  RETURN ARRAY[ref_id];

CREATE INDEX items_ref_id_arr2_idx ON items USING gin (uuidarr(ref_id));

EXPLAIN
SELECT  i2.*
FROM    items i1
    JOIN item_paths p1 ON i1.ref_id = p1.ref_id
    JOIN items i2 ON uuidarr(i2.ref_id) <@ p1.item_path
WHERE   i1.name = 'a';


The performance was even worse. Then I tried with a generated column:

CREATE TABLE items (
    ref_id uuid DEFAULT public.gen_random_uuid() NOT NULL,
    ref_id_array uuid[] GENERATED ALWAYS AS (uuidarr(ref_id)) STORED,
    name character varying,
    parent_ref_id uuid
);

CREATE INDEX items_ref_id_array_idx ON items USING gin (ref_id_array);

EXPLAIN
SELECT  i2.*
FROM    items i1
    JOIN item_paths p1 ON i1.ref_id = p1.ref_id
    JOIN items i2 ON i2.ref_id_array <@ p1.item_path
WHERE   i1.name = 'a';

The performance was very similar to the query with ARRAY[...]

It seems there is no good solution for the general case, apart from 
changing the structure of my dataset and removing the use of arrays 
entirely.


I think I'll update my codebase and use <@ where it makes sense and = 
ANY in other places, but it'll be difficult to know for sure without 
running the query which one will be better.

Efficient rows filter for array inclusion with gin index

Re: Efficient rows filter for array inclusion with gin index

2 matches

Site Navigation

Mail list logo

Footer information