Re: Multicolumn index scan efficiency

2025-11-09 Thread Peter Geoghegan
On Sun, Nov 9, 2025 at 9:44 PM Vitalii Tymchyshyn  wrote:
> I am wondering about 2 things:
> 1) Does anyone know which specific change / version made it fast?
> 2) What was the proper way to do a range index scan like WHERE (a,b,c) 
> between (x1,y1,z1) and (x2,y2,z2) before the improvement.
> Note that my tests can mostly be rewritten as equality at least for some 
> columns (and this is what we'll do), but sometimes we do need a range scan 
> like above, so understanding it would be important. Also I am curious :).

This improvement you're seeing here is down to work in commit
bd3f59fd. The short version is that the way we used to decide when a
condition like "WHERE (a,b,c) <= (x2,y2,z2)" was needlessly
conservative. If there were many "a" values equal to x2, we'd have to
scan the index until we got to the next distinct/non-equal "a" value
-- without realizing that we're already past the point where there
cannot possibly be any more matches.

See the discussion on this thread which complained about the problem,
particularly my response to the complaint:

https://www.postgresql.org/message-id/flat/CAH2-WzmLREy6r68A6SEHXnstg01kNs1HiQtOvSO5cTvWuaducw%40mail.gmail.com#62e393ac8bbf06f0f73598ba2ceeab69

--
Peter Geoghegan




Multicolumn index scan efficiency

2025-11-09 Thread Vitalii Tymchyshyn
Hi.

Since Friday I have been trying to diagnose the slowness of scanning the
multicolumn index in Postgres. I figured out that multicolumns conditions
like (a,b,c) > (x,y,z) with a,b and c being part of primary index work slow
in postgresql 9.6 (original version in prod, I know it's old). I thought
that it would be fast in 15 (one I had handy), but it was still slow. It
was finally fast in 18. Note that all 3 are Google CloudSQL, but I believe
those are pretty vanilla in terms of index scan internals.

I am wondering about 2 things:
1) Does anyone know which specific change / version made it fast?
2) What was the proper way to do a range index scan like WHERE (a,b,c)
between (x1,y1,z1) and (x2,y2,z2) before the improvement.
Note that my tests can mostly be rewritten as equality at least for some
columns (and this is what we'll do), but sometimes we do need a range scan
like above, so understanding it would be important. Also I am curious :).

The full test will be towards the end, but the query I started from is
EXPLAIN (ANALYZE, BUFFERS, TIMING OFF)
SELECT * FROM application_specs
WHERE (namespace, application, version) > ('default',
'test_application_pipeline_with__simulated_long_name_075000', '')
  AND (namespace, application) <= ('default',
'test_application_pipeline_with__simulated_long_name_075000')
  AND (latest = true OR latest is NULL);

Plans:
--- 9.6
 Index Scan using application_specs_pkey on application_specs
 (cost=0.56..6.33 rows=1 width=198) (actual rows=1 loops=1)
   Index Cond: ((ROW(namespace, application, version) >
ROW('default'::text,
'test_application_pipeline_with__simulated_long_name_075000'::text,
''::text)) AND (ROW(namespace, application) <= ROW('default'::text,
'test_application_pipeline_with__simulated_long_name_075000'::text)))
   Filter: (latest OR (latest IS NULL))
   Rows Removed by Filter: 29
   Buffers: shared hit=35442
 Planning time: 0.153 ms
 Execution time: 5049.123 ms
(7 rows)
--- 15
 Index Scan using application_specs_pkey on application_specs
 (cost=0.56..6.33 rows=1 width=206) (actual rows=1 loops=1)
   Index Cond: ((ROW(namespace, application, version) >
ROW('default'::text,
'test_application_pipeline_with__simulated_long_name_075000'::text,
''::text)) AND (ROW(namespace, application) <= ROW('default'::text,
'test_application_pipeline_with__simulated_long_name_075000'::text)))
   Filter: (latest OR (latest IS NULL))
   Rows Removed by Filter: 29
   Buffers: shared hit=45839
 Planning Time: 0.171 ms
 Execution Time: 8190.116 ms
(7 rows)

--- 18
 Index Scan using application_specs_pkey on application_specs
 (cost=0.56..6.33 rows=1 width=206) (actual rows=1.00 loops=1)
   Index Cond: ((ROW(namespace, application, version) >
ROW('default'::text,
'test_application_pipeline_with__simulated_long_name_075000'::text,
''::text)) AND (ROW(namespace, application) <= ROW('default'::text,
'test_application_pipeline_with__simulated_long_name_075000'::text)))
   Filter: (latest OR (latest IS NULL))
   Rows Removed by Filter: 29
   Index Searches: 1
   Buffers: shared hit=35
 Planning:
   Buffers: shared hit=29
 Planning Time: 0.243 ms
 Execution Time: 0.155 ms
(10 rows)

So, it's index scan in all cases, but for some reason index scan is very
inefficient in the old versions and it has to scan a huge portion of index
to find a few rows.

--- Full test

-- 0. DISABLE SEQSCAN to ensure index is used
SET ENABLE_SEQSCAN=false;

-- 1. SETUP: Clean slate
DROP TABLE IF EXISTS application_specs;

-- 2. SCHEMA
CREATE TABLE application_specs (
namespace text NOT NULL,
application text NOT NULL,
version text NOT NULL,
latest boolean,
data text,
PRIMARY KEY (namespace, application, version) WITH (fillfactor = 90)
);

-- 3. GENERATE DATA: ~4.5 Million Rows Total
-- 150,000 Applications * ~30 versions each
INSERT INTO application_specs (namespace, application, version, latest,
data)
SELECT
'default',
'test_application_pipeline_with__simulated_long_name_' || lpad(a::text,
6, '0'),
md5(random()::text || clock_timestamp()::text)::uuid::text,
(v = 30), -- Make the 30th version the 'latest' one
repeat('x', 100)
FROM generate_series(1, 15) a
CROSS JOIN generate_series(1, 30) v;

-- 4. ANALYZE to ensure planner has accurate stats
VACUUM (ANALYZE, FREEZE) application_specs;

-- =
-- THE TEST
-- Target a middle application: '...075000'
-- =

---
-- TEST A: Original Row-wise Query (Should be slow)
---
EXPLAIN (ANALYZE, BUFFERS, TIMING OFF)
SELECT * FROM application_specs
WHERE (namespace, application, version) > ('default',
'test_application_pipeline_with__simulated_long_name_075000', '')
  AND (namespace, application) <= ('default',
'test_application_pipeline_with__simulated_long_name_075000')
  AND (latest = true OR latest is NULL);

--

Re: Multicolumn index scan efficiency

2025-11-09 Thread Vitalii Tymchyshyn
Thank you so much for both clarifying and fixing it!
In our case (FYI, this is from http://github.com/cdapio/cdap) a lot of
users have just a single namespace, so it effectively means scanning till
the end of the index.
We'll fix
https://github.com/cdapio/cdap/blob/develop/cdap-data-fabric/src/main/java/io/cdap/cdap/spi/data/sql/PostgreSqlStructuredTable.java
to detect equality scan prefixes and make corresponding SQL. That would fix
it for all postgres versions.

Best regards, Vitalii Tymchyshyn

нд, 9 лист. 2025 р. о 20:21 Peter Geoghegan  пише:

> On Sun, Nov 9, 2025 at 9:44 PM Vitalii Tymchyshyn  wrote:
> > I am wondering about 2 things:
> > 1) Does anyone know which specific change / version made it fast?
> > 2) What was the proper way to do a range index scan like WHERE (a,b,c)
> between (x1,y1,z1) and (x2,y2,z2) before the improvement.
> > Note that my tests can mostly be rewritten as equality at least for some
> columns (and this is what we'll do), but sometimes we do need a range scan
> like above, so understanding it would be important. Also I am curious :).
>
> This improvement you're seeing here is down to work in commit
> bd3f59fd. The short version is that the way we used to decide when a
> condition like "WHERE (a,b,c) <= (x2,y2,z2)" was needlessly
> conservative. If there were many "a" values equal to x2, we'd have to
> scan the index until we got to the next distinct/non-equal "a" value
> -- without realizing that we're already past the point where there
> cannot possibly be any more matches.
>
> See the discussion on this thread which complained about the problem,
> particularly my response to the complaint:
>
>
> https://www.postgresql.org/message-id/flat/CAH2-WzmLREy6r68A6SEHXnstg01kNs1HiQtOvSO5cTvWuaducw%40mail.gmail.com#62e393ac8bbf06f0f73598ba2ceeab69
>
> --
> Peter Geoghegan
>