Unnecessary buffer usage with multicolumn index, row comparison, and equility constraint
Hi everyone, first time here. Please kindly let me know if this is not the right place to ask. I notice a simple query can read a lot of buffer blocks in a meaningless way, when 1. there is an index scan on a multicolumn index 2. there is row constructor comparison in the Index Cond 3. there is also an equality constraint on the leftmost column of the multicolumn index ## How to reproduce I initially noticed it on AWS Aurora RDS, but it can be reproduced in docker container as well. ```bash docker run --name test-postgres -e POSTGRES_PASSWORD=mysecretpassword -d -p 5432:5432 postgres:16.3 ``` Create a table with a multicolumn index. Populate 12 million rows with random integers. ```sql CREATE TABLE t(a int, b int); CREATE INDEX my_idx ON t USING BTREE (a, b); INSERT INTO t(a, b) SELECT (random() * 123456)::int AS a, (random() * 123456)::int AS b FROM generate_series(1, 12345678); ANALYZE t; ``` Simple query that uses the multicolumn index. ``` postgres=# explain (analyze, buffers) select * from t where row(a, b) > row(123450, 123450) and a = 0 order by a, b; QUERY PLAN --- Index Only Scan using my_idx on t (cost=0.43..8.46 rows=1 width=8) (actual time=284.312..284.314 rows=0 loops=1) Index Cond: ((ROW(a, b) > ROW(123450, 123450)) AND (a = 0)) Heap Fetches: 0 Buffers: shared hit=3777 read=37304 written=11713 Planning: Buffers: shared hit=22 read=4 Planning Time: 0.270 ms Execution Time: 284.341 ms (8 rows) ``` ## Expected output The number of buffer blocks used is high. I expect it to be no more than when there’s only one constraint. ``` postgres=# explain (analyze, buffers) select * from t where row(a, b) > row(123450, 123450) order by a, b; QUERY PLAN Index Only Scan using my_idx on t (cost=0.43..23.67 rows=642 width=8) (actual time=0.030..0.158 rows=542 loops=1) Index Cond: (ROW(a, b) > ROW(123450, 123450)) Heap Fetches: 0 Buffers: shared hit=254 read=3 Planning: Buffers: shared read=4 Planning Time: 0.232 ms Execution Time: 0.206 ms (8 rows) postgres=# explain (analyze, buffers) select * from t where a = 0 order by a, b; QUERY PLAN -- Index Only Scan using my_idx on t (cost=0.43..6.20 rows=101 width=8) (actual time=0.099..0.113 rows=57 loops=1) Index Cond: (a = 0) Heap Fetches: 0 Buffers: shared hit=27 read=2 Planning Time: 0.081 ms Execution Time: 0.131 ms (6 rows) ``` ## Postgres version 16.3 ## Platform information I can reproduce it on the latest postgres docker image, which is based on Debian Linux. Originally found the issue on AWS Aurora. The following are my own observation and thoughts. Please disregard if it’s distraction. For a general form of ```sql select * from t where (a, b) > (x, y) and a = z order by a, b; ``` 1. The number of buffer blocks is proportional to the gap between x and z. Strictly, it’s max(0, min(x, max(a)) – max(z, min(a))). ``` postgres=# explain (analyze, buffers) select * from t where row(a, b) > row(123450, 123450) and a = -3 order by a, b; QUERY PLAN --- Index Only Scan using my_idx on t (cost=0.43..4.45 rows=1 width=8) (actual time=243.173..243.175 rows=0 loops=1) Index Cond: ((ROW(a, b) > ROW(123450, 123450)) AND (a = '-3'::integer)) Heap Fetches: 0 Buffers: shared hit=1 read=41080 Planning: Buffers: shared hit=2 read=2 Planning Time: 0.174 ms Execution Time: 243.199 ms (8 rows) postgres=# explain (analyze, buffers) select * from t where row(a, b) > row(123450, 123450) and a = 0 order by a, b; QUERY PLAN --- Index Only Scan using my_idx on t (cost=0.43..4.45 rows=1 width=8) (actual time=230.425..230.426 rows=0 loops=1) Index Cond: ((ROW(a, b) > ROW(123450, 123450)) AND (a = 0)) Heap Fetches: 0 Buffers: shared hit=1 read=41080 Planning: Buffers: shared read=4 Planning Time: 0.296 ms Execution Time: 230.460 ms (8 rows) postgres=# explain (analyze, buffers) select * from t where row(a, b) > row(123450, 123450) and a = 3 order by a, b; QUERY PLAN --- Index Only Scan using
Re: Wasteful nested loop join when there is `limit` in the query
Thank you for your help, Tom. You are right. I added an index on employee.name (by making it unique), and then postgres can visit employee table in a pre-sorted manner, and can exit early without joining more rows. Just sharing the tweak I did to the example, if anyone else is interested in a quick test. I also populated 1 million rows so the example is no longer a toy demo. ```sql drop table if exists department; drop table if exists employee; create table department( id int primary key, name text); create table employee( id int primary key, name text unique, department_id int); INSERT INTO department (id, name) SELECT i+1, 'department' || i+1 FROM generate_series(0, 9) AS i; INSERT INTO employee (id, name, department_id) SELECT i+1, 'name' || i+1, i % 10 +1 FROM generate_series(0, 99) AS i; analyze department; analyze employee; explain analyze select * from employee left outer join department on employee.department_id = department.id order by employee.name limit 10; ``` And here is the plan: ``` QUERY PLAN Limit (cost=0.57..1.36 rows=10 width=34) (actual time=0.017..0.030 rows=10 loops=1) -> Nested Loop Left Join (cost=0.57..78630.06 rows=100 width=34) (actual time=0.016..0.028 rows=10 loops=1) -> Index Scan using employee_name_key on employee (cost=0.42..54855.68 rows=100 width=18) (actual time=0.008..0.015 rows=10 loops=1) -> Memoize (cost=0.15..0.16 rows=1 width=16) (actual time=0.001..0.001 rows=1 loops=10) Cache Key: employee.department_id Cache Mode: logical Hits: 6 Misses: 4 Evictions: 0 Overflows: 0 Memory Usage: 1kB -> Index Scan using department_pkey on department (cost=0.14..0.15 rows=1 width=16) (actual time=0.001..0.001 rows=1 loops=4) Index Cond: (id = employee.department_id) Planning Time: 0.189 ms Execution Time: 0.045 ms (11 rows) ``` Personally I still wish someday postgres can push down `limit` node together with `sort` node when certain conditions are met, so that there's no need to add an index :D Thank you again for your help! On Mon, 17 Feb 2025 at 18:01, Tom Lane wrote: > WU Yan <4wu...@gmail.com> writes: > > Hello everyone, I am still learning postgres planner and performance > > optimization, so please kindly point out if I missed something obvious. > > An index on employee.name would likely help here. Even if we had > an optimization for pushing LIMIT down through a join (which you > are right, we don't) it could not push the LIMIT through a sort step. > So you need presorted output from the scan of "employee". I think > this example would behave better with that. You may also need to > test with non-toy amounts of data to get the plan you think is > better: an example with only half a dozen rows is going to be > swamped by startup costs. > > regards, tom lane >
Wasteful nested loop join when there is `limit` in the query
Hello everyone, I am still learning postgres planner and performance optimization, so please kindly point out if I missed something obvious. I've noticed that postgres joins all rows in two tables, even though there's a `limit` in the query. It means lots of joined rows are discarded eventually, and the performance can be better if postgres planner can do `limit` before the `join`. Here's my example. I am running postgres 17.3 in a container ```sh docker run --rm -e POSTGRES_PASSWORD=mysecretpassword -d -p 5432:5432 postgres:17.3 ``` ```sql create table department( id int primary key, name text); create table employee( id int primary key, name text, department_id int); insert into department values (1, 'sales'), (2, 'support'), (3, 'research'), (4, 'management'), (5, 'hr'); insert into employee values (101, 'alice', 1), (102, 'bob', 2), (103, 'ccc', 3), (104, 'ddd', 4), (105, 'eee', 999); analyze department; analyze employee; set enable_mergejoin = off; set enable_hashjoin = off; explain analyze select * from employee left outer join department on employee.department_id = department.id order by employee.name limit 2; ``` In this simple setup, I want to see the top 2 employees by name, and check their department info. Here's the query plan: ``` QUERY PLAN - Limit (cost=2.49..2.49 rows=2 width=23) (actual time=0.037..0.038 rows=2 loops=1) -> Sort (cost=2.49..2.50 rows=5 width=23) (actual time=0.036..0.037 rows=2 loops=1) Sort Key: employee.name Sort Method: top-N heapsort Memory: 25kB -> Nested Loop Left Join (cost=0.00..2.44 rows=5 width=23) (actual time=0.014..0.021 rows=5 loops=1) Join Filter: (employee.department_id = department.id) Rows Removed by Join Filter: 11 -> Seq Scan on employee (cost=0.00..1.05 rows=5 width=12) (actual time=0.006..0.007 rows=5 loops=1) -> Materialize (cost=0.00..1.07 rows=5 width=11) (actual time=0.001..0.001 rows=3 loops=5) -> Seq Scan on department (cost=0.00..1.05 rows=5 width=11) (actual time=0.001..0.002 rows=5 loops=1) Planning Time: 0.248 ms Execution Time: 0.062 ms (12 rows) ``` Note `loops=5` in the plan. My understanding is, all 5 rows in table employee are joined, although we only need 2. I think join before limit is necessary if we do not know whether join will create multiple rows or eliminate rows. However, in this setup, it's left outer join on a primary key, so we know one employee will produce exactly one row. Similar assumption can be made for `a inner join b on a.x = b.x` when a.x is a foreign key referencing b.x and b.x has a unique constraint. At the moment, my workaround is to rewrite the query with CTE. I need to write `order by ... limit ...` twice. `loops=2` is observed. ``` postgres=# explain analyze with t as ( select * from employee order by employee.name limit 2 ) select * from t left outer join department on t.department_id = department.id order by t.name limit 2; QUERY PLAN --- Limit (cost=1.10..2.32 rows=2 width=23) (actual time=0.045..0.049 rows=2 loops=1) -> Nested Loop Left Join (cost=1.10..2.32 rows=2 width=23) (actual time=0.044..0.047 rows=2 loops=1) Join Filter: (employee.department_id = department.id) Rows Removed by Join Filter: 1 -> Limit (cost=1.10..1.10 rows=2 width=12) (actual time=0.032..0.033 rows=2 loops=1) -> Sort (cost=1.10..1.11 rows=5 width=12) (actual time=0.032..0.033 rows=2 loops=1) Sort Key: employee.name Sort Method: top-N heapsort Memory: 25kB -> Seq Scan on employee (cost=0.00..1.05 rows=5 width=12) (actual time=0.015..0.016 rows=5 loops=1) -> Materialize (cost=0.00..1.07 rows=5 width=11) (actual time=0.004..0.004 rows=2 loops=2) -> Seq Scan on department (cost=0.00..1.05 rows=5 width=11) (actual time=0.004..0.005 rows=2 loops=1) Planning Time: 0.196 ms Execution Time: 0.103 ms (13 rows) ``` I wonder whether there's a better way to influence the planner to push the `limit` down to the `employee` table first, or this optimization strategy is not yet in the planner. The demo is trivial, but in some real cases, it's possible to select top 100 from 1 million rows. Sorting 1 million rows is slow but necessary and acceptable, but wasteful nested loop join on millions of rows can make the performance much slower. Even worse, the wasteful join can happen multiple times when it's more than 2 tables joining. Thank you Best regards