Re: Window partial fetch optimization

2022-05-04 Thread Jeff Janes
On Tue, May 3, 2022 at 2:11 PM Levi Aul  wrote:

> I have a “temporal table” — a table where there are multiple “versions” of
> entities, with each version having a distinct timestamp:
> CREATE TABLE contract_balance_updates (
> block_id bigint NOT NULL,
> block_signed_at timestamp(0) without time zone NOT NULL,
> contract_address bytea NOT NULL,
> holder_address bytea NOT NULL,
> start_block_height bigint NOT NULL,
> balance numeric NOT NULL
> ) PARTITION BY RANGE (block_signed_at);
>
> -- one for each partition (applied by pg_partman from a template)
> CREATE UNIQUE INDEX contract_balance_updates_pkey
> ON contract_balance_updates(
> holder_address bytea_ops,
> contract_address bytea_ops,
> start_block_height int8_ops DESC
> );
>

How does pg_partman deal with the fact that a unique index on a partitioned
table must contain the partitioning key?

It should be noted that your 3 queries don't return the same thing.  The
last one returns columns holder_address, contract_address, and balance,
while the first returns all columns in the table.  If you were to make the
first query return just the three columns holder_address, contract_address,
and balance and build a suitable index, then you could get it to use an
index-only scan.  This should be similar to (but probably faster than) your
3rd query, without all the kerfuffle of extra scans and dummy syntax.  The
index needed would be:

(holder_address bytea_ops, contract_address bytea_ops, start_block_height,
balance);

Note that in theory it could do a better job of using the index you already
have.  It could compute the row_number using only the data available in the
index, then go fetch the table tuple for just the rows which pass the
row_number filter.  But it just isn't smart enough to do that. (By
separating the WHERE clause from the select list into different queries,
that is essentially what your third query is tricking it into doing)

Cheers,

Jeff


Why is there a Sort after an Index Only Scan?

2022-05-04 Thread André Hänsel
Quick(?) question... why is there a Sort node after an Index Only Scan?
Shouldn't the index already spit out sorted tuples?

CREATE INDEX ON orders_test(shipping_date, order_id);

EXPLAIN ANALYZE SELECT
FROM orders_test
WHERE TRUE
AND shipping_date >= '2022-05-01'
AND shipping_date <= '2022-05-01'
ORDER BY order_id
LIMIT 50;

Limit  (cost=8.46..8.46 rows=1 width=4) (actual time=0.031..0.032 rows=0
loops=1)
  ->  Sort  (cost=8.46..8.46 rows=1 width=4) (actual time=0.025..0.025
rows=0 loops=1)
Sort Key: order_id
Sort Method: quicksort  Memory: 25kB
->  Index Only Scan using orders_test_shipping_date_order_id_idx on
orders_test  (cost=0.43..8.45 rows=1 width=4) (actual time=0.017..0.018
rows=0 loops=1)
  Index Cond: ((shipping_date >= '2022-05-01'::date) AND
(shipping_date <= '2022-05-01'::date))
  Heap Fetches: 0

Fiddle:
https://dbfiddle.uk/?rdbms=postgres_14&fiddle=7a3bc2421b5de5a2a377bd39b78d1c
d5

I am actually asking because when I skew the distribution a little and
repeat the query I get a rather unfortunate plan:

INSERT INTO orders_test SELECT generate_series(201, 210),
'2022-05-01';
ANALYZE orders_test;

Limit  (cost=0.43..37.05 rows=50 width=4) (actual time=1186.565..1186.593
rows=50 loops=1)
  ->  Index Scan using orders_test_pkey on orders_test  (cost=0.43..74336.43
rows=101502 width=4) (actual time=1186.562..1186.584 rows=50 loops=1)
Filter: ((shipping_date >= '2022-05-01'::date) AND (shipping_date <=
'2022-05-01'::date))
Rows Removed by Filter: 200

Postgres here uses the primary key to get the sort order, so I'm wondering
if there is anything about my index that precludes its use for ORDER BY.






Re: Why is there a Sort after an Index Only Scan?

2022-05-04 Thread Jeff Janes
On Wed, May 4, 2022 at 7:15 PM André Hänsel  wrote:

> Quick(?) question... why is there a Sort node after an Index Only Scan?
> Shouldn't the index already spit out sorted tuples?
>
> CREATE INDEX ON orders_test(shipping_date, order_id);
>
> EXPLAIN ANALYZE SELECT
> FROM orders_test
> WHERE TRUE
> AND shipping_date >= '2022-05-01'
> AND shipping_date <= '2022-05-01'
> ORDER BY order_id
> LIMIT 50;
>

They are sorted by order_id only within sets of the same shipping_date,
which is not good enough.  (It would be good enough if it were smart enough
to know that there is only one possible shipping date to satisfy your weird
range condition.)

Cheers,

Jeff


Re: Why is there a Sort after an Index Only Scan?

2022-05-04 Thread David Rowley
On Thu, 5 May 2022 at 11:15, André Hänsel  wrote:
>
> Quick(?) question... why is there a Sort node after an Index Only Scan?
> Shouldn't the index already spit out sorted tuples?
>
> CREATE INDEX ON orders_test(shipping_date, order_id);
>
> EXPLAIN ANALYZE SELECT
> FROM orders_test
> WHERE TRUE
> AND shipping_date >= '2022-05-01'
> AND shipping_date <= '2022-05-01'
> ORDER BY order_id
> LIMIT 50;

Unfortunately, the query planner is not quite smart enough to realise
that your shipping_date clauses can only match a single value.
There's quite a bit more we could do with the planner's
EquivalanceClasses. There is a patch around to help improve things in
this area but it requires some more infrastructure to make it more
practical to do from a performance standpoint in the planner.

You'll get the plan you want if you requite the query and replace your
date range with shipping_date = '2022-05-01'.  Your use of WHERE TRUE
indicates to me that you might be building this query in an
application already, so maybe you can just tweak that application to
test if the start and end dates are the same and use equality when
they are.

David

[1] https://commitfest.postgresql.org/38/3524/




RE: Why is there a Sort after an Index Only Scan?

2022-05-04 Thread André Hänsel
> They are sorted by order_id only within sets of the same shipping_date, which 
> is not good enough.  

Ah yes, that totally makes sense for the general case.

> so maybe you can just tweak that application to test if the start and end 
> dates are the same and use equality when they are.

I definitely can.

But now I have a followup question, which probably should have been a separate 
question all along. I have modified the example a bit to have a more natural 
date distribution and I got rid of the weird shipping_date condition and 
actually made it different dates, so the index order is out of the picture. I 
also added some statistics so Postgres knows about the relationship between the 
columns.

https://dbfiddle.uk/?rdbms=postgres_14&fiddle=54c7774432e896e3c0e89d8084c4b194

After inserting more rows, Postgres still chooses a scan on the primary key 
instead of using the index.

Limit  (cost=0.43..296.63 rows=50 width=4) (actual time=1052.692..1052.737 
rows=50 loops=1)
  ->  Index Scan using orders_test_pkey on orders_test  (cost=0.43..71149.43 
rows=12010 width=4) (actual time=1052.690..1052.728 rows=50 loops=1)
Filter: ((shipping_date >= '2022-04-30'::date) AND (shipping_date <= 
'2022-05-01'::date))
Rows Removed by Filter: 1998734

By setting the CPU costs to 0 (last block in the fiddle) I can force the use of 
the previous plan and as I already suspected it is much better:

Limit  (cost=101.00..101.00 rows=50 width=4) (actual time=4.835..4.843 rows=50 
loops=1)
  ->  Sort  (cost=101.00..101.00 rows=12010 width=4) (actual time=4.833..4.837 
rows=50 loops=1)
Sort Key: order_id
Sort Method: top-N heapsort  Memory: 27kB
->  Index Scan using orders_test_shipping_date_idx on orders_test  
(cost=0.00..101.00 rows=12010 width=4) (actual time=0.026..3.339 rows=11266 
loops=1)
  Index Cond: ((shipping_date >= '2022-04-30'::date) AND 
(shipping_date <= '2022-05-01'::date))

Is it overestimating the cost of the sorting?






Re: Why is there a Sort after an Index Only Scan?

2022-05-04 Thread Tom Lane
=?UTF-8?Q?Andr=C3=A9_H=C3=A4nsel?=  writes:
> Limit  (cost=0.43..296.63 rows=50 width=4) (actual time=1052.692..1052.737 
> rows=50 loops=1)
>   ->  Index Scan using orders_test_pkey on orders_test  (cost=0.43..71149.43 
> rows=12010 width=4) (actual time=1052.690..1052.728 rows=50 loops=1)
> Filter: ((shipping_date >= '2022-04-30'::date) AND (shipping_date <= 
> '2022-05-01'::date))
> Rows Removed by Filter: 1998734

> Is it overestimating the cost of the sorting?

No, but it's guessing it will hit 50 rows that satisfy the filter
before it's gone very far in this index.  If the shipping date and
pkey are correlated in the wrong direction, that could be a very
overoptimistic guess.  I don't think we have adequate stats yet
to detect this sort of problem.

regards, tom lane