Re: Merge David and Goliath tables efficiently

2023-06-19 Thread Alvaro Herrera
I came here to talk about partitionwise join, but then noticed you have
already thought of that:

On 2023-Jun-18, nicolas paris wrote:

> Note that both plan acome from the same partitioned by hash table with
> 100 parts, with a unique index on the list_id + hash_key. For strategy
> 2.1, I turned on enable_partitionwise_join, since david table has the
> same partitioning scheme as goliath including unique indexe. In both
> case the query is:

Hmm, I suppose the reason partitionwise join isn't having any effect is
that the presence of WHEN NOT MATCHED clauses force an outer join, which
probably disarms partitionwise joining, since each join pair would
require to match for nulls, so there would be two matching partitions at
the other end.  A quick test for this hypothesis might be to try the
MERGE without the WHEN NOT MATCHED clauses and see if partitionwise join
works better.

Maybe Tom L's new outer-join infrastructure in 16 allows to improve on
this, not sure.

-- 
Álvaro HerreraBreisgau, Deutschland  —  https://www.EnterpriseDB.com/
"Los dioses no protegen a los insensatos.  Éstos reciben protección de
otros insensatos mejor dotados" (Luis Wu, Mundo Anillo)




Re: Merge David and Goliath tables efficiently

2023-06-19 Thread Tomas Vondra



On 6/18/23 22:57, nicolas paris wrote:
>> ...
>> Well, I kinda doubt you have 17023331531230 rows (not even physically
>> possible with 2TB disk), so that's immediately suspicious.
> 
> Below is the full plan for the strategy 2.1 (Indeed the previous email
> plan was truncated and wrong, sorry for that). 
> 

None of the plans has estimates anywhere close to 17023331531230, so
where did that come from?


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Merge David and Goliath tables efficiently

2023-06-19 Thread Tomas Vondra


On 6/19/23 09:46, Alvaro Herrera wrote:
> I came here to talk about partitionwise join, but then noticed you have
> already thought of that:
> 
> On 2023-Jun-18, nicolas paris wrote:
> 
>> Note that both plan acome from the same partitioned by hash table with
>> 100 parts, with a unique index on the list_id + hash_key. For strategy
>> 2.1, I turned on enable_partitionwise_join, since david table has the
>> same partitioning scheme as goliath including unique indexe. In both
>> case the query is:
> 
> Hmm, I suppose the reason partitionwise join isn't having any effect is
> that the presence of WHEN NOT MATCHED clauses force an outer join, which
> probably disarms partitionwise joining, since each join pair would
> require to match for nulls, so there would be two matching partitions at
> the other end.  A quick test for this hypothesis might be to try the
> MERGE without the WHEN NOT MATCHED clauses and see if partitionwise join
> works better.
> 
> Maybe Tom L's new outer-join infrastructure in 16 allows to improve on
> this, not sure.
> 

Not sure why would that disarm partitionwise join - attached is a simple
reproducer, generating two tables, loading 1000 and 1 rows into
them, and then doing explain on a simple merge.

IMHO the thing that breaks it is the ORDER BY in the merge, which likely
acts as an optimization fence and prevents all sorts of smart things
including the partitionwise join. I'd bet that if Nicolas replaces

  MERGE INTO "goliath" ca
  USING (SELECT * FROM "david" ORDER BY "list_id") AS t
  ..

with

  MERGE INTO "goliath" ca
  USING "david" AS t
  ...

it'll start doing the working much better.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

repro.sql
Description: application/sql


Re: Merge David and Goliath tables efficiently

2023-06-19 Thread nicolas paris
On Mon, 2023-06-19 at 13:34 +0200, Tomas Vondra wrote:
> 
> 
> On 6/18/23 22:57, nicolas paris wrote:
> > > ...
> > > Well, I kinda doubt you have 17023331531230 rows (not even
> > > physically
> > > possible with 2TB disk), so that's immediately suspicious.
> > 
> > Below is the full plan for the strategy 2.1 (Indeed the previous
> > email
> > plan was truncated and wrong, sorry for that). 
> > 
> 
> None of the plans has estimates anywhere close to 17023331531230, so
> where did that come from?
> 

Well this was an old plan where there was an issue: the david table did
not have the same partitioning scheme as goliath. It was partitioned by
an other column.




Re: Merge David and Goliath tables efficiently

2023-06-19 Thread nicolas paris
> IMHO the thing that breaks it is the ORDER BY in the merge, which
> likely
> acts as an optimization fence and prevents all sorts of smart things
> including the partitionwise join. I'd bet that if Nicolas replaces
> 
>   MERGE INTO "goliath" ca
>   USING (SELECT * FROM "david" ORDER BY "list_id") AS t
>   .

Sorry if it was not clear, however there is no order by in the 2.1
strategy. Then this cannot be the reason of not triggering the optim.

For information I do enable partition join feature with jdbc call just
before the merge:
set enable_partitionwise_join=true




Re: Merge David and Goliath tables efficiently

2023-06-19 Thread Tomas Vondra
On 6/19/23 14:20, nicolas paris wrote:
>> IMHO the thing that breaks it is the ORDER BY in the merge, which
>> likely
>> acts as an optimization fence and prevents all sorts of smart things
>> including the partitionwise join. I'd bet that if Nicolas replaces
>>
>>   MERGE INTO "goliath" ca
>>   USING (SELECT * FROM "david" ORDER BY "list_id") AS t
>>   .
> 
> Sorry if it was not clear, however there is no order by in the 2.1
> strategy. Then this cannot be the reason of not triggering the optim.
> 
> For information I do enable partition join feature with jdbc call just
> before the merge:
> set enable_partitionwise_join=true
> 

But you wrote that in both cases the query is:

MERGE INTO "goliath" ca
USING (SELECT * FROM "david" ORDER BY "list_id") AS t
ON t."list_id" = ca."list_id"
WHEN MATCHED THEN
UPDATE SET ...
WHEN NOT MATCHED THEN
INSERT (...)
VALUES (...)

With all due respect, I'm getting a bit tired of having to speculate
about what exactly you're doing etc. based on bits of information.

I'm willing to continue to investigate, but only if you prepare a
reproducer, i.e. a SQL script that demonstrates the issue - I don't
think preparing that should be difficult, something like the SQL script
I shared earlier today should do the trick.

I suggest you do that directly, not through JDBC. Perhaps the JDBC
connection pool does something funny (like connection pooling and
resetting settings).


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company




Re: Merge David and Goliath tables efficiently

2023-06-19 Thread nicolas paris
> But you wrote that in both cases the query is:

that was indeed yet another tipo, hope to do better in the future.


> I'm willing to continue to investigate, but only if you prepare a
> reproducer, 

Thanks for your starter script. Please find attached 2 scripts which
now illustrates two troubles.

repro1.sql is a slight evolution of yours. When I play with david size
(as described in the comments) you will see plan going from nested loop
to sequential scan. Also note that the partition wise join is likely
working. This illustrate my initial problem: the sequential scan is not
going to work fine on my workload (david too large). How to suggest
postgres to use a nested loop here ? (suspect playing with costs should
help)


repro2.sql  now I changed the table layout (similar to my setup) to
reproduce the partition wise join which does not triggers. I added a
partition column, and a unique index to be able to mimic a primary key.
Now partition wise (in my local docker vanilla postgres 15.3) does not
work. Eventually, if I do small batch, then the merge is working fast.
That's an other, unrelated problem.


> I suggest you do that directly, not through JDBC. Perhaps the JDBC
> connection pool does something funny (like connection pooling and
> resetting settings).

I can tell jdbc was working, and likely the reason might be in my
current table setup.


repro1.sql
Description: application/sql


repro2.sql
Description: application/sql


Re: Forced to use UNION ALL when having multiple ANY operators and ORDER BY LIMIT

2023-06-19 Thread Benoit Tigeot

No it is not. But do you think there is an impact here?



Le 18/06/2023 à 23:23, [email protected] a écrit :


Hi,

Do you really need to do “select *”?

In other words, is it necessary to have all columns in the result?

/Michel SALAIS/

*De :*benoit 
*Envoyé :* lundi 12 juin 2023 23:35
*À :* Chris Hoover 
*Cc :* [email protected]
*Objet :* RE: Forced to use UNION ALL when having multiple ANY 
operators and ORDER BY LIMIT


This new index is used but still the read is 230mb.

https://explain.dalibo.com/plan/b0f28a9e8a136afd



*De :*Chris Hoover 
*Envoyé :* lundi 12 juin 2023 22:55
*À :* benoit
*Cc :* [email protected]
*Objet :* Re: Forced to use UNION ALL when having multiple ANY 
operators and ORDER BY LIMIT


I normally create my indexes to match the where clause of the query. 
While technically, it should not matter, I find a lot of time, it does.


I would create an index on (status, sender_reference, sent_at) and see 
if the improves your query performance.





SELECT * FROM docs WHEREstatus

IN('draft',

'sent')

ANDsender_reference

IN('Custom/1175',

'Client/362',

'Custom/280')

ORDER BYsent_at DESC

Thanks,

Chris Hoover

Senior DBA

AWeber.com

Cell: (803) 528-2269

Email: [email protected]



On Jun 12, 2023, at 4:17 PM, benoit  wrote:

Hello

I have a database with few 60gb tables. Tables rows are requested
with multiple ANY or IN operators. I am not able to find an easy
way to make DB able to use indexes. I often hit the index, but see
a a spike of 200mb of IO or disk read.

I am using version 13 but soon 14.

I wrote a reproduction script on version 14 with plans included.
https://gist.github.com/benoittgt/ab72dc4cfedea2a0c6a5ee809d16e04d

I also have plans on a snapshot of the DB with real data.

- The current query that I try to improve :
https://explain.dalibo.com/plan/8b8f6e0he9feb551

  - I added the DB schema + index in query view. As you can see I
have many indexes for testing purpose and try what the planner can do.

- The optimized query when I have only one ANY and migrate to
UNION ALL for each parameter of the ANY operator
https://explain.dalibo.com/plan/427gg053d07328ga . Query is fast
as I would like but it means generate some merge to be able to get
a fast result.

- The new issue I have when I have a new ANY operator on the
previous optimized query. Big IO/read
https://explain.dalibo.com/plan/e7ha9g637b4eh946

It seems to me quite undoable to generate for every parameters a
query that will then merge. I have sometimes 3-4 ANY operators
with up to 15 elements in an array.

Is there a misusage of my indexes?

Is there a limitation when using ANY or IN operators and
ordered LIMIT behind?

Thanks a lot



Index on (fixed size) bytea value

2023-06-19 Thread Les
   Dear fellow list members,

I'm in the process of implementing a file storage system that is based on
PostgreSQL and streaming replication. There will possibly be many similar
files stored. I would like to implement block-level deduplication: each
file consists of a series of blocks, and each unique block is stored only
once (e.g. one block may be reused by multiple files). It will be part of a
bigger software, e.g. the same database will be used for other purposes too.

Here is the basic idea for storing individual blocks:


create table block(
  id uuid not null primary key,
  block bytea not null,
  hs256 bytea not null
) ;
create unique index uidx_block_hs256 on block(hs256);

create or replace function trg_biu_block() returns trigger language plpgsql
as
$function$
begin
new.hs256 = digest(new.block, 'sha256');
end;
$function$;

create trigger trg_biu_block before insert or update on block for each row
execute procedure trg_biu_block();

This is just for storing the blocks. I'm going to put this "block" table
into a separate tablespace. File operations will be at least 95% read and
at most 5% write. (Streaming replication will hopefully allow almost
horizontal scaling for read operations.) Most of the files will be several
megabytes in size (images), and some of them will be 100MB or more
(videos). Total capacity is in the 10TB range. Storage will be SSD
(possibly fiber connected, or local RAID, we are not sure yet).

I do not want to use PostgreSQL large objects, because it does not have
block level deduplication.

Here are some things that I need help with:

1. What should be the maximum size of a block? I was trying to find out the
optimal value. Default BLCKSZ is 8192 bytes. AFAIK PostgreSQL does not
allow a row to occupy multiple blocks. I don't know enough to calculate the
optimal "block size" (the max. number of bytes stored in a single row in
the block.block field), but I suspect that it should be 4K or something
similar. I think that it should be as large as possible, without hitting
the toast. My thinking is this: most files will be at least 1MB in size, so
most "block" rows will reach the maximum tuple size. It would be practical
to make one row in the "block" table occupy almost one PostgreSQL block.
2. I'm not sure if it would be beneficial to increase BLCKSZ. I will be
able to test the speed of my (not yet finished) implementation with
different BLCKSZ values, but that alone won't help me make the decision,
because AFAIK BLCKSZ must have a fixed value for the PostgreSQL instance,
so it will affect all other tables in the database. It would be hard to
tell how changing BLCKSZ would affect the system as a whole.
3. In the above example, I used SHA-256 (pgcrypto), because apparently it
is very well optimized for 64 bit machines, and it has practically zero
chance of a collision. I think that sha512 would be an overkill. But I'm
not sure that this is the best choice. Maybe somebody with more experience
can suggest a better hash function.
4. The hs256 value will always be non-null, fixed 32 byte binary value, but
probably the query planner will not know anything about that. I was also
thinking about bit(256), but I don't see an easy way to convert the bytea
digest into bit(256). A simple type cast won't work here. Maybe using bytea
here is perfectly fine, and creating an index on the hs256 bytea fields is
as effective as possible.

I'm not looking for a definitive answer, just trying to get some hints from
more experienced users before I fill up the drives with terabytes of data.

Thank you,

Laszlo


Re: Index on (fixed size) bytea value

2023-06-19 Thread David G. Johnston
On Mon, Jun 19, 2023 at 1:05 PM Les  wrote:

> AFAIK PostgreSQL does not allow a row to occupy multiple blocks.
>

Your plan is going to heavily involve out-of-band storage.  Please read up
on it here:

https://www.postgresql.org/docs/current/storage-toast.html

I'm not looking for a definitive answer, just trying to get some hints from
> more experienced users before I fill up the drives with terabytes of data.
>
>
Store a hash (and other metadata, like the hashing algorithm) as well as
the path to some system better designed for object storage and retrieval
instead.

David J.


Re: Merge David and Goliath tables efficiently

2023-06-19 Thread Tomas Vondra



On 6/19/23 17:45, nicolas paris wrote:
>> But you wrote that in both cases the query is:
> 
> that was indeed yet another tipo, hope to do better in the future.
> 
> 
>> I'm willing to continue to investigate, but only if you prepare a
>> reproducer, 
> 
> Thanks for your starter script. Please find attached 2 scripts which
> now illustrates two troubles.
> 
> repro1.sql is a slight evolution of yours. When I play with david size
> (as described in the comments) you will see plan going from nested loop
> to sequential scan. Also note that the partition wise join is likely
> working. This illustrate my initial problem: the sequential scan is not
> going to work fine on my workload (david too large). How to suggest
> postgres to use a nested loop here ? (suspect playing with costs should
> help)
> 

In general, this behavior is expected. The overall idea is that nested
loops are efficient for small row counts, but the cost raises quickly
exactly because they do a lot of random I/O (due to the index scan). At
some point it's cheaper to switch to plan does sequential I/O. Which is
exactly what's happening here - we switch from a plan doing a lot of
random I/O on goliath

  QUERY PLAN
  --
   Merge on goliath  (cost=0.29..7888.69 rows=0 width=0)
 Merge on goliath_0 goliath_1
 ...
 Merge on goliath_99 goliath_100
 ->  Append  (cost=0.29..7888.69 rows=9998 width=47)
   ->  Nested Loop Left Join  (cost=0.29..93.89 rows=120 ...
 ->  Seq Scan on david_0 david_1  (cost=0.00..2.20 ...
 ->  Index Scan using goliath_0_pkey on goliath_0 ...
   Index Cond: (id = david_1.id)
   ->  Nested Loop Left Join  (cost=0.29..77.10 rows=98 ...
 ->  Seq Scan on david_1 david_2  (cost=0.00..1.98 ...
 ->  Index Scan using goliath_1_pkey on goliath_1 ...
   Index Cond: (id = david_2.id)
   ...
   ->  Nested Loop Left Join  (cost=0.29..74.58 rows=95 ...
 ->  Seq Scan on david_99 david_100  (cost=0.00..1.95
 ->  Index Scan using goliath_99_pkey on goliath_99 ...
   Index Cond: (id = david_100.id)
  (502 rows)

to a plan that does a lot of sequential I/O

  QUERY PLAN
  --
   Merge on goliath  (cost=293.44..264556.47 rows=0 width=0)
 Merge on goliath_0 goliath_1
 ...
 Merge on goliath_99 goliath_100
 ->  Append  (cost=293.44..264556.47 rows=951746 width=47)
   ->  Hash Right Join  (cost=293.44..2597.05 rows=9486 ...
 Hash Cond: (goliath_1.id = david_1.id)
 ->  Seq Scan on goliath_0 goliath_1  (cost=0.00..
 ->  Hash  (cost=174.86..174.86 rows=9486 width=37)
   ->  Seq Scan on david_0 david_1  (cost=0.00..
   ->  Hash Right Join  (cost=295.62..2613.90 rows=9583 width=
 Hash Cond: (goliath_2.id = david_2.id)
 ->  Seq Scan on goliath_1 goliath_2  (cost=0.00..1845
 ->  Hash  (cost=175.83..175.83 rows=9583 width=37)
   ->  Seq Scan on david_1 david_2  (cost=0.00..
   ...
   ->  Hash Right Join  (cost=288.33..2593.16 rows=9348 width
 Hash Cond: (goliath_100.id = david_100.id)
 ->  Seq Scan on goliath_99 goliath_100  (cost=0.00..
 ->  Hash  (cost=171.48..171.48 rows=9348 width=37)
   ->  Seq Scan on david_99 david_100  (cost=0.00..
  (602 rows)

That's expected, because the cost if I force the Nested Loop with the
higher row cound in "david" looks like this:

  QUERY PLAN
  --
   Merge on goliath  (cost=0.29..331253.00 rows=0 width=0)
   ...

Of course, the question is at which point the switch should happen. You
can try setting enable_hashjoin=off, which should push the optimizer to
use the first plan. If it does, you'll know the cost difference between
the two plans.

If you run it and it's actually faster than the "default" plan with a
hashjoin/seqscans, you can try lowering random_page_cost, which is
likely the main input. The default is 4, in general it should be higher
than seq_page_cost=1.0 (because random I/O is more expensive).

In any case, there's a point when the nested loops get terrible. I mean,
imagine the "david" has 10 rows, and "goliath" hash 10 pages
(800MB). It's just cheaper to do seqscan 100k pages than randomly scan
the same 100k pages. You can tune where the plan flips, ofc.

> 
> repro2.sql  now I changed the table layout (similar to my setup) to
> reproduce the partition wise join which does not triggers. I added a
> partition column, and a unique index to be able to mi

Re: Index on (fixed size) bytea value

2023-06-19 Thread Les
David G. Johnston  ezt írta (időpont: 2023.
jún. 19., H, 22:30):

> On Mon, Jun 19, 2023 at 1:05 PM Les  wrote:
>
>> AFAIK PostgreSQL does not allow a row to occupy multiple blocks.
>>
>
> Your plan is going to heavily involve out-of-band storage.  Please read up
> on it here:
>
> https://www.postgresql.org/docs/current/storage-toast.html
>
I'm aware of the TOAST, and how it works. I was referring to it ("I think
that it should be as large as possible, without hitting the toast. ") I
have designed a separate "block" table specifically to avoid storing binary
data in the TOAST. So my plan is not going to involve out-of-band storage.

Just to make this very clear: a record in the block table would store a
block, not the whole file. My question is to finding the optimal block size
(without hitting the toast), and finding the optimal hash algorithm for
block de-duplication.

Unless I totally misunderstood how the TOAST works. (?)

Laszlo


Re: Index on (fixed size) bytea value

2023-06-19 Thread Laurenz Albe
On Tue, 2023-06-20 at 08:13 +0200, Les wrote:
> I'm aware of the TOAST, and how it works. I was referring to it ("I think 
> that it should
> be as large as possible, without hitting the toast. ") I have designed a 
> separate "block"
> table specifically to avoid storing binary data in the TOAST. So my plan is 
> not going to
> involve out-of-band storage.
> 
> Just to make this very clear: a record in the block table would store a 
> block, not the
> whole file. My question is to finding the optimal block size (without hitting 
> the toast),
> and finding the optimal hash algorithm for block de-duplication.

Then you would ALTER the column and SET STORAGE MAIN, so that it does not ever 
use TOAST.

The size limit for a row would then be 8kB minus page header minus row header, 
which
should be somewhere in the vicinity of 8140 bytes.

If you want your block size to be a power of two, the limit would be 4kB, which 
would waste
almost half your storage space.

Yours,
Laurenz Albe