Re: Merge David and Goliath tables efficiently
I came here to talk about partitionwise join, but then noticed you have already thought of that: On 2023-Jun-18, nicolas paris wrote: > Note that both plan acome from the same partitioned by hash table with > 100 parts, with a unique index on the list_id + hash_key. For strategy > 2.1, I turned on enable_partitionwise_join, since david table has the > same partitioning scheme as goliath including unique indexe. In both > case the query is: Hmm, I suppose the reason partitionwise join isn't having any effect is that the presence of WHEN NOT MATCHED clauses force an outer join, which probably disarms partitionwise joining, since each join pair would require to match for nulls, so there would be two matching partitions at the other end. A quick test for this hypothesis might be to try the MERGE without the WHEN NOT MATCHED clauses and see if partitionwise join works better. Maybe Tom L's new outer-join infrastructure in 16 allows to improve on this, not sure. -- Álvaro HerreraBreisgau, Deutschland — https://www.EnterpriseDB.com/ "Los dioses no protegen a los insensatos. Éstos reciben protección de otros insensatos mejor dotados" (Luis Wu, Mundo Anillo)
Re: Merge David and Goliath tables efficiently
On 6/18/23 22:57, nicolas paris wrote: >> ... >> Well, I kinda doubt you have 17023331531230 rows (not even physically >> possible with 2TB disk), so that's immediately suspicious. > > Below is the full plan for the strategy 2.1 (Indeed the previous email > plan was truncated and wrong, sorry for that). > None of the plans has estimates anywhere close to 17023331531230, so where did that come from? regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Merge David and Goliath tables efficiently
On 6/19/23 09:46, Alvaro Herrera wrote: > I came here to talk about partitionwise join, but then noticed you have > already thought of that: > > On 2023-Jun-18, nicolas paris wrote: > >> Note that both plan acome from the same partitioned by hash table with >> 100 parts, with a unique index on the list_id + hash_key. For strategy >> 2.1, I turned on enable_partitionwise_join, since david table has the >> same partitioning scheme as goliath including unique indexe. In both >> case the query is: > > Hmm, I suppose the reason partitionwise join isn't having any effect is > that the presence of WHEN NOT MATCHED clauses force an outer join, which > probably disarms partitionwise joining, since each join pair would > require to match for nulls, so there would be two matching partitions at > the other end. A quick test for this hypothesis might be to try the > MERGE without the WHEN NOT MATCHED clauses and see if partitionwise join > works better. > > Maybe Tom L's new outer-join infrastructure in 16 allows to improve on > this, not sure. > Not sure why would that disarm partitionwise join - attached is a simple reproducer, generating two tables, loading 1000 and 1 rows into them, and then doing explain on a simple merge. IMHO the thing that breaks it is the ORDER BY in the merge, which likely acts as an optimization fence and prevents all sorts of smart things including the partitionwise join. I'd bet that if Nicolas replaces MERGE INTO "goliath" ca USING (SELECT * FROM "david" ORDER BY "list_id") AS t .. with MERGE INTO "goliath" ca USING "david" AS t ... it'll start doing the working much better. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company repro.sql Description: application/sql
Re: Merge David and Goliath tables efficiently
On Mon, 2023-06-19 at 13:34 +0200, Tomas Vondra wrote: > > > On 6/18/23 22:57, nicolas paris wrote: > > > ... > > > Well, I kinda doubt you have 17023331531230 rows (not even > > > physically > > > possible with 2TB disk), so that's immediately suspicious. > > > > Below is the full plan for the strategy 2.1 (Indeed the previous > > email > > plan was truncated and wrong, sorry for that). > > > > None of the plans has estimates anywhere close to 17023331531230, so > where did that come from? > Well this was an old plan where there was an issue: the david table did not have the same partitioning scheme as goliath. It was partitioned by an other column.
Re: Merge David and Goliath tables efficiently
> IMHO the thing that breaks it is the ORDER BY in the merge, which > likely > acts as an optimization fence and prevents all sorts of smart things > including the partitionwise join. I'd bet that if Nicolas replaces > > MERGE INTO "goliath" ca > USING (SELECT * FROM "david" ORDER BY "list_id") AS t > . Sorry if it was not clear, however there is no order by in the 2.1 strategy. Then this cannot be the reason of not triggering the optim. For information I do enable partition join feature with jdbc call just before the merge: set enable_partitionwise_join=true
Re: Merge David and Goliath tables efficiently
On 6/19/23 14:20, nicolas paris wrote: >> IMHO the thing that breaks it is the ORDER BY in the merge, which >> likely >> acts as an optimization fence and prevents all sorts of smart things >> including the partitionwise join. I'd bet that if Nicolas replaces >> >> MERGE INTO "goliath" ca >> USING (SELECT * FROM "david" ORDER BY "list_id") AS t >> . > > Sorry if it was not clear, however there is no order by in the 2.1 > strategy. Then this cannot be the reason of not triggering the optim. > > For information I do enable partition join feature with jdbc call just > before the merge: > set enable_partitionwise_join=true > But you wrote that in both cases the query is: MERGE INTO "goliath" ca USING (SELECT * FROM "david" ORDER BY "list_id") AS t ON t."list_id" = ca."list_id" WHEN MATCHED THEN UPDATE SET ... WHEN NOT MATCHED THEN INSERT (...) VALUES (...) With all due respect, I'm getting a bit tired of having to speculate about what exactly you're doing etc. based on bits of information. I'm willing to continue to investigate, but only if you prepare a reproducer, i.e. a SQL script that demonstrates the issue - I don't think preparing that should be difficult, something like the SQL script I shared earlier today should do the trick. I suggest you do that directly, not through JDBC. Perhaps the JDBC connection pool does something funny (like connection pooling and resetting settings). regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Re: Merge David and Goliath tables efficiently
> But you wrote that in both cases the query is: that was indeed yet another tipo, hope to do better in the future. > I'm willing to continue to investigate, but only if you prepare a > reproducer, Thanks for your starter script. Please find attached 2 scripts which now illustrates two troubles. repro1.sql is a slight evolution of yours. When I play with david size (as described in the comments) you will see plan going from nested loop to sequential scan. Also note that the partition wise join is likely working. This illustrate my initial problem: the sequential scan is not going to work fine on my workload (david too large). How to suggest postgres to use a nested loop here ? (suspect playing with costs should help) repro2.sql now I changed the table layout (similar to my setup) to reproduce the partition wise join which does not triggers. I added a partition column, and a unique index to be able to mimic a primary key. Now partition wise (in my local docker vanilla postgres 15.3) does not work. Eventually, if I do small batch, then the merge is working fast. That's an other, unrelated problem. > I suggest you do that directly, not through JDBC. Perhaps the JDBC > connection pool does something funny (like connection pooling and > resetting settings). I can tell jdbc was working, and likely the reason might be in my current table setup. repro1.sql Description: application/sql repro2.sql Description: application/sql
Re: Forced to use UNION ALL when having multiple ANY operators and ORDER BY LIMIT
No it is not. But do you think there is an impact here? Le 18/06/2023 à 23:23, [email protected] a écrit : Hi, Do you really need to do “select *”? In other words, is it necessary to have all columns in the result? /Michel SALAIS/ *De :*benoit *Envoyé :* lundi 12 juin 2023 23:35 *À :* Chris Hoover *Cc :* [email protected] *Objet :* RE: Forced to use UNION ALL when having multiple ANY operators and ORDER BY LIMIT This new index is used but still the read is 230mb. https://explain.dalibo.com/plan/b0f28a9e8a136afd *De :*Chris Hoover *Envoyé :* lundi 12 juin 2023 22:55 *À :* benoit *Cc :* [email protected] *Objet :* Re: Forced to use UNION ALL when having multiple ANY operators and ORDER BY LIMIT I normally create my indexes to match the where clause of the query. While technically, it should not matter, I find a lot of time, it does. I would create an index on (status, sender_reference, sent_at) and see if the improves your query performance. SELECT * FROM docs WHEREstatus IN('draft', 'sent') ANDsender_reference IN('Custom/1175', 'Client/362', 'Custom/280') ORDER BYsent_at DESC Thanks, Chris Hoover Senior DBA AWeber.com Cell: (803) 528-2269 Email: [email protected] On Jun 12, 2023, at 4:17 PM, benoit wrote: Hello I have a database with few 60gb tables. Tables rows are requested with multiple ANY or IN operators. I am not able to find an easy way to make DB able to use indexes. I often hit the index, but see a a spike of 200mb of IO or disk read. I am using version 13 but soon 14. I wrote a reproduction script on version 14 with plans included. https://gist.github.com/benoittgt/ab72dc4cfedea2a0c6a5ee809d16e04d I also have plans on a snapshot of the DB with real data. - The current query that I try to improve : https://explain.dalibo.com/plan/8b8f6e0he9feb551 - I added the DB schema + index in query view. As you can see I have many indexes for testing purpose and try what the planner can do. - The optimized query when I have only one ANY and migrate to UNION ALL for each parameter of the ANY operator https://explain.dalibo.com/plan/427gg053d07328ga . Query is fast as I would like but it means generate some merge to be able to get a fast result. - The new issue I have when I have a new ANY operator on the previous optimized query. Big IO/read https://explain.dalibo.com/plan/e7ha9g637b4eh946 It seems to me quite undoable to generate for every parameters a query that will then merge. I have sometimes 3-4 ANY operators with up to 15 elements in an array. Is there a misusage of my indexes? Is there a limitation when using ANY or IN operators and ordered LIMIT behind? Thanks a lot
Index on (fixed size) bytea value
Dear fellow list members, I'm in the process of implementing a file storage system that is based on PostgreSQL and streaming replication. There will possibly be many similar files stored. I would like to implement block-level deduplication: each file consists of a series of blocks, and each unique block is stored only once (e.g. one block may be reused by multiple files). It will be part of a bigger software, e.g. the same database will be used for other purposes too. Here is the basic idea for storing individual blocks: create table block( id uuid not null primary key, block bytea not null, hs256 bytea not null ) ; create unique index uidx_block_hs256 on block(hs256); create or replace function trg_biu_block() returns trigger language plpgsql as $function$ begin new.hs256 = digest(new.block, 'sha256'); end; $function$; create trigger trg_biu_block before insert or update on block for each row execute procedure trg_biu_block(); This is just for storing the blocks. I'm going to put this "block" table into a separate tablespace. File operations will be at least 95% read and at most 5% write. (Streaming replication will hopefully allow almost horizontal scaling for read operations.) Most of the files will be several megabytes in size (images), and some of them will be 100MB or more (videos). Total capacity is in the 10TB range. Storage will be SSD (possibly fiber connected, or local RAID, we are not sure yet). I do not want to use PostgreSQL large objects, because it does not have block level deduplication. Here are some things that I need help with: 1. What should be the maximum size of a block? I was trying to find out the optimal value. Default BLCKSZ is 8192 bytes. AFAIK PostgreSQL does not allow a row to occupy multiple blocks. I don't know enough to calculate the optimal "block size" (the max. number of bytes stored in a single row in the block.block field), but I suspect that it should be 4K or something similar. I think that it should be as large as possible, without hitting the toast. My thinking is this: most files will be at least 1MB in size, so most "block" rows will reach the maximum tuple size. It would be practical to make one row in the "block" table occupy almost one PostgreSQL block. 2. I'm not sure if it would be beneficial to increase BLCKSZ. I will be able to test the speed of my (not yet finished) implementation with different BLCKSZ values, but that alone won't help me make the decision, because AFAIK BLCKSZ must have a fixed value for the PostgreSQL instance, so it will affect all other tables in the database. It would be hard to tell how changing BLCKSZ would affect the system as a whole. 3. In the above example, I used SHA-256 (pgcrypto), because apparently it is very well optimized for 64 bit machines, and it has practically zero chance of a collision. I think that sha512 would be an overkill. But I'm not sure that this is the best choice. Maybe somebody with more experience can suggest a better hash function. 4. The hs256 value will always be non-null, fixed 32 byte binary value, but probably the query planner will not know anything about that. I was also thinking about bit(256), but I don't see an easy way to convert the bytea digest into bit(256). A simple type cast won't work here. Maybe using bytea here is perfectly fine, and creating an index on the hs256 bytea fields is as effective as possible. I'm not looking for a definitive answer, just trying to get some hints from more experienced users before I fill up the drives with terabytes of data. Thank you, Laszlo
Re: Index on (fixed size) bytea value
On Mon, Jun 19, 2023 at 1:05 PM Les wrote: > AFAIK PostgreSQL does not allow a row to occupy multiple blocks. > Your plan is going to heavily involve out-of-band storage. Please read up on it here: https://www.postgresql.org/docs/current/storage-toast.html I'm not looking for a definitive answer, just trying to get some hints from > more experienced users before I fill up the drives with terabytes of data. > > Store a hash (and other metadata, like the hashing algorithm) as well as the path to some system better designed for object storage and retrieval instead. David J.
Re: Merge David and Goliath tables efficiently
On 6/19/23 17:45, nicolas paris wrote: >> But you wrote that in both cases the query is: > > that was indeed yet another tipo, hope to do better in the future. > > >> I'm willing to continue to investigate, but only if you prepare a >> reproducer, > > Thanks for your starter script. Please find attached 2 scripts which > now illustrates two troubles. > > repro1.sql is a slight evolution of yours. When I play with david size > (as described in the comments) you will see plan going from nested loop > to sequential scan. Also note that the partition wise join is likely > working. This illustrate my initial problem: the sequential scan is not > going to work fine on my workload (david too large). How to suggest > postgres to use a nested loop here ? (suspect playing with costs should > help) > In general, this behavior is expected. The overall idea is that nested loops are efficient for small row counts, but the cost raises quickly exactly because they do a lot of random I/O (due to the index scan). At some point it's cheaper to switch to plan does sequential I/O. Which is exactly what's happening here - we switch from a plan doing a lot of random I/O on goliath QUERY PLAN -- Merge on goliath (cost=0.29..7888.69 rows=0 width=0) Merge on goliath_0 goliath_1 ... Merge on goliath_99 goliath_100 -> Append (cost=0.29..7888.69 rows=9998 width=47) -> Nested Loop Left Join (cost=0.29..93.89 rows=120 ... -> Seq Scan on david_0 david_1 (cost=0.00..2.20 ... -> Index Scan using goliath_0_pkey on goliath_0 ... Index Cond: (id = david_1.id) -> Nested Loop Left Join (cost=0.29..77.10 rows=98 ... -> Seq Scan on david_1 david_2 (cost=0.00..1.98 ... -> Index Scan using goliath_1_pkey on goliath_1 ... Index Cond: (id = david_2.id) ... -> Nested Loop Left Join (cost=0.29..74.58 rows=95 ... -> Seq Scan on david_99 david_100 (cost=0.00..1.95 -> Index Scan using goliath_99_pkey on goliath_99 ... Index Cond: (id = david_100.id) (502 rows) to a plan that does a lot of sequential I/O QUERY PLAN -- Merge on goliath (cost=293.44..264556.47 rows=0 width=0) Merge on goliath_0 goliath_1 ... Merge on goliath_99 goliath_100 -> Append (cost=293.44..264556.47 rows=951746 width=47) -> Hash Right Join (cost=293.44..2597.05 rows=9486 ... Hash Cond: (goliath_1.id = david_1.id) -> Seq Scan on goliath_0 goliath_1 (cost=0.00.. -> Hash (cost=174.86..174.86 rows=9486 width=37) -> Seq Scan on david_0 david_1 (cost=0.00.. -> Hash Right Join (cost=295.62..2613.90 rows=9583 width= Hash Cond: (goliath_2.id = david_2.id) -> Seq Scan on goliath_1 goliath_2 (cost=0.00..1845 -> Hash (cost=175.83..175.83 rows=9583 width=37) -> Seq Scan on david_1 david_2 (cost=0.00.. ... -> Hash Right Join (cost=288.33..2593.16 rows=9348 width Hash Cond: (goliath_100.id = david_100.id) -> Seq Scan on goliath_99 goliath_100 (cost=0.00.. -> Hash (cost=171.48..171.48 rows=9348 width=37) -> Seq Scan on david_99 david_100 (cost=0.00.. (602 rows) That's expected, because the cost if I force the Nested Loop with the higher row cound in "david" looks like this: QUERY PLAN -- Merge on goliath (cost=0.29..331253.00 rows=0 width=0) ... Of course, the question is at which point the switch should happen. You can try setting enable_hashjoin=off, which should push the optimizer to use the first plan. If it does, you'll know the cost difference between the two plans. If you run it and it's actually faster than the "default" plan with a hashjoin/seqscans, you can try lowering random_page_cost, which is likely the main input. The default is 4, in general it should be higher than seq_page_cost=1.0 (because random I/O is more expensive). In any case, there's a point when the nested loops get terrible. I mean, imagine the "david" has 10 rows, and "goliath" hash 10 pages (800MB). It's just cheaper to do seqscan 100k pages than randomly scan the same 100k pages. You can tune where the plan flips, ofc. > > repro2.sql now I changed the table layout (similar to my setup) to > reproduce the partition wise join which does not triggers. I added a > partition column, and a unique index to be able to mi
Re: Index on (fixed size) bytea value
David G. Johnston ezt írta (időpont: 2023.
jún. 19., H, 22:30):
> On Mon, Jun 19, 2023 at 1:05 PM Les wrote:
>
>> AFAIK PostgreSQL does not allow a row to occupy multiple blocks.
>>
>
> Your plan is going to heavily involve out-of-band storage. Please read up
> on it here:
>
> https://www.postgresql.org/docs/current/storage-toast.html
>
I'm aware of the TOAST, and how it works. I was referring to it ("I think
that it should be as large as possible, without hitting the toast. ") I
have designed a separate "block" table specifically to avoid storing binary
data in the TOAST. So my plan is not going to involve out-of-band storage.
Just to make this very clear: a record in the block table would store a
block, not the whole file. My question is to finding the optimal block size
(without hitting the toast), and finding the optimal hash algorithm for
block de-duplication.
Unless I totally misunderstood how the TOAST works. (?)
Laszlo
Re: Index on (fixed size) bytea value
On Tue, 2023-06-20 at 08:13 +0200, Les wrote:
> I'm aware of the TOAST, and how it works. I was referring to it ("I think
> that it should
> be as large as possible, without hitting the toast. ") I have designed a
> separate "block"
> table specifically to avoid storing binary data in the TOAST. So my plan is
> not going to
> involve out-of-band storage.
>
> Just to make this very clear: a record in the block table would store a
> block, not the
> whole file. My question is to finding the optimal block size (without hitting
> the toast),
> and finding the optimal hash algorithm for block de-duplication.
Then you would ALTER the column and SET STORAGE MAIN, so that it does not ever
use TOAST.
The size limit for a row would then be 8kB minus page header minus row header,
which
should be somewhere in the vicinity of 8140 bytes.
If you want your block size to be a power of two, the limit would be 4kB, which
would waste
almost half your storage space.
Yours,
Laurenz Albe
