Re: When to use PARTITION BY HASH?

2020-06-07 Thread MichaelDBA
The article referenced below assumes a worst case scenario for 
bulk-loading with hash partitioned tables.  It assumes that the values 
being inserted are in strict ascending or descending order with no gaps 
(like a sequence number incrementing by 1), thereby ensuring every 
partition is hit in order before repeating the process.  If the values 
being inserted are not strictly sequential with no gaps, then the 
performance is much better.  Obviously, what part of the tables and 
indexes are in memory has a lot to do with it as well.


Regards,
Michael Vitale

Imre Samu wrote on 6/5/2020 7:48 AM:

> "Bulk loads ...",

As I see - There is an interesting bulkload benchmark:

"How Bulkload performance is affected by table partitioning in 
PostgreSQL" by Beena Emerson (Enterprisedb, December 4, 2019 )
/SUMMARY: This article covers how benchmark tests can be used to 
demonstrate the effect of table partitioning on performance. Tests 
using range- and hash-partitioned tables are compared and the reasons 
for their different results are explained:

                 1. Range partitions
           2. Hash partitions
                 3. Combination graphs
               4. Explaining the behavior
                 5. Conclusion/
/
/
/"For the hash-partitioned table, the first value is inserted in the 
first partition, the second number in the second partition and so on 
till all the partitions are reached before it loops back to the first 
partition again until all the data is exhausted. Thus it exhibits the 
worst-case scenario where the partition is repeatedly switched for 
every value inserted. As a result, the number of times the partition 
is switched in a range-partitioned table is equal to the number of 
partitions, while in a hash-partitioned table, the number of times the 
partition has switched is equal to the amount of data being inserted. 
This causes the massive difference in timing for the two partition 
types."/


https://www.enterprisedb.com/postgres-tutorials/how-bulkload-performance-affected-table-partitioning-postgresql

Regards,
 Imre



Re: When to use PARTITION BY HASH?

2020-06-07 Thread David Rowley
On Sun, 7 Jun 2020 at 23:41, MichaelDBA  wrote:
> The article referenced below assumes a worst case scenario for bulk-loading 
> with hash partitioned tables.  It assumes that the values being inserted are 
> in strict ascending or descending order with no gaps (like a sequence number 
> incrementing by 1), thereby ensuring every partition is hit in order before 
> repeating the process.  If the values being inserted are not strictly 
> sequential with no gaps, then the performance is much better.  Obviously, 
> what part of the tables and indexes are in memory has a lot to do with it as 
> well.

In PostgreSQL 12, COPY was modified to support bulk-inserts for
partitioned tables. This did speed up many scenarios.  Internally, how
this works is that we maintain a series of multi insert buffers, one
per partition. We generally only flush those buffers to the table when
the buffer for the partition fills.  However, there is a sort of
sanity limit [1] on the number of multi insert buffers we maintain at
once and currently, that is 32.  Technically we could increase that
limit, but there would still need to be a limit.  Unfortunately, for
this particular case, since we're most likely touching between 199-799
other partitions before hitting the first one again, that will mean
that we really don't get any multi-inserts, which is likely the reason
why the performance is worse for hash partitioning.

With PG12 and for this particular case, you're likely to see COPY
performance drop quite drastically when going from 32 to 33
partitions.  The code was more designed for hitting partitions more
randomly rather than in this sort-of round-robin way that we're likely
to get from hash partitioning on a serial column.

David

[1] 
https://github.com/postgres/postgres/blob/master/src/backend/commands/copy.c#L2569