Thousands of partitions performance questions

2019-04-28 Thread Shai Cantor
   - *General*
  - Our system gathers data from CI builds regarding a customer's code
  base
   - *Data includes*
  - type
 - methods/lines/branches A.K.A - code elements
 - files
  - *Queries*
  - The data is queried for a specific build only
  - Queries are aggregations on the code elements and files with some
  filtering and sorting
   - *Volume*
  - I expect to have about 1500 builds a day, 45000 builds a month
  - A build can have around 30 code elements and around 3 files
   - *Retention*
  - Thought about keeping 90 days of builds as retention
   - *Plan*
  - 2 tables
 - code elements
 - files
  - create a partition for each build
  - each day delete partitions older than 90 days
  - create 2 schemas for each client
 -  schema which holds the parent tables
 - _partitions schema that will hold the partitions
  - upon a new build
 - create a partition for the code elements table and for the files
 table in the "_partitions" schema
  - *Questions*
  - Will the db hold 135000 (45000 * 3 months) partitions under the
  assumption that *I query only 1 partition?*
  - Should I model it differently in terms of schema, partitions etc.?


Re: Thousands of partitions performance questions

2019-04-29 Thread Shai Cantor
Thanks a lot for your feedback.
We're gonna user PG 11 on AWS RDS.
I will do some workload simulations.
Some notes I haven't shared or were not clear enough on the previous post.

   1. *Data is inserted using the copy command only* and in an offline
   manner. Meaning, no user action creates or updates the data. An offline job
   runs is. Number of inserts can reach up to 1500 a day.
   2. *Queries are only on a single partition*



   - Can that ease the performance?
   - Will querying directly the partition tables help?


On Mon, Apr 29, 2019 at 8:46 AM David Rowley 
wrote:

> On Mon, 29 Apr 2019 at 17:13, Shai Cantor  wrote:
> > Will the db hold 135000 (45000 * 3 months) partitions under the
> assumption that I query only 1 partition?
> > Should I model it differently in terms of schema, partitions etc.?
>
> Which PG version?
>
> Before PG11 the query planner pruned unneeded partition by looking at
> each partition and determining if the partition constraint could not
> match the base quals on the query.  PG11 improved this by adding a
> smarter and faster algorithm to get rid of non-matching partitions,
> however, this really only speeds things up a little as it only really
> allows the planner to skip generating paths for these partitions,
> certain meta-data is still loaded, which is not really slow per
> partition, but it is slow if you have thousands of partitions.
>
> PG11 is also pretty bad at planning UPDATE/DELETEs to partitioned
> tables with a large number of partitions. You'll at best suffer from
> high planning times to plan these queries and at worst suffer out of
> memory errors with that many partitions.
>
> The yet to be released PG12 improves both of these deficiencies
> providing you can eliminate the majority of partitions during query
> planning. PG12 also improves the performance of INSERT into a
> partitioned table, wherein PG11 a lock was obtained on every
> partition, with PG12 we only grab a lock on a partition the first time
> the INSERT command inserts a row into it. If you're just INSERTing 1
> row per command into a partitioned table with many partitions then
> this makes a pretty big difference.
>
> Depending on the types of query you're running it's likely not a good
> idea to go above 100 or so partitions with PG11. You might get away
> with more if you're running a more data-warehouse type load, i.e fewer
> but longer running queries, but for a more OLTP type workload, with
> more queries and lower latencies, then you may struggle to cope with a
> dozen.
>
> I'd recommend you do workload simulations with whatever number you
> choose and ensure performance is to the level you require before
> getting in too deep with your design.
>
> If your go-live date is near the end of the year or beyond, then it
> might be a good idea to start testing with PG12 right away. The
> release date for that will likely be around the middle of October.
>
> --
>  David Rowley   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services
>