Join order optimization
Hi, I'm from the Hibernate team (Java ORM) and a user recently reported that a change in our SQL rendering affected his query plans in a bad way. In short, we decided to model certain constructs in our ORM with "nested joins" i.e. using parenthesis to express the join order. This is what we want to model semantically, though there are cases when we could detect that we don't need the explicit join ordering to get the same semantics. I expected that the PostgreSQL optimizer can do the same reasoning and do further join re-ordering to produce an optimal plan, but to my surprise it seems it isn't capable to do that. The query we generate right now is of the following structure: from tbl1 t1 join (tbl2 t2 left join tbl3 t3 on t3.pkAndFk = t2.pk left join tbl4 t4 on t4.pkAndFk = t2.pk ... left join tbl9 t9 on t9.pkAndFk = t2.pk ) on t2.fk = t1.pk where t1.indexedColumn = ... whereas the query we generated before, which is semantically equivalent, is the following: from tbl1 t1 join tbl2 t2 on t2.fk = t1.pk left join tbl3 t3 on t3.pkAndFk = t2.pk left join tbl4 t4 on t4.pkAndFk = t2.pk ... left join tbl9 t9 on t9.pkAndFk = t2.pk where t1.indexedColumn = ... You can find the full queries in the attachments section of the issue report from the user: https://hibernate.atlassian.net/browse/HHH-16595 Query_Hibernate5.txt shows the old style query without parenthesis and Query_Hibernate6.txt shows the new style. You will also find the query plans for the two queries attached as CSV files. It almost seems like the PostgreSQL optimizer sees the parenthesis for join ordering as an optimization fence!? The user reported that the behavior is reproducible in PostgreSQL versions 11 and 15. He promised to provide a full reproducer for this which I am still waiting for, but I'll share it with you as soon as that was provided if needed. I think that we can detect that the parenthesis is unnecessary in this particular case, but ideally PostgreSQL would be able to detect this as well to plan the optimal join order. Any ideas what is going on here? Is this a bug or missed optimization in the query optimizer? I'm a bit worried about what PostgreSQL will produce for queries that really need the parenthesis for join ordering e.g. from tbl1 t1 left join (tbl2 t2 join tbl3 t3 on t3.pkAndFk = t2.pk join tbl4 t4 on t4.pkAndFk = t2.pk ... join tbl9 t9 on t9.pkAndFk = t2.pk ) on t2.fk = t1.pk where t1.indexedColumn = ... Thanks for any help. Christian
Re: Index bloat and REINDEX/VACUUM optimization for partial index
> At any moment, there are *around 1000-1500 tasks in pending statuses*
> (Init + InProgress) out of around 500 million tasks.
>
> Now, we have a task monitoring query that will look for all pending tasks
> that have not received any update in the last n minutes.
>
> ```
> SELECT [columns list]
> FROM tasks
> WHERE status NOT IN (3,4,5) AND created > NOW() - INTERVAL '30 days' AND
> updated < NOW() - interval '30 minutes'
> ```
>
> Since we are only interested in the pending tasks, I created a partial
> index
> `*"tasks_pending_status_created_type_idx" btree (status, created,
> task_type) WHERE status <> ALL (ARRAY[3, 4, 5])*`.
>
> This worked great initially, however this started to get bloated very very
> quickly because, every task starts in pending state, gets multiple updates
> (and many of them are not HOT updates, working on optimizing fill factor
> now), and eventually gets deleted from the index (as status changes to
> success).
>
>From my experience I suspect that there is a problem with "of around 500
million tasks."
Autovacuum indeed cleans old dead index entries, but how many such dead
index entries will be collected on the 500M table before autovacuum kicks
in?
With the default value of autovacuum_vacuum_scale_factor (The default is
0.2 (20% of table size).) index will collect like 100M outdated/dead index
entries before autovacuum kicks in and cleans them all (in a worst case),
and of course it will lead to huge index bloat and awful performance.
Even if you scale down autovacuum_vacuum_scale_factor to some unreasonable
low value like 0.01, the index still bloats to the 5M dead entries before
autovacuum run, and constant vacuuming of a huge 500M table will put a huge
load on the database server.
Unfortunately there is no easy way out of this situation from database
side, in general I recommend not trying to implement a fast pacing queue
like load inside of a huge and constantly growing table, it never works
well because you cannot keep up partial efficient indexes for the queue in
a clean/non-bloated state.
In my opinion the best solution is to keep list of entries to process ("*around
1000-1500 tasks in pending statuses")* duplicated in the separate tiny
table (via triggers or implement it on the application level), in that case
autovacuum will be able quickly clean dead entries from the index.
Kind Regards,
Maxim
--
Maxim Boguk
Senior Postgresql DBA
Phone UA: +380 99 143
Phone AU: +61 45 218 5678
Re: Queries containing ORDER BY and LIMIT started to work slowly
On Wed, Aug 30, 2023 at 1:31 PM Rondat Flyag wrote: > Hi and thank you for the response. > > I tried VACUUM ANALYZE for three tables, but without success. I also tried > to set enable_seqscan=off and the query took even more time. If I set > enable_sort=off then the query takes a lot of time and I cancel it. > Maybe you could restore (to a temp server, not the production) a physical backup taken from before the change happened, and get an old plan that way. I'm guessing that somehow an index got dropped around the same time you took the dump. That might be a lot of work, and maybe it would just be easier to optimize the current query while ignoring the past. But you seem to be interested in a root-cause analysis, and I don't see any other way to do one of those. What I would expect to be the winning plan would be something sort-free like: Limit merge join index scan yielding books in asin order (already being done) nested loop index scan yielding asins in value order index scan probing asins_statistics driven by asins_statistics.asin_id = asins.id Or possibly a 2nd nested loop rather than the merge join just below the limit, but with the rest the same In addition to the "books" index already evident in your current plan, you would also need an index leading with asins_statistics.asin_id, and one leading with asins.value. But if all those indexes exists, it is hard to see why setting enable_seqscan=off wouldn't have forced them to be used. Cheers, Jeff
Re: Index bloat and REINDEX/VACUUM optimization for partial index
Thanks Maxim, that's something we are considering now - keep the in
progress tasks in one table and periodically move the old and completed
tasks to an archive table.
We could use a view that unions them for most queries.
I'm not sure if that's the best alternative though, and we want to know if
there are any gotchas to worry about.
On Thu, Aug 31, 2023, 8:06 AM Maxim Boguk wrote:
>
> At any moment, there are *around 1000-1500 tasks in pending statuses*
>> (Init + InProgress) out of around 500 million tasks.
>>
>> Now, we have a task monitoring query that will look for all pending tasks
>> that have not received any update in the last n minutes.
>>
>> ```
>> SELECT [columns list]
>> FROM tasks
>> WHERE status NOT IN (3,4,5) AND created > NOW() - INTERVAL '30 days'
>> AND updated < NOW() - interval '30 minutes'
>> ```
>>
>> Since we are only interested in the pending tasks, I created a partial
>> index
>> `*"tasks_pending_status_created_type_idx" btree (status, created,
>> task_type) WHERE status <> ALL (ARRAY[3, 4, 5])*`.
>>
>> This worked great initially, however this started to get bloated very
>> very quickly because, every task starts in pending state, gets multiple
>> updates (and many of them are not HOT updates, working on optimizing fill
>> factor now), and eventually gets deleted from the index (as status changes
>> to success).
>>
>
> From my experience I suspect that there is a problem with "of around 500
> million tasks."
> Autovacuum indeed cleans old dead index entries, but how many such dead
> index entries will be collected on the 500M table before autovacuum kicks
> in?
>
> With the default value of autovacuum_vacuum_scale_factor (The default is
> 0.2 (20% of table size).) index will collect like 100M outdated/dead index
> entries before autovacuum kicks in and cleans them all (in a worst case),
> and of course it will lead to huge index bloat and awful performance.
>
> Even if you scale down autovacuum_vacuum_scale_factor to some
> unreasonable low value like 0.01, the index still bloats to the 5M dead
> entries before autovacuum run, and constant vacuuming of a huge 500M table
> will put a huge load on the database server.
>
> Unfortunately there is no easy way out of this situation from database
> side, in general I recommend not trying to implement a fast pacing queue
> like load inside of a huge and constantly growing table, it never works
> well because you cannot keep up partial efficient indexes for the queue in
> a clean/non-bloated state.
>
> In my opinion the best solution is to keep list of entries to process
> ("*around
> 1000-1500 tasks in pending statuses")* duplicated in the separate tiny
> table (via triggers or implement it on the application level), in that case
> autovacuum will be able quickly clean dead entries from the index.
>
> Kind Regards,
> Maxim
>
>
> --
> Maxim Boguk
> Senior Postgresql DBA
>
> Phone UA: +380 99 143
> Phone AU: +61 45 218 5678
>
>
Re: Index bloat and REINDEX/VACUUM optimization for partial index
On Wed, Aug 30, 2023 at 8:43 PM jayaprabhakar k wrote: > > > On Tue, Aug 29, 2023, 12:43 PM Jeff Janes wrote: > >> On Mon, Aug 28, 2023 at 8:33 PM jayaprabhakar k >> wrote: >> >>> >>> Since we are only interested in the pending tasks, I created a partial >>> index >>> `*"tasks_pending_status_created_type_idx" btree (status, created, >>> task_type) WHERE status <> ALL (ARRAY[3, 4, 5])*`. >>> >> >> This looks like a poorly designed index. Since the status condition >> exactly matches the index where clause, there is no residual point in >> having "status" be the first column in the index, it can only get in the >> way (for this particular query). Move it to the end, or remove it >> altogether. >> > Interesting. I don't understand why it will get in the way. Unfortunately > we have a few other cases where status is used in filter. That said, I will > consider how to get this to work. > Would removing status from the index column, improve HOT updates %? For > example, changing status from 1->2, doesn't change anything on the index > (assuming other criteria for HOT updates are met), but I am not sure how > the implementation is. > No, changes to the status column will not qualify as HOT updates, even if status is only in the WHERE clause and not the index body. I don't know if there is a fundamental reason that those can't be done as HOT, or if it is just an optimization that no one implemented. > > >> Within the tuples which pass the status check, which inequality is more >> selective, the "created" one or "updated" one? >> > Obviously updated time is more selective (after status), and the created > time is included only to exclude some bugs in our system that had left some > old tasks stuck in progress (and for sorting). We do try to clean > up occasionally, but not each time. > If "created" were the leading column in the index, then it could jump directly to the part of the index which meets the `created > ...` without having to scroll through all of them and throw them out one by one. But it sounds like there are so few of them that being able to skip them wouldn't be worth very much. > > However we cannot add an index on `updated` column because that timestamp > gets updated over 10x on average for each task. Since if a single index use > a column, then the update will not be HOT, and every index needs to be > updated. That will clearly add a bloat to every index. Did I miss something? > Why does it get updated so much? It seems like status should go from 1 to 2, then from 2 to 3,4,or 5, and then be done. So only 2 updates, not 10. Maybe the feature which needs this frequent update could be done in some other way which is less disruptive. But anyway, PostgreSQL has features to prevent the index bloat from becoming too severe of a problem, and you should figure out why they are not working for you. The most common ones I know of are 1) long open snapshots preventing clean up, 2) all index scans being bitmap index scans, which don't to micro-vacuuming/index hinting the way ordinary btree index scans do, and 3) running the queries on a hot-standby, where index hint bits must be ignored. If you could identify and solve this issue, then you wouldn't need to twist yourself into knots avoiding non-HOT updates. Cheers, Jeff >
Re: Index bloat and REINDEX/VACUUM optimization for partial index
On Thu, Aug 31, 2023 at 11:06 AM Maxim Boguk wrote:
> With the default value of autovacuum_vacuum_scale_factor (The default is
> 0.2 (20% of table size).) index will collect like 100M outdated/dead index
> entries before autovacuum kicks in and cleans them all (in a worst case),
> and of course it will lead to huge index bloat and awful performance.
>
Index bloat doesn't automatically lead to awful performance. There must be
some additional factor at play.
> Even if you scale down autovacuum_vacuum_scale_factor to some
> unreasonable low value like 0.01, the index still bloats to the 5M dead
> entries before autovacuum run, and constant vacuuming of a huge 500M table
> will put a huge load on the database server.
>
For this type of situation, I would generally set
autovacuum_vacuum_scale_factor to 0, and use autovacuum_vacuum_threshold to
drive the vacuuming instead. But I'd make those changes just on the queue
table(s), not system wide. Due to the visibility map, the load on the
server does not need to be huge just due to the table, as the stable part
of the table can be ignored. The problem is that each index still needs to
be read entirely for each vacuum cycle, which would not be much of a
problem for the partial indexes, but certainly could be for the full
indexes. There are some very recent improvements in this area, but I don't
think they can be applied selectively to specific indexes.
>
> Unfortunately there is no easy way out of this situation from database
> side, in general I recommend not trying to implement a fast pacing queue
> like load inside of a huge and constantly growing table, it never works
> well because you cannot keep up partial efficient indexes for the queue in
> a clean/non-bloated state.
>
> In my opinion the best solution is to keep list of entries to process
> ("*around
> 1000-1500 tasks in pending statuses")* duplicated in the separate tiny
> table (via triggers or implement it on the application level), in that case
> autovacuum will be able quickly clean dead entries from the index.
>
You should be able to use declarative partitioning to separate the "final"
tuples from the "active" tuples, to get the same benefit but with less work.
Cheers,
Jeff
