GEQO plans much slower than standard join plans

2025-10-27 Thread Carlo Sganzerla
Hello experts,

Recently, we've been encountering poor query plans on our database workload.
We've successfully diagnosed the cause as reaching the default limit of 8
for join_collapse_limit.

We've already dealt with the situation successfully. However, I've been
discussing with some colleagues the possibility of raising the
join_collapse_limit parameter value (and from_collapse_limit with it) in
our production environment. I was trying to determine what would be a
sensible new value when I stumbled upon this thread on OLTP Star Joins:
https://www.postgresql.org/message-id/1ea167aa-457d-422a-8422-b025bb660ef3%40vondra.me

Our data model is hierarchical, so we rather have child tables instead of
the dimension tables mentioned above. I created a similar script as the one
found on the OLTP Star Join thread to test the effects of different values
of join_collapse_limit locally on an hierarchical model which is more
similar to ours. I ran pg_bench with 30 threads and 30 clients for 5
seconds. The following results were obtained on my machine (AMD Ryzen 7
5700U with 8 cores and 32 GB RAM, PostgreSQL 17.5 (Debian 17.5-1.pgdg120+1)
on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit):

OLTP starjoin (10 dimensions)
- join_collapse_limit = 1: 6.4k TPS
- join_collapse_limit = 8: 2.8k TPS
- join_collapse_limit = 11: 400 TPS

Tree join (10 levels)
- join_collapse_limit = 1:  7.4k TPS
- join_collapse_limit = 8: 4.7k TPS
- join_collapse_limit = 11:  3.9k TPS

Tree join (14 levels)
- join_collapse_limit = 1:  5.1k TPS
- join_collapse_limit = 8:  3.4k TPS
- join_collapse_limit = 11:  3k TPS
- join_collapse_limit = 14, geqo = off:  2.2k TPS
- join_collapse_limit = 14, geqo = on: 200 TPS

Note: geqo_threshold was unchanged from the default (12)

I assume that the reason why the hierarchical "tree join" is much faster is
due to the dependencies among tables, so the standard join search has a
much narrower range of possible query paths compared to the OLTP Star Join
case. What surprised me, however, is that when GEQO is turned on, the TPS
falls dramatically. Given that the documentation states that GEQO "...
reduces planning time for complex queries (those joining many relations),
at the cost of producing plans that are sometimes inferior to those found
by the normal exhaustive-search algorithm", it made me wonder what could be
the cause of this much slower planning. I'm not really familiar with
genetic algorithms, so perhaps I might be missing something, but is this
kind of planning performance hit normal when GEQO is on? I was hoping
someone could help us on this topic.

Regards,
Carlo


treejoin-10-dims.sql
Description: application/sql


treejoin-14-dims.sql
Description: application/sql


treejoin-create.sql
Description: application/sql


Re: GEQO plans much slower than standard join plans

2025-10-27 Thread Tomas Vondra
On 10/27/25 19:17, Carlo Sganzerla wrote:
> 
> I assume that the reason why the hierarchical "tree join" is much faster
> is due to the dependencies among tables, so the standard join search has
> a much narrower range of possible query paths compared to the OLTP Star
> Join case. What surprised me, however, is that when GEQO is turned on,
> the TPS falls dramatically. Given that the documentation states that
> GEQO "... reduces planning time for complex queries (those joining many
> relations), at the cost of producing plans that are sometimes inferior
> to those found by the normal exhaustive-search algorithm", it made me
> wonder what could be the cause of this much slower planning. I'm not
> really familiar with genetic algorithms, so perhaps I might be missing
> something, but is this kind of planning performance hit normal when GEQO
> is on? I was hoping someone could help us on this topic.

I'm not particularly familiar with the GEQO internals, so I can't point
at specific issues. But I've heard from a couple experienced developers
that they consider GEQO ineffective / not the right approach.

However, I don't think you need detailed knowledge of the GEQO internals
to explain the observed behavior.

The regular (non-GEQO) planning explores more or less all possible join
orders - we don't construct all of them thanks to dynamic programming,
but we only skip orders that we evaluate as not interesting. It's 100%
true, due to join_collapse_limit (which splits the problem into smaller
problems, and limits which part of the limit space we really search).

The idea of GEQO is that it reduces the search space even more, and it
explores only a very small fraction of join orders. The idea was to do
that in a smart way to still produce "quality" join orders, but the
experience is it's not reliable.

If shouldn't be difficult to demonstrate this using EXPLAIN. If you look
at the plans with/without geqo, I'd bet the geqo=on costs will have much
higher cost (which proves the geqo fails to find many "good" plans). Of
course, the execution might still be fast.

Another question is whether the difference is in planning or execution.
I'd expect geqo=on makes planning faster and execution slower, but maybe
that's not true for your test. It shouldn't be difficult to verify using
pg_stat_statements (which tracks both plan and exec time).


regards

-- 
Tomas Vondra