GEQO and join_collapse_limit correlation

2018-01-05 Thread Juan José Santamaría Flecha
Hi,

We recently had an issue in production. We have queries that are
procedurally generated by an Object/Relational Mapping framework. Some of
these queries are huge, involving over 120 tables.

With the following parameters the planner seemed to be getting very bad
plans for some of these queries (times are from a single execution, but
they are in those orders of magnitude):


from_collapse_limit = 14
join_collapse_limit = 14
geqo_threshold = 14
geqo_effort= 5

(cost=14691360.79..81261293.30 rows=6 width=15934)

 Planning time: 3859.928 ms
 Execution time: 6883365.973 ms


If we raise the join_collapse_limit to a really high value the plans are
much better, but (of course) planning time gets worse:


from_collapse_limit = 150
join_collapse_limit = 150
geqo_threshold = 14
geqo_effort= 5

(cost=379719.44..562997.32 rows=7 width=15934)

 Planning time: 7112.416 ms
 Execution time: 7.741 ms


After some testing in order to lower the planning time we ended bringing
down the GEQO values, and we have the best results with:


from_collapse_limit = 150
join_collapse_limit = 150
geqo_threshold = 2
geqo_effort= 2

(cost=406427.86..589667.55 rows=6 width=15934)

 Planning time: 2721.099 ms
 Execution time: 22.728 ms


Issues with the join_collapse_limit have been discussed before [1], but
lowering the GEQO values seems counterintuitive based on the documentation
for this parameter [2]: "Setting this value [join_collapse_limit] to
geqo_threshold or more may trigger use of the GEQO planner, resulting in
non-optimal plans."

What we want to know is if this mechanisms are working as intended and we
can follow a similar approach in the future (lower GEQO values), or this is
just a fluke for a corner case.

I have been able to reproduce a similar behaviour, to a much smaller scale,
with the attached scripts in Postgres 10.

[1] https://www.postgresql.org/message-id/25845.1483809942%40sss.pgh.pa.us
[2] https://www.postgresql.org/docs/current/static/runtime-config-query.html


Regards,

Juan José Santamaría
CREATE TABLE master_tbl AS
SELECT s.a AS master_id FROM generate_series(1,1000) AS s(a);

ALTER TABLE master_tbl ADD PRIMARY KEY (master_id);

DO $$
DECLARE
i INT;
child_name CHAR(2);

BEGIN
FOR i IN 1..40 LOOP
 
child_name := TO_CHAR(i, 'fm00');
EXECUTE 'CREATE TABLE child_tbl' || child_name ||
' AS SELECT s.a AS child_id, s.a%10 AS master_id
FROM generate_series(1,1) AS s(a)';

EXECUTE 'ALTER TABLE child_tbl' || child_name ||
' ADD PRIMARY KEY (child_id)';
EXECUTE 'CREATE INDEX idx_child_tbl' || child_name ||
' ON child_tbl' || child_name || '(master_id)';
 
END LOOP;
END;
$$ LANGUAGE plpgsql;

VACUUM;
set session max_parallel_workers = 0;

set session join_collapse_limit = 8;
set session geqo_threshold = 12;
set session geqo_effort = 5;

SELECT master_tbl.*
FROM master_tbl
JOIN child_tbl01 ON master_tbl.master_id = child_tbl01.master_id
JOIN child_tbl02 ON child_tbl01.child_id = child_tbl02.master_id
JOIN child_tbl03 ON child_tbl02.child_id = child_tbl03.master_id
JOIN child_tbl04 ON child_tbl03.child_id = child_tbl04.master_id
JOIN child_tbl05 ON child_tbl04.child_id = child_tbl05.master_id
JOIN child_tbl06 ON child_tbl05.child_id = child_tbl06.master_id
JOIN child_tbl07 ON child_tbl06.child_id = child_tbl07.master_id
JOIN child_tbl08 ON child_tbl07.child_id = child_tbl08.master_id
JOIN child_tbl09 ON child_tbl08.child_id = child_tbl09.master_id
JOIN child_tbl10 ON child_tbl09.child_id = child_tbl10.master_id

JOIN child_tbl11 ON child_tbl10.child_id = child_tbl11.master_id
JOIN child_tbl12 ON child_tbl11.child_id = child_tbl12.master_id
JOIN child_tbl13 ON child_tbl12.child_id = child_tbl13.master_id
JOIN child_tbl14 ON child_tbl13.child_id = child_tbl14.master_id
JOIN child_tbl15 ON child_tbl14.child_id = child_tbl15.master_id
JOIN child_tbl16 ON child_tbl15.child_id = child_tbl16.master_id
JOIN child_tbl17 ON child_tbl16.child_id = child_tbl17.master_id
JOIN child_tbl18 ON child_tbl17.child_id = child_tbl18.master_id
JOIN child_tbl19 ON child_tbl18.child_id = child_tbl19.master_id
JOIN child_tbl20 ON child_tbl19.child_id = child_tbl20.master_id

JOIN child_tbl21 ON child_tbl20.child_id = child_tbl21.master_id
JOIN child_tbl22 ON child_tbl21.child_id = child_tbl22.master_id
JOIN child_tbl23 ON child_tbl22.child_id = child_tbl23.master_id
JOIN child_tbl24 ON child_tbl23.child_id = child_tbl24.master_id
JOIN child_tbl25 ON child_tbl24.child_id = child_tbl25.master_id
JOIN child_tbl26 ON child_tbl25.child_id = child_tbl26.master_id
JOIN child_tbl27 ON child_tbl26.child_id = child_tbl27.master_id
JOIN child_tbl28 ON child_tbl27.child_id = child_tbl28.master_id
JOIN child_tbl29 ON child_tbl28.child_id = child_tbl29.master_id
JOIN child_tbl30 ON child_tbl29.child_id = child_tbl30.master_id

JOIN child_tbl31 ON child_tbl20.child_id = child_tbl31.master_id
JOIN child_tbl32 ON child_tbl

Re: GEQO and join_collapse_limit correlation

2018-01-05 Thread Tom Lane
=?UTF-8?Q?Juan_Jos=C3=A9_Santamar=C3=ADa_Flecha?=  
writes:
> We recently had an issue in production. We have queries that are
> procedurally generated by an Object/Relational Mapping framework. Some of
> these queries are huge, involving over 120 tables.

Yeah, you're going to have problems with that :-(

> After some testing in order to lower the planning time we ended bringing
> down the GEQO values, and we have the best results with:

> from_collapse_limit = 150
> join_collapse_limit = 150
> geqo_threshold = 2
> geqo_effort= 2

Hmm.  The trouble with this approach is that you're relying on GEQO
to find a good plan, and that's only probabilistic --- especially so
when you're reducing geqo_effort, meaning it doesn't try as many
possibilities as it otherwise might.  Basically, therefore, the
fear is that every so often you'll get a bad plan.

If the queries are fairly stylized, you might be able to get good 
results by exploiting rather than bypassing join_collapse_limit:
determine what a good join order is, and then write the FROM clause
as an explicit JOIN nest in that order, and then *reduce* not raise
join_collapse_limit to force the planner to follow the syntactic
join order.  In this way you'd get rid of most of the run-time
join order search effort.  Don't know how cooperative your ORM
would be with such an approach though.

regards, tom lane



Re: GEQO and join_collapse_limit correlation

2018-01-05 Thread Juan José Santamaría Flecha
Hi,

> After some testing in order to lower the planning time we ended bringing
> > down the GEQO values, and we have the best results with:
>
> > from_collapse_limit = 150
> > join_collapse_limit = 150
> > geqo_threshold = 2
> > geqo_effort= 2
>
> Hmm.  The trouble with this approach is that you're relying on GEQO
> to find a good plan, and that's only probabilistic --- especially so
> when you're reducing geqo_effort, meaning it doesn't try as many
> possibilities as it otherwise might.  Basically, therefore, the
> fear is that every so often you'll get a bad plan.
>

What we felt odd was having to find a balance between geqo_threshold and
join_collapse_limit, lowering one was only effective after raising the
other. The geqo_effort was only mofidied after we found this path, and some
more testing.

In an environment with geqo_threshold=1 and join_collapse_limit=1, would
the planner be GEQO exclusive (and syntactic)?

If the queries are fairly stylized, you might be able to get good
> results by exploiting rather than bypassing join_collapse_limit:
> determine what a good join order is, and then write the FROM clause
> as an explicit JOIN nest in that order, and then *reduce* not raise
> join_collapse_limit to force the planner to follow the syntactic
> join order.  In this way you'd get rid of most of the run-time
> join order search effort.  Don't know how cooperative your ORM
> would be with such an approach though.
>

The ORM seems to build the join path just the other way round of what would
be good for the planner. The thing we should take a good look at if it is
really needed looking at +120 tables for a query that gets a pretty trivial
result, but that is completely off topic.


> regards, tom lane
>

Thanks for your repply.

Regards,

Juan José Santamaría


Re: GEQO and join_collapse_limit correlation

2018-01-05 Thread Tom Lane
=?UTF-8?Q?Juan_Jos=C3=A9_Santamar=C3=ADa_Flecha?=  
writes:
> In an environment with geqo_threshold=1 and join_collapse_limit=1, would
> the planner be GEQO exclusive (and syntactic)?

GEQO's only function, basically, is to search for the join order to use.
If you're constraining the join order completely with
join_collapse_limit=1 then forcing the GEQO path to be taken would just
add pointless overhead.  (If it does anything at all ... I don't remember
the logic exactly but we might be bright enough not to bother with GEQO in
such a situation, regardless of geqo_threshold.)

regards, tom lane