On Thu, Mar 19, 2026 at 4:38 PM Robert Haas <[email protected]> wrote:
> schnauzer's got a failure that looks like this:
>
> +WARNING:  supplied plan advice was not enforced
> +DETAIL:  advice SEQ_SCAN(pg_trigger@exists_1) feedback is "matched, failed"
> +advice NO_GATHER(pg_trigger@exists_1) feedback is "matched, failed"
>
> And an older run on skink has a failure that looks like this:
>
> +WARNING:  supplied plan advice was not enforced
> +DETAIL:  advice SEQ_SCAN(pg_trigger@exists_6) feedback is "matched, failed"
> +advice NO_GATHER(pg_trigger@exists_6) feedback is "matched, failed"

I spent a bunch of time looking into this. I don't have a definitive
answer to the question of what to do about it yet, but I wanted to
write down some initial thoughts.

First, I think that I broke fix_alternative_subplan when I added
disabled_nodes. Before, when disabling things just affected the cost,
the logic here would have taken what was disabled into account in
picking between alternatives. Now it doesn't, because I didn't add a
disabled_nodes field to SubPlan. You can see that it breaking with
this test case:

CREATE TABLE t1 (a int);
CREATE TABLE t2 (a int);
CREATE INDEX ON t2(a);
INSERT INTO t1 SELECT generate_series(1, 1000);
INSERT INTO t2 SELECT generate_series(1, 100000);
ANALYZE;

EXPLAIN (VERBOSE, COSTS ON)
SELECT * FROM t1
WHERE EXISTS (SELECT 1 FROM t2 WHERE t2.a = t1.a) OR t1.a < 0;

SET enable_seqscan = off;
SET enable_indexonlyscan = off;

EXPLAIN (VERBOSE, COSTS ON)
SELECT * FROM t1
WHERE EXISTS (SELECT 1 FROM t2 WHERE t2.a = t1.a) OR t1.a < 0;

With the currently-committed code, you get a hashed SubPlan both
times, and it's just disabled in the second case. But there's a
perfectly good non-hashed variant that is not disabled: scan t2 using
a parameterized Index Scan. So that sucks. I have a small patch to fix
this, which I'll post later. I don't think we can or should do
anything about this in released branches, but we should fix it in
master regardless of what happens to pg_plan_advice or
test_plan_advice.

Second, in terms of actually fixing the problem, I think the issue
here is that the scope I chose for pg_plan_advice doesn't quite fit
the goal of making test_plan_advice the canonical way of testing this
code. In order to keep this project manageable in size, I decided
that, for the first version, pg_plan_advice wasn't going to try to
control anything above the level of scan/join planning, so for example
we don't care what kind of aggregation the planner chooses. That
should be OK for test_plan_advice, because test_plan_advice works by
checking if all the advice applied cleanly, and since we didn't emit
any advice about the aggregation method in the first place, there
can't be any problem with applying it later. In other words, if the
aggregation method chosen does flip between consecutive planning
cycles, it should not cause a test_plan_advice test failure. This same
general principle applies to a bunch of other cases too, like set
operation planning: if there's more than one way to do it, the planner
can change its mind and test_plan_advice should not care. However,
there's an important exception: if something changes above the level
of scan/join planning that affects what rels are involved in scan/join
planning, then a plan change will cause a test_plan_advice failure.

The failures above are of that type: the way the AlternativeSubPlan
machinery works is that the query gets copied and replanned, and each
plan has a separate plan name. So the final plan has either
pg_trigger@exists_5 or pg_trigger@exists_6 in it, but never both.
There's one other case that I think is similar, which is the
MinMaxAggPath stuff: if we choose the MinMaxAggPath, then the final
plan will have something like t1@minmax_1 where it otherwise would
have had just t1, or perhaps t1@minmax_1 instead of t1@something_else.
These cases have something in common, which is that they are the only
cases where we make a new PlannerInfo to try replanning part of the
query in a second way. That's the pattern that causes breakage here. I
would be remiss if I did not mention that Jakub Wartak was poking at
me about the MinMaxAggPath case a while back, but I dismissed it as
out of scope, which was accurate, but that's because I didn't think at
the time that it would destabilize test_plan_advice. Now, I think it
could, although I don't think we have seen any such failures in the
buildfarm yet. Perhaps a concurrent statistics change is more likely
to flip the hashed/non-hashed SubPlan decision than the choice of
whether to use MinMaxAggPath.

One approach that I considered here is to try to unify the "sibling"
relations. If the final plan is bound to contain either
pg_trigger@exists_5 or pg_trigger@exists_6, maybe the advice shouldn't
actually think there are two separate things. Maybe instead of
subqueries exists_1 ... exists_6 we should end up with subqueries
exists_1 ... exists_3, with each name used twice. That amounts to
deciding that the patch to give each subplan a unique name got it
wrong. While this idea has some appeal, ultimately I think it's a
loser, because addressing the problem this way wouldn't actually fix
the test_plan_advice instability we're currently seeing, or at least
not necessarily. For example, in the test case earlier in this email,
INDEX_SCAN(t2@whatever) can only be applied if the non-hashed subplan
is chosen, because we won't consider a plan index scan to even be a
possibility for a full table scan. An index-only scan is considered,
but not a plain index scan. This means that even if we flattened the
rels in each pair of siblings together and called each pair by the
same name, we could still see failures to apply advice cleanly.
Moreover, that's assuming that optimizations like self-join
elimination, left join removal, and partition pruning always apply in
the same way to both copies, which doesn't sound like a safe
assumption at all.

A second approach is to try to control the initial conditions of the
two planning cycles better. Possibly just running the tests serially
instead of in parallel would get that done, but that seems too slow to
consider and, even if we did it anyway, I'm not all that sure that an
autoanalyze or autovacuum in the background couldn't mess us up. Or,
alternatively, if we could keep the second planning cycle from seeing
different statistics than the first, I think that would do it, but I
don't think there's a feasible way of doing that.

So I'm left with the idea that to get test_plan_advice to be fully
stable on these slower machines, it will probably be necessary to make
it control which AlternativeSubPlan is chosen and whether a
MinMaxAggPath is chosen or not. I have some ideas about how to
accomplish that in a reasonably elegant fashion without adding too
much new machinery, but I need to spend some more time validating
those ideas before committing to a precise course of action. More
soon.

-- 
Robert Haas
EDB: http://www.enterprisedb.com


Reply via email to