On Thu, Mar 19, 2026 at 4:38 PM Robert Haas <[email protected]> wrote: > schnauzer's got a failure that looks like this: > > +WARNING: supplied plan advice was not enforced > +DETAIL: advice SEQ_SCAN(pg_trigger@exists_1) feedback is "matched, failed" > +advice NO_GATHER(pg_trigger@exists_1) feedback is "matched, failed" > > And an older run on skink has a failure that looks like this: > > +WARNING: supplied plan advice was not enforced > +DETAIL: advice SEQ_SCAN(pg_trigger@exists_6) feedback is "matched, failed" > +advice NO_GATHER(pg_trigger@exists_6) feedback is "matched, failed"
I spent a bunch of time looking into this. I don't have a definitive answer to the question of what to do about it yet, but I wanted to write down some initial thoughts. First, I think that I broke fix_alternative_subplan when I added disabled_nodes. Before, when disabling things just affected the cost, the logic here would have taken what was disabled into account in picking between alternatives. Now it doesn't, because I didn't add a disabled_nodes field to SubPlan. You can see that it breaking with this test case: CREATE TABLE t1 (a int); CREATE TABLE t2 (a int); CREATE INDEX ON t2(a); INSERT INTO t1 SELECT generate_series(1, 1000); INSERT INTO t2 SELECT generate_series(1, 100000); ANALYZE; EXPLAIN (VERBOSE, COSTS ON) SELECT * FROM t1 WHERE EXISTS (SELECT 1 FROM t2 WHERE t2.a = t1.a) OR t1.a < 0; SET enable_seqscan = off; SET enable_indexonlyscan = off; EXPLAIN (VERBOSE, COSTS ON) SELECT * FROM t1 WHERE EXISTS (SELECT 1 FROM t2 WHERE t2.a = t1.a) OR t1.a < 0; With the currently-committed code, you get a hashed SubPlan both times, and it's just disabled in the second case. But there's a perfectly good non-hashed variant that is not disabled: scan t2 using a parameterized Index Scan. So that sucks. I have a small patch to fix this, which I'll post later. I don't think we can or should do anything about this in released branches, but we should fix it in master regardless of what happens to pg_plan_advice or test_plan_advice. Second, in terms of actually fixing the problem, I think the issue here is that the scope I chose for pg_plan_advice doesn't quite fit the goal of making test_plan_advice the canonical way of testing this code. In order to keep this project manageable in size, I decided that, for the first version, pg_plan_advice wasn't going to try to control anything above the level of scan/join planning, so for example we don't care what kind of aggregation the planner chooses. That should be OK for test_plan_advice, because test_plan_advice works by checking if all the advice applied cleanly, and since we didn't emit any advice about the aggregation method in the first place, there can't be any problem with applying it later. In other words, if the aggregation method chosen does flip between consecutive planning cycles, it should not cause a test_plan_advice test failure. This same general principle applies to a bunch of other cases too, like set operation planning: if there's more than one way to do it, the planner can change its mind and test_plan_advice should not care. However, there's an important exception: if something changes above the level of scan/join planning that affects what rels are involved in scan/join planning, then a plan change will cause a test_plan_advice failure. The failures above are of that type: the way the AlternativeSubPlan machinery works is that the query gets copied and replanned, and each plan has a separate plan name. So the final plan has either pg_trigger@exists_5 or pg_trigger@exists_6 in it, but never both. There's one other case that I think is similar, which is the MinMaxAggPath stuff: if we choose the MinMaxAggPath, then the final plan will have something like t1@minmax_1 where it otherwise would have had just t1, or perhaps t1@minmax_1 instead of t1@something_else. These cases have something in common, which is that they are the only cases where we make a new PlannerInfo to try replanning part of the query in a second way. That's the pattern that causes breakage here. I would be remiss if I did not mention that Jakub Wartak was poking at me about the MinMaxAggPath case a while back, but I dismissed it as out of scope, which was accurate, but that's because I didn't think at the time that it would destabilize test_plan_advice. Now, I think it could, although I don't think we have seen any such failures in the buildfarm yet. Perhaps a concurrent statistics change is more likely to flip the hashed/non-hashed SubPlan decision than the choice of whether to use MinMaxAggPath. One approach that I considered here is to try to unify the "sibling" relations. If the final plan is bound to contain either pg_trigger@exists_5 or pg_trigger@exists_6, maybe the advice shouldn't actually think there are two separate things. Maybe instead of subqueries exists_1 ... exists_6 we should end up with subqueries exists_1 ... exists_3, with each name used twice. That amounts to deciding that the patch to give each subplan a unique name got it wrong. While this idea has some appeal, ultimately I think it's a loser, because addressing the problem this way wouldn't actually fix the test_plan_advice instability we're currently seeing, or at least not necessarily. For example, in the test case earlier in this email, INDEX_SCAN(t2@whatever) can only be applied if the non-hashed subplan is chosen, because we won't consider a plan index scan to even be a possibility for a full table scan. An index-only scan is considered, but not a plain index scan. This means that even if we flattened the rels in each pair of siblings together and called each pair by the same name, we could still see failures to apply advice cleanly. Moreover, that's assuming that optimizations like self-join elimination, left join removal, and partition pruning always apply in the same way to both copies, which doesn't sound like a safe assumption at all. A second approach is to try to control the initial conditions of the two planning cycles better. Possibly just running the tests serially instead of in parallel would get that done, but that seems too slow to consider and, even if we did it anyway, I'm not all that sure that an autoanalyze or autovacuum in the background couldn't mess us up. Or, alternatively, if we could keep the second planning cycle from seeing different statistics than the first, I think that would do it, but I don't think there's a feasible way of doing that. So I'm left with the idea that to get test_plan_advice to be fully stable on these slower machines, it will probably be necessary to make it control which AlternativeSubPlan is chosen and whether a MinMaxAggPath is chosen or not. I have some ideas about how to accomplish that in a reasonably elegant fashion without adding too much new machinery, but I need to spend some more time validating those ideas before committing to a precise course of action. More soon. -- Robert Haas EDB: http://www.enterprisedb.com
