On Thu, Mar 19, 2026 at 1:17 PM Robert Haas <[email protected]> wrote:
> > I would dig into why grison and schnauzer are failing this test,
> > except that I don't agree that we should be running it in the
> > first place.
>
> I'll go have a look.

grison failed like this:

 CREATE INDEX ON vaccluster(wrap_do_analyze(i));
+ERROR:  indexes on virtual generated columns are not supported

This is a surprising diff, because either that command is trying to
create an index on a virtual generated column, or it's not, and it
either is or isn't for every machine in the buildfarm and under
test_plan_advice or not. The problem might be that the first
TupleDescAttr(...) == ATTRIBUTE_GENERATED_VIRTUAL test in DefineIndex
can be reached with attno == 0, which seems like it will then access
memory that is not part of the TupleDesc. If that memory happens to
contain a 'v' in the right spot, this will happen. (Is it not viable
to have TupleDescAttr() assert that the second argument is >= 0?)

skink has a failure that looks like this:

+WARNING:  supplied plan advice was not enforced
+DETAIL:  advice NESTED_LOOP_MEMOIZE(nt) feedback is "matched, failed"

I think this is caused by a minor bug in the pgs_mask infrastructure.
get_memoize_path() exits quickly when outer_path->parent->rows < 2, on
the theory that the resulting path must lose on cost. But that
presumes that we could do a plain nested loop instead, i.e. that
PGS_NESTLOOP_PLAIN is set. And it might not be. Before the pgs_mask
stuff, that case couldn't happen: enable_nestloop=off disabled all
nested loops, and enable_memoize=off disabled only the memoized
version, but there wasn't any way to disable only the non-memoized
version (which, of course, was part of the point of the whole thing).
I think the fix is as attached.

Unfortunately, the other failures look like they are pointing to a
rather more serious problem. schnauzer's got a failure that looks like
this:

+WARNING:  supplied plan advice was not enforced
+DETAIL:  advice SEQ_SCAN(pg_trigger@exists_1) feedback is "matched, failed"
+advice NO_GATHER(pg_trigger@exists_1) feedback is "matched, failed"

And an older run on skink has a failure that looks like this:

+WARNING:  supplied plan advice was not enforced
+DETAIL:  advice SEQ_SCAN(pg_trigger@exists_6) feedback is "matched, failed"
+advice NO_GATHER(pg_trigger@exists_6) feedback is "matched, failed"

What these failures have in common is that both of them involve
selecting from information_schema.views. What "matched, failed" means
is that we saw the advice target during planning and tried to
influence the plan, but then observed that the final plan doesn't
respect the supplied advice. The query for information_schema.views
involves three subplans, so the fact that exists_6 is mentioned here
is a strong hint that the AlternativeSubPlan machinery is in play
here. The query is planned once, and one of the two alternative
subplans is chosen, generating advice for that plan. Upon replanning,
the other plan is chosen, so the subplan for which we have plan advice
doesn't appear in the query at all, leading to this. Now, you might
wonder how that's possible, considering that we're planning the same
query twice in a row with advice that isn't supposed to change
anything. My guess is that it's possible because these machines are
slow and other tests are running concurrently. If those other sessions
execute DDL, they can send sinval messages, which can cause the second
planning cycle to see different statistics than the first one. That
then means the plan can change in any way except for what the advice
system already knows how to control, and choice of AlternativeSubPlan
is not in that set of things. I think I actually saw a failure similar
to this once or twice locally during development, but that was back
when the code had a lot of bugs, and I assumed that the failure was
caused by some transient bug in whatever changes I was hacking on at
the time, or some other bug that I fixed later, rather than being a
real issue. I think the reason it doesn't happen very often is because
the statistics have to change enough at just the right moment and even
on slower buildfarm machines, most of the time, they don't.

It's not really clear to me exactly where to place the blame for this
category of failure. One view is that tests are being run in parallel
and I didn't think about that, and therefore this is a defect in the
test methodology that needs to be rectified somehow (hopefully not by
throwing it out). We might be able to fix that by running the test
suite serially rather than in parallel, although I expect that since
you (Tom) are already unhappy with the time this takes, that will
probably go over like a lead balloon. Another angle is to blame this
on the decision to assign different plan names to the different
subroots that we use for the alternative subplans. If we used
exists_1, exists_2, and exists_3 twice each instead of
exists_1..exists_6, this wouldn't happen, though that idea seems
questionable on theoretical grounds. A third possible take is that not
including choice-of-alternative-subplan in the initial scope was an
error.

As of this moment, I'm not really sure which way to jump. I need to
think further about what to do about this one. We can continue the
discussion about reducing the cost at the same time; again, I am
definitely not saying that it isn't legitimate to be concerned about
the CPU cycles expended running these tests, but those CPU cycles have
found three separate problems in two days, which is not nothing.

Separately, I am now 100% convinced that I need to go revise the
pg_collect_advice patch, because that adds yet another run of the core
regression tests, but for much less possibility of any real gain. I'll
go get rid of that and figure out what, if anything, to replace it
with.

-- 
Robert Haas
EDB: http://www.enterprisedb.com

Attachment: v1-0001-get_memoize_path-Don-t-exit-quickly-when-PGS_NEST.nocfbot
Description: Binary data

Reply via email to