Re: Need help identifying a periodic performance issue.

2021-11-18 Thread Robert Creager



> On Nov 17, 2021, at 10:42 PM, Justin Pryzby  wrote:
> 
> This message originated outside your organization.
> 
> On Thu, Nov 18, 2021 at 04:39:42PM +1300, Thomas Munro wrote:
>> On Thu, Nov 18, 2021 at 1:18 PM Robert Creager  
>> wrote:
>>> So, how do I go about capturing more information for the big brains (you 
>>> guys) to help figure this out?  I have all our resources at mine (and hence 
>>> your) disposal.
>> 
>> As a workaround, does it help if you issue DISCARD PLANS before your
>> COPY jobs, or alternatively start with a fresh connection?  I'm

I can certainly give that a try.

> It also seems to work if one does SET plan_cache_mode=force_custom_plan;
> 
> Robert might try that, either in postresql.conf, or SET in the client that's
> doing COPY.

Which would be better?  Discard plans or forcing custom plans?  Seems like 
wrapping a copy might be better than the Postgres.conf change as that would 
affect all statements.  What kind of performance hit would we be taking with 
that do you estimate?  Microseconds per statement?  Yeah, hard to say, depends 
on hardware and such.  Would there be any benefit overall to doing that?  
Forcing the replan?

Best,
Robert





Re: Need help identifying a periodic performance issue.

2021-11-18 Thread Thomas Munro
On Fri, Nov 19, 2021 at 6:03 AM Robert Creager  wrote:
> Which would be better?  Discard plans or forcing custom plans?  Seems like 
> wrapping a copy might be better than the Postgres.conf change as that would 
> affect all statements.  What kind of performance hit would we be taking with 
> that do you estimate?  Microseconds per statement?  Yeah, hard to say, 
> depends on hardware and such.  Would there be any benefit overall to doing 
> that?  Forcing the replan?

Just to understand what's going on, it'd be interesting to know if the
problem goes away if you *just* inject the DISCARD PLANS statement
before running your COPYs, but if that doesn't help it'd also be
interesting to know what happens if you ANALYZE each table after each
COPY.  Are you running any explicit ANALYZE commands?  How long do
your sessions/connections live for?

I'm wondering if the thing that changed between 9.6 and 13 might be
the heuristics for when auto vacuum's background ANALYZE is triggered,
creating the unlucky timing required to get your system to this state
occasionally.

For a while now I have been wondering how we could teach the
planner/stats system about "volatile" tables (as DB2 calls them), that
is, ones that are frequently empty, which often come up in job queue
workloads.  I've seen problems like this with user queries (I used to
work on big job queue systems across different relational database
vendors, which is why I finished up writing the SKIP LOCKED patch for
9.5), but this is the first time I've contemplated FK check queries
being negatively affected by this kind of stats problem.  I don't have
a good concrete idea, though (various dumb ideas: don't let auto
analyze run on an empty table if it's marked VOLATILE, or ignore
apparently empty stats on tables marked VOLATILE (and use what?),
...).




Re: Need help identifying a periodic performance issue.

2021-11-18 Thread Tom Lane
Thomas Munro  writes:
> I'm wondering if the thing that changed between 9.6 and 13 might be
> the heuristics for when auto vacuum's background ANALYZE is triggered,
> creating the unlucky timing required to get your system to this state
> occasionally.

> For a while now I have been wondering how we could teach the
> planner/stats system about "volatile" tables (as DB2 calls them), that
> is, ones that are frequently empty, which often come up in job queue
> workloads.  I've seen problems like this with user queries (I used to
> work on big job queue systems across different relational database
> vendors, which is why I finished up writing the SKIP LOCKED patch for
> 9.5), but this is the first time I've contemplated FK check queries
> being negatively affected by this kind of stats problem.  I don't have
> a good concrete idea, though (various dumb ideas: don't let auto
> analyze run on an empty table if it's marked VOLATILE, or ignore
> apparently empty stats on tables marked VOLATILE (and use what?),
> ...).

Hmm.  If this complaint were about v14 rather than v13, I'd be
wondering whether 3d351d916 was what made things worse.  But
in v13, if the table does go to empty (zero length) and ANALYZE
happens to see that state, we should end up back at the planner's
"minimum ten pages" heuristic, which likely would be enough to
prevent choice of a seqscan.  OTOH, if the analyzed state is
"empty but has a couple of pages", it looks like that could
provoke a seqscan.

This is all guesswork though, since we don't know quite what's
happening on Robert's system.  It might be worth setting
"log_autovacuum_min_duration = 0" (either globally, or as a
reloption on the relevant tables), and seeing if there seems
to be any correlation between autovacuum/autoanalyze activity
and the occurrences of poor plan choices.

regards, tom lane




Re: Need help identifying a periodic performance issue.

2021-11-18 Thread Robert Creager


On Nov 18, 2021, at 2:08 PM, Thomas Munro 
mailto:[email protected]>> wrote:

This message originated outside your organization.

On Fri, Nov 19, 2021 at 6:03 AM Robert Creager 
mailto:[email protected]>> wrote:
Which would be better?  Discard plans or forcing custom plans?  Seems like 
wrapping a copy might be better than the Postgres.conf change as that would 
affect all statements.  What kind of performance hit would we be taking with 
that do you estimate?  Microseconds per statement?  Yeah, hard to say, depends 
on hardware and such.  Would there be any benefit overall to doing that?  
Forcing the replan?

Just to understand what's going on, it'd be interesting to know if the
problem goes away if you *just* inject the DISCARD PLANS statement
before running your COPYs, but if that doesn't help it'd also be

I’m doing that now, “SET plan_cache_mode=force_custom_plan” before the copy, 
then auto after the copy.

interesting to know what happens if you ANALYZE each table after each
COPY.  Are you running any explicit ANALYZE commands?  How long do
your sessions/connections live for?

No explicit analyze happening.  I’m not super familiar with this code base, but 
a bit of looking confirmed what I thought, they served via connection pool and 
appear to live for the life of the app, which could be months.

After this test (tomorrow likely) I can try the explicit ANALYZE after the copy 
completes.

Best,
Robert


Re: Need help identifying a periodic performance issue.

2021-11-18 Thread Robert Creager


> On Nov 18, 2021, at 2:42 PM, Tom Lane  wrote:
> 
> This is all guesswork though, since we don't know quite what's
> happening on Robert's system.  It might be worth setting
> "log_autovacuum_min_duration = 0" (either globally, or as a
> reloption on the relevant tables), and seeing if there seems
> to be any correlation between autovacuum/autoanalyze activity
> and the occurrences of poor plan choices.

I’ve changed the log duration globally, and am also using the 
plan_cache_mode=force_custom_plan suggested by Justin.

Best,
Robert