Re: Need help identifying a periodic performance issue.
> On Nov 17, 2021, at 10:42 PM, Justin Pryzby wrote: > > This message originated outside your organization. > > On Thu, Nov 18, 2021 at 04:39:42PM +1300, Thomas Munro wrote: >> On Thu, Nov 18, 2021 at 1:18 PM Robert Creager >> wrote: >>> So, how do I go about capturing more information for the big brains (you >>> guys) to help figure this out? I have all our resources at mine (and hence >>> your) disposal. >> >> As a workaround, does it help if you issue DISCARD PLANS before your >> COPY jobs, or alternatively start with a fresh connection? I'm I can certainly give that a try. > It also seems to work if one does SET plan_cache_mode=force_custom_plan; > > Robert might try that, either in postresql.conf, or SET in the client that's > doing COPY. Which would be better? Discard plans or forcing custom plans? Seems like wrapping a copy might be better than the Postgres.conf change as that would affect all statements. What kind of performance hit would we be taking with that do you estimate? Microseconds per statement? Yeah, hard to say, depends on hardware and such. Would there be any benefit overall to doing that? Forcing the replan? Best, Robert
Re: Need help identifying a periodic performance issue.
On Fri, Nov 19, 2021 at 6:03 AM Robert Creager wrote: > Which would be better? Discard plans or forcing custom plans? Seems like > wrapping a copy might be better than the Postgres.conf change as that would > affect all statements. What kind of performance hit would we be taking with > that do you estimate? Microseconds per statement? Yeah, hard to say, > depends on hardware and such. Would there be any benefit overall to doing > that? Forcing the replan? Just to understand what's going on, it'd be interesting to know if the problem goes away if you *just* inject the DISCARD PLANS statement before running your COPYs, but if that doesn't help it'd also be interesting to know what happens if you ANALYZE each table after each COPY. Are you running any explicit ANALYZE commands? How long do your sessions/connections live for? I'm wondering if the thing that changed between 9.6 and 13 might be the heuristics for when auto vacuum's background ANALYZE is triggered, creating the unlucky timing required to get your system to this state occasionally. For a while now I have been wondering how we could teach the planner/stats system about "volatile" tables (as DB2 calls them), that is, ones that are frequently empty, which often come up in job queue workloads. I've seen problems like this with user queries (I used to work on big job queue systems across different relational database vendors, which is why I finished up writing the SKIP LOCKED patch for 9.5), but this is the first time I've contemplated FK check queries being negatively affected by this kind of stats problem. I don't have a good concrete idea, though (various dumb ideas: don't let auto analyze run on an empty table if it's marked VOLATILE, or ignore apparently empty stats on tables marked VOLATILE (and use what?), ...).
Re: Need help identifying a periodic performance issue.
Thomas Munro writes: > I'm wondering if the thing that changed between 9.6 and 13 might be > the heuristics for when auto vacuum's background ANALYZE is triggered, > creating the unlucky timing required to get your system to this state > occasionally. > For a while now I have been wondering how we could teach the > planner/stats system about "volatile" tables (as DB2 calls them), that > is, ones that are frequently empty, which often come up in job queue > workloads. I've seen problems like this with user queries (I used to > work on big job queue systems across different relational database > vendors, which is why I finished up writing the SKIP LOCKED patch for > 9.5), but this is the first time I've contemplated FK check queries > being negatively affected by this kind of stats problem. I don't have > a good concrete idea, though (various dumb ideas: don't let auto > analyze run on an empty table if it's marked VOLATILE, or ignore > apparently empty stats on tables marked VOLATILE (and use what?), > ...). Hmm. If this complaint were about v14 rather than v13, I'd be wondering whether 3d351d916 was what made things worse. But in v13, if the table does go to empty (zero length) and ANALYZE happens to see that state, we should end up back at the planner's "minimum ten pages" heuristic, which likely would be enough to prevent choice of a seqscan. OTOH, if the analyzed state is "empty but has a couple of pages", it looks like that could provoke a seqscan. This is all guesswork though, since we don't know quite what's happening on Robert's system. It might be worth setting "log_autovacuum_min_duration = 0" (either globally, or as a reloption on the relevant tables), and seeing if there seems to be any correlation between autovacuum/autoanalyze activity and the occurrences of poor plan choices. regards, tom lane
Re: Need help identifying a periodic performance issue.
On Nov 18, 2021, at 2:08 PM, Thomas Munro mailto:[email protected]>> wrote: This message originated outside your organization. On Fri, Nov 19, 2021 at 6:03 AM Robert Creager mailto:[email protected]>> wrote: Which would be better? Discard plans or forcing custom plans? Seems like wrapping a copy might be better than the Postgres.conf change as that would affect all statements. What kind of performance hit would we be taking with that do you estimate? Microseconds per statement? Yeah, hard to say, depends on hardware and such. Would there be any benefit overall to doing that? Forcing the replan? Just to understand what's going on, it'd be interesting to know if the problem goes away if you *just* inject the DISCARD PLANS statement before running your COPYs, but if that doesn't help it'd also be I’m doing that now, “SET plan_cache_mode=force_custom_plan” before the copy, then auto after the copy. interesting to know what happens if you ANALYZE each table after each COPY. Are you running any explicit ANALYZE commands? How long do your sessions/connections live for? No explicit analyze happening. I’m not super familiar with this code base, but a bit of looking confirmed what I thought, they served via connection pool and appear to live for the life of the app, which could be months. After this test (tomorrow likely) I can try the explicit ANALYZE after the copy completes. Best, Robert
Re: Need help identifying a periodic performance issue.
> On Nov 18, 2021, at 2:42 PM, Tom Lane wrote: > > This is all guesswork though, since we don't know quite what's > happening on Robert's system. It might be worth setting > "log_autovacuum_min_duration = 0" (either globally, or as a > reloption on the relevant tables), and seeing if there seems > to be any correlation between autovacuum/autoanalyze activity > and the occurrences of poor plan choices. I’ve changed the log duration globally, and am also using the plan_cache_mode=force_custom_plan suggested by Justin. Best, Robert
