Please help! Query jumps from 1s -> 4m
Hi, Hoping someone can help with this performance issue that's been driving a few of us crazy :-) Any guidance greatly appreciated. A description of what you are trying to achieve and what results you expect.: - I'd like to get an understanding of why the following query (presented in full, but there are specific parts that are confusing me) starts off taking ~second in duration but 'switches' to taking over 4 minutes. - we initially saw this behaviour for the exact same sql with a different index that resulted in an index scan. To try and fix the issue we've created an additional index with additional included fields so we now have Index Only Scans, but are still seeing the same problem. - execution plan is from auto_explain output when it took just over 4 minutes to execute. The time is shared ~equally between these two index-only scans. - There are no checkpoints occurring concurrently with this (based on "checkpoint starting" and "checkpoint complete" in logs) - bloat on the index is about 30% Segments of interest: 1. -> Index Only Scan using table1_typea_include_uniqueid_col16_idx on table1 table1alias1 (cost=0.56..17.25 rows=1 width=60) (actual time=110.539..123828.134 rows=67000 loops=1) Index Cond: (col20 = $2005) Filter: (((col3 = $2004) OR (col3 IS NULL)) AND ((col8)::text = ANY ((ARRAY[$1004, ..., $2003])::text[]))) Rows Removed by Filter: 2662652 Heap Fetches: 6940 Buffers: shared hit=46619 read=42784 written=52 2. -> Index Only Scan using table1_typea_include_uniqueid_col16_idx on table1 table1alias2 (cost=0.56..17.23 rows=1 width=36) (actual time=142.855..122034.039 rows=67000 loops=1) Index Cond: (col20 = $1001) Filter: ((col8)::text = ANY ((ARRAY[$1, ..., $1000])::text[])) Rows Removed by Filter: 2662652 Heap Fetches: 6891 Buffers: shared hit=47062 read=42331 written=37 If I run the same queries now: Index Only Scan using table1_typea_include_uniqueid_col16_idx on table1 table1alias1 (cost=0.56..2549.69 rows=69 width=36) (actual time=1.017..1221.375 rows=67000 loops=1) Heap Fetches: 24 Buffers: shared hit=2849 read=2483 buffers do look different - but still, reading 42k doesn't seem like it would cause a delay of 4m? Actually, here's another example of segment 2 from logs. Index Only Scan using table1_typea_include_uniqueid_col16_idx on table1 table1alias2 (cost=0.56..17.23 rows=1 width=36) (actual time=36.559..120649.742 rows=65000 loops=1) Index Cond: (col20 = $1001) Filter: ((col8)::text = ANY ((ARRAY[$1, $1000]::text[])) Rows Removed by Filter: 2664256 Heap Fetches: 6306 Buffers: shared hit=87712 read=1507 One note: I've replaced table/column names (sorry, a requirement). Full subquery execution plan (i've stripped out the view materialization from row 14 onwards but left the header in): https://explain.depesz.com/s/vsdH Full Sql: SELECT subquery.id FROM ( SELECT table1alias1.id, table1alias1.uniqueid, table1alias1.col16 AS order_by FROM table1 AS table1alias1 LEFT OUTER JOIN ( SELECT inlinealias1.id, inlinealias1.uniqueid, inlinealias1.col4, inlinealias1.col5, inlinealias1.col6, inlinealias1.col7 FROM ( SELECT table2alias.id, table2alias.uniqueid, table2alias.col3, table2alias.col4, table2alias.col5, table2alias.col6, row_number() OVER (PARTITION BY table2alias.uniqueid ORDER BY table2alias.col13 DESC, table2alias.col3 DESC, table2alias.id DESC) AS rn FROM table2 AS table2alias JOIN ( SELECT DISTINCT table1alias2.uniqueid FROM table1 AS table1alias2 WHERE (table1alias2.col8 IN ($1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $12, $13, $14, $15, $16, $17, $18, $19, $20, $21, $22, $23, $24, $25, $26, $27, $28, $29, $30, $31, $32, $33, $34, $35, $36, $37, $38, $39, $40, $41, $42, $43, $44, $45, $46, $47, $48, $49, $50, $51, $52, $53, $54, $55, $56, $57, $58, $59, $60, $61, $62, $63, $64, $65, $66, $67, $68, $69, $70, $71, $72, $73, $74, $75, $76, $77, $78, $79, $80, $81, $82, $83, $84, $85, $86, $87, $88, $89, $90, $91, $92, $93, $94, $95, $96, $97, $98, $99, $100, $101, $102, $103, $104, $105, $106, $107, $108, $109, $110, $111, $112, $113, $114, $115, $116, $117, $118, $119, $120, $121, $122, $123, $124, $125, $126, $127, $128, $129, $130, $131, $132, $133, $134, $135, $136, $137, $138, $139, $140, $141, $142, $143, $144, $145, $146, $147, $148, $149, $150, $151, $152, $153, $154, $155, $156, $157, $158, $159, $160, $161, $162, $163, $164, $165, $166, $167, $168, $169, $170, $171, $172, $173, $174, $175, $176, $177, $178, $179, $180, $181, $182, $183, $184, $185, $186, $187, $188, $189, $190, $191, $192, $193, $194, $195, $196, $197, $198, $199, $2
Re: Please help! Query jumps from 1s -> 4m
The change is abrupt, on the 10th execution (but I hadn't spotted it was always after the same number of executions until your suggestion - thanks for pointing me in that direction). I don't see any custom configuration on our end that changes the threshold for this from 5->10. Debugging the query call I also see that PgConnection has the prepareThreshold set to 5. Additionally, the execution plans for the 10th + following queries look fine, they have the same structure as if I run the query manually. It's not that the query plan switches, it seems as though the same query plan is just > 200X slower than usual. As for the heap fetches -> as far as I can tell, on both occasions the fetches are relatively low and shouldn't account for minutes of execution (even if one is lower than the other). Looking through one days logs I do find cases with lower heap fetches too, for example as below which has 1977 fetches instead of the previous 6940 but took approx the same time: -> Index Only Scan using table1_typea_include_uniqueid_col16_idx on table1 table1alias1 (cost=0.56..17.25 rows=1 width=60) (actual time=56.858..120893.874 rows=67000 loops=1) Index Cond: (col20 = $2005) Filter: (((col3 = $2004) OR (col3 IS NULL)) AND ((col8)::text = ANY ((ARRAY[$1004, ..., $2003])::text[]))) Rows Removed by Filter: 2662793 Heap Fetches: 1977 Buffers: shared hit=84574 read=3522 Would you agree the statement threshold / heap fetches seems unlikely to be causing this? Any other thoughts? Thanks! On Sun, 3 May 2020 at 16:38, Justin Pryzby wrote: > On Sun, May 03, 2020 at 09:58:27AM +0100, James Thompson wrote: > > Hi, > > > > Hoping someone can help with this performance issue that's been driving a > > few of us crazy :-) Any guidance greatly appreciated. > > > > A description of what you are trying to achieve and what results you > > expect.: > > - I'd like to get an understanding of why the following query (presented > > in full, but there are specific parts that are confusing me) starts off > > taking ~second in duration but 'switches' to taking over 4 minutes. > > Does it "switch" abruptly or do you get progressively slower queries ? > If it's abrupt following the 5th execution, I guess you're hitting this: > > > https://www.postgresql.org/message-id/[email protected] > > https://www.postgresql.org/message-id/a737b7a37273e048b164557adef4a58b50fb8...@ntex2010i.host.magwien.gv.at > > > - we initially saw this behaviour for the exact same sql with a > different > > index that resulted in an index scan. To try and fix the issue we've > > created an additional index with additional included fields so we now > have > > Index Only Scans, but are still seeing the same problem. > > > Segments of interest: > > 1. -> Index Only Scan using table1_typea_include_uniqueid_col16_idx on > > table1 table1alias1 (cost=0.56..17.25 rows=1 width=60) (actual > > time=110.539..123828.134 rows=67000 loops=1) > > Index Cond: (col20 = $2005) > > Filter: (((col3 = $2004) OR (col3 IS NULL)) AND ((col8)::text = > ANY > > ((ARRAY[$1004, ..., $2003])::text[]))) > > Rows Removed by Filter: 2662652 > > Heap Fetches: 6940 > > Buffers: shared hit=46619 read=42784 written=52 > > > If I run the same queries now: > > Index Only Scan using table1_typea_include_uniqueid_col16_idx on table1 > > table1alias1 (cost=0.56..2549.69 rows=69 width=36) > > (actual time=1.017..1221.375 rows=67000 loops=1) > > Heap Fetches: 24 > > Buffers: shared hit=2849 read=2483 > > It looks to me like you're getting good performance following a vacuum, > when > Heap Fetches is low. So you'd want to run vacuum more often, like: > | ALTER TABLE table1 SET (autovacuum_vacuum_scale_factor=0.005). > > But maybe I've missed something - you showed the bad query plan, but not > the > good one, and I wonder if they may be subtly different, and that's maybe > masked > by the replaced identifiers. > > -- > Justin >
Re: Please help! Query jumps from 1s -> 4m
I've managed to replicate this now with prepared statements. Thanks for all
the guidance so far.
The slowness occurs when the prepared statement changes to a generic plan.
Initial plan:
-> Index Only Scan using table1_typea_include_uniqueid_col16_idx on table1
table1alias2 (cost=0.56..2549.70 rows=70 width=36) (actual
time=1.901..45.256 rows=65000 loops=1)
Output: table1alias2.uniqueid
Index Cond: ((table1alias2.col20 = '12345'::bigint) AND (table1alias2.
col8 = ANY ('{c5986b02-3a02-4639-8147-f286972413ba,...
98ed24b1-76f5-4b0e-bb94-86cf13a4809c}'::text[])))
Heap Fetches: 10
Buffers: shared hit=5048
after 5 executions of the statement:
-> Index Only Scan using table1_typea_include_uniqueid_col16_idx on table1
table1alias2 (cost=0.56..17.23 rows=1 width=36) (actual
time=125.344..126877.822 rows=65000 loops=1)
Output: table1alias2.uniqueid
Index Cond: (table1alias2.col20 = $1001)
Filter: ((table1alias2.col8)::text = ANY ((ARRAY[$1, ...,
$1000])::text[]))
Rows Removed by Filter: 2670023
Heap Fetches: 428
Buffers: shared hit=45933 read=42060 dirtied=4
The second plan looks worse to me as it's applying a filter rather than
using an index condition? I don't understand why it's not part of the
condition and also why this is so much slower though.
If I force a retrieval of all index rows for col20 = '12345' using an
ad-hoc query (below, which in my mind is what the 'bad' plan is doing),
that only takes 2s (2.7 mil rows). Where's the difference?
EXPLAIN (ANALYZE, BUFFERS, TIMING) SELECT COUNT(DISTINCT id) FROM table1
WHERE datatype='TypeA' AND col20 = 12345;
-> Index Only Scan using table1_typea_include_uniqueid_col16_idx on table1
(cost=0.56..2762.95 rows=55337 width=8) (actual time=3.976..1655.645
rows=2735023 loops=1)
Index Cond: (col20 = 12345)
Heap Fetches: 417
Buffers: shared hit=43843 read=44147 dirtied=8
>You could do: SET track_io_timing TO on;
I've not tried this yet, and haven't used it before - sounds like there's
some risks associated with running it on a production server / clocks going
backwards?
>Also, you should be able to check if that's the problem by doing either:
>plan_cache_mode = force_generic_plan;
>Or (I would think) DISCARD PLANS;
I think plan_cache_mode is just for pg12+ unfortunately? We're on 11
currently.
Just tested DISCARD PLANS locally, it didn't switch back from the generic
plan. Was that your expectation?
Re: Please help! Query jumps from 1s -> 4m
Just to follow up on this...
Tried increasing stats targets last week + re-analyzing but the query was
just as bad.
Ended up increasing the prepareThreshold to prevent server-side prepares
for now (and thus later generic statements). This 'fixed' the issue and had
no noticeable negative effect for our workloads.
I still don't understand why the plan being off makes the query so much
slower in this case (the plans I shared in the last email don't look too
different, I don't understand how the filter can add on 2mins of execution
time to an index-only scan). If anyone does have thoughts on what could be
happening I would be very interested to hear, but the main performance
problem is effectively solved.
Thanks all for the valuable help getting to the bottom of what was
happening.
On Tue, 5 May 2020 at 22:42, Tom Lane wrote:
> James Thompson writes:
> > The slowness occurs when the prepared statement changes to a generic
> plan.
>
> > Initial plan:
> > -> Index Only Scan using table1_typea_include_uniqueid_col16_idx on
> table1
> > table1alias2 (cost=0.56..2549.70 rows=70 width=36) (actual
> > time=1.901..45.256 rows=65000 loops=1)
> > Output: table1alias2.uniqueid
> > Index Cond: ((table1alias2.col20 = '12345'::bigint) AND
> (table1alias2.
> > col8 = ANY ('{c5986b02-3a02-4639-8147-f286972413ba,...
> > 98ed24b1-76f5-4b0e-bb94-86cf13a4809c}'::text[])))
> > Heap Fetches: 10
> > Buffers: shared hit=5048
>
> > after 5 executions of the statement:
> > -> Index Only Scan using table1_typea_include_uniqueid_col16_idx on
> table1
> > table1alias2 (cost=0.56..17.23 rows=1 width=36) (actual
> > time=125.344..126877.822 rows=65000 loops=1)
> > Output: table1alias2.uniqueid
> > Index Cond: (table1alias2.col20 = $1001)
> > Filter: ((table1alias2.col8)::text = ANY ((ARRAY[$1, ...,
> > $1000])::text[]))
> > Rows Removed by Filter: 2670023
> > Heap Fetches: 428
> > Buffers: shared hit=45933 read=42060 dirtied=4
>
> Yeah, this is a dynamic we've seen before. The rowcount estimate, and
> hence the cost estimate, for the plan with explicit parameter values is
> way off; but the estimate for the generic plan is even more way off,
> causing the system to falsely decide that the latter is cheaper.
>
> I've speculated about refusing to believe generic cost estimates if they
> are
> more than epsilon less than the concrete cost estimate, but it's not quite
> clear how that should work or whether it'd have its own failure modes.
>
> The one thing that is totally clear is that these rowcount estimates are
> crappy. Can you improve them by increasing the stats target for that
> table? Maybe with less-garbage-y inputs, the system would make the right
> plan choice here.
>
> regards, tom lane
>
