Strange behavior of limit clause in complex query
Hi,
I'm using PostgreSQL 14.3 and I'm getting strange behavior in a complex
query generated by the Entity Framework.
The inner (complex) query has a quick execution time:
# SELECT "Extent1"."id", CAST ("Extent1"."date" AS timestamp) AS "C3"
FROM "dbo"."ng_content" AS "Extent1"
INNER JOIN "dbo"."ng_path_content" AS "Extent2" ON "Extent1"."id" =
"Extent2"."id_content"
WHERE "Extent1"."date_from" <= CAST (LOCALTIMESTAMP AS timestamp)
AND "Extent1"."date_to" >= CAST (LOCALTIMESTAMP AS
timestamp)
AND 2 = "Extent1"."id_status"
AND EXISTS (
SELECT 1 AS "C1"
FROM (
SELECT "Extent3"."TagId" FROM
"dbo"."ngx_tag_content" AS "Extent3"
WHERE "Extent1"."id" = "Extent3"."ContentId"
) AS "Project1"
WHERE EXISTS (
SELECT 1 AS "C1" FROM (SELECT 1 AS "C") AS
"SingleRowTable1"
WHERE "Project1"."TagId" = 337139)
)
AND ("Extent2"."id_path" IN (27495,27554,27555) AND
NOT EXISTS (SELECT 1 AS "C1" FROM (SELECT 1 AS "C") AS "SingleRowTable2"
WHERE TRUE = FALSE)
);
id| C3
--+-
13505155 | 2021-03-27 12:01:00
13505187 | 2021-03-27 12:03:00
13505295 | 2021-03-27 12:06:00
13505348 | 2021-03-27 12:09:00
13505552 | 2021-03-27 12:11:00
(5 rows)
*Time: 481.826 ms*
If I run the same query as a nested select I get similar results (Q1):
*SELECT "Project5".idFROM (*
SELECT "Extent1"."id", CAST ("Extent1"."date" AS timestamp) AS "C3"
FROM "dbo"."ng_content" AS "Extent1"
INNER JOIN "dbo"."ng_path_content" AS "Extent2" ON "Extent1"."id" =
"Extent2"."id_content"
WHERE "Extent1"."date_from" <= CAST (LOCALTIMESTAMP AS timestamp)
AND "Extent1"."date_to" >= CAST (LOCALTIMESTAMP AS
timestamp)
AND 2 = "Extent1"."id_status"
AND EXISTS (
SELECT 1 AS "C1"
FROM (
SELECT "Extent3"."TagId" FROM
"dbo"."ngx_tag_content" AS "Extent3"
WHERE "Extent1"."id" = "Extent3"."ContentId"
) AS "Project1"
WHERE EXISTS (
SELECT 1 AS "C1" FROM (SELECT 1 AS "C") AS
"SingleRowTable1"
WHERE "Project1"."TagId" = 337139)
)
AND ("Extent2"."id_path" IN (27495,27554,27555) AND
NOT EXISTS (SELECT 1 AS "C1" FROM (SELECT 1 AS "C") AS "SingleRowTable2"
WHERE TRUE = FALSE)
)
*) AS "Project5";*
id
--
13505155
13505187
13505295
13505348
13505552
(5 rows)
*Time: 486.174 ms*
But if I add an ORDER BY and a LIMIT something goes very wrong (Q2):
# SELECT "Project5".id
FROM (
SELECT "Extent1"."id", CAST ("Extent1"."date" AS timestamp) AS "C3"
FROM "dbo"."ng_content" AS "Extent1"
INNER JOIN "dbo"."ng_path_content" AS "Extent2" ON "Extent1"."id" =
"Extent2"."id_content"
WHERE "Extent1"."date_from" <= CAST (LOCALTIMESTAMP AS timestamp)
AND "Extent1"."date_to" >= CAST (LOCALTIMESTAMP AS
timestamp)
AND 2 = "Extent1"."id_status"
AND EXISTS (
SELECT 1 AS "C1"
FROM (
SELECT "Extent3"."TagId" FROM
"dbo"."ngx_tag_content" AS "Extent3"
WHERE "Extent1"."id" = "Extent3"."ContentId"
) AS "Project1"
WHERE EXISTS (
SELECT 1 AS "C1" FROM (SELECT 1 AS "C") AS
"SingleRowTable1"
WHERE "Project1"."TagId" = 337139)
)
AND ("Extent2"."id_path" IN (27495,27554,27555) AND
NOT EXISTS (SELECT 1 AS "C1" FROM (SELECT 1 AS "C") AS "SingleRowTable2"
WHERE TRUE = FALSE)
)
) AS "Project5" *ORDER BY "Project5"."C3" DESC LIMIT 6*;
id
--
13505552
13505348
13505295
13505187
13505155
(5 rows)
*Time: 389375.374 ms (06:29.375)*
An EXPLAIN (ANALYZE, BUFFERS) for Q1 returns this:
QUERY PLAN
---
Result (cost=661.07..826757.96 rows=27943 width=4) (actual
time=446.767..492.874 rows=5 loops=1)
One-Time Filter: (NOT $1)
Buffers: shared hit=344618 read=17702 written=349
InitPlan 2 (returns $1)
-> Result (cost=0.00..0.01 rows=1 width=0) (actual time=0.003..0.003
rows=0 loops=1)
One-Time Filter: false
-> Nested Loop (cost=661.07..826757.96 rows=27943 width=4) (actual
time=267.061..313.166 rows=5 loops=1)
Buffers: shared hit=34461
Adding non-selective key to jsonb query @> reduces performance?
Hi,
I have a table "tbl" with a couple of columns. One of them is "row"
jsonb. It has a GIN index as per below. The table isn't particularly
large, in lower tens of GB. Each "row" has maybe 5-20 keys, nothing
crazy.
Now, when I query it with @> operator I get very different performance
depending on the selection of keys I want to look for. The queries
below return the same result set (just a few rows). I think I have
narrowed the problem down to uniqueness of a given key. For example
this query is fast:
Q1
select count(*) from tbl where row @> '{"SELECTIVE_COL":
"SearchValue", "DATE": "20220606"}'::jsonb
It takes about 0.6ms execution time
However this one, is slow:
Q2
select count(*) from tbl where row @> '{"SELECTIVE_COL":
"SearchValue", "DATE": "20220606", "NON_SELECTIVE_COL": "Abc"}'::jsonb
It takes 17ms
Note that the only difference is adding one more - not very unique -
key. If in Q2 I replaced NON_SELECTIVE_COL with another selective
column, it's becoming fast again.
Here are the query plans:
Q1: https://explain.depesz.com/s/qxU8
Q2: https://explain.depesz.com/s/oIW3
Both look very similar, apart from a very different number of shared
buffers hit.
Index on "row":
"tbl_row_idx" gin ("row" jsonb_path_ops) WITH (fastupdate=off) WHERE
upper_inf(effective_range) AND NOT deleted
PG Version: 14.3, work_mem 512MB
What are my options? Why is the second query so much slower? I changed
Q2 to conjunction of conditions on single columns (row @> '..' and row
@> ...) and it was fast, even with the NON_SELECTIVE_COL included.
Sadly it will be difficult for me do to this in my code without using
dynamic SQL.
Many thanks,
-- Marcin
Re: Strange behavior of limit clause in complex query
Em qua., 8 de jun. de 2022 às 05:44, Paulo Silva
escreveu:
> Hi,
>
> I'm using PostgreSQL 14.3 and I'm getting strange behavior in a complex
> query generated by the Entity Framework.
>
> The inner (complex) query has a quick execution time:
>
> # SELECT "Extent1"."id", CAST ("Extent1"."date" AS timestamp) AS "C3"
> FROM "dbo"."ng_content" AS "Extent1"
> INNER JOIN "dbo"."ng_path_content" AS "Extent2" ON "Extent1"."id"
> = "Extent2"."id_content"
> WHERE "Extent1"."date_from" <= CAST (LOCALTIMESTAMP AS timestamp)
> AND "Extent1"."date_to" >= CAST (LOCALTIMESTAMP AS
> timestamp)
> AND 2 = "Extent1"."id_status"
> AND EXISTS (
> SELECT 1 AS "C1"
> FROM (
> SELECT "Extent3"."TagId" FROM
> "dbo"."ngx_tag_content" AS "Extent3"
> WHERE "Extent1"."id" =
> "Extent3"."ContentId"
> ) AS "Project1"
> WHERE EXISTS (
> SELECT 1 AS "C1" FROM (SELECT 1 AS "C") AS
> "SingleRowTable1"
> WHERE "Project1"."TagId" = 337139)
> )
> AND ("Extent2"."id_path" IN (27495,27554,27555)
> AND NOT EXISTS (SELECT 1 AS "C1" FROM (SELECT 1 AS "C") AS
> "SingleRowTable2" WHERE TRUE = FALSE)
> );
> id| C3
> --+-
> 13505155 | 2021-03-27 12:01:00
> 13505187 | 2021-03-27 12:03:00
> 13505295 | 2021-03-27 12:06:00
> 13505348 | 2021-03-27 12:09:00
> 13505552 | 2021-03-27 12:11:00
> (5 rows)
>
> *Time: 481.826 ms*
>
> If I run the same query as a nested select I get similar results (Q1):
>
>
> *SELECT "Project5".idFROM (*
> SELECT "Extent1"."id", CAST ("Extent1"."date" AS timestamp) AS "C3"
> FROM "dbo"."ng_content" AS "Extent1"
> INNER JOIN "dbo"."ng_path_content" AS "Extent2" ON "Extent1"."id"
> = "Extent2"."id_content"
> WHERE "Extent1"."date_from" <= CAST (LOCALTIMESTAMP AS timestamp)
> AND "Extent1"."date_to" >= CAST (LOCALTIMESTAMP AS
> timestamp)
> AND 2 = "Extent1"."id_status"
> AND EXISTS (
> SELECT 1 AS "C1"
> FROM (
> SELECT "Extent3"."TagId" FROM
> "dbo"."ngx_tag_content" AS "Extent3"
> WHERE "Extent1"."id" =
> "Extent3"."ContentId"
> ) AS "Project1"
> WHERE EXISTS (
> SELECT 1 AS "C1" FROM (SELECT 1 AS "C") AS
> "SingleRowTable1"
> WHERE "Project1"."TagId" = 337139)
> )
> AND ("Extent2"."id_path" IN (27495,27554,27555)
> AND NOT EXISTS (SELECT 1 AS "C1" FROM (SELECT 1 AS "C") AS
> "SingleRowTable2" WHERE TRUE = FALSE)
> )
> *) AS "Project5";*
> id
> --
> 13505155
> 13505187
> 13505295
> 13505348
> 13505552
> (5 rows)
>
> *Time: 486.174 ms*
>
> But if I add an ORDER BY and a LIMIT something goes very wrong (Q2):
>
> # SELECT "Project5".id
> FROM (
> SELECT "Extent1"."id", CAST ("Extent1"."date" AS timestamp) AS "C3"
> FROM "dbo"."ng_content" AS "Extent1"
> INNER JOIN "dbo"."ng_path_content" AS "Extent2" ON "Extent1"."id"
> = "Extent2"."id_content"
> WHERE "Extent1"."date_from" <= CAST (LOCALTIMESTAMP AS timestamp)
> AND "Extent1"."date_to" >= CAST (LOCALTIMESTAMP AS
> timestamp)
> AND 2 = "Extent1"."id_status"
> AND EXISTS (
> SELECT 1 AS "C1"
> FROM (
> SELECT "Extent3"."TagId" FROM
> "dbo"."ngx_tag_content" AS "Extent3"
> WHERE "Extent1"."id" =
> "Extent3"."ContentId"
> ) AS "Project1"
> WHERE EXISTS (
> SELECT 1 AS "C1" FROM (SELECT 1 AS "C") AS
> "SingleRowTable1"
> WHERE "Project1"."TagId" = 337139)
> )
> AND ("Extent2"."id_path" IN (27495,27554,27555)
> AND NOT EXISTS (SELECT 1 AS "C1" FROM (SELECT 1 AS "C") AS
> "SingleRowTable2" WHERE TRUE = FALSE)
> )
> ) AS "Project5" *ORDER BY "Project5"."C3" DESC LIMIT 6*;
>
I think that LIMIT is confusing the planner.
Forcing a path that in the end is not faster.
Can you try something similar to this?
WITH q AS (
SELECT "Project5".id
FROM (
SELECT "Extent1"."id", CAST ("Extent1"."date" AS timestamp) AS "C3"
FROM "dbo"."ng_content" AS "Extent1"
INNER JOIN "dbo"."ng_path_content" AS "Extent2" ON "Extent1"."id" =
"Extent2"."id_content"
WHERE "Extent1"."date_from" <= CAST (LOCALTIMESTAMP AS timestamp)
AND "Extent1"."date_to" >= CAST (LOCALTIMESTAMP AS
timestamp)
Re: Strange behavior of limit clause in complex query
Hi,
The problem is that the query is generated by the framework, I'm not sure
if I can change anything on it. Any other way to influence planner?
Regards
Ranier Vilela escreveu no dia quarta, 8/06/2022 à(s)
12:40:
> Em qua., 8 de jun. de 2022 às 05:44, Paulo Silva
> escreveu:
>
>> Hi,
>>
>> I'm using PostgreSQL 14.3 and I'm getting strange behavior in a complex
>> query generated by the Entity Framework.
>>
>> The inner (complex) query has a quick execution time:
>>
>> # SELECT "Extent1"."id", CAST ("Extent1"."date" AS timestamp) AS "C3"
>> FROM "dbo"."ng_content" AS "Extent1"
>> INNER JOIN "dbo"."ng_path_content" AS "Extent2" ON "Extent1"."id"
>> = "Extent2"."id_content"
>> WHERE "Extent1"."date_from" <= CAST (LOCALTIMESTAMP AS timestamp)
>> AND "Extent1"."date_to" >= CAST (LOCALTIMESTAMP AS
>> timestamp)
>> AND 2 = "Extent1"."id_status"
>> AND EXISTS (
>> SELECT 1 AS "C1"
>> FROM (
>> SELECT "Extent3"."TagId" FROM
>> "dbo"."ngx_tag_content" AS "Extent3"
>> WHERE "Extent1"."id" =
>> "Extent3"."ContentId"
>> ) AS "Project1"
>> WHERE EXISTS (
>> SELECT 1 AS "C1" FROM (SELECT 1 AS "C")
>> AS "SingleRowTable1"
>> WHERE "Project1"."TagId" = 337139)
>> )
>> AND ("Extent2"."id_path" IN (27495,27554,27555)
>> AND NOT EXISTS (SELECT 1 AS "C1" FROM (SELECT 1 AS "C") AS
>> "SingleRowTable2" WHERE TRUE = FALSE)
>> );
>> id| C3
>> --+-
>> 13505155 | 2021-03-27 12:01:00
>> 13505187 | 2021-03-27 12:03:00
>> 13505295 | 2021-03-27 12:06:00
>> 13505348 | 2021-03-27 12:09:00
>> 13505552 | 2021-03-27 12:11:00
>> (5 rows)
>>
>> *Time: 481.826 ms*
>>
>> If I run the same query as a nested select I get similar results (Q1):
>>
>>
>> *SELECT "Project5".idFROM (*
>> SELECT "Extent1"."id", CAST ("Extent1"."date" AS timestamp) AS "C3"
>> FROM "dbo"."ng_content" AS "Extent1"
>> INNER JOIN "dbo"."ng_path_content" AS "Extent2" ON "Extent1"."id"
>> = "Extent2"."id_content"
>> WHERE "Extent1"."date_from" <= CAST (LOCALTIMESTAMP AS timestamp)
>> AND "Extent1"."date_to" >= CAST (LOCALTIMESTAMP AS
>> timestamp)
>> AND 2 = "Extent1"."id_status"
>> AND EXISTS (
>> SELECT 1 AS "C1"
>> FROM (
>> SELECT "Extent3"."TagId" FROM
>> "dbo"."ngx_tag_content" AS "Extent3"
>> WHERE "Extent1"."id" =
>> "Extent3"."ContentId"
>> ) AS "Project1"
>> WHERE EXISTS (
>> SELECT 1 AS "C1" FROM (SELECT 1 AS "C")
>> AS "SingleRowTable1"
>> WHERE "Project1"."TagId" = 337139)
>> )
>> AND ("Extent2"."id_path" IN (27495,27554,27555)
>> AND NOT EXISTS (SELECT 1 AS "C1" FROM (SELECT 1 AS "C") AS
>> "SingleRowTable2" WHERE TRUE = FALSE)
>> )
>> *) AS "Project5";*
>> id
>> --
>> 13505155
>> 13505187
>> 13505295
>> 13505348
>> 13505552
>> (5 rows)
>>
>> *Time: 486.174 ms*
>>
>> But if I add an ORDER BY and a LIMIT something goes very wrong (Q2):
>>
>> # SELECT "Project5".id
>> FROM (
>> SELECT "Extent1"."id", CAST ("Extent1"."date" AS timestamp) AS "C3"
>> FROM "dbo"."ng_content" AS "Extent1"
>> INNER JOIN "dbo"."ng_path_content" AS "Extent2" ON "Extent1"."id"
>> = "Extent2"."id_content"
>> WHERE "Extent1"."date_from" <= CAST (LOCALTIMESTAMP AS timestamp)
>> AND "Extent1"."date_to" >= CAST (LOCALTIMESTAMP AS
>> timestamp)
>> AND 2 = "Extent1"."id_status"
>> AND EXISTS (
>> SELECT 1 AS "C1"
>> FROM (
>> SELECT "Extent3"."TagId" FROM
>> "dbo"."ngx_tag_content" AS "Extent3"
>> WHERE "Extent1"."id" =
>> "Extent3"."ContentId"
>> ) AS "Project1"
>> WHERE EXISTS (
>> SELECT 1 AS "C1" FROM (SELECT 1 AS "C")
>> AS "SingleRowTable1"
>> WHERE "Project1"."TagId" = 337139)
>> )
>> AND ("Extent2"."id_path" IN (27495,27554,27555)
>> AND NOT EXISTS (SELECT 1 AS "C1" FROM (SELECT 1 AS "C") AS
>> "SingleRowTable2" WHERE TRUE = FALSE)
>> )
>> ) AS "Project5" *ORDER BY "Project5"."C3" DESC LIMIT 6*;
>>
> I think that LIMIT is confusing the planner.
> Forcing a path that in the end is not faster.
>
> Can you try something similar to this?
>
> WITH q AS (
> SELECT "Project5".id
> FROM (
> SELECT "Extent1"."id",
Re: Adding non-selective key to jsonb query @> reduces performance?
Marcin Krupowicz writes:
> However this one, is slow:
> Q2
> select count(*) from tbl where row @> '{"SELECTIVE_COL":
> "SearchValue", "DATE": "20220606", "NON_SELECTIVE_COL": "Abc"}'::jsonb
> It takes 17ms
> Note that the only difference is adding one more - not very unique -
> key. If in Q2 I replaced NON_SELECTIVE_COL with another selective
> column, it's becoming fast again.
This doesn't surprise me a whole lot based on what I know of GIN.
It's going to store sets of TIDs associated with each key or value
mentioned in the data, and then a query will have to AND the sets
of TIDs for keys/values mentioned in the query. That will take
longer when some of those sets are big.
It might be worth experimenting with an index built using the
non-default jsonb_path_ops opclass [1]. I'm not sure if that'd be
faster for this scenario, but it seems worth trying.
regards, tom lane
[1] https://www.postgresql.org/docs/current/datatype-json.html#JSON-INDEXING
Re: Strange behavior of limit clause in complex query
On Wed, Jun 08, 2022 at 09:44:08AM +0100, Paulo Silva wrote: > But if I add an ORDER BY and a LIMIT something goes very wrong (Q2): A somewhat common problem. A common workaround is to change "ORDER BY a" to something like "ORDER BY a+0" (if your framework will allow it). > An EXPLAIN (ANALYZE, BUFFERS) for Q2 returns this: ... >-> Index Scan Backward using ix_ng_content_date on ng_content > "Extent1" (cost=0.43..40616715.85 rows=2231839 width=12) (actual > time=11027.808..183839.289 rows=5 loops=1) > Filter: ((2 = id_status) AND (date_from <= > LOCALTIMESTAMP) AND (date_to >= LOCALTIMESTAMP) AND (SubPlan 1)) > Rows Removed by Filter: 4685618 > Buffers: shared hit=15414533 read=564480 written=504 I'm not sure if it would help your original issue, but the rowcount estimate here is bad - overestimating 2231839 rows instead of 5. Could you try to determine which of those conditions (id_status, date_from, date_to, or SubPlan) causes the mis-estimate, or if the estimate is only wrong when they're combined ? -- Justin
