Potential performance issues related to group by and covering index
Hello, We have 2 TPC-H queries which fetch the same tuples but have significant query execution time differences (4.3 times). We are sharing a pair of TPC-H queries that exhibit this performance difference: First query: SELECT "ps_comment", "ps_suppkey", "ps_supplycost", "ps_partkey", "ps_availqty" FROM "partsupp" WHERE "ps_partkey" + 16 < 1 OR "ps_partkey" = 2 GROUP BY "ps_partkey", "ps_suppkey", "ps_availqty", "ps_supplycost", "ps_comment" Second query: SELECT "ps_comment", "ps_suppkey", "ps_supplycost", "ps_partkey", "ps_availqty" FROM "partsupp" WHERE "ps_partkey" + 16 < 1 OR "ps_partkey" = 2 GROUP BY "ps_comment", "ps_suppkey", "ps_supplycost", "ps_partkey", "ps_availqty" * Actual Behavior We executed both queries on the TPC-H benchmark of scale factor 5: the first query takes over 1.7 seconds, while the second query only takes 0.4 seconds. We think the time difference results from different plans selected. Specifically, in the first (slow) query, the DBMS performs an index scan on table partsupp using the covering index (ps_partkey, ps_suppkey), while the second (fast) query performs a parallel scan on (ps_suppkey, ps_partkey). * Query Execution Plan * First query: QUERY PLAN -- Group (cost=0.43..342188.58 rows=399262 width=144) (actual time=0.058..1737.659 rows=4 loops=1) Group Key: ps_partkey, ps_suppkey Buffers: shared hit=123005 read=98055 -> Index Scan using partsupp_pkey on partsupp (cost=0.43..335522.75 rows=1333167 width=144) (actual time=0.055..1737.651 rows=4 loops=1) Filter: (((ps_partkey + 16) < 1) OR (ps_partkey = 2)) Rows Removed by Filter: 396 Buffers: shared hit=123005 read=98055 Planning Time: 0.926 ms Execution Time: 1737.754 ms (9 rows) * Second query: QUERY PLAN --- Group (cost=250110.68..350438.93 rows=399262 width=144) (actual time=400.353..400.361 rows=4 loops=1) Group Key: ps_suppkey, ps_partkey Buffers: shared hit=5481 read=24093 -> Gather Merge (cost=250110.68..346446.31 rows=798524 width=144) (actual time=400.351..406.741 rows=4 loops=1) Workers Planned: 2 Workers Launched: 2 Buffers: shared hit=15151 read=72144 -> Group (cost=249110.66..253276.80 rows=399262 width=144) (actual time=395.882..395.883 rows=1 loops=3) Group Key: ps_suppkey, ps_partkey Buffers: shared hit=15151 read=72144 -> Sort (cost=249110.66..250499.37 rows=555486 width=144) (actual time=395.880..395.881 rows=1 loops=3) Sort Key: ps_suppkey, ps_partkey Sort Method: quicksort Memory: 25kB Worker 0: Sort Method: quicksort Memory: 25kB Worker 1: Sort Method: quicksort Memory: 25kB Buffers: shared hit=15151 read=72144 -> Parallel Seq Scan on partsupp (cost=0.00..116363.88 rows=555486 width=144) (actual time=395.518..395.615 rows=1 loops=3) Filter: (((ps_partkey + 16) < 1) OR (ps_partkey = 2)) Rows Removed by Filter: 132 Buffers: shared hit=15065 read=72136 Planning Time: 0.360 ms Execution Time: 406.880 ms (22 rows) *Expected Behavior Since these two queries are semantically equivalent, we were hoping that PostgreSQL would evaluate them in roughly the same amount of time. It looks to me that different order of group by clauses triggers different plans: when the group by clauses (ps_partkey, ps_suppkey) is the same as the covering index, it will trigger an index scan on associated columns; however, when the group by clauses have different order than the covering index (ps_suppkey, ps_partkey), the index scan will not be triggered. Given that the user might not pay close attention to this subtle difference, I was wondering if it is worth making these two queries have the same and predictable performance on Postgresql. *Test Environment Ubuntu 20.04 machine "Linux panda 5.4.0-40-generic #44-Ubuntu SMP Tue Jun 23 00:01:04 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux" PostgreSQL v12.3 Database: TPC-H benchmark (with scale factor 5) The description of table partsupp is as follows: tpch5=# \d partsupp;
Performance issues related to left join and order by
Hello, We have 2 TPC-H queries which fetch the same tuples but have significant query execution time differences (22.0 times). We are sharing a pair of TPC-H queries that exhibit this performance difference: First query: SELECT"orders3"."o_comment", "orders3"."o_orderstatus", "orders3"."o_orderkey", "t17"."ps_partkey", "t17"."ps_supplycost", "t17"."ps_comment", "orders3"."o_clerk", "orders3"."o_totalprice", "t17"."ps_availqty", "t17"."ps_suppkey" FROM ( SELECT * FROM "partsupp" WHERE "ps_comment" LIKE ', even theodolites. regular, final theodolites eat after the carefully pending foxes. furiously regular deposits sleep slyly. carefully bold realms above the ironic dependencies haggle careful') AS "t17" LEFT JOIN "orders" AS "orders3" ONtrue ORDER BY "t17"."ps_supplycost"FETCH next 14 rows only Second query: SELECT"orders3"."o_comment", "orders3"."o_orderstatus", "orders3"."o_orderkey", "t17"."ps_partkey", "t17"."ps_supplycost", "t17"."ps_comment", "orders3"."o_clerk", "orders3"."o_totalprice", "t17"."ps_availqty", "t17"."ps_suppkey" FROM ( SELECT * FROM "partsupp" WHERE"ps_comment" LIKE ', even theodolites. regular, final theodolites eat after the carefully pending foxes. furiously regular deposits sleep slyly. carefully bold realms above the ironic dependencies haggle careful' ORDER BY "ps_supplycost"FETCH next 14 rows only) AS "t17" LEFT JOIN "orders" AS "orders3" ONtrue ORDER BY "t17"."ps_supplycost"FETCH next 14 rows only * Actual Behavior We executed both queries on the TPC-H benchmark of scale factor 5: the first query takes over 8 seconds, while the second query only takes 0.3 seconds. We think the time difference results from different plans selected. Specifically, in the first (slow) query, the DBMS performs a left join using entire table partsupp, while the second (fast) query performs a left join using only 14 rows from partsupp). * Query Execution Plan * First query: QUERY PLAN Limit (cost=464628.69..464630.32 rows=14 width=223) (actual time=8082.764..8082.767 rows=14 loops=1) -> Gather Merge (cost=464628.69..1193917.91 rows=6250614 width=223) (actual time=8082.762..8087.639 rows=14 loops=1) Workers Planned: 2 Workers Launched: 2 -> Sort (cost=463628.66..471441.93 rows=3125307 width=223) (actual time=2933.506..2933.506 rows=5 loops=3) Sort Key: partsupp.ps_supplycost Sort Method: quicksort Memory: 25kB Worker 0: Sort Method: top-N heapsort Memory: 32kB Worker 1: Sort Method: quicksort Memory: 25kB -> Nested Loop Left Join (cost=0.00..388506.36 rows=3125307 width=223) (actual time=360.602..1643.471 rows=250 loops=3) -> Parallel Seq Scan on partsupp (cost=0.00..108031.62 rows=1 width=144) (actual time=360.577..360.599 rows=0 loops=3) Filter: ((ps_comment)::text ~~ ', even theodolites. regular, final theodolites eat after the carefully pending foxes. furiously regular deposits sleep slyly. carefully bold realms above the ironic dependencies haggle careful'::text) Rows Removed by Filter: 133 -> Seq Scan on orders orders3 (cost=0.00..205467.37 rows=7500737 width=79) (actual time=0.064..1544.990 rows=750 loops=1) Planning Time: 0.278 ms Execution Time: 8087.714 ms (16 rows) * Second query: QUERY PLAN - - Limit (cost=109031.74..109032.26 rows=14 width=223) (actual time=363.883..363.890 rows=14 loops=1) -> Nested Loop Left Join (cost=109031.74..389506.49 rows=7500737 width=223) (actual ti
Re: Potential performance issues related to group by and covering index
út 2. 3. 2021 v 9:53 odesílatel Liu, Xinyu napsal: > > > > > > > > > > > * Hello, We have 2 TPC-H queries which fetch the same tuples but have > significant query execution time differences (4.3 times). We are sharing a > pair of TPC-H queries that exhibit this performance difference: First > query: SELECT "ps_comment", "ps_suppkey", "ps_supplycost", >"ps_partkey", "ps_availqty" FROM "partsupp" WHERE > "ps_partkey" + 16 < 1 OR "ps_partkey" = 2 GROUP > BY "ps_partkey","ps_suppkey","ps_availqty", > "ps_supplycost","ps_comment" Second query: SELECT > "ps_comment", "ps_suppkey", "ps_supplycost", >"ps_partkey", "ps_availqty" FROM "partsupp" WHERE > "ps_partkey" + 16 < 1 OR "ps_partkey" = 2 GROUP BY > "ps_comment","ps_suppkey","ps_supplycost", > "ps_partkey","ps_availqty" * Actual Behavior We > executed both queries on the TPC-H benchmark of scale factor 5: the first > query takes over 1.7 seconds, while the second query only takes 0.4 > seconds. We think the time difference results from different plans > selected. Specifically, in the first (slow) query, the DBMS performs an > index scan on table partsupp using the covering index (ps_partkey, > ps_suppkey), while the second (fast) query performs a parallel scan on > (ps_suppkey, ps_partkey). * Query Execution Plan - First query: >QUERY > PLAN > -- > Group (cost=0.43..342188.58 rows=399262 width=144) (actual > time=0.058..1737.659 rows=4 loops=1)Group Key: ps_partkey, ps_suppkey >Buffers: shared hit=123005 read=98055-> Index Scan using > partsupp_pkey on partsupp (cost=0.43..335522.75 rows=1333167 width=144) > (actual time=0.055..1737.651 rows=4 loops=1) Filter: (((ps_partkey > + 16) < 1) OR (ps_partkey = 2)) Rows Removed by Filter: 396 > Buffers: shared hit=123005 read=98055 Planning Time: 0.926 ms > Execution Time: 1737.754 ms (9 rows) * > In this case there is brutal overestimation. Probably due planner unfriendly written predicate ps_partkey + 16 < 1) OR ps_parkey = 2. You can try to rewrite this predicate to ps_parthkey < -15 OR ps_parkey = 2 Regards Pavel > > > > > > > > > > > > > > > > > > > > > > > > * - Second query: > > QUERY > PLAN > --- > Group (cost=250110.68..350438.93 rows=399262 width=144) (actual > time=400.353..400.361 rows=4 loops=1)Group Key: ps_suppkey, ps_partkey >Buffers: shared hit=5481 read=24093-> Gather Merge > (cost=250110.68..346446.31 rows=798524 width=144) (actual > time=400.351..406.741 rows=4 loops=1) Workers Planned: 2 > Workers Launched: 2 Buffers: shared hit=15151 read=72144 > -> Group (cost=249110.66..253276.80 rows=399262 width=144) > (actual time=395.882..395.883 rows=1 loops=3)Group Key: > ps_suppkey, ps_partkeyBuffers: shared hit=15151 read=72144 >-> Sort (cost=249110.66..250499.37 rows=555486 width=144) > (actual time=395.880..395.881 rows=1 loops=3) Sort > Key: ps_suppkey, ps_partkey Sort Method: quicksort > Memory: 25kB Worker 0: Sort Method: quicksort > Memory: 25kB Worker 1: Sort Method: quicksort > Memory: 25kB Buffers: shared hit=15151 read=72144 > -> Parallel Seq Scan on partsupp > (cost=0.00..116363.88 rows=555486 width=144) (actual time=395.518..395.615 > rows=1 loops=3)Filter: (((ps_partkey + 16) < 1) > OR (ps_partkey = 2))Rows Removed by Filter: > 132Buffers: shared hit=15065 read=72136 > Planning Time: 0.360 ms Execution Time: 406.880 ms (22 rows) *Expected > Behavior Since these two queries are semantically equivalent, we were > hoping that PostgreSQL would evaluate them in roughly the same amount of > time. It looks to me that different order of group by clauses triggers > different plans: when the group by clauses (ps_partkey, ps_suppkey) is the > same as the covering index, it will trigger an index scan on associated > columns; however, when the group by clauses have different order than the > covering index (ps_suppkey, ps_partkey), the index scan will not be > triggered. Given that the user might not pay close attention to this subtle > difference, I was wondering if it is w
Re: Potential performance issues related to group by and covering index
On Tue, 2 Mar 2021 at 21:53, Liu, Xinyu wrote:
> *Expected Behavior
>
> Since these two queries are semantically equivalent, we were hoping that
> PostgreSQL would evaluate them in roughly the same amount of time.
> It looks to me that different order of group by clauses triggers different
> plans: when the group by clauses (ps_partkey, ps_suppkey) is the same as the
> covering index, it will trigger an index scan on associated columns;
> however, when the group by clauses have different order than the covering
> index (ps_suppkey, ps_partkey), the index scan will not be triggered.
> Given that the user might not pay close attention to this subtle difference,
> I was wondering if it is worth making these two queries have the same and
> predictable performance on Postgresql.
Unfortunately, it would take a pretty major overhaul of the query
planner to do that efficiently.
For now, have a few smarts involved in trying to make the GROUP BY
processing more efficient:
1) We remove columns from the GROUP BY if they're functionally
dependent on the primary key, providing the primary key is present
too. (you're seeing this in your example query)
2) We also change the order of the GROUP BY columns if it's a subset
of the ORDER BY columns. This is quite good as we'd do grouping by
{b,a} if someone wrote GROUP BY a,b ORDER BY b,a; to which would save
having to re-sort the data for the ORDER BY after doing the GROUP BY.
That's especially useful for queries with a LIMIT clause.
If we want to do anything much smarter than that like trying every
combination of the GROUP BY clause, then plan times are likely going
to explode. The join order search is done based on the chosen query
pathkeys, which in many queries is the pathkeys for the GROUP BY
clause (see standard_qp_callback()). This means throughout the join
search, planner will try and form paths that provide pre-sorted input
that allows the group by to be implemented efficiently with pre-sorted
data. You might see Merge Joins rather than Hash Joins, for example.
If we want to try every combination of the GROUP BY columns then it
means repeating that join search once per combination. The join search
is often, *by far*, the most expensive part of planning a query.
While it would be nice if the planner did a better job on selecting
the best order for group by columns, unless we can come up with some
heuristics that allow us to just try a single combination that is
likely good, then I don't think anyone would thank us for slowing down
the planner by a factor of the number of possible combinations of the
group by columns.
David
Re: Performance issues related to left join and order by
On Tue, 2 Mar 2021 at 21:53, Liu, Xinyu wrote: > *Expected Behavior > > Since these two queries are semantically equivalent, we were hoping that > PostgreSQL would evaluate them in roughly the same amount of time. > It looks to me that there is a missing optimization rule related to pushing > the sort operator (i.e., order and limit) through the left join. > Given the significant query execution time difference, I was wondering if it > is worth adding such a rule to make the system evaluate the first query more > efficiently. > It would also be helpful if you could comment on if there is a standard > practice to evaluate the tradeoff associated with adding such a rule in > Postgresql. We currently don't attempt to push down LIMIT clauses into subqueries. Before we did that we'd need to get much better at figuring out how joins duplicate rows so that we could be sure that we're not limiting the subquery more than the number of records that the outer query will need to reach its limit. If you want some advice, you're likely to get more people on your side and possible support for making improvements to the query planner if you provide examples that look remotely like real-world queries. In the other emails that I've read from you on this list [1], it seems you're example queries are all completely bogus. I suspect that the queries are generated by some fuzz testing tool. I very much imagine that really don't need help with these at all. With respect, it seems to me that there's about zero chance that you genuinely need the results of this query more quickly and you've come for help with that. Because PostgreSQL does not proactively cache query plans, ad-hoc queries are always parsed, planned then executed. This means that it's often not practical to spend excessive amounts of time planning a query that gets executed just once. Adding new optimisations to the query planner means they either have to be very cheap to detect, or they must pay off in many cases. If you happen to think there's a genuine case for having the query planner do a better job of doing LIMIT pushdowns into subqueries, then you're welcome to submit a patch to implement that. You'll also need to carefully document exactly which cases the LIMIT can be pushed down and when it cannot. That's the hard part. The actual pushing down of the clause is dead easy. If you're going to do that, then I'd suggest you come up with better examples than this one. I don't think many people will get on board with your newly proposed optimisations when the queries are obviously not real. It's hard to imagine the optimisation being useful to any queries with a query that's so obviously not a real one. David [1] https://www.postgresql.org/message-id/BN7PR07MB52024B973EAB075F4DF6C19ACD999%40BN7PR07MB5202.namprd07.prod.outlook.com
High availability management tool.
Hi, I have to configure a postgresql in high availability. I want to ask you what tool you recommend to manage replication and failover or switchover. Thanks. Regards.- Pablo.
Re: Potential performance issues related to group by and covering index
> > If we want to do anything much smarter than that like trying every > combination of the GROUP BY clause, then plan times are likely going > to explode. The join order search is done based on the chosen query > pathkeys, which in many queries is the pathkeys for the GROUP BY > clause (see standard_qp_callback()). This means throughout the join > search, planner will try and form paths that provide pre-sorted input > that allows the group by to be implemented efficiently with pre-sorted > data. You might see Merge Joins rather than Hash Joins, for example. > Are there guidelines or principles you could share about writing the group by clause such that it is more efficient?
Re: Potential performance issues related to group by and covering index
On Wed, 3 Mar 2021 at 10:04, Michael Lewis wrote: > Are there guidelines or principles you could share about writing the group by > clause such that it is more efficient? If you have the option of writing them in the same order as an existing btree index that covers the entire GROUP BY clause (in version < PG13) or at least prefix of the GROUP BY clause (version >= PG13), then the planner has a chance to make use of that index to provide pre-sorted input to do group aggregate. Since PG13 has Incremental Sort, having an index that covers only a prefix of the GROUP BY clause may still help. If no indexes exist then you might get better performance by putting the most distinct column first. That's because sorts don't need to compare the remaining columns once it receives two different values for one column. That gets more complex when the most distinct column is wider than the others. e.g a text compare is more expensive than comparing two ints. For Hash Aggregate, I don't think the order will matter much. David
tables meta data collection
Hi Everyone,
I was trying to collect table metadata with a description; the use case is that
I need to show all columns of the tables whether it has the description or not.
I tried the below query, but it only gives column details that have a
description and ignore others if not.
Postgres 11 | db<>fiddle
|
|
| |
Postgres 11 | db<>fiddle
Free online SQL environment for experimenting and sharing.
|
|
|
create table test(id int);create table test1(id int Primary key );comment on
column test.id is 'Test descr';
SELECT c.table_schema,c.table_name,c.column_name,case when c.domain_name
is not null then c.domain_name when c.data_type='character varying' THEN
'character varying('||c.character_maximum_length||')' when
c.data_type='character' THEN 'character('||c.character_maximum_length||')'
when c.data_type='numeric' THEN
'numeric('||c.numeric_precision||','||c.numeric_scale||')' else
c.data_typeend as data_type,c.is_nullable, (select 'Y' from
information_schema.table_constraints tcojoin
information_schema.key_column_usage kcu on kcu.constraint_name =
tco.constraint_name and kcu.constraint_schema = tco.constraint_schema
and kcu.constraint_schema = c.table_schema and kcu.table_name = c.table_name
and kcu.column_name = c.column_namewhere tco.constraint_type = 'PRIMARY KEY' )
as is_in_PK,(select distinct 'Y' from information_schema.table_constraints
tcojoin information_schema.key_column_usage kcu on kcu.constraint_name =
tco.constraint_name and kcu.constraint_schema = tco.constraint_schema
and kcu.constraint_schema = c.table_schema and kcu.table_name = c.table_name
and kcu.column_name = c.column_namewhere tco.constraint_type = 'FOREIGN KEY' )
as is_in_FK,pgd.description
FROM pg_catalog.pg_statio_all_tables as st Left outer join
pg_catalog.pg_description pgd on (pgd.objoid=st.relid) left outer join
information_schema.columns c on (pgd.objsubid=c.ordinal_position and
c.table_schema=st.schemaname and c.table_name=st.relname)where
c.table_name='test'order by c.table_schema,c.table_name,c.ordinal_position;
expected formate is :
| table_schema | table_name | column_name | data_type | is_nullable | is_in_pk
| is_in_fk | description |
any suggestions?
Thanks,Rj
Re: tables meta data collection
On Wed, Mar 03, 2021 at 01:20:51AM +, Nagaraj Raj wrote:
> I was trying to collect table metadata with a description; the use case is
> that I need to show all columns of the tables whether it has the description
> or not.
> I tried the below query, but it only gives column details that have a
> description and ignore others if not.
Looks like you should join information_schema.columns *before*
pg_catalog.pg_description, otherwise the columns view has nothing to join to
unless the column has a description.
...or you could use FULL OUTER JOIN on the columns view.
> SELECT c.table_schema,c.table_name,c.column_name,case when c.domain_name
> is not null then c.domain_name when c.data_type='character varying' THEN
> 'character varying('||c.character_maximum_length||')' when
> c.data_type='character' THEN 'character('||c.character_maximum_length||')'
> when c.data_type='numeric' THEN
> 'numeric('||c.numeric_precision||','||c.numeric_scale||')' else
> c.data_typeend as data_type,c.is_nullable, (select 'Y' from
> information_schema.table_constraints tcojoin
> information_schema.key_column_usage kcu on kcu.constraint_name =
> tco.constraint_name and kcu.constraint_schema = tco.constraint_schema
> and kcu.constraint_schema = c.table_schema and kcu.table_name = c.table_name
> and kcu.column_name = c.column_namewhere tco.constraint_type = 'PRIMARY KEY'
> ) as is_in_PK,(select distinct 'Y' from information_schema.table_constraints
> tcojoin information_schema.key_column_usage kcu on kcu.constraint_name =
> tco.constraint_name and kcu.constraint_schema = tco.constraint_schema
> and kcu.constraint_schema = c.table_schema and kcu.table_name = c.table_name
> and kcu.column_name = c.column_namewhere tco.constraint_type = 'FOREIGN KEY'
> ) as is_in_FK,pgd.description
>
> FROM pg_catalog.pg_statio_all_tables as st Left outer join
> pg_catalog.pg_description pgd on (pgd.objoid=st.relid) left outer join
> information_schema.columns c on (pgd.objsubid=c.ordinal_position and
> c.table_schema=st.schemaname and c.table_name=st.relname)where
> c.table_name='test'order by c.table_schema,c.table_name,c.ordinal_position;
>
> expected formate is :
>
> | table_schema | table_name | column_name | data_type | is_nullable |
> is_in_pk | is_in_fk | description |
--
Justin Pryzby
System Administrator
Telsasoft
+1-952-707-8581
