Re: Performance differential when 0 values present vs when 1 values present. Planner return 52k rows when 0 expected.
> You've got the wrong column order (for this query anyway) in that > index. It'd work a lot better if dataview were the first column; I might be misunderstanding you, but I assume that you are suggesting an index on (dataview, valuetimestamp). We have that index, it is the primary key. For some reason it isn't being selected. I can understand that it has to go through the whole index, potentially even the whole table, but I do not why it takes so long. Even a query that should take equally long (probably longer) is substantially faster: explain (analyze, buffers) select valuetimestamp from datavalue where valuetimestamp <> '1965-01-07 05:50:59'; Completes in less than 500ms using a sequential scan, ... -> Seq Scan on datavalue_2022_04 datavalue_7 (cost=0.00..1450.39 rows=56339 width=8) (actual time=0.013..5.988 rows=56109 loops=1)" Filter: (valuetimestamp <> '1965-01-07 05:50:59'::timestamp without time zone) Buffers: shared hit=742 read=4 ... Planning Time: 0.781 ms Execution Time: 394.408 ms while the original query takes over 1 second. ... -> Index Scan Backward using datavalue_2022_04_valuetimestamp_dataview_idx on datavalue_2022_04 datavalue_7 (cost=0.29..4292.48 rows=56351 width=227) (actual time=0.166..17.340 rows=56109 loops=1) Buffers: shared hit=42013 read=278 ... Planning Time: 0.964 ms Execution Time: 1291.509 ms I do not understand how looking at every value in the index and returning none be slower than looking at every table in the table and returning none. If it takes 500ms to return every value in the table via a sequential scan, then it should take less via an index scan. In case we never solve it, and someone else runs into similiar problems, we (hopefully temporarily) worked around it by reformulating the query to use a lateral join: EXPLAIN (analyze, buffers) SELECT dv.* FROM valueseries vs LEFT JOIN LATERAL ( SELECT * FROM datavalue dv WHERE dv.dataview = vs.id ORDER BY VALUETIMESTAMP FETCH FIRST 1 ROWS ONLY ) dv ON TRUE where vs.channel = 752433 This causes it to use the correct index: -> Index Scan using datavalue_2022_01_pkey on datavalue_2022_01 dv_4 (cost=0.42..2951.17 rows=1032 width=228) (actual time=0.034..0.034 rows=0 loops=1) Index Cond: (dataview = vs.id) Buffers: shared read=3 ... Planning Time: 1.169 ms Execution Time: 0.524 ms Regards Emil On 2022-04-25 18:00, Tom Lane wrote: Emil Iggland writes: The query that is giving us issues is the following, channel 752433 has NO values, 752431 has values. (Channel 752433 only has valueseries 752434) select * from datavalue where dataview in ( select id from valueseries where channel = %channel_idx%) ORDER BY VALUETIMESTAMP DESC FETCH FIRST ROW only; Running explain analyze shows strange numbers, 52'000 rows are being returned but there are no rows there. For channel 752433 -> Index Scan Backward using datavalue_2022_03_valuetimestamp_dataview_idx on datavalue_2022_03 datavalue_6 (cost=0.42..7166.19 rows=119673 width=226) (actual time=0.008..32.831 rows=119601 loops=1) You've got the wrong column order (for this query anyway) in that index. It'd work a lot better if dataview were the first column; or at least, it wouldn't tempt the planner to try this unstably- performing plan. It's trying to use the index ordering to satisfy the ORDER BY, which works great as long as it finds a dataview match in some reasonably recent index entry. Otherwise, it's going to crawl the whole index to discover that there's no match. regards, tom lane
Re: Performance differential when 0 values present vs when 1 values present. Planner return 52k rows when 0 expected.
On Wed, 27 Apr 2022 at 19:54, Emil Iggland wrote: > > > You've got the wrong column order (for this query anyway) in that > > index. It'd work a lot better if dataview were the first column; > I might be misunderstanding you, but I assume that you are suggesting an > index on (dataview, valuetimestamp). > We have that index, it is the primary key. For some reason it isn't > being selected. I don't think that index can be used for your original query. It could only be used if "channel" is unique in "valueseries" and you'd written the query as: select * from datavalue where dataview = (select id from valueseries where channel = 752433) ORDER BY VALUETIMESTAMP DESC FETCH FIRST ROW only; that would allow a backwards index scan using the (dataview, valuetimestamp) index. Because you're using the IN clause to possibly look for multiple "dataview" values matching the given "channel", the index range scan does not have a single point to start at. What you've done with the LATERAL query allows the index to be scanned once for each "valueseries" row with a "channel" value matching your WHERE clause. I guess "channel" must not be the primary key to "valueseries" and that's why you use an IN(). The above query would return an error if multiple rows were returned by the subquery. David
Fwd: Array of integer indexed nested-loop semi join
Hello everyone, *1) Context* I'm working with large tables containing arrays of integers, indexed with " *gin__int_ops*" GIN indexes offered by the "*intarray*" extension. The goal I'm trying to achieve is to do a "nested loop semi join" using the array inclusion operation (@>) as join condition but in an indexed way. (Basically an INNER JOIN without the duplicate rows and without needing to use columns from the joined table.) *2) Configuration* The queries are run on a PostgreSQL v14 server with 32GB RAM and 8 vCPUs on a 64 bit ARM Neoverse architecture (m6g.2xlarge AWS RDS instance). PostgreSQL's configuration uses the following key values: - work_mem = 8GB (only set for this query) - shared_buffers = 8GB - effective_cache_size = 22GB - max_worker_processes = 8 - max_parallel_workers_per_gather = 4 - jit = on *3) Tables* The "light_pages_attributes" contains about 2 million rows, each with an "attributes" column containing on average 20 integers. CREATE TABLE > light_pages_attributes > ( > idINTEGER NOT NULL, > "attributes" INTEGER[] NOT NULL > ) > ; > CREATE INDEX > light_pages_attributes_attributes > ON > light_pages_attributes > USING > gin > ( > attributes gin__int_ops > ) > ; The "light_pages_views" contains about 25 million rows, each with a "page_ids" column containing on average 20 integers as well. CREATE TABLE > light_pages_views > ( > vector_id BIGINTNOT NULL, > page_ids INTEGER[] NOT NULL > ) > ; > CREATE INDEX > light_pages_views_page_ids > ON > light_pages_views > USING > gin > ( > page_ids gin__int_ops > ) > ; *4) Query* The query I'm trying to optimise is the following: BEGIN; SET LOCAL work_mem = '8GB'; CREATE TEMPORARY VIEW > urls > AS > ( > SELECT ARRAY[lpa.id] > AS page_id > FROM > light_pages_attributes > AS lpa > WHERE > lpa."attributes" @> ARRAY[189376] > ); > EXPLAIN ( > ANALYZE, > VERBOSE, > COSTS, > BUFFERS, > TIMING > ) > SELECT > COUNT(*) > FROM > light_pages_views > AS lpv > WHERE > EXISTS ( > SELECT > 1 > FROM > urls > AS u > WHERE > lpv.page_ids @> u.page_id > ) > ; COMMIT; The last query does not finish after waiting for more than 15 minutes. (The temporary view creation is very fast and required due to the same query in a CTE greatly reducing performance (by more than 5 min.) due to the optimisation barrier I'm guessing.) This alternative query, which should be far slower due to the fact that it generates duplicate lines through the INNER JOIN, is in fact much faster, 1 min. and 39 s.: EXPLAIN ( > ANALYZE, > VERBOSE, > COSTS, > BUFFERS, > TIMING > ) > SELECT > COUNT(*) > FROM > ( > SELECT > 1 > FROM > light_pages_views > AS lpv > INNER JOIN > urls > AS u > ON lpv.page_ids @> u.page_id > GROUP BY > lpv.vector_id > ) > AS t > ; Visual query plan: https://explain.dalibo.com/plan/bc3#plan Raw query plan: https://explain.dalibo.com/plan/bc3#raw Other strategies I've tried as well: - lpv.page_ids @> ANY(SELECT u.page_id FROM urls AS u) - FULL OUTER JOIN, not possible due to the condition not being merge-joinable The end-goal would be to update all matching "light_pages_views" rows by appending an integer to their array of integer. So possibly millions of tows to be updated. Thank you a lot in advance for your help! Mickael
Re: Array of integer indexed nested-loop semi join
On Wed, Apr 27, 2022 at 8:19 AM Mickael van der Beek < [email protected]> wrote: > > The last query does not finish after waiting for more than 15 minutes. > (The temporary view creation is very fast and required due to the same > query in a CTE greatly reducing performance (by more than 5 min.) due to > the optimisation barrier I'm guessing.) > How much over 15 minutes? 20 minutes doesn't seem that long to wait to get a likely definitive answer. But at the least show us the EXPLAIN without ANALYZE of it, that should take no milliseconds. And what does it mean for something to take 5 minutes longer than "never finishes"? (Also, putting every or every other token on a separate line does not make it easier to read) Cheer, Jeff >
Re: Array of integer indexed nested-loop semi join
Hello Jeff, I have waited a few hours without the query ever finishing which is the reason I said "never finishes". Especially because the INNER JOIN version finishes within a few minutes while being combinatorial and less efficient. The query probably only does sequential scans. You will find the query plan using EXPLAIN here: - Visual query plan: https://explain.dalibo.com/plan#plan - Raw query plan: https://explain.dalibo.com/plan#raw Thanks for your help, Mickael On Wed, Apr 27, 2022 at 4:28 PM Jeff Janes wrote: > On Wed, Apr 27, 2022 at 8:19 AM Mickael van der Beek < > [email protected]> wrote: > >> >> The last query does not finish after waiting for more than 15 minutes. >> (The temporary view creation is very fast and required due to the same >> query in a CTE greatly reducing performance (by more than 5 min.) due to >> the optimisation barrier I'm guessing.) >> > > How much over 15 minutes? 20 minutes doesn't seem that long to wait to > get a likely definitive answer. But at the least show us the EXPLAIN > without ANALYZE of it, that should take no milliseconds. > > And what does it mean for something to take 5 minutes longer than "never > finishes"? > > (Also, putting every or every other token on a separate line does not make > it easier to read) > > Cheer, > > Jeff > >> -- Mickael van der BeekWeb developer & Security analyst [email protected]
Unworkable plan above certain row count
I noticed an issue in a simple query with WHERE NOT IN (SELECT ...). I am aware that anti-joins with NOT IN are currently not optimized and should be rewritten as WHERE NOT EXISTS (SELECT ...), so if this is irrelevant please just ignore it. Here is a setup that works: CREATE TABLE a ( a_id serial NOT NULL, PRIMARY KEY (a_id) ); CREATE TABLE b ( b_id serial NOT NULL, a_id intNOT NULL, PRIMARY KEY (b_id) ); INSERT INTO a(a_id) SELECT generate_series(1, 2); INSERT INTO b(b_id, a_id) SELECT generate_series(1, 50), floor(random() * 22000 + 1)::int; ANALYZE a; ANALYZE b; EXPLAIN SELECT count(*) FROM b WHERE a_id NOT IN (SELECT a_id FROM a); Finalize Aggregate (cost=7596.23..7596.24 rows=1 width=8) -> Gather (cost=7596.12..7596.23 rows=1 width=8) Workers Planned: 1 -> Partial Aggregate (cost=6596.12..6596.13 rows=1 width=8) -> Parallel Seq Scan on b (cost=339.00..6228.47 rows=147059 width=0) Filter: (NOT (hashed SubPlan 1)) SubPlan 1 -> Seq Scan on a (cost=0.00..289.00 rows=2 width=4) Fiddle: https://dbfiddle.uk/?rdbms=postgres_14&fiddle=497ab1d5eec6e02d4d1c0f6630b6f1 f1 Now if you change INSERT INTO a(a_id) SELECT generate_series(1, 2); to INSERT INTO a(a_id) SELECT generate_series(1, 20); i.e. add a zero, the plan becomes this: Finalize Aggregate (cost=759860198.41..759860198.42 rows=1 width=8) -> Gather (cost=759860198.29..759860198.40 rows=1 width=8) Workers Planned: 1 -> Partial Aggregate (cost=759859198.29..759859198.30 rows=1 width=8) -> Parallel Seq Scan on b (cost=0.00..759858830.65 rows=147059 width=0) Filter: (NOT (SubPlan 1)) SubPlan 1 -> Materialize (cost=0.00..4667.00 rows=20 width=4) -> Seq Scan on a (cost=0.00..2885.00 rows=20 width=4) Fiddle: https://dbfiddle.uk/?rdbms=postgres_14&fiddle=bec018196195635cb6ec05ccae3213 7c
Re: Unworkable plan above certain row count
=?iso-8859-1?Q?Andr=E9_H=E4nsel?= writes: > Now if you change > INSERT INTO a(a_id) SELECT generate_series(1, 2); > to > INSERT INTO a(a_id) SELECT generate_series(1, 20); > i.e. add a zero, the plan becomes [ not a hashed subplan ] Yeah, it won't hash the subplan if the estimated size of the hash table exceeds work_mem. In this case, boosting work_mem would be a mighty good idea. regards, tom lane
