Re: Performance differential when 0 values present vs when 1 values present. Planner return 52k rows when 0 expected.

2022-04-27 Thread Emil Iggland

> You've got the wrong column order (for this query anyway) in that
> index.  It'd work a lot better if dataview were the first column;
I might be misunderstanding you, but I assume that you are suggesting an 
index on (dataview, valuetimestamp).
We have that index, it is the primary key. For some reason it isn't 
being selected.


I can understand that it has to go through the whole index, potentially 
even the whole table, but I do not why it takes so long.


Even a query that should take equally long (probably longer) is 
substantially faster:


explain (analyze, buffers)
select valuetimestamp from datavalue
where valuetimestamp <> '1965-01-07 05:50:59';

Completes in less than 500ms using a sequential scan,

...
->  Seq Scan on datavalue_2022_04 datavalue_7  (cost=0.00..1450.39 
rows=56339 width=8) (actual time=0.013..5.988 rows=56109 loops=1)"
	Filter: (valuetimestamp <> '1965-01-07 05:50:59'::timestamp without 
time zone)

Buffers: shared hit=742 read=4
...
Planning Time: 0.781 ms
Execution Time: 394.408 ms


while the original query takes over 1 second.
...
->  Index Scan Backward using 
datavalue_2022_04_valuetimestamp_dataview_idx on datavalue_2022_04 
datavalue_7  (cost=0.29..4292.48 rows=56351 width=227) (actual 
time=0.166..17.340 rows=56109 loops=1)

Buffers: shared hit=42013 read=278
...
Planning Time: 0.964 ms
Execution Time: 1291.509 ms

I do not understand how looking at every value in the index and 
returning none be slower than looking at every table in the table and 
returning none. If it takes 500ms to return every value in the table via 
a sequential scan, then it should take less via an index scan.



In case we never solve it, and someone else runs into similiar problems, 
we (hopefully temporarily) worked around it by reformulating the query 
to use a lateral join:


EXPLAIN (analyze, buffers)
SELECT dv.* FROM valueseries vs
LEFT JOIN LATERAL (
SELECT * FROM datavalue dv WHERE dv.dataview = vs.id
ORDER BY VALUETIMESTAMP
FETCH FIRST 1 ROWS ONLY
) dv ON TRUE
where vs.channel = 752433

This causes it to use the correct index:
->  Index Scan using datavalue_2022_01_pkey on datavalue_2022_01 dv_4 
(cost=0.42..2951.17 rows=1032 width=228) (actual time=0.034..0.034 
rows=0 loops=1)

Index Cond: (dataview = vs.id)
Buffers: shared read=3
...
Planning Time: 1.169 ms
Execution Time: 0.524 ms


Regards
Emil


On 2022-04-25 18:00, Tom Lane wrote:

Emil Iggland  writes:

The query that is giving us issues is the following, channel 752433 has
NO values, 752431 has values.
(Channel 752433 only has valueseries 752434)



select * from datavalue
where dataview in ( select id from valueseries where channel =
%channel_idx%)
ORDER BY VALUETIMESTAMP DESC
FETCH FIRST ROW only;



Running explain analyze shows strange numbers, 52'000 rows are being
returned but there are no rows there.



For channel 752433
->  Index Scan Backward using
datavalue_2022_03_valuetimestamp_dataview_idx on datavalue_2022_03
datavalue_6  (cost=0.42..7166.19 rows=119673 width=226) (actual
time=0.008..32.831 rows=119601 loops=1)


You've got the wrong column order (for this query anyway) in that
index.  It'd work a lot better if dataview were the first column;
or at least, it wouldn't tempt the planner to try this unstably-
performing plan.  It's trying to use the index ordering to satisfy
the ORDER BY, which works great as long as it finds a dataview
match in some reasonably recent index entry.  Otherwise, it's
going to crawl the whole index to discover that there's no match.

regards, tom lane





Re: Performance differential when 0 values present vs when 1 values present. Planner return 52k rows when 0 expected.

2022-04-27 Thread David Rowley
On Wed, 27 Apr 2022 at 19:54, Emil Iggland  wrote:
>
>  > You've got the wrong column order (for this query anyway) in that
>  > index.  It'd work a lot better if dataview were the first column;

> I might be misunderstanding you, but I assume that you are suggesting an
> index on (dataview, valuetimestamp).
> We have that index, it is the primary key. For some reason it isn't
> being selected.

I don't think that index can be used for your original query. It could
only be used if "channel" is unique in "valueseries" and you'd written
the query as:

select * from datavalue
where dataview = (select id from valueseries where channel = 752433)
ORDER BY VALUETIMESTAMP DESC
FETCH FIRST ROW only;

that would allow a backwards index scan using the (dataview,
valuetimestamp) index.  Because you're using the IN clause to possibly
look for multiple "dataview" values matching the given "channel", the
index range scan does not have a single point to start at.  What
you've done with the LATERAL query allows the index to be scanned once
for each "valueseries" row with a "channel" value matching your WHERE
clause.

I guess "channel" must not be the primary key to "valueseries" and
that's why you use an IN().

The above query would return an error if multiple rows were returned
by the subquery.

David




Fwd: Array of integer indexed nested-loop semi join

2022-04-27 Thread Mickael van der Beek
Hello everyone,

*1) Context*

I'm working with large tables containing arrays of integers, indexed with "
*gin__int_ops*" GIN indexes offered by the "*intarray*" extension.
The goal I'm trying to achieve is to do a "nested loop semi join" using the
array inclusion operation (@>) as join condition but in an indexed way.
(Basically an INNER JOIN without the duplicate rows and without needing to
use columns from the joined table.)

*2) Configuration*

The queries are run on a PostgreSQL v14 server with 32GB RAM and 8 vCPUs on
a 64 bit ARM Neoverse architecture (m6g.2xlarge AWS RDS instance).
PostgreSQL's configuration uses the following key values:


   - work_mem = 8GB (only set for this query)
   - shared_buffers = 8GB
   - effective_cache_size = 22GB
   - max_worker_processes = 8
   - max_parallel_workers_per_gather = 4
   - jit = on

*3) Tables*

The "light_pages_attributes" contains about 2 million rows, each with an
"attributes" column containing on average 20 integers.

CREATE TABLE
>   light_pages_attributes
>   (
> idINTEGER   NOT NULL,
> "attributes"  INTEGER[] NOT NULL
>   )
> ;
> CREATE INDEX
>   light_pages_attributes_attributes
> ON
>   light_pages_attributes
> USING
>   gin
>   (
> attributes gin__int_ops
>   )
> ;


The "light_pages_views" contains about 25 million rows, each with a
"page_ids" column containing on average 20 integers as well.

CREATE TABLE
>   light_pages_views
>   (
> vector_id BIGINTNOT NULL,
> page_ids  INTEGER[] NOT NULL
>   )
> ;
> CREATE INDEX
>   light_pages_views_page_ids
> ON
>   light_pages_views
> USING
>   gin
>   (
> page_ids gin__int_ops
>   )
> ;


*4) Query*

The query I'm trying to optimise is the following:

BEGIN;



SET LOCAL work_mem = '8GB';



CREATE TEMPORARY VIEW
>   urls
>   AS
>   (
> SELECT ARRAY[lpa.id]
> AS page_id
>   FROM
> light_pages_attributes
>   AS lpa
>   WHERE
> lpa."attributes" @> ARRAY[189376]
>   );
> EXPLAIN (
>   ANALYZE,
>   VERBOSE,
>   COSTS,
>   BUFFERS,
>   TIMING
> )
> SELECT
>   COUNT(*)
> FROM
>   light_pages_views
> AS lpv
> WHERE
>   EXISTS (
> SELECT
>   1
> FROM
>   urls
> AS u
> WHERE
>   lpv.page_ids @> u.page_id
>   )
> ;



COMMIT;


The last query does not finish after waiting for more than 15 minutes.
(The temporary view creation is very fast and required due to the same
query in a CTE greatly reducing performance (by more than 5 min.) due to
the optimisation barrier I'm guessing.)
This alternative query, which should be far slower due to the fact that it
generates duplicate lines through the INNER JOIN, is in fact much faster, 1
min. and 39 s.:

EXPLAIN (
>   ANALYZE,
>   VERBOSE,
>   COSTS,
>   BUFFERS,
>   TIMING
> )
> SELECT
>   COUNT(*)
> FROM
>   (
> SELECT
>   1
> FROM
>   light_pages_views
> AS lpv
> INNER JOIN
>   urls
> AS u
> ON lpv.page_ids @> u.page_id
> GROUP BY
>   lpv.vector_id
>   )
> AS t
> ;


Visual query plan: https://explain.dalibo.com/plan/bc3#plan
Raw query plan: https://explain.dalibo.com/plan/bc3#raw

Other strategies I've tried as well:

   - lpv.page_ids @> ANY(SELECT u.page_id FROM urls AS u)
   - FULL OUTER JOIN, not possible due to the condition not being
   merge-joinable

The end-goal would be to update all matching "light_pages_views" rows by
appending an integer to their array of integer.
So possibly millions of tows to be updated.

Thank you a lot in advance for your help!

Mickael


Re: Array of integer indexed nested-loop semi join

2022-04-27 Thread Jeff Janes
On Wed, Apr 27, 2022 at 8:19 AM Mickael van der Beek <
[email protected]> wrote:

>
> The last query does not finish after waiting for more than 15 minutes.
> (The temporary view creation is very fast and required due to the same
> query in a CTE greatly reducing performance (by more than 5 min.) due to
> the optimisation barrier I'm guessing.)
>

How much over 15 minutes?  20 minutes doesn't seem that long to wait to get
a likely definitive answer.  But at the least show us the EXPLAIN without
ANALYZE of it, that should take no milliseconds.

And what does it mean for something to take 5 minutes longer than "never
finishes"?

(Also, putting every or every other token on a separate line does not make
it easier to read)

Cheer,

Jeff

>


Re: Array of integer indexed nested-loop semi join

2022-04-27 Thread Mickael van der Beek
Hello Jeff,

I have waited a few hours without the query ever finishing which is the
reason I said "never finishes".
Especially because the INNER JOIN version finishes within a few minutes
while being combinatorial and less efficient.
The query probably only does sequential scans.

You will find the query plan using EXPLAIN here:
- Visual query plan: https://explain.dalibo.com/plan#plan
- Raw query plan: https://explain.dalibo.com/plan#raw

Thanks for your help,

Mickael

On Wed, Apr 27, 2022 at 4:28 PM Jeff Janes  wrote:

> On Wed, Apr 27, 2022 at 8:19 AM Mickael van der Beek <
> [email protected]> wrote:
>
>>
>> The last query does not finish after waiting for more than 15 minutes.
>> (The temporary view creation is very fast and required due to the same
>> query in a CTE greatly reducing performance (by more than 5 min.) due to
>> the optimisation barrier I'm guessing.)
>>
>
> How much over 15 minutes?  20 minutes doesn't seem that long to wait to
> get a likely definitive answer.  But at the least show us the EXPLAIN
> without ANALYZE of it, that should take no milliseconds.
>
> And what does it mean for something to take 5 minutes longer than "never
> finishes"?
>
> (Also, putting every or every other token on a separate line does not make
> it easier to read)
>
> Cheer,
>
> Jeff
>
>>

-- 
Mickael van der BeekWeb developer & Security analyst

[email protected]


Unworkable plan above certain row count

2022-04-27 Thread André Hänsel
I noticed an issue in a simple query with WHERE NOT IN (SELECT ...). I am
aware that anti-joins with NOT IN are currently not optimized and should be
rewritten as WHERE NOT EXISTS (SELECT ...), so if this is irrelevant please
just ignore it.

Here is a setup that works:

CREATE TABLE a
(
a_id   serial NOT NULL,
PRIMARY KEY (a_id)
);
CREATE TABLE b
(
b_id   serial NOT NULL,
a_id   intNOT NULL,
PRIMARY KEY (b_id)
);

INSERT INTO a(a_id) SELECT generate_series(1, 2);
INSERT INTO b(b_id, a_id) SELECT generate_series(1, 50), floor(random()
* 22000 + 1)::int;

ANALYZE a;
ANALYZE b;

EXPLAIN SELECT count(*) FROM b WHERE a_id NOT IN (SELECT a_id FROM a);

Finalize Aggregate  (cost=7596.23..7596.24 rows=1 width=8)
  ->  Gather  (cost=7596.12..7596.23 rows=1 width=8)
Workers Planned: 1
->  Partial Aggregate  (cost=6596.12..6596.13 rows=1 width=8)
  ->  Parallel Seq Scan on b  (cost=339.00..6228.47 rows=147059
width=0)
Filter: (NOT (hashed SubPlan 1))
SubPlan 1
  ->  Seq Scan on a  (cost=0.00..289.00 rows=2
width=4)

Fiddle:
https://dbfiddle.uk/?rdbms=postgres_14&fiddle=497ab1d5eec6e02d4d1c0f6630b6f1
f1

Now if you change
INSERT INTO a(a_id) SELECT generate_series(1, 2);
to
INSERT INTO a(a_id) SELECT generate_series(1, 20);
i.e. add a zero, the plan becomes this:

Finalize Aggregate  (cost=759860198.41..759860198.42 rows=1 width=8)
  ->  Gather  (cost=759860198.29..759860198.40 rows=1 width=8)
Workers Planned: 1
->  Partial Aggregate  (cost=759859198.29..759859198.30 rows=1
width=8)
  ->  Parallel Seq Scan on b  (cost=0.00..759858830.65
rows=147059 width=0)
Filter: (NOT (SubPlan 1))
SubPlan 1
  ->  Materialize  (cost=0.00..4667.00 rows=20
width=4)
->  Seq Scan on a  (cost=0.00..2885.00
rows=20 width=4)

Fiddle:
https://dbfiddle.uk/?rdbms=postgres_14&fiddle=bec018196195635cb6ec05ccae3213
7c






Re: Unworkable plan above certain row count

2022-04-27 Thread Tom Lane
=?iso-8859-1?Q?Andr=E9_H=E4nsel?=  writes:
> Now if you change
> INSERT INTO a(a_id) SELECT generate_series(1, 2);
> to
> INSERT INTO a(a_id) SELECT generate_series(1, 20);
> i.e. add a zero, the plan becomes [ not a hashed subplan ]

Yeah, it won't hash the subplan if the estimated size of the hash
table exceeds work_mem.  In this case, boosting work_mem would be
a mighty good idea.

regards, tom lane