Re: How to solve my slow disk i/o throughput during index scan

2024-07-04 Thread Andrei Lepikhov
On 4/7/2024 20:25, FREYBURGER Simon (SNCF VOYAGEURS / DIRECTION GENERALE 
TGV / DM RMP YIELD MANAGEMENT) wrote:

*My question is : what can I change to get a better index reading speed ?*

What I already tried :

  * Setting random_page_cost to prohibitive value (1000) to force a
bitmap heap scan, because those can be made in parallel. This has
not worked, the optimizer is still doing an index scan on my fact table.
  * Change effective_io_concurrency, max_parallel_workers_per_gather,
work_mem to much higher values.
I'm not sure the case is only about speed of index scan. Just see into 
slow Index clause:

fth.\"ID_TRAIN\" = ANY ('{17855,13945,536795,18838,18837,13574 ...
and many more values.
IndexScan need to make scan for each of these values and for each value 
go through the pages to check other conditions.
We already discuss some optimisations related to this case in couple of 
pgsql-hackers threads. But I'm not sure we have quick solution right now.
If you want to use BitmapScan (that might be reasonable to try here) - 
you need to split huge ANY (...) clause into sequence of ORs.
Also, may be parallel append could help here? if can change 
corresponding startup and tuple costs to force such a plan.


--
regards, Andrei Lepikhov





RE: How to solve my slow disk i/o throughput during index scan

2024-07-04 Thread FREYBURGER Simon (SNCF VOYAGEURS / DIRECTION GENERALE TGV / DM RMP YIELD MANAGEMENT)
Hello,

Thank you, splitting in “OR” query definitely enables bitmap heap scans, and 
thus parallelized read to disk 😃 ! I though did not understand your second 
point, what is parallel append, and how to enable it ?

Simon F.



Interne

De : Andrei Lepikhov 
Envoyé : jeudi 4 juillet 2024 16:37
À : FREYBURGER Simon (SNCF VOYAGEURS / DIRECTION GENERALE TGV / DM RMP YIELD 
MANAGEMENT) ; [email protected]; 
Peter Geoghegan 
Objet : Re: How to solve my slow disk i/o throughput during index scan

On 4/7/2024 20: 25, FREYBURGER Simon (SNCF VOYAGEURS / DIRECTION GENERALE TGV / 
DM RMP YIELD MANAGEMENT) wrote: > *My question is : what can I change to get a 
better index reading speed ?* > > What I already tried : > > * Setting


On 4/7/2024 20:25, FREYBURGER Simon (SNCF VOYAGEURS / DIRECTION GENERALE

TGV / DM RMP YIELD MANAGEMENT) wrote:

> *My question is : what can I change to get a better index reading speed ?*

>

> What I already tried :

>

>   * Setting random_page_cost to prohibitive value (1000) to force a

> bitmap heap scan, because those can be made in parallel. This has

> not worked, the optimizer is still doing an index scan on my fact table.

>   * Change effective_io_concurrency, max_parallel_workers_per_gather,

> work_mem to much higher values.

I'm not sure the case is only about speed of index scan. Just see into

slow Index clause:

fth.\"ID_TRAIN\" = ANY ('{17855,13945,536795,18838,18837,13574 ...

and many more values.

IndexScan need to make scan for each of these values and for each value

go through the pages to check other conditions.

We already discuss some optimisations related to this case in couple of

pgsql-hackers threads. But I'm not sure we have quick solution right now.

If you want to use BitmapScan (that might be reasonable to try here) -

you need to split huge ANY (...) clause into sequence of ORs.

Also, may be parallel append could help here? if can change

corresponding startup and tuple costs to force such a plan.



--

regards, Andrei Lepikhov


---
Ce message et toutes les pièces jointes sont établis à l'intention exclusive de 
ses destinataires et sont confidentiels. L'intégrité de ce message n'étant pas 
assurée sur Internet, la SNCF ne peut être tenue responsable des altérations 
qui pourraient se produire sur son contenu. Toute publication, utilisation, 
reproduction, ou diffusion, même partielle, non autorisée préalablement par la 
SNCF, est strictement interdite. Si vous n'êtes pas le destinataire de ce 
message, merci d'en avertir immédiatement l'expéditeur et de le détruire.
---
This message and any attachments are intended solely for the addressees and are 
confidential. SNCF may not be held responsible for their contents whose 
accuracy and completeness cannot be guaranteed over the Internet. Unauthorized 
use, disclosure, distribution, copying, or any part thereof is strictly 
prohibited. If you are not the intended recipient of this message, please 
notify the sender immediately and delete it.


Re: Hash Right join and seq scan

2024-07-04 Thread James Pang
   we have a daily vacuumdb and analyze job, generally speaking it's done
in seconds, sometimes it suddenly running more than tens of minutes with
same bind variable values and huge temp space got used  and at that time,
explain show  "Hash Anti join, Hash Right join" with seq scan two tables.

Tomas Vondra  於 2024年7月4日週四 上午1:40寫道:

> Hi James,
>
> I think it'd be much easier to help you with investigating this issue if
> you shared the actual queries, and the full EXPLAIN ANALYZE output both
> with and without disabled hashjoin. Or even better, share a script that
> reproduces the issue (creates tables, loads data, runs the queries).
>
> BTW you suggested each partition has ~250k rows, but the explain plan
> snippet you shared does not seem to be consistent with that - it only
> shows 2500-5000 rows per partition. If you run ANALYZE on the table,
> does that change the plan?
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
--SQL text 
prepare slowsql10(bigint,bigint,timestamp,timestamp) AS 
SELECT m.partitionkeyid, COUNT(w.ctinfo) AS __count
FROM testaa.wmdata m
 LEFT JOIN testaa.wmvtee w
   ON m.partitionkeyid = 
w.partitionkeyid
WHERE m.sstype = 'lltype'
  AND m.llid = $1
  AND m.hhid = $2
  AND m.stime BETWEEN $3 AND $4
  AND NOT EXISTS (SELECT 1
  FROM testaa.wmpct P
  WHERE P.partitionkeyid = 
m.partitionkeyid
AND LOWER(P.ctinfo) = 
LOWER(w.ctinfo))
GROUP BY m.partitionkeyid

generally speaking, it run in less than 2 seconds, but sometimes, it suddenly 
go to bad plan.


schema:
   all three tables are hash partitioned tables(partitionkeyid column is the 
top index key too), generally speaking with index scan and partition prunning 
   with partitionkeyid, it's good plan. but sometimes, optimizer suddenly go to 
"Hash Right Join" and seq scan on that large table "testaa.wmpct, that made 
very 
   bad plan.  from the cost, looks like optimizer try to calculate cost for all 
partitions bitmapindex scan that show huge cost but actually only a small part 
of 
   partition needed other (never executed). and similar to estimate the "rows" 
for leftoutjoin, that made huge difference too. 

no idea how to make optimizer to consider partition prunning for this SQL case 
?

  
--the bad plan explain output, with analyze mode, it's hanging there and 
consumes huge temp space, so cancel that. 
test=> explain execute slowsql10(1234567,678901234,'2024-06-15 
03:00:00','2024-07-02 03:00:00');

   QUERY PLAN

-

 GroupAggregate  (cost=125657713.20..125659395.82 rows=8 width=62)
   Group Key: m.partitionkeyid
   ->  Sort  (cost=125657713.20..125658274.05 rows=224338 width=78)
 Sort Key: m.partitionkeyid
 ->  Hash Anti Join  (cost=107542871.02..125637774.81 rows=224338 
width=78)
   Hash Cond: (((m.partitionkeyid)::text = 
(p.partitionkeyid)::text) AND (lower((w.ctinfo)::text) = 
lower((p.ctinfo)::text)))
   ->  Hash Right Join  (cost=22.46..338601.11 rows=299118 width=78)
 Hash Cond: ((w.partitionkeyid)::text = 
(m.partitionkeyid)::text)
 ->  Append  (cost=0.00..307545.17 rows=7477945 width=78)
   ->  Seq Scan on wmvtee_p0 w.1  (cost=0.00..8263.66 
rows=228666 width=78)
   ->  Seq Scan on wmvtee_p1 w.2  (cost=0.00..9770.55 
rows=270255 width=78)
   ->  Seq Scan on wmvtee_p2 w.3  (cost=0.00..7751.92 
rows=214392 width=78)
   ->  Seq Scan on wmvtee_p3 w.4  (cost=0.00..7760.89 
rows=214589 width=78)
   ->  Seq Scan on wmvtee_p4 w.5  (cost=0.00..8733.78 
rows=241678 width=78)
   ->  Seq Scan on wmvtee_p5 w.6  (cost=0.00..7108.46 
rows=196946 width=78)
   ->  Seq Scan on wmvtee_p6 w.7  (cost=0.00..8047.84 
rows=222784 width=78)
   ->  Seq Scan on wmvtee_p7 w.8  (cost=0.00..6575.69 
rows=181869 width=78)
 

Re: How to solve my slow disk i/o throughput during index scan

2024-07-04 Thread Andrei Lepikhov
On 7/4/24 22:23, FREYBURGER Simon (SNCF VOYAGEURS / DIRECTION GENERALE 
TGV / DM RMP YIELD MANAGEMENT) wrote:

Hello,

Thank you, splitting in “OR” query definitely enables bitmap heap scans, 
and thus parallelized read to disk 😃! I though did not understand your 
second point, what is parallel append, and how to enable it ?

Just for example:

DROP TABLE IF EXISTS t CASCADE;
CREATE TABLE t (id int not null, payload text) PARTITION BY RANGE (id);
CREATE TABLE p1 PARTITION OF t FOR VALUES FROM (0) TO (1000);
CREATE TABLE p2 PARTITION OF t FOR VALUES FROM (1000) TO (2000);
CREATE TABLE p3 PARTITION OF t FOR VALUES FROM (2000) TO (3000);
CREATE TABLE p4 PARTITION OF t FOR VALUES FROM (3000) TO (4000);
INSERT INTO t SELECT x % 4000, repeat('a',128) || x FROM 
generate_series(1,1E5) AS x;

ANALYZE t;

SET enable_parallel_append = on;
SET parallel_setup_cost = 0.1;
SET parallel_tuple_cost = 0.1;
SET max_parallel_workers_per_gather = 8;
SET min_parallel_table_scan_size = 0;
SET min_parallel_index_scan_size = 0;

EXPLAIN (COSTS OFF)
SELECT t.id, t.payload FROM t WHERE t.id % 2 = 0
GROUP BY t.id, t.payload;

 Group
   Group Key: t.id, t.payload
   ->  Gather Merge
 Workers Planned: 6
 ->  Sort
   Sort Key: t.id, t.payload
   ->  Parallel Append
 ->  Parallel Seq Scan on p1 t_1
   Filter: ((id % 2) = 0)
 ->  Parallel Seq Scan on p2 t_2
   Filter: ((id % 2) = 0)
 ->  Parallel Seq Scan on p3 t_3
   Filter: ((id % 2) = 0)
 ->  Parallel Seq Scan on p4 t_4
   Filter: ((id % 2) = 0)

Here the table is scanned in parallel. It also works with IndexScan.

--
regards, Andrei Lepikhov





Re: Hash Right join and seq scan

2024-07-04 Thread David Rowley
On Fri, 5 Jul 2024 at 12:50, James Pang  wrote:
>we have a daily vacuumdb and analyze job, generally speaking it's done in 
> seconds, sometimes it suddenly running more than tens of minutes with same 
> bind variable values and huge temp space got used  and at that time, explain 
> show  "Hash Anti join, Hash Right join" with seq scan two tables.

There was talk about adding costing for run-time partition pruning
factors but nothing was ever agreed, so nothing was done.  It's just
not that obvious to me how we'd do that.  If the Append had 10
partitions as subnodes, with an equality join condition, you could
assume we'll only match to 1 of those 10, but we've no idea at plan
time which one that'll be and the partitions might drastically vary in
size.  The best I think we could do is take the total cost of those 10
and divide by 10 to get the average cost.  It's much harder for range
conditions as those could match anything from 0 to all partitions. The
best suggestion I saw for that was to multiply the costs by
DEFAULT_INEQ_SEL.

I think for now, you might want to lower the random_page_cost or
increase effective_cache_size to encourage the nested loop -> index
scan plan. Good ranges for effective_cache_size is anywhere between 50
- 75% of your servers's RAM.  However, that might not be ideal if your
server is under memory pressure from other running processes. It also
depends on how large shared_buffers are as a percentage of total RAM.

David