Re: How to solve my slow disk i/o throughput during index scan
On 4/7/2024 20:25, FREYBURGER Simon (SNCF VOYAGEURS / DIRECTION GENERALE
TGV / DM RMP YIELD MANAGEMENT) wrote:
*My question is : what can I change to get a better index reading speed ?*
What I already tried :
* Setting random_page_cost to prohibitive value (1000) to force a
bitmap heap scan, because those can be made in parallel. This has
not worked, the optimizer is still doing an index scan on my fact table.
* Change effective_io_concurrency, max_parallel_workers_per_gather,
work_mem to much higher values.
I'm not sure the case is only about speed of index scan. Just see into
slow Index clause:
fth.\"ID_TRAIN\" = ANY ('{17855,13945,536795,18838,18837,13574 ...
and many more values.
IndexScan need to make scan for each of these values and for each value
go through the pages to check other conditions.
We already discuss some optimisations related to this case in couple of
pgsql-hackers threads. But I'm not sure we have quick solution right now.
If you want to use BitmapScan (that might be reasonable to try here) -
you need to split huge ANY (...) clause into sequence of ORs.
Also, may be parallel append could help here? if can change
corresponding startup and tuple costs to force such a plan.
--
regards, Andrei Lepikhov
RE: How to solve my slow disk i/o throughput during index scan
2024-07-04
Thread
FREYBURGER Simon (SNCF VOYAGEURS / DIRECTION GENERALE TGV / DM RMP YIELD MANAGEMENT)
Hello, Thank you, splitting in “OR” query definitely enables bitmap heap scans, and thus parallelized read to disk 😃 ! I though did not understand your second point, what is parallel append, and how to enable it ? Simon F. Interne De : Andrei Lepikhov Envoyé : jeudi 4 juillet 2024 16:37 À : FREYBURGER Simon (SNCF VOYAGEURS / DIRECTION GENERALE TGV / DM RMP YIELD MANAGEMENT) ; [email protected]; Peter Geoghegan Objet : Re: How to solve my slow disk i/o throughput during index scan On 4/7/2024 20: 25, FREYBURGER Simon (SNCF VOYAGEURS / DIRECTION GENERALE TGV / DM RMP YIELD MANAGEMENT) wrote: > *My question is : what can I change to get a better index reading speed ?* > > What I already tried : > > * Setting On 4/7/2024 20:25, FREYBURGER Simon (SNCF VOYAGEURS / DIRECTION GENERALE TGV / DM RMP YIELD MANAGEMENT) wrote: > *My question is : what can I change to get a better index reading speed ?* > > What I already tried : > > * Setting random_page_cost to prohibitive value (1000) to force a > bitmap heap scan, because those can be made in parallel. This has > not worked, the optimizer is still doing an index scan on my fact table. > * Change effective_io_concurrency, max_parallel_workers_per_gather, > work_mem to much higher values. I'm not sure the case is only about speed of index scan. Just see into slow Index clause: fth.\"ID_TRAIN\" = ANY ('{17855,13945,536795,18838,18837,13574 ... and many more values. IndexScan need to make scan for each of these values and for each value go through the pages to check other conditions. We already discuss some optimisations related to this case in couple of pgsql-hackers threads. But I'm not sure we have quick solution right now. If you want to use BitmapScan (that might be reasonable to try here) - you need to split huge ANY (...) clause into sequence of ORs. Also, may be parallel append could help here? if can change corresponding startup and tuple costs to force such a plan. -- regards, Andrei Lepikhov --- Ce message et toutes les pièces jointes sont établis à l'intention exclusive de ses destinataires et sont confidentiels. L'intégrité de ce message n'étant pas assurée sur Internet, la SNCF ne peut être tenue responsable des altérations qui pourraient se produire sur son contenu. Toute publication, utilisation, reproduction, ou diffusion, même partielle, non autorisée préalablement par la SNCF, est strictement interdite. Si vous n'êtes pas le destinataire de ce message, merci d'en avertir immédiatement l'expéditeur et de le détruire. --- This message and any attachments are intended solely for the addressees and are confidential. SNCF may not be held responsible for their contents whose accuracy and completeness cannot be guaranteed over the Internet. Unauthorized use, disclosure, distribution, copying, or any part thereof is strictly prohibited. If you are not the intended recipient of this message, please notify the sender immediately and delete it.
Re: Hash Right join and seq scan
we have a daily vacuumdb and analyze job, generally speaking it's done in seconds, sometimes it suddenly running more than tens of minutes with same bind variable values and huge temp space got used and at that time, explain show "Hash Anti join, Hash Right join" with seq scan two tables. Tomas Vondra 於 2024年7月4日週四 上午1:40寫道: > Hi James, > > I think it'd be much easier to help you with investigating this issue if > you shared the actual queries, and the full EXPLAIN ANALYZE output both > with and without disabled hashjoin. Or even better, share a script that > reproduces the issue (creates tables, loads data, runs the queries). > > BTW you suggested each partition has ~250k rows, but the explain plan > snippet you shared does not seem to be consistent with that - it only > shows 2500-5000 rows per partition. If you run ANALYZE on the table, > does that change the plan? > > regards > > -- > Tomas Vondra > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > --SQL text prepare slowsql10(bigint,bigint,timestamp,timestamp) AS SELECT m.partitionkeyid, COUNT(w.ctinfo) AS __count FROM testaa.wmdata m LEFT JOIN testaa.wmvtee w ON m.partitionkeyid = w.partitionkeyid WHERE m.sstype = 'lltype' AND m.llid = $1 AND m.hhid = $2 AND m.stime BETWEEN $3 AND $4 AND NOT EXISTS (SELECT 1 FROM testaa.wmpct P WHERE P.partitionkeyid = m.partitionkeyid AND LOWER(P.ctinfo) = LOWER(w.ctinfo)) GROUP BY m.partitionkeyid generally speaking, it run in less than 2 seconds, but sometimes, it suddenly go to bad plan. schema: all three tables are hash partitioned tables(partitionkeyid column is the top index key too), generally speaking with index scan and partition prunning with partitionkeyid, it's good plan. but sometimes, optimizer suddenly go to "Hash Right Join" and seq scan on that large table "testaa.wmpct, that made very bad plan. from the cost, looks like optimizer try to calculate cost for all partitions bitmapindex scan that show huge cost but actually only a small part of partition needed other (never executed). and similar to estimate the "rows" for leftoutjoin, that made huge difference too. no idea how to make optimizer to consider partition prunning for this SQL case ? --the bad plan explain output, with analyze mode, it's hanging there and consumes huge temp space, so cancel that. test=> explain execute slowsql10(1234567,678901234,'2024-06-15 03:00:00','2024-07-02 03:00:00'); QUERY PLAN - GroupAggregate (cost=125657713.20..125659395.82 rows=8 width=62) Group Key: m.partitionkeyid -> Sort (cost=125657713.20..125658274.05 rows=224338 width=78) Sort Key: m.partitionkeyid -> Hash Anti Join (cost=107542871.02..125637774.81 rows=224338 width=78) Hash Cond: (((m.partitionkeyid)::text = (p.partitionkeyid)::text) AND (lower((w.ctinfo)::text) = lower((p.ctinfo)::text))) -> Hash Right Join (cost=22.46..338601.11 rows=299118 width=78) Hash Cond: ((w.partitionkeyid)::text = (m.partitionkeyid)::text) -> Append (cost=0.00..307545.17 rows=7477945 width=78) -> Seq Scan on wmvtee_p0 w.1 (cost=0.00..8263.66 rows=228666 width=78) -> Seq Scan on wmvtee_p1 w.2 (cost=0.00..9770.55 rows=270255 width=78) -> Seq Scan on wmvtee_p2 w.3 (cost=0.00..7751.92 rows=214392 width=78) -> Seq Scan on wmvtee_p3 w.4 (cost=0.00..7760.89 rows=214589 width=78) -> Seq Scan on wmvtee_p4 w.5 (cost=0.00..8733.78 rows=241678 width=78) -> Seq Scan on wmvtee_p5 w.6 (cost=0.00..7108.46 rows=196946 width=78) -> Seq Scan on wmvtee_p6 w.7 (cost=0.00..8047.84 rows=222784 width=78) -> Seq Scan on wmvtee_p7 w.8 (cost=0.00..6575.69 rows=181869 width=78)
Re: How to solve my slow disk i/o throughput during index scan
On 7/4/24 22:23, FREYBURGER Simon (SNCF VOYAGEURS / DIRECTION GENERALE
TGV / DM RMP YIELD MANAGEMENT) wrote:
Hello,
Thank you, splitting in “OR” query definitely enables bitmap heap scans,
and thus parallelized read to disk 😃! I though did not understand your
second point, what is parallel append, and how to enable it ?
Just for example:
DROP TABLE IF EXISTS t CASCADE;
CREATE TABLE t (id int not null, payload text) PARTITION BY RANGE (id);
CREATE TABLE p1 PARTITION OF t FOR VALUES FROM (0) TO (1000);
CREATE TABLE p2 PARTITION OF t FOR VALUES FROM (1000) TO (2000);
CREATE TABLE p3 PARTITION OF t FOR VALUES FROM (2000) TO (3000);
CREATE TABLE p4 PARTITION OF t FOR VALUES FROM (3000) TO (4000);
INSERT INTO t SELECT x % 4000, repeat('a',128) || x FROM
generate_series(1,1E5) AS x;
ANALYZE t;
SET enable_parallel_append = on;
SET parallel_setup_cost = 0.1;
SET parallel_tuple_cost = 0.1;
SET max_parallel_workers_per_gather = 8;
SET min_parallel_table_scan_size = 0;
SET min_parallel_index_scan_size = 0;
EXPLAIN (COSTS OFF)
SELECT t.id, t.payload FROM t WHERE t.id % 2 = 0
GROUP BY t.id, t.payload;
Group
Group Key: t.id, t.payload
-> Gather Merge
Workers Planned: 6
-> Sort
Sort Key: t.id, t.payload
-> Parallel Append
-> Parallel Seq Scan on p1 t_1
Filter: ((id % 2) = 0)
-> Parallel Seq Scan on p2 t_2
Filter: ((id % 2) = 0)
-> Parallel Seq Scan on p3 t_3
Filter: ((id % 2) = 0)
-> Parallel Seq Scan on p4 t_4
Filter: ((id % 2) = 0)
Here the table is scanned in parallel. It also works with IndexScan.
--
regards, Andrei Lepikhov
Re: Hash Right join and seq scan
On Fri, 5 Jul 2024 at 12:50, James Pang wrote: >we have a daily vacuumdb and analyze job, generally speaking it's done in > seconds, sometimes it suddenly running more than tens of minutes with same > bind variable values and huge temp space got used and at that time, explain > show "Hash Anti join, Hash Right join" with seq scan two tables. There was talk about adding costing for run-time partition pruning factors but nothing was ever agreed, so nothing was done. It's just not that obvious to me how we'd do that. If the Append had 10 partitions as subnodes, with an equality join condition, you could assume we'll only match to 1 of those 10, but we've no idea at plan time which one that'll be and the partitions might drastically vary in size. The best I think we could do is take the total cost of those 10 and divide by 10 to get the average cost. It's much harder for range conditions as those could match anything from 0 to all partitions. The best suggestion I saw for that was to multiply the costs by DEFAULT_INEQ_SEL. I think for now, you might want to lower the random_page_cost or increase effective_cache_size to encourage the nested loop -> index scan plan. Good ranges for effective_cache_size is anywhere between 50 - 75% of your servers's RAM. However, that might not be ideal if your server is under memory pressure from other running processes. It also depends on how large shared_buffers are as a percentage of total RAM. David
