Hi! Sorry to post to this mailing list, but I could not find many tips working
around HashAggregate issues.
In a research project involving text repetition analysis (on top of public
documents)
I have a VirtualMachine (CPU AMD Epyc 7502P, 128GB RAM, 12TB HDD, 2TB SSD),
running postgres 12.12 (Ubuntu 12.12-0ubuntu0.20.04.1)
and some tables with many rows:
nsoamt=> ANALYSE VERBOSE SentenceSource;
INFO: analyzing "public.sentencesource"
INFO: "sentencesource": scanned 30000 of 9028500 pages, containing 3811990
live rows and 268323 dead rows; 30000 rows in sample, 1147218391 estimated
total rows
ANALYZE
nsoamt=> ANALYSE VERBOSE SentenceToolCheck;
INFO: analyzing "public.sentencetoolcheck"
INFO: "sentencetoolcheck": scanned 30000 of 33536425 pages, containing 498508
live rows and 25143 dead rows; 30000 rows in sample, 557272538 estimated total
rows
ANALYZE
nsoamt=> ANALYZE VERBOSE Document;
INFO: analyzing "public.document"
INFO: "document": scanned 30000 of 34570 pages, containing 1371662 live rows
and 30366 dead rows; 30000 rows in sample, 1580612 estimated total rows
ANALYZE
The estimates for the number of rows above are accurate.
I am running this query
SELECT COUNT(*), COUNT(NULLIF(Stchk.haserrors,'f'))
FROM SentenceToolCheck Stchk
WHERE EXISTS (SELECT SSrc.sentence
FROM SentenceSource SSrc, Document Doc
WHERE SSrc.sentence = Stchk.id
AND Doc.id = SSrc.document
AND Doc.source ILIKE
'/bigpostgres/misc/arxiv/arxiv/arxiv/pdf/%');
and I have 2 (related?) problems
1 - the query is making a postgresql project have 76.7 GB resident RAM usage.
Having a WORK_MEM setting of 2GB (and "simple" COUNT() results),
that was not expected.
(I risk oom-killer killing my postgres as soon as I run another concurrent
query.)
The memory settings are:
work_mem = 2GB
shared_buffers = 16GB
maintenance_work_mem = 1GB
2 - the query never finishes... (it is over 3x24hours execution by now,
and I have no ideia how far from finishing it is).
The EXPLAIN plan is:
QUERY PLAN
----------------------------------------------------------------------------------------------------------------
Aggregate (cost=28630195.79..28630195.80 rows=1 width=16)
-> Nested Loop (cost=26397220.49..28628236.23 rows=261275 width=1)
-> HashAggregate (cost=26397219.92..26399832.67 rows=261275 width=8)
Group Key: ssrc.sentence
-> Hash Join (cost=73253.21..23635527.52 rows=1104676957
width=8)
Hash Cond: (ssrc.document = doc.id)
-> Seq Scan on sentencesource ssrc
(cost=0.00..20540394.02 rows=1151189402 width=16)
-> Hash (cost=54310.40..54310.40 rows=1515425 width=4)
-> Seq Scan on document doc (cost=0.00..54310.40
rows=1515425 width=4)
Filter: (source ~~*
'/bigpostgres/misc/arxiv/arxiv/arxiv/pdf/%'::text)
-> Index Scan using pk_sentencetoolcheck on sentencetoolcheck stchk
(cost=0.57..8.53 rows=1 width=9)
Index Cond: (id = ssrc.sentence)
JIT:
Functions: 20
Options: Inlining true, Optimization true, Expressions true, Deforming true
(15 rows)
The rows=1515425 estimate on Seq Scan on document doc (cost=0.00..54310.40
rows=1515425 width=4) seems right.
The rows=1104676957 estimate on Hash Join (cost=73253.21..23635527.52
rows=1104676957 width=8) also seems right.
The rows=261275 on HashAggregate (cost=26397219.92..26399832.67 rows=261275
width=8) seems VERY WRONG!
I was expecting something like rows=1.0E+09 instead.
On a laptop (with just 80% of the rows, 32GB RAM, but all SSD disks),
I finish the query in a few hours (+/- 2 hours).
The EXPLAIN plan is different on the laptop:
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
Aggregate (cost=216688374.89..216688374.90 rows=1 width=16)
-> Nested Loop (cost=211388557.47..216686210.27 rows=288616 width=1)
-> Unique (cost=211388556.90..215889838.75 rows=288616 width=8)
-> Sort (cost=211388556.90..213639197.82 rows=900256370
width=8)
Sort Key: ssrc.sentence
-> Hash Join (cost=56351.51..28261726.31 rows=900256370
width=8)
Hash Cond: (ssrc.document = doc.id)
-> Seq Scan on sentencesource ssrc
(cost=0.00..16453055.44 rows=948142144 width=16)
-> Hash (cost=38565.65..38565.65 rows=1084069
width=4)
-> Seq Scan on document doc
(cost=0.00..38565.65 rows=1084069 width=4)
Filter: (source ~~*
'/bigpostgres/misc/arxiv/arxiv/arxiv/pdf/%'::text)
-> Index Scan using pk_sentencetoolcheck on sentencetoolcheck stchk
(cost=0.57..2.76 rows=1 width=9)
Index Cond: (id = ssrc.sentence)
JIT:
Functions: 18
Options: Inlining true, Optimization true, Expressions true, Deforming true
(The Unique rows estimation is also very wrong, but at least the query
finishes).
I would guess that HashAggregate is behaving very badly (using to much RAM
beyond WORK_MEM, amd also badly estimating the #rows and taking forever...)
Any suggestions ?
João Luís
Senior Developer
<mailto:%%Email%%>[email protected]<mailto:[email protected]>
+351 210 337 700
[https://dlnk.bio/wp-content/uploads/2022/11/assinaturaPDM-Natal-1-1.gif]
[https://www.pdmfc.com/images/email-signature/28-04.png]<https://pdmfc.com>
[https://www.pdmfc.com/images/email-signature/28-06.png]
<https://www.facebook.com/PDMFC>
[https://www.pdmfc.com/images/email-signature/28-05.png]
<https://www.linkedin.com/company/pdmfc>
[https://www.pdmfc.com/images/email-signature/28-07.png]
<https://www.instagram.com/pdmfc.tech>
[https://www.pdmfc.com/images/email-signature/28-08.png]
<https://www.youtube.com/channel/UCFiu8g5wv10TfMB-OfOaJUA>
Confidentiality
The information in this message is confidential and privileged. It is intended
solely for the addressee. If you are not the intended recipient, any
disclosure, copying, or distribution of the message, or any action or omission
taken by you in reliance on it is prohibited.
Please contact the sender immediately if you have received this message by
mistake.
Thank you for your cooperation.