Why is a hash join preferred when it does not fit in work_mem

2023-01-12 Thread Dimitrios Apostolou

Hello list,

I have a very simple NATURAL JOIN that does not fit in the work_mem.  Why
does the query planner prefer a hash join that needs 361s, while with a
sort operation and a merge join it takes only 13s?

The server is an old Mac Mini with hard disk drive and only 4GB RAM.
Postgres version info:

 PostgreSQL 15.0 on x86_64-apple-darwin20.6.0, compiled by Apple clang
version 12.0.0 (clang-1200.0.32.29), 64-bit

The low work_mem and the disabled memoization are set on purpose, in order
to simplify a complex query, while reproducing the same problem that I
experienced there. This result is the simplest query I could get, where
the optimizer does not go for a faster merge join.

From my point of view a merge join is clearly faster, because the hash
table does not fit in memory and I expect a hash join to do a lot of
random I/O. But the query planner does not see that, and increasing
random_page_cost does not help either. In fact the opposite happens: the
merge join gets a higher cost difference to the hash join, as I increase
the random page cost!



# EXPLAIN (ANALYZE,VERBOSE,BUFFERS,SETTINGS) SELECT * FROM
tasks_mm_workitems NATURAL JOIN workitem_ids;

QUERY PLAN
--
 Hash Join  (cost=121222.68..257633.01 rows=3702994 width=241) (actual
time=184498.464..360606.257 rows=3702994 loops=1)
   Output: tasks_mm_workitems.workitem_n, tasks_mm_workitems.task_n,
workitem_ids.workitem_id
   Inner Unique: true
   Hash Cond: (tasks_mm_workitems.workitem_n = workitem_ids.workitem_n)
   Buffers: shared hit=15068 read=47434, temp read=56309 written=56309
   ->  Seq Scan on public.tasks_mm_workitems  (cost=0.00..53488.94
rows=3702994 width=8) (actual time=0.040..1376.084 rows=3702994 loops=1)
 Output: tasks_mm_workitems.workitem_n, tasks_mm_workitems.task_n
 Buffers: shared read=16459
   ->  Hash  (cost=59780.19..59780.19 rows=1373719 width=237) (actual
time=184361.874..184361.875 rows=1373737 loops=1)
 Output: workitem_ids.workitem_id, workitem_ids.workitem_n
 Buckets: 4096  Batches: 512  Memory Usage: 759kB
 Buffers: shared hit=15068 read=30975, temp written=43092
 ->  Seq Scan on public.workitem_ids  (cost=0.00..59780.19
rows=1373719 width=237) (actual time=0.026..1912.312 rows=1373737 loops=1)
   Output: workitem_ids.workitem_id, workitem_ids.workitem_n
   Buffers: shared hit=15068 read=30975
 Settings: effective_cache_size = '500MB', enable_memoize = 'off',
hash_mem_multiplier = '1', max_parallel_workers_per_gather = '1', work_mem
= '1MB'
 Planning:
   Buffers: shared hit=2 read=6
 Planning Time: 0.568 ms
 Execution Time: 361106.876 ms
(20 rows)


# EXPLAIN (ANALYZE,VERBOSE,BUFFERS,SETTINGS) SELECT * FROM
tasks_mm_workitems NATURAL JOIN workitem_ids;

QUERY PLAN
---
 Merge Join  (cost=609453.49..759407.78 rows=3702994 width=241) (actual 
time=5062.513..10866.313 rows=3702994 loops=1)
   Output: tasks_mm_workitems.workitem_n, tasks_mm_workitems.task_n, 
workitem_ids.workitem_id
   Merge Cond: (workitem_ids.workitem_n = tasks_mm_workitems.workitem_n)
   Buffers: shared hit=5343 read=66053, temp read=32621 written=32894
   ->  Index Scan using workitem_ids_pkey on public.workitem_ids
(cost=0.43..81815.86 rows=1373719 width=237) (actual time=0.111..1218.363 
rows=1373737 loops=1)
 Output: workitem_ids.workitem_n, workitem_ids.workitem_id
 Buffers: shared hit=5310 read=49627
   ->  Materialize  (cost=609372.91..627887.88 rows=3702994 width=8) (actual 
time=5062.389..7392.640 rows=3702994 loops=1)
 Output: tasks_mm_workitems.workitem_n, tasks_mm_workitems.task_n
 Buffers: shared hit=33 read=16426, temp read=32621 written=32894
 ->  Sort  (cost=609372.91..618630.40 rows=3702994 width=8) (actual 
time=5062.378..6068.703 rows=3702994 loops=1)
   Output: tasks_mm_workitems.workitem_n, tasks_mm_workitems.task_n
   Sort Key: tasks_mm_workitems.workitem_n
   Sort Method: external merge  Disk: 65256kB
   Buffers: shared hit=33 read=16426, temp read=32621 written=32894
   ->  Seq Scan on public.tasks_mm_workitems
(cost=0.00..53488.94 rows=3702994 width=8) (actual time=0.045..1177.202 
rows=3702994 loops=1)
 Output: tasks_mm_workitems.workitem_n, 
tasks_mm_workitems.task_n
 Buffers: shared hit=33 read=16426
 Settings: effective_cache_size = '500MB', enable_hashjoin = 'off', 
enable_memoize = 'off', hash_mem_multiplier = '1', 
max_parallel_workers_per_gather = '1', work_mem = '1MB'
 Planning:
   Buffers: shared hit=8
 Planning Time: 0.677 ms
 Execution Time: 13364.545 ms
(23 rows)


Thank you in advance,
Dim

Re: Why is a hash join preferred when it does not fit in work_mem

2023-01-14 Thread Dimitrios Apostolou



On Fri, 13 Jan 2023, David Rowley wrote:


I'd expect reducing random_page_cost to make the Mege Join cheaper as
that's where the Index Scan is. I'm not quite sure where you think the
random I/O is coming from in a batched hash join.


Thanks for the feedback, indeed you are right! Decreasing random_page_cost
to values way below the default makes the planner prefer the merge join!
This seems strange to me.

Please correct me if I'm wrong, as I'm a newcomer to PostgreSQL, but here
is how I understand things according to posts I've read, and classical
algorithms:

+ The Hash Join is fastest when one side fits in work_mem. Then on one
  hand you have a hash table lookup (amortized O(1)) and on the other
  hand, if the table has M rows that that do not fit in memory, you have
  sequential reads from the disk (given low fragmentation of the table or
  index files):  For every line you read from the disk, you lookup the key
  in the hash table.

  If the hash table does not fit in RAM then the cost becomes prohibitive.
  Every lookup is a random access possibly hitting the disk. The total
  cost should be random_page_cost * M.

+ The Merge Join involves mostly sequential accesses if the disk files are
  not fragmented. It reads sequentially and in parallel from both tables,
  merging the results where the key matches.

  It requires on-disk sorting (because tables don't fit in work_mem), but
  even this operation requires little disk seeking. A merge-sort algorithm
  might have a random access cost of logN * random_page_cost.

So I would expect an increased random_page_cost to benefit the Merge Join
algorithm. And since my setup involves spinning disks, it does makes sense
to increase it.



It would be interesting to see the same plans with SET track_io_timing
= on; set.  It's possible that there's less *actual* I/O going on with
the Merge Join plan vs the Hash Join plan.  Since we do buffered I/O,
without track_io_timing, we don't know if the read buffers resulted in
an actual disk read or a read from the kernel buffers.



The database has been VACUUM ANALYZEd first and is otherwise idle.
Every query has been run twice, and I paste here only the 2nd run.


Slow Hash Join:

# EXPLAIN (ANALYZE,VERBOSE,BUFFERS,SETTINGS) SELECT * FROM tasks_mm_workitems 
NATURAL JOIN workitem_ids;

 Hash Join  (cost=121222.68..257633.01 rows=3702994 width=241) (actual 
time=145641.295..349682.387 rows=3702994 loops=1)
   Output: tasks_mm_workitems.workitem_n, tasks_mm_workitems.task_n, 
workitem_ids.workitem_id
   Inner Unique: true
   Hash Cond: (tasks_mm_workitems.workitem_n = workitem_ids.workitem_n)
   Buffers: shared hit=12121 read=50381, temp read=56309 written=56309
   I/O Timings: shared/local read=745.925, temp read=162199.307 write=172758.699
   ->  Seq Scan on public.tasks_mm_workitems  (cost=0.00..53488.94 rows=3702994 
width=8) (actual time=0.114..1401.896 rows=3702994 loops=1)
 Output: tasks_mm_workitems.workitem_n, tasks_mm_workitems.task_n
 Buffers: shared hit=65 read=16394
 I/O Timings: shared/local read=183.959
   ->  Hash  (cost=59780.19..59780.19 rows=1373719 width=237) (actual 
time=145344.555..145344.557 rows=1373737 loops=1)
 Output: workitem_ids.workitem_id, workitem_ids.workitem_n
 Buckets: 4096  Batches: 512  Memory Usage: 759kB
 Buffers: shared hit=12056 read=33987, temp written=43092
 I/O Timings: shared/local read=561.966, temp write=142221.740
 ->  Seq Scan on public.workitem_ids  (cost=0.00..59780.19 rows=1373719 
width=237) (actual time=0.033..1493.652 rows=1373737 loops=1)
   Output: workitem_ids.workitem_id, workitem_ids.workitem_n
   Buffers: shared hit=12056 read=33987
   I/O Timings: shared/local read=561.966
 Settings: effective_cache_size = '500MB', enable_memoize = 'off', 
hash_mem_multiplier = '1', max_parallel_workers_per_gather = '1', work_mem = 
'1MB'
 Planning:
   Buffers: shared hit=8
 Planning Time: 0.693 ms
 Execution Time: 350290.496 ms
(24 rows)


Fast Merge Join:

# SET enable_hashjoin TO off;
SET

# EXPLAIN (ANALYZE,VERBOSE,BUFFERS,SETTINGS) SELECT * FROM tasks_mm_workitems 
NATURAL JOIN workitem_ids;

 Merge Join  (cost=609453.49..759407.78 rows=3702994 width=241) (actual 
time=4602.623..9700.435 rows=3702994 loops=1)
   Output: tasks_mm_workitems.workitem_n, tasks_mm_workitems.task_n, 
workitem_ids.workitem_id
   Merge Cond: (workitem_ids.workitem_n = tasks_mm_workitems.workitem_n)
   Buffers: shared hit=5310 read=66086, temp read=32621 written=32894
   I/O Timings: shared/local read=566.121, temp read=228.063 write=526.739
   ->  Index Scan using workitem_ids_pkey on public.workitem_ids  
(cost=0.43..81815.86 rows=1373719 width=237) (actual time=0.034..1080.800 
rows=1373737 loops=1)
 Output: workitem_ids.workitem_n, workitem_ids.workitem_id
 Buffers: shared hit=5310 read=49627
 I/O Timings: shared/local read=448.952
   ->  Materialize  (cost=609372.9

Re: Why is a hash join preferred when it does not fit in work_mem

2023-01-16 Thread Dimitrios Apostolou

On Sat, 14 Jan 2023, Tom Lane wrote:


Dimitrios Apostolou  writes:

Please correct me if I'm wrong, as I'm a newcomer to PostgreSQL, but here
is how I understand things according to posts I've read, and classical
algorithms:



+ The Hash Join is fastest when one side fits in work_mem. Then on one
   hand you have a hash table lookup (amortized O(1)) and on the other
   hand, if the table has M rows that that do not fit in memory, you have
   sequential reads from the disk (given low fragmentation of the table or
   index files):  For every line you read from the disk, you lookup the key
   in the hash table.



   If the hash table does not fit in RAM then the cost becomes prohibitive.
   Every lookup is a random access possibly hitting the disk. The total
   cost should be random_page_cost * M.


That would be true of a simple hash join, but Postgres uses batched
hash joins: we split up the hash key space into subsets, where hopefully
each subset includes few enough inner-side rows to fit into work_mem.
While this can go wrong given pathological distribution of the inner-side
keys, it does mean that the join can perform well even when the inner
side is much larger than work_mem.  So it's not the case that the planner
will simply disregard hash joins beyond work_mem.  It will apply a cost
penalty for the predicted batching overhead;


Thanks for this, I found a page [1] that describes the hash join and
now I understand a bit more.

[1] https://www.interdb.jp/pg/pgsql03.html

I'm not sure whether the key distribution is pathological in my case.
The join condition is:

  Hash Cond: (tasks_mm_workitems.workitem_n = workitem_ids.workitem_n)

and workitem_ids.workitem_n is an integer GENERATED AS IDENTITY and PUBLIC
KEY. The TABLE workitem_ids har 1.7M rows, and the other table has 3.7M
rows. None of them fit in workmem.

In my (simplified and pathological) case of work_mem == 1MB, the hash join
does 512 batches (Buckets: 4,096 Batches: 512 Memory Usage: 759kB). I'm
not sure which hash-merge strategy is followed, but based on that
document, it should be the "hybrid hash join with skew". I don't quite
follow the I/O requirements of this algorithm, yet. :-)


but that can still come out
cheaper than merge join, because the sorting needed for merge is generally
also far from cheap.


I was under the impression that on-disk merge-sort is a relatively cheap
(logN) operation, regarding random I/O.




So I would expect an increased random_page_cost to benefit the Merge Join
algorithm. And since my setup involves spinning disks, it does makes sense
to increase it.


What is probably really happening is that random_page_cost affects the
estimated cost of performing the sort using an index scan instead of
a bespoke sort step.  AFAIR, cost_sort doesn't consider random_page_cost
at all, and neither does cost_hashjoin.


On the last EXPLAIN I posted for the forced merge-join, I see that it uses
an index-scan on the "small" table. It makes sense since the join happens
on the primary key of the table. On the large table it does not use an
index scan, because an index doesn't exist for that column. It sorts the
3.7M rows of the table (and FWIW that table only has two integer columns).
If I understood correctly what you meant with "performing the sort using
an index scan".


The problem I see is that the estimated cost of the sort operation is
609,372.91..618,630.40. It's already way above the whole hash-join cost
(121,222.68..257,633.01). However the real timings are very different.
Actual time for Sort is 4,602.569..5,414.072 ms while for the whole hash
join it is 145,641.295..349,682.387 ms.

Am I missing some configuration knobs to put some sense to the planner?


Thanks,
Dimitris






Re: Why is a hash join preferred when it does not fit in work_mem

2023-01-20 Thread Dimitrios Apostolou

Hello again, I am back with new experiments.

First of all, I have a concrete set of steps that replicate the
slowness of the hash join that I described. If you have a system with
spinning disks lying around, I would appreciate if you can verify the
scenario. Can you also replicate it in different kind of systems?


CREATE TABLE descriptions (description_id serial PRIMARY KEY, description text);
INSERT INTO  descriptions (description_id, description)
   SELECT s, repeat(encode(sha512(s::text::bytea), 'hex'), 4)
   FROM generate_series(0,1200300) AS s;
CREATE TABLE descriptions_in_books (description_id integer REFERENCES 
descriptions(description_id), book_id integer);
INSERT INTO  descriptions_in_books (description_id, book_id)
   SELECT s % 1200300, s
   FROM generate_series(0,5200300) AS s;

SET work_mem TO '1MB';
SET hash_mem_multiplier = 1.0;
SET track_io_timing TO on;

EXPLAIN (ANALYZE,VERBOSE,BUFFERS,SETTINGS)  SELECT * FROM descriptions NATURAL 
JOIN descriptions_in_books;
SET enable_hashjoin TO off;
EXPLAIN (ANALYZE,VERBOSE,BUFFERS,SETTINGS)  SELECT * FROM descriptions NATURAL 
JOIN descriptions_in_books;



The first JOIN query uses a hash join with Batches: 1024 and takes 622s!
For the longest part, I can see the disk writing most of the time around
1-2 MB/s, so I have to assume it's not writing sequentially.

The second identical JOIN uses a merge join that completes in 14s. The I/O
happens in a much higher rate (10x maybe), so I'm assuming it's mostly
sequential.

Another observation is that the hash join deteriorates as the length of
the TEXT column grows. In fact, if I fill it with only 32 char long
strings, then the hash join is split in only 128 batches, and it completes
almost as fast as the merge join. Could it be that the cost estimation
is underestimating the I/O pattern related to splitting in batches?


Here are the measurements:


 Hash Join  (cost=192450.84..401456.02 rows=5200486 width=524) (actual 
time=344516.004..621725.562 rows=5200301 loops=1)
   Output: descriptions.description_id, descriptions.description, 
descriptions_in_books.book_id
   Inner Unique: true
   Hash Cond: (descriptions_in_books.description_id = 
descriptions.description_id)
   Buffers: shared hit=15586 read=93161, temp read=97829 written=97829
   I/O Timings: shared/local read=1402.597, temp read=229252.170 
write=371508.313
   ->  Seq Scan on public.descriptions_in_books  (cost=0.00..75015.86 
rows=5200486 width=8) (actual time=0.068..1819.629 rows=5200301 loops=1)
 Output: descriptions_in_books.book_id, 
descriptions_in_books.description_id
 Buffers: shared hit=32 read=22979
 I/O Timings: shared/local read=249.910
   ->  Hash  (cost=97739.04..97739.04 rows=1200304 width=520) (actual 
time=343268.470..343268.471 rows=1200301 loops=1)
 Output: descriptions.description_id, descriptions.description
 Buckets: 2048  Batches: 1024  Memory Usage: 686kB
 Buffers: shared hit=15554 read=70182, temp written=78538
 I/O Timings: shared/local read=1152.687, temp write=338883.205
 ->  Seq Scan on public.descriptions  (cost=0.00..97739.04 rows=1200304 
width=520) (actual time=0.028..2278.791 rows=1200301 loops=1)
   Output: descriptions.description_id, descriptions.description
   Buffers: shared hit=15554 read=70182
   I/O Timings: shared/local read=1152.687
 Settings: hash_mem_multiplier = '1', work_mem = '1MB'
 Planning Time: 0.303 ms
 Execution Time: 622495.279 ms
(22 rows)


SET enable_hashjoin TO off;


 Merge Join  (cost=868411.87..1079330.96 rows=5200301 width=524) (actual 
time=6091.932..13304.924 rows=5200301 loops=1)
   Output: descriptions.description_id, descriptions.description, 
descriptions_in_books.book_id
   Merge Cond: (descriptions.description_id = 
descriptions_in_books.description_id)
   Buffers: shared hit=67 read=111962 written=1, temp read=45806 written=46189
   I/O Timings: shared/local read=1007.043 write=28.575, temp read=344.937 
write=794.483
   ->  Index Scan using descriptions_pkey on public.descriptions
(cost=0.43..116919.99 rows=1200304 width=520) (actual time=0.028..1596.387 
rows=1200301 loops=1)
 Output: descriptions.description_id, descriptions.description
 Buffers: shared hit=3 read=89015 written=1
 I/O Timings: shared/local read=834.732 write=28.575
   ->  Materialize  (cost=868408.84..894410.35 rows=5200301 width=8) (actual 
time=6091.892..9171.796 rows=5200301 loops=1)
 Output: descriptions_in_books.book_id, 
descriptions_in_books.description_id
 Buffers: shared hit=64 read=22947, temp read=45806 written=46189
 I/O Timings: shared/local read=172.311, temp read=344.937 write=794.483
 ->  Sort  (cost=868408.84..881409.60 rows=5200301 width=8) (actual 
time=6091.885..7392.828 rows=5200301 loops=1)
   Output: descriptions_in_books.book_id, 
descriptions_in_books.description_id
   Sort Key: description

SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches

2023-01-31 Thread Dimitrios Apostolou



Hello list,

I run a very simple SELECT on a huge table (350M rows, 10 columns of type
integer or timestamp). The server is an old Mac Mini with 4GB RAM and a
slow HDD. All caches are cold either via reboot, or by restarting postgres
and using the `purge` command to flush the OS cache.

The question is why this simple query is taking so long to complete.
Most of the time I see constant 30MB/s read I/O. The memory usage is low
and the system is not swapping or in other ways unhealthy.  Postgres is
not serving any other requests.

Is postgres reading all the table sequentially, just to return 10 rows in
no particular order?  I'm trying to understand what PostgreSQL is doing
and the data access patterns involved.  All feedback is appreciated.


# SELECT version();

 PostgreSQL 15.0 on x86_64-apple-darwin20.6.0, compiled by Apple clang version 
12.0.0 (clang-1200.0.32.29), 64-bit

# EXPLAIN (ANALYZE, VERBOSE,BUFFERS,SETTINGS)  SELECT * FROM test_runs_raw 
LIMIT 10;

 Limit  (cost=0.00..0.29 rows=10 width=42) (actual time=746005.251..746014.910 
rows=10 loops=1)
   Output: run_n, test_name_n, workitem_n, started_on, duration_ms, 
test_result_n, test_executable_n, test_function_n, test_datatag_n
   Buffers: shared read=2334526
   I/O Timings: shared/local read=691137.029
   ->  Seq Scan on public.test_runs_raw  (cost=0.00..9250235.80 rows=317603680 
width=42) (actual time=745910.672..745910.677 rows=10 loops=1)
 Output: run_n, test_name_n, workitem_n, started_on, duration_ms, 
test_result_n, test_executable_n, test_function_n, test_datatag_n
 Buffers: shared read=2334526
 I/O Timings: shared/local read=691137.029
 Settings: effective_cache_size = '2GB', max_parallel_workers_per_gather = '0', 
work_mem = '64MB'
 Planning:
   Buffers: shared hit=69 read=18
   I/O Timings: shared/local read=265.550
 Planning Time: 271.719 ms
 Execution Time: 746057.424 ms
(14 rows)


Repeating the query returns instantaneously, and EXPLAIN shows shared
hit=14 and no reads. Changing the LIMIT to 100 takes another 3min to
respond, with constant read I/O while it runs:


# EXPLAIN (ANALYZE, VERBOSE,BUFFERS,SETTINGS)  SELECT * FROM test_runs_raw 
LIMIT 100;

 Limit  (cost=0.00..2.91 rows=100 width=42) (actual time=0.030..184692.698 
rows=100 loops=1)
   Output: run_n, test_name_n, workitem_n, started_on, duration_ms, 
test_result_n, test_executable_n, test_function_n, test_datatag_n
   Buffers: shared hit=14 read=772427
   I/O Timings: shared/local read=171484.072
   ->  Seq Scan on public.test_runs_raw  (cost=0.00..9250235.80 rows=317603680 
width=42) (actual time=0.029..184692.664 rows=100 loops=1)
 Output: run_n, test_name_n, workitem_n, started_on, duration_ms, 
test_result_n, test_executable_n, test_function_n, test_datatag_n
 Buffers: shared hit=14 read=772427
 I/O Timings: shared/local read=171484.072
 Settings: effective_cache_size = '2GB', max_parallel_workers_per_gather = '0', 
work_mem = '64MB'
 Planning Time: 0.083 ms
 Execution Time: 184692.748 ms
(11 rows)


Reruning this is instantaneous again.
Other postgres settings:

shared_buffers = 1024MB
effective_io_concurrency = 0
wal_compression = zstd
wal_recycle = on
min_wal_size = 1GB
max_wal_size = 10GB
checkpoint_timeout = 15min


Bonus question: some time ago the server crashed and rebooted, and after a
long WAL recovery and some hours of idle time, I run the same query.  It
took even longer to respond, and besides reading, there was also a lot
writing involved.  What kind of writes was postgres doing?


# EXPLAIN (ANALYZE, VERBOSE,BUFFERS,SETTINGS)  SELECT * FROM test_runs_raw 
LIMIT 10;

 Limit  (cost=0.00..0.29 rows=10 width=42) (actual 
time=1284114.810..1284114.819 rows=10 loops=1)
   Output: run_n, test_name_n, workitem_n, started_on, duration_ms, 
test_result_n, test_executable_n, test_function_n, test_datatag_n
   Buffers: shared hit=47490 read=2287036 dirtied=1809974 written=1762484
   I/O Timings: shared/local read=104.264 write=107258.150
   ->  Seq Scan on public.test_runs_raw  (cost=0.00..9250235.80 rows=317603680 
width=42) (actual time=1283955.130..1283955.135 rows=10 loops=1)
 Output: run_n, test_name_n, workitem_n, started_on, duration_ms, 
test_result_n, test_executable_n, test_function_n, test_datatag_n
 Buffers: shared hit=47490 read=2287036 dirtied=1809974 written=1762484
 I/O Timings: shared/local read=104.264 write=107258.150
 Settings: effective_cache_size = '2GB', max_parallel_workers_per_gather = '0', 
work_mem = '64MB'
 Planning Time: 0.545 ms
 Execution Time: 1284114.972 ms
(11 rows)


Relevant settings:

synchronous_commit = off
wal_recycle = on
min_wal_size = 1GB
max_wal_size = 10GB
track_io_timing = on
track_wal_io_timing = on
wal_compression = zstd


Thanks in advance,
Dimitris





Re: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches

2023-01-31 Thread Dimitrios Apostolou

I managed to double-paste different subset of my settings. FWIW all of
them are true for all the experiments. Sorry for the confusion.


On Tue, 31 Jan 2023, Dimitrios Apostolou wrote:


Other postgres settings:

shared_buffers = 1024MB
effective_io_concurrency = 0
wal_compression = zstd
wal_recycle = on
min_wal_size = 1GB
max_wal_size = 10GB
checkpoint_timeout = 15min


[...]


Relevant settings:

synchronous_commit = off
wal_recycle = on
min_wal_size = 1GB
max_wal_size = 10GB
track_io_timing = on
track_wal_io_timing = on
wal_compression = zstd





Re: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches

2023-01-31 Thread Dimitrios Apostolou

On Tue, 31 Jan 2023, David G. Johnston wrote:


On Tue, Jan 31, 2023 at 8:07 AM Dimitrios Apostolou  wrote:
 
      ->  Seq Scan on public.test_runs_raw  (cost=0.00..9250235.80 
rows=317603680 width=42) (actual time=745910.672..745910.677 rows=10 loops=1)
            Output: run_n, test_name_n, workitem_n, started_on, 
duration_ms, test_result_n, test_executable_n, test_function_n, test_datatag_n
            Buffers: shared read=2334526
            I/O Timings: shared/local read=691137.029


The system has to return 10 live rows to you.  If it needs to search through 
that many buffers to find 10 live rows you most likely have a large bloating 
problem going on.  Seems like it is time to vacuum full.


I looked up on how to measure bloat, so I run the query found at [1].

[1] https://wiki.postgresql.org/wiki/Show_database_bloat

The first two rows show huge bloat on the two indices of this table:

... ORDER BY wastedbytes DESC LIMIT 2;
 current_database | schemaname |   tablename   | tbloat | wastedbytes | 
   iname | ibloat | wastedibytes
--++---++-+--++--
 coin | public | test_runs_raw |1.8 | 21742305280 | 
test_runs_raw_pkey   |1.0 |0
 coin | public | test_runs_raw |1.8 | 21742305280 | 
test_runs_raw_idx_workitem_n |0.3 |0
(2 rows)

Is this bloat even affecting queries that do not use the index?

It seems I have to add VACUUM FULL to nightly maintainance. I had run some
schema restructuring (several ADD COLUMN followed by UPDATE SET on all
rows) some days ago, and I was not aware this degraded the table.
Thanks for the useful info!


Dimitris


Re: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches

2023-01-31 Thread Dimitrios Apostolou

Thanks for the insight on the internals. Regarding your questions:

On Tue, 31 Jan 2023, Tom Lane wrote:


Do you get the same 10 rows when you repeat the command?


Yes. Just tested with both cold and hot caches. The first 10 rows are
exactly the same, either they return slowly or immediately.


If turning synchronize_seqscans off changes the behavior, that'd
be a good clue that this is the right theory.


Turning it off makes the query slow no matter how many times I re-run it.
The system is doing lots of read I/O with both hot and cold caches. Here
is the EXPLAIN output from the hot cache run (that previously had only 14
hits and no reads):

 Limit  (cost=0.00..0.29 rows=10 width=42) (actual time=620510.813..620510.821 
rows=10 loops=1)
   Output: run_n, test_name_n, workitem_n, started_on, duration_ms, 
test_result_n, test_executable_n, test_function_n, test_datatag_n
   Buffers: shared hit=64 read=2334462
   I/O Timings: shared/local read=567846.559
   ->  Seq Scan on public.test_runs_raw  (cost=0.00..9250235.80 rows=317603680 
width=42) (actual time=620510.800..620510.804 rows=10 loops=1)
 Output: run_n, test_name_n, workitem_n, started_on, duration_ms, 
test_result_n, test_executable_n, test_function_n, test_datatag_n
 Buffers: shared hit=64 read=2334462
 I/O Timings: shared/local read=567846.559
 Settings: effective_cache_size = '2GB', max_parallel_workers_per_gather = '0', 
work_mem = '64MB'
 Planning Time: 0.099 ms
 Execution Time: 620510.855 ms


After reading the docs, I'm surprised this setting affects my case given
that I have no parallelism in my setup.



As for a real fix,
it might be time for a VACUUM FULL or CLUSTER on that table.


Regarding CLUSTER, would it help with a seqscan on a bloated table?
Furthermore, given that the table is expected to grow every day by a few
million rows, do you suggest running CLUSTER every night? Will postgres
remember that the rows up to N are clustered, even after appending more
rows?


Dimitris




Re: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches

2023-02-01 Thread Dimitrios Apostolou

On Tue, 31 Jan 2023, David G. Johnston wrote:


It feels like there is room for improvement here using table statistics and the 
visibility map to significantly reduce the number of pages retrieved that turn 
out to be all dead. 


Yes, I too wish postgres was more clever for a simple unordered query. I
should also mention that this table has two btree indices, one primary key
index and one on another column. Would it be feasible to "descend" an
index in order to immediately find some live data?


A limit without an order is not a query to optimize for of course...


A good reason is that the web interface I'm using to browse my tables (and
I guess many other too) do a simple limited+unordered SELECT when you
click on a table. Timing out was what made me look deeper. I couldn't
accept that no matter how slow my system is, it has trouble returning a
few rows.

It is also the first query somoebody will type on psql to interactively
get a clue about the contents of a table.


Regards,
Dimitris


VACUUM vs VACUUM FULL (was: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches)

2023-02-01 Thread Dimitrios Apostolou

Thanks everyone for the help.

I have now run simple VACUUM but it didn't change anything, the simple
SELECT is still slow.

My understanding by reading the docs is that it should reclaim all unused
space, just not return it to the OS. Which is fine by me. Any idea why it
failed to reclaim the space in my case?

I'm now running VACUUM FULL as everyone suggested. I just tried plain
VACUUM as I was curious if it would work and because it doesn't lock
the table with an operation that takes hours for my setup.


Thanks,
Dimitris





Moving data from huge table slow, min() query on indexed column taking 38s

2023-07-06 Thread Dimitrios Apostolou

I have a huge (10G rows) table "tbl_legacy" named test_runs that I want to
PARTITION BY RANGE on an indexed integer column.

I have created the new partitioned table "tbl" and 1000 partitions in it,
so that it can take my existing data and have the partitions ready for the
next year (so most of the new partitions will be empty for now).

On the topic of bulk-transferring the data from tbl_legacy to tbl:

+ First I chose the method to DELETE...INSERT everything.

  I noticed that the postgres process was growing without bounds, up to
  the point that it consumed almost all of swap space and I had to kill
  the command. Any idea why that happens?

+ Then I broke the command above in many small chunks

  WITH rows AS (
DELETE FROM tbl_legacy AS t
WHERE (partition_key_column < $1)
RETURNING t.*
) INSERT INTO tbl SELECT * FROM rows;
  COMMIT;

  I increase the parameter $1 and keep going in a loop.  At first this
  goes OK, after one day though I notice that it has slowed down
  significantly. My investigation shows that the simple query

SELECT min(partition_key_column) from tbl_legacy;

  takes 38s, despite having an index on the column! A VACUUM fixes that,
  so I guess the index has a ton of dead tuples. I guess autovacuum does
  not do its job because the table is constantly busy.

  Unfortunately VACUUM takes long (several hours) on this huge table, so I
  can't add in the loop after the DELETE command.

  Is there a better way to avoid the bloat in the loop above?  Why can't
  the DELETE command update the index by pointing the minimum element
  beyond the dead tuples?

+ Finally I resorted to just copying the data:

INSERT INTO tbl SELECT * FROM tbl_legacy;

  This took half a day or so but worked without blowing up the memory
  usage like the 1st query.  Why is that?

  By the way, I don't consider the last solution ideal, since the data is
  duplicated between the two tables and it's harder to implement
  workarounds to do it without taking downtime. But it worked for me for
  this time. Any other ideas for partitioning an existing huge table?


Thanks in advance,
Dimitris




Re: Moving data from huge table slow, min() query on indexed column taking 38s

2023-07-06 Thread Dimitrios Apostolou

On Thu, 6 Jul 2023, Dimitrios Apostolou wrote:


+ First I chose the method to DELETE...INSERT everything.


Just to clarify, the query looked more like

  WITH rows AS ( DELETE FROM tbl_legacy RETURNING * )
INSERT INTO tbl
  SELECT * FROM rows;


  I noticed that the postgres process was growing without bounds, up to
  the point that it consumed almost all of swap space and I had to kill
  the command. Any idea why that happens?


Also note that my settings for work_mem, temp_buffers, shared_buffers etc
are all well below the RAM size and postgres has never shown unbound
growth again. Postgres version is 15.2.


Dimitris





Re: Moving data from huge table slow, min() query on indexed column taking 38s

2023-07-10 Thread Dimitrios Apostolou

Thanks! I have completed the transfer by taking down the database and
working exclusively on it, but I still wonder how one could do it in
production without exclusive locks. The loop with small DELETE...INSERT
transactions I posted on the parent post bloats the table fast.

The link you posted contains very useful info, I was not expecting that
the constraints could blow up the memory like that. Comments from me:

Disabling and then re-enabling the foreign key constraints is easily done
with ALTER TABLE.

Unfortunately it doesn't seem to be the same case for indices. One has to
create the table without indices, and then create the indices separately.
With such a process there is a risk of ending up with non-identical
table schemas.

By the way, with COPY one must use an intermediate file, right? There is
no way to COPY from table to table directly?


Thanks,
Dimitris

On Thu, 6 Jul 2023, Lorusso Domenico wrote:


Hello,
this is a standard problem during bulk copy.

here some suggestions; for example disable indexes.

The main issue is related to index, lock escalation and log writing.
In other dbms you should set log off on the table, but postgresql does not seem 
to have this feature.

Anyway, using an explicit lock table exclusive should prevent lock escalation.

So: disable indexes in target table
lock exclusive both table
insert data
truncate old table

If this doesn't work you can consider using the copy command.







Il giorno gio 6 lug 2023 alle ore 18:12 Dimitrios Apostolou  ha 
scritto:
  On Thu, 6 Jul 2023, Dimitrios Apostolou wrote:

  > + First I chose the method to DELETE...INSERT everything.

  Just to clarify, the query looked more like

     WITH rows AS ( DELETE FROM tbl_legacy RETURNING * )
       INSERT INTO tbl
         SELECT * FROM rows;

  >   I noticed that the postgres process was growing without bounds, up to
  >   the point that it consumed almost all of swap space and I had to kill
  >   the command. Any idea why that happens?

  Also note that my settings for work_mem, temp_buffers, shared_buffers etc
  are all well below the RAM size and postgres has never shown unbound
  growth again. Postgres version is 15.2.


  Dimitris





--
Domenico L.

per stupire mezz'ora basta un libro di storia,
io cercai di imparare la Treccani a memoria... [F.d.A.]




Re: Moving data from huge table slow, min() query on indexed column taking 38s

2023-07-21 Thread Dimitrios Apostolou

On Thu, 6 Jul 2023, Dimitrios Apostolou wrote:

+ Then I broke the command above in many small chunks

  WITH rows AS (
DELETE FROM tbl_legacy AS t
WHERE (partition_key_column < $1)
RETURNING t.*
   )  INSERT INTO tbl SELECT * FROM rows;
  COMMIT;

  I increase the parameter $1 and keep going in a loop.  At first this
  goes OK, after one day though I notice that it has slowed down
  significantly. My investigation shows that the simple query

   SELECT min(partition_key_column) from tbl_legacy;

  takes 38s, despite having an index on the column! A VACUUM fixes that,
  so I guess the index has a ton of dead tuples. I guess autovacuum does
  not do its job because the table is constantly busy.

  Unfortunately VACUUM takes long (several hours) on this huge table, so I
  can't add in the loop after the DELETE command.

  Is there a better way to avoid the bloat in the loop above?  Why can't
  the DELETE command update the index by pointing the minimum element
  beyond the dead tuples?


Any comments on this one? It annoys me that a simple loop deteriorated so
much and kept filling the table with bloat. What is that VACUUM does that
DELETE can't do, to keep the index fresh?




Re: Partitionwise JOIN scanning all partitions, even unneeded ones

2023-08-30 Thread Dimitrios Apostolou

Thank you for the clear explanation, and I hope the missing optimisation
gets implemented sooner rather than later. Maybe the query planner should
consider the missing optimisation and ban *hash* partitionwise joins.

Indeed I verified that disabling hash join fixed the situation, with both
queries taking almost the same time. Great!

In the meantime, I'm considering disabling hash joins globally, as I've
had issues with them before (excessive I/O slowing causing much slower
execution than merge join, see [1] for your answer to my question then :).
Do you think that would save me from other problems I'm not aware of,
given the huge size of the table I'm querying?

[1] 
https://www.postgresql.org/message-id/caaphdvppvydonkeqlybsbjwq8kq8m7ywdka44rtea2mnao3...@mail.gmail.com


Regards,
Dimitris




BRIN index maintenance on table without primary key

2023-10-27 Thread Dimitrios Apostolou

Hello list.

Key characteristics of my case:
+ HUGE table with 20G rows, ca 2TB
+ May be partitioned (have both versions on two test clusters ATM)
+ Plenty of inserts, no updates
+ No primary key - we had one IDENTITY bigint column until recently, but
  it proved useless and inefficient (index too big for mem) so we deleted it.
+ All queries are filtering on a not-unique not-null integer column.
  + On this column we have a BRIN INDEX since insert order is
/mostly/ incremental.

So the question is: how to maintain the physical order of the tuples?

Even though the insertions populate the index column /mostly/
incrementally, there are outliers and it's certain that over time the
BRIN index will deteriorate.

I'm monitoring the "correlation" and I want to run a command to "defrag"
it when it drops below 0.9.

+ Can't run CLUSTER:
ERROR:  cannot cluster on index "tst_brin_idx" because access method does 
not support clustering

+ Can't run pg_repack, from [1]:
> Target table must have a PRIMARY KEY, or at least a UNIQUE total index on 
a NOT NULL column

[1] https://reorg.github.io/pg_repack/


Any ideas?

Thank you in advance,
Dimitris





Re: BRIN index maintenance on table without primary key

2023-10-30 Thread Dimitrios Apostolou

On Fri, 27 Oct 2023, Dimitrios Apostolou wrote:


So the question is: how to maintain the physical order of the tuples?


Answering to myself, there doesn't seem to be any way to run pg_repack on
a table without a UNIQUE key.

To run CLUSTER, the only way I see is to:

1. Create a btree index on the same column that has the BRIN index
2. CLUSTER
3. Drop the index

This should take very long on my huge table, and keeps the table
exclusively locked. The disk space needed would also be huge, if the table
isn't partitioned.

I wonder why CLUSTER can't run based on a BRIN-indexed column. Is it
theoretically impossible, or is it just not implemented yet?

My understanding so far is that CLUSTER only rewrites the table *according
to the index order* and does not touch the index itself. For a BRIN index
though it would need to rewrite the table *ignoring the index* and then
rewrite the index too, in order to keep the ranges fully optimized. So the
logic is very different, and maybe closer to what VACUUM FULL does.


Thanks,
Dimitris




Inefficient query plan for SELECT ... EXCEPT ...

2023-10-31 Thread Dimitrios Apostolou

Hello list,

I'm getting an inefficient query plan for a SELECT ... EXCEPT ... query,
where the left side is a very short table (even zero-length sometimes, but
also also rarely can be as long as 200K rows), and the right side is a
table with 10M UNIQUE NOT NULL rows:


\d test_datatags

Table "public.test_datatags"
 Column |  Type   | Collation | Nullable | Default
+-+---+--+--
 test_datatag_n | integer |   | not null | generated by default as 
identity
 test_datatag   | text|   | not null |
Indexes:
"test_datatags_pkey" PRIMARY KEY, btree (test_datatag_n)
"test_datatags_test_datatag_key" UNIQUE CONSTRAINT, btree (test_datatag)


Follows a simplified and pathological case, that takes 1.7s even though
the left side is empty:



BEGIN;
CREATE TEMPORARY TABLE tmp_table(d1 TEXT) ON COMMIT DROP;
ANALYZE VERBOSE tmp_table ;

INFO:  analyzing "pg_temp_9.tmp_table"
INFO:  "tmp_table": scanned 0 of 0 pages, containing 0 live rows and 0 dead 
rows; 0 rows in sample, 0 estimated total rows

EXPLAIN (ANALYSE,VERBOSE,BUFFERS,SETTINGS)

SELECT DISTINCT  d1
FROM tmp_table
WHERE d1 IS NOT NULL
EXCEPT
SELECT test_datatag FROM test_datatags;

QUERY PLAN
--
 HashSetOp Except  (cost=0.00..299854.08 rows=1 width=36) (actual 
time=1726.470..1726.472 rows=0 loops=1)
   Output: "*SELECT* 1".d1, (0)
   Buffers: shared hit=1259 read=58800
   I/O Timings: shared/local read=77.713
   ->  Append  (cost=0.00..278054.53 rows=8719821 width=36) (actual 
time=3.754..1287.901 rows=8702840 loops=1)
 Buffers: shared hit=1259 read=58800
 I/O Timings: shared/local read=77.713
 ->  Subquery Scan on "*SELECT* 1"  (cost=0.00..0.02 rows=1 width=36) 
(actual time=0.003..0.003 rows=0 loops=1)
   Output: "*SELECT* 1".d1, 0
   ->  HashAggregate  (cost=0.00..0.01 rows=1 width=32) (actual 
time=0.002..0.003 rows=0 loops=1)
 Output: tmp_table.d1
 Group Key: tmp_table.d1
 Batches: 1  Memory Usage: 24kB
 ->  Seq Scan on pg_temp.tmp_table  (cost=0.00..0.00 rows=1 
width=32) (actual time=0.001..0.001 rows=0 loops=1)
   Output: tmp_table.d1
   Filter: (tmp_table.d1 IS NOT NULL)
 ->  Subquery Scan on "*SELECT* 2"  (cost=0.00..234455.40 rows=8719820 
width=26) (actual time=3.751..943.850 rows=8702840 loops=1)
   Output: "*SELECT* 2".test_datatag, 1
   Buffers: shared hit=1259 read=58800
   I/O Timings: shared/local read=77.713
   ->  Seq Scan on public.test_datatags  (cost=0.00..147257.20 
rows=8719820 width=22) (actual time=3.747..515.420 rows=8702840 loops=1)
 Output: test_datatags.test_datatag
 Buffers: shared hit=1259 read=58800
 I/O Timings: shared/local read=77.713
 Settings: effective_io_concurrency = '0', enable_partitionwise_aggregate = 
'on', enable_partitionwise_join = 'on', hash_mem_multiplier = '1',
 maintenance_io_concurrency = '0', max_parallel_workers = '12', 
max_parallel_workers_per_gather = '0', temp_buffers = '64MB'
 Planning:
   Buffers: shared hit=2
 Planning Time: 0.055 ms
 JIT:
   Functions: 15
   Options: Inlining false, Optimization false, Expressions true,
Deforming true
   Timing: Generation 0.317 ms, Inlining 0.000 ms, Optimization 0.179 ms,
Emission 3.542 ms, Total 4.037 ms
 Execution Time: 1726.835 ms
(33 rows)


I'm wondering why the planner doesn't see that the left table is very small and 
follow a different path.
From an abstract computer science POV, I would

1. sort the left table  (the right one is already indexed)
2. "merge" the two tables, by walking them in-order in parallel and excluding 
the matches
3. stop when the left table is exhausted, which would happen very early.

Is this worth a bug report? I can file one if the issue is not known.
Or am I misunderstanding the implications of the SELECT-EXCEPT query?

In the meantime I have replaced the query with a LEFT OUTER JOIN which
performs much better, and I believe is equivalent. I find it less readable
than the query in question though. Plus, I have a bunch of SELECT-EXCEPT
queries (with smaller right-side tables) in my application that I would
hate to change them all to the ugliest equivalent. Under what conditions
would the above query plan perform well?

Thanks in advance,
Dimitris





Re: Inefficient query plan for SELECT ... EXCEPT ...

2023-11-01 Thread Dimitrios Apostolou

Thank you all for the answers, they covered me well.



Is this worth a bug report? I can file one if the issue is not known.


No. It's just a missing optimisation. We know about it.


It's good I shot an email first then. FWIW my usual way in other projects
would be to check the bugtracker, and just "follow" the relevant issue if
it's minor like a missing optimisation. I didn't find a way to search for
"known issues" in the Postgresql project.


Dimitris




Is "DISTINCT" making a diffence in: SELECT [DISTINCT] ... EXCEPT

2023-11-15 Thread Dimitrios Apostolou

Hello list,

SELECT [DISTINCT] ... EXCEPT ...

In this query I get the same results regardless of including DISTINCT or
not. But I get different query plans, I get an extra HashAggregate node
in the case of SELECT DISTINCT. Any idea why?

Reading the docs, I understand that postgres does by default
EXCEPT DISTINCT, but I don't see the same for the 1st clause,
SELECT [DISTINCT].


Steps to reproduce:

CREATE TABLE t1(i INTEGER NOT NULL);
CREATE TABLE t2(i INTEGER NOT NULL);

INSERT INTO t1 VALUES (1),(2),(1),(2),(3),(3),(3);
INSERT INTO t2 VALUES (3);

SELECT * FROM t1 EXCEPT SELECT * FROM t2;
 i
---
 2
 1
(2 rows)

SELECT DISTINCT * FROM t1 EXCEPT SELECT * FROM t2;
 i
---
 2
 1
(2 rows)


EXPLAIN SELECT * FROM t1 EXCEPT SELECT * FROM t2;
   QUERY PLAN
-
 HashSetOp Except  (cost=0.00..160.25 rows=200 width=8)
   ->  Append  (cost=0.00..147.50 rows=5100 width=8)
 ->  Subquery Scan on "*SELECT* 1"  (cost=0.00..61.00 rows=2550 width=8)
   ->  Seq Scan on t1  (cost=0.00..35.50 rows=2550 width=4)
 ->  Subquery Scan on "*SELECT* 2"  (cost=0.00..61.00 rows=2550 width=8)
   ->  Seq Scan on t2  (cost=0.00..35.50 rows=2550 width=4)
(6 rows)

EXPLAIN SELECT DISTINCT * FROM t1 EXCEPT SELECT * FROM t2;
   QUERY PLAN
-
 HashSetOp Except  (cost=41.88..127.50 rows=200 width=8)
   ->  Append  (cost=41.88..120.62 rows=2750 width=8)
 ->  Subquery Scan on "*SELECT* 1"  (cost=41.88..45.88 rows=200 width=8)
   ->  HashAggregate  (cost=41.88..43.88 rows=200 width=4)
 Group Key: t1.i
 ->  Seq Scan on t1  (cost=0.00..35.50 rows=2550 width=4)
 ->  Subquery Scan on "*SELECT* 2"  (cost=0.00..61.00 rows=2550 width=8)
   ->  Seq Scan on t2  (cost=0.00..35.50 rows=2550 width=4)
(8 rows)



Regards,
Dimitris





Re: Is "DISTINCT" making a diffence in: SELECT [DISTINCT] ... EXCEPT

2023-11-15 Thread Dimitrios Apostolou

On Wed, 15 Nov 2023, Erik Wienhold wrote:


On 2023-11-15 10:57 +0100, Dimitrios Apostolou wrote:

SELECT [DISTINCT] ... EXCEPT ...

In this query I get the same results regardless of including DISTINCT or
not. But I get different query plans, I get an extra HashAggregate node
in the case of SELECT DISTINCT. Any idea why?


As Tom Lane recently wrote[1] EXCEPT is not optimized and will operate
on the subqueries which are planned independently.

[1] https://www.postgresql.org/message-id/2664450.1698799...@sss.pgh.pa.us


Heh, as he wrote to me even. :-) I just wanted to make sure that this is
indeed a missing optimisation of the planner, and that the queries are
effectively the same. Thank you for clarifying.

As mentioned, the docs don't make it clear if the SELECT DISTINCT part is
implied or not, only the EXCEPT DISTINCT part is clearly on by default.


Dimitris





Orphan files filling root partition after crash

2024-02-28 Thread Dimitrios Apostolou

Hello list,

yesterday I was doing:

ALTER TABLE x ADD UNIQUE ... , ADD PRIMARY KEY ...;

The table is almost a billion rows long but lies in its own TABLESPACE
that has plenty of space.  But apparently the ALTER TABLE command is
writing a lot to the default tablespace (not the temp_tablespace, that is
already moved to a different partition).

That quickly filled up the 50GB free space in my root partition:


20:18:04.222 UTC [94144] PANIC:  could not write to file 
"pg_wal/xlogtemp.94144": No space left on device
[...]
20:19:11.578 UTC [94140] LOG:  WAL writer process (PID 94144) was terminated by 
signal 6: Aborted
20:19:11.578 UTC [94140] LOG:  terminating any other active server processes


After postgresql crashed and restarted, the disk space in the root
partition was still not freed! I believe this is because of "orphaned
files" as discussed in mailing list thread [1].

[1] 
https://www.postgresql.org/message-id/CAN-RpxDBA7HbTsJPq4t4VznmRFJkssP2SNEMuG%3DoNJ%2B%3DsxLQew%40mail.gmail.com

I ended up doing some risky actions to remediate the problem: Find the
filenames that have no identically named "oid" in pg_class, and delete
(move to backup) the biggest ones while the database is stopped.
Fortunately the database started up fine after that!

So what is the moral of the story? How to guard against this?

Why did the database write so much to the default tablespace, even when
both the table and the temp tablespace are elsewhere?  Also should one
always keep the default tablespace away from the wal partition? (I guess
it would have helped to avoid the crash, but the ALTER TABLE command would
have still run out of space, so I'm not sure if the orphan files would
have been avoided)?

Needless to say, I would have hoped the database cleaned-up after itself
even after an uncontrolled crash, or that it provided some tools for the
job. (I tried VACUUM FULL on the table, but the orphaned files did not go
away).

My postgresql version is 16.2 installed on Ubuntu.

Thank you,
Dimitris





Re: Orphan files filling root partition after crash

2024-02-28 Thread Dimitrios Apostolou

Hi Sergey,

On Wed, 28 Feb 2024, Sergey Fukanchik wrote:


Hi Dimitrios,
Do you have wal archiving enabled?
$PGDATA/pg_wal/ is a bit different from tablespaces (including
"default" one). It stores transaction journal.


I don't think I have WAL archiving enabled. Here are the relevant WAL
options in my config:

wal_compression = zstd
max_wal_size = 8GB
min_wal_size = 1GB
track_wal_io_timing = on
wal_writer_flush_after = 8MB

The issue happens because the WAL directory is by default in the same
filesystem with the default tablespace (root partition for Ubuntu). So
when the default tablespace filled up because of my ALTER TABLE operation,
there was no space for WAL either.


Thanks,
Dimitris





Re: Orphan files filling root partition after crash

2024-02-28 Thread Dimitrios Apostolou

Thanks for the feedback Laurenz,

On Wed, 28 Feb 2024, Laurenz Albe wrote:


On Wed, 2024-02-28 at 12:16 +0100, Dimitrios Apostolou wrote:


I ended up doing some risky actions to remediate the problem: Find the
filenames that have no identically named "oid" in pg_class, and delete
(move to backup) the biggest ones while the database is stopped.
Fortunately the database started up fine after that!


Lucky you.  It should have been "relfilenode" rather than "oid",
and some catalog tables don't have their files listed in the catalog,
because they are needed *before* the database can access tables.


I actually double checked that the filenames don't appear anywhere in
SELECT * FROM pg_class
and that the files were multi-GB in size including all the
1GB-pieces. But luck was definitely a part of the equation, I didn't know
that the files might be accessed before tables (at db startup?) or that
"relfilenode" would be more appropriate. Why is that, where can I read
more? I see that many (but not all) rows in pg_class have oid=relfilenode
but for many rows relfilenode=0 which is meaningless as filename.




So what is the moral of the story? How to guard against this?


Monitor disk usage ...


It happened *fast*. And it was quite a big suprise coming
from "just" a disk-full situation.

A couple of suggestions; wouldn't it make sense:

+ for the index to be written by default to the table's tablespace?

+ for postgres to refuse to write non-wal files, if it's on
  the same device as the WAL and less than max_wal_size bytes are free?



The root of the problem is that you created the index in the default
tablespace.  You should have

  ALTER TABLE x ADD UNIQUE ... USING INDEX TABLESPACE bigtblspc;



Thank you, was reading the docs but didn't realize this
syntax is valid. I thought it was only for CREATE/ALTER INDEX.



Needless to say, I would have hoped the database cleaned-up after itself
even after an uncontrolled crash, or that it provided some tools for the
job. (I tried VACUUM FULL on the table, but the orphaned files did not go
away).


That is not so simple... Also, it would slow down crash recovery.

But I agree that it would be nice to have a tool that reports or
cleans up orphaned files.

Yours,
Laurenz Albe






SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

2024-05-10 Thread Dimitrios Apostolou

Hello list,

INTRO

I have a huge (multi-billion rows) table partitioned into 1000 partitions.
Around half of the partitions are full and the rest are empty, created in
advance ready to receive future incoming data. Postgres is 16.2.
Here are the relevant parts of the schema:


\d test_runs_raw

 Partitioned table "public.test_runs_raw"
  Column   |Type | Collation | Nullable |
Default
---+-+---+--+--
 run_n | bigint  |   | not null | 
generated by default as identity
 test_case_n   | smallint|   | not null |
 workitem_n| integer |   | not null |
 test_resulttype_n | smallint|   |  |
Partition key: RANGE (workitem_n)
Indexes:
"test_runs_raw_partitioned_pkey" PRIMARY KEY, btree (workitem_n, run_n)

Each partition is made to keep entries with workitem_n in ranges
(0,20k), (20k,40k) and so on (k = kilo) up to 2k.


PROBLEM

I noticed that the following query is very very slow (too long to wait for
it to finish):

SELECT DISTINCT workitem_n FROM test_runs_raw ORDER BY workitem_n DESC LIMIT 10;

What is remarkable, is that in 998 out of 1000 table scans it involves,
the planner does not use the index. Instead it chooses a sequential scan.
Here is the output from EXPLAIN:

 Limit  (cost=853891608.79..853891608.99 rows=10 width=4)
   ->  Unique  (cost=853891608.79..853891612.79 rows=200 width=4)
 ->  Sort  (cost=853891608.79..853891610.79 rows=800 width=4)   
Sort Key: test_runs_raw.workitem_n DESC
   ->  Gather  (cost=853891488.22..853891570.22 rows=800 width=4)
 Workers Planned: 4
 ->  HashAggregate  (cost=853890488.22..853890490.22 
rows=200 width=4)
   Group Key: test_runs_raw.workitem_n
   ->  Parallel Append  (cost=0.00..813118117.30 
rows=16308948365 width=4)
 ->  Parallel Index Only Scan Backward using 
test_runs_raw__part_max9600k_pkey on test_runs_raw__part_max9600k 
test_runs_raw_480  (cost=0.57..1597355.10 rows=33623320 width=4)
 ->  Parallel Index Only Scan Backward using 
test_runs_raw__part_max10140k_pkey on test_runs_raw__part_max10140k 
test_runs_raw_507  (cost=0.57..1210795.63 rows=25793672 width=4)
 ->  Parallel Seq Scan on 
test_runs_raw__part_max9500k test_runs_raw_475  (cost=0.00..3037793.12 
rows=64121612 width=4)
 ->  Parallel Seq Scan on 
test_runs_raw__part_max11180k test_runs_raw_559  (cost=0.00..2918875.90 
rows=61612190 width=4)
[ ... 996 more sequential scans ... ]

If I remove DISTINCT then the plan changes dramatically and it runs
instantaneously:

 Limit  (cost=363.84..367.30 rows=10 width=4)
   ->  Append  (cost=363.84..22527480551.58 rows=65235793929 width=4)
 ->  Index Only Scan Backward using test_runs_raw__part_max2k_pkey 
on test_runs_raw__part_max2k test_runs_raw_1000  (cost=0.12..2.34 rows=1 
width=4)
 ->  Index Only Scan Backward using test_runs_raw__part_max19980k_pkey 
on test_runs_raw__part_max19980k test_runs_raw_999  (cost=0.12..2.34 rows=1 
width=4)
 ->  Index Only Scan Backward using test_runs_raw__part_max19960k_pkey 
on test_runs_raw__part_max19960k test_runs_raw_998  (cost=0.12..2.34 rows=1 
width=4)
 ->  Index Only Scan Backward using test_runs_raw__part_max19940k_pkey 
on test_runs_raw__part_max19940k test_runs_raw_997  (cost=0.12..2.34 rows=1 
width=4)
[ ... 996 more index scans ... ]

Notice how in the last plan there is no parallel scanning. Instead the
partitions are scanned sequentially, *in proper order*,
so that the plan execution stops after reading the first
10 rows in the first non-empty partition.

Why can't the same be done with DISTINCT?
Please note that the workitem_n value range is well spread into in range
(0,13M) and the table has been gradually filled within one year, so I'm
assuming the vacuum worker has worked long enough to build sane statistics
(not sure how to verify that).


REMARKS

1. I tried reproducing the problem on an artificial table with few
   partitions and few values, but I couldn't. Both queries execute fast,
   and the planner is always choosing a non-parallel index-only scan.

2. Among testing changes to various settings, I just noticed that setting
   max_parallel_workers_per_gather to 0 (from the original value of 4)
   fixes the issue! On the original huge table, disabling parallelism
   actually makes the query infinitely faster and it returns within 1s! Is
   this a bug in the planner?


Thank you,
Dimitris





Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

2024-05-10 Thread Dimitrios Apostolou

On Fri, 10 May 2024, Dimitrios Apostolou wrote:


I noticed that the following query is very very slow (too long to wait for it
to finish):

SELECT DISTINCT workitem_n FROM test_runs_raw ORDER BY workitem_n DESC LIMIT 10;


Update: even the simplest SELECT DISTINCT query shows similar behaviour:

EXPLAIN
SELECT DISTINCT workitem_n FROM test_runs_raw LIMIT 10;

 Limit  (cost=724518979.52..724518979.92 rows=10 width=4)
   ->  Unique  (cost=724518979.52..724518987.52 rows=200 width=4)
 ->  Sort  (cost=724518979.52..724518983.52 rows=1600 width=4)  
 Sort Key: test_runs_raw.workitem_n
   ->  Gather  (cost=724518732.37..724518894.37 rows=1600 width=4)
 Workers Planned: 4
 ->  HashAggregate  (cost=724517732.37..724517734.37 
rows=200 width=4)
   Group Key: test_runs_raw.workitem_n
   ->  Parallel Append  (cost=0.00..704131546.90 
rows=8154474186 width=4)
 ->  Parallel Index Only Scan using 
test_runs_raw__part_max9600k_pkey on test_runs_raw__part_max9600k 
test_runs_raw_480  (cost=0.57..1429238.50 rows=16811660 width=4)
 ->  Parallel Index Only Scan using 
test_runs_raw__part_max10140k_pkey on test_runs_raw__part_max10140k 
test_runs_raw_507  (cost=0.57..1081827.27 rows=12896836 width=4)
 ->  Parallel Seq Scan on 
test_runs_raw__part_max9500k test_runs_raw_475  (cost=0.00..2717185.06 
rows=32060806 width=4)
 ->  Parallel Seq Scan on 
test_runs_raw__part_max11180k test_runs_raw_559  (cost=0.00..2610814.95 
rows=30806095 width=4)


It also takes ages to return, so I have to interrupt it.

I believe it should exit early, as soon as it finds 10 distinct values
(which should be rather easy even with parallel seqscans, given the
pattern followed when inserting the data).


Thanks,
Dimitris




Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

2024-05-10 Thread Dimitrios Apostolou




On Fri, 10 May 2024, Dimitrios Apostolou wrote:


On Fri, 10 May 2024, Dimitrios Apostolou wrote:

Update: even the simplest SELECT DISTINCT query shows similar behaviour:


Further digging into this simple query, if I force the non-parallel plan
by setting max_parallel_workers_per_gather TO 0, I see that the query
planner comes up with a cost much higher:

 Limit  (cost=363.84..1134528847.47 rows=10 width=4)
   ->  Unique  (cost=363.84..22690570036.41 rows=200 width=4)
 ->  Append  (cost=363.84..22527480551.58 rows=65235793929 width=4)
   ->  Index Only Scan using test_runs_raw__part_max20k_pkey on 
test_runs_raw__part_max20k test_runs_raw_1  (cost=0.12..2.34 rows=1 width=4)
   ->  Index Only Scan using test_runs_raw__part_max40k_pkey on 
test_runs_raw__part_max40k test_runs_raw_2  (cost=0.12..2.34 rows=1 width=4)
[...]
   ->  Index Only Scan using test_runs_raw__part_max1780k_pkey on 
test_runs_raw__part_max1780k test_runs_raw_89  (cost=0.57..53587294.65 
rows=106088160 width=4)
   ->  Index Only Scan using test_runs_raw__part_max1800k_pkey on 
test_runs_raw__part_max1800k test_runs_raw_90  (cost=0.57..98943539.74 
rows=96214080 width=4)
   ->  Index Only Scan using test_runs_raw__part_max1820k_pkey on 
test_runs_raw__part_max1820k test_runs_raw_91  (cost=0.57..97495653.34 
rows=193248960 width=4)
   ->  Index Only Scan using test_runs_raw__part_max1840k_pkey on 
test_runs_raw__part_max1840k test_runs_raw_92  (cost=0.57..110205205.07 
rows=218440928 width=4)
   ->  Index Only Scan using test_runs_raw__part_max1860k_pkey on 
test_runs_raw__part_max1860k test_runs_raw_93  (cost=0.57..50164056.28 
rows=99431760 width=4)
[...]


The total cost on the 1st line (cost=363.84..1134528847.47) has a much
higher upper limit than the total cost when
max_parallel_workers_per_gather is 4 (cost=853891608.79..853891608.99).

This explains the planner's choice. But I wonder why the cost estimation
is so far away from reality.


Dimitris





Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

2024-05-10 Thread Dimitrios Apostolou

On Fri, 10 May 2024, Tom Lane wrote:


Dimitrios Apostolou  writes:

Further digging into this simple query, if I force the non-parallel plan
by setting max_parallel_workers_per_gather TO 0, I see that the query
planner comes up with a cost much higher:



  Limit  (cost=363.84..1134528847.47 rows=10 width=4)
->  Unique  (cost=363.84..22690570036.41 rows=200 width=4)
  ->  Append  (cost=363.84..22527480551.58 rows=65235793929 width=4)
...



The total cost on the 1st line (cost=363.84..1134528847.47) has a much
higher upper limit than the total cost when
max_parallel_workers_per_gather is 4 (cost=853891608.79..853891608.99).
This explains the planner's choice. But I wonder why the cost estimation
is so far away from reality.


I'd say the blame lies with that (probably-default) estimate of
just 200 distinct rows.  That means the planner expects to have
to read about 5% (10/200) of the tables to get the result, and
that's making fast-start plans look bad.


Indeed that's an awful estimate, the table has more than 1M of unique
values in that column. Looking into pg_stat_user_tables, I can't see the
partitions having been vacuum'd or analyzed at all. I think they should
have been auto-analyzed, since they get a ton of INSERTs
(no deletes/updates though) and I have the default autovacuum settings.
Could it be that autovacuum starts, but never
finishes? I can't find something in the logs.

In any case, even after the planner decides to execute the terrible plan
with the parallel seqscans, why doesn't it finish right when it finds 10
distinct values?



Possibly an explicit ANALYZE on the partitioned table would help.


Thanks, I'll save the ANALYZE as the last step; I feel it's a good
opportunity to figure out more details about how postgres works. Plus I
expect ANALYZE to last a couple of days, so I should first find quiet time
for that. :-)

Dimitris




Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

2024-05-13 Thread Dimitrios Apostolou

On Sat, 11 May 2024, David Rowley wrote:


On Sat, 11 May 2024 at 13:33, Tom Lane  wrote:

I do kind of wonder why it's producing both a hashagg and a Unique
step --- seems like it should do one or the other.


It still needs to make the duplicate groups from parallel workers unique.


Range partitioning of the table guarantees that, since the ranges are not
overlapping.


Dimitris





Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

2024-05-13 Thread Dimitrios Apostolou

On Sat, 11 May 2024, David Rowley wrote:


On Sat, 11 May 2024 at 13:11, Dimitrios Apostolou  wrote:

Indeed that's an awful estimate, the table has more than 1M of unique
values in that column. Looking into pg_stat_user_tables, I can't see the
partitions having been vacuum'd or analyzed at all. I think they should
have been auto-analyzed, since they get a ton of INSERTs
(no deletes/updates though) and I have the default autovacuum settings.
Could it be that autovacuum starts, but never
finishes? I can't find something in the logs.


It's not the partitions getting analyzed you need to worry about for
an ndistinct estimate on the partitioned table. It's auto-analyze or
ANALYZE on the partitioned table itself that you should care about.

If you look at [1], it says "Tuples changed in partitions and
inheritance children do not trigger analyze on the parent table."


Thanks




In any case, even after the planner decides to execute the terrible plan
with the parallel seqscans, why doesn't it finish right when it finds 10
distinct values?


It will. It's just that Sorting requires fetching everything from its subnode.


Isn't it plain wrong to have a sort step in the plan than? The different
partitions contain different value ranges with no overlap, and the last
query I posted doesn't even contain an ORDER BY clause, just a DISTINCT
clause on an indexed column.

Even with bad estimates, even with seq scan instead of index scan, the
plan should be such that it concludes all parallel work as soon as it
finds the 10 distinct values. And this is actually achieved if I disable
parallel plans. Could it be a bug in the parallel plan generation?


Dimitris





Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

2024-05-13 Thread Dimitrios Apostolou

On Mon, 13 May 2024, Dimitrios Apostolou wrote:


On Sat, 11 May 2024, David Rowley wrote:


 On Sat, 11 May 2024 at 13:11, Dimitrios Apostolou  wrote:

 Indeed that's an awful estimate, the table has more than 1M of unique
 values in that column. Looking into pg_stat_user_tables, I can't see the
 partitions having been vacuum'd or analyzed at all. I think they should
 have been auto-analyzed, since they get a ton of INSERTs
 (no deletes/updates though) and I have the default autovacuum settings.
 Could it be that autovacuum starts, but never
 finishes? I can't find something in the logs.


 It's not the partitions getting analyzed you need to worry about for
 an ndistinct estimate on the partitioned table. It's auto-analyze or
 ANALYZE on the partitioned table itself that you should care about.

 If you look at [1], it says "Tuples changed in partitions and
 inheritance children do not trigger analyze on the parent table."


Thanks


Do I read that correctly, that I have to setup cron jobs to manually
analyze partitioned tables?


Dimitris





Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

2024-05-13 Thread Dimitrios Apostolou

On Tue, 14 May 2024, David Rowley wrote:


On Tue, 14 May 2024 at 00:41, Dimitrios Apostolou  wrote:


On Sat, 11 May 2024, David Rowley wrote:

It will. It's just that Sorting requires fetching everything from its subnode.


Isn't it plain wrong to have a sort step in the plan than? The different
partitions contain different value ranges with no overlap, and the last
query I posted doesn't even contain an ORDER BY clause, just a DISTINCT
clause on an indexed column.


The query does contain an ORDER BY, so if the index is not chosen to
provide pre-sorted input, then something has to put the results in the
correct order before the LIMIT is applied.


The last query I tried was:

SELECT DISTINCT workitem_n FROM test_runs_raw LIMIT 10;

See my message at

[1] 
https://www.postgresql.org/message-id/69077f15-4125-2d63-733f-21ce6eac4f01%40gmx.net

Will re-check things and report back with further debugging info you asked
for later today.


Dimitris





Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

2024-05-13 Thread Dimitrios Apostolou




On Tue, 14 May 2024, David Rowley wrote:


That assumes the Append won't ever use > 1 worker per subnode, but
that's not the case for your plan as the subnodes are "Parallel".
That means all the workers could be working on the same subnode which
could result in one group being split between 2 or more workers.


Didn't think of that, makes sense!


Parallel Append can also run in a way that the Append child nodes will
only get 1 worker each.


How can I tell which case it is, from the EXPLAIN output (for example
the output at [1]) ?

[1] 
https://www.postgresql.org/message-id/69077f15-4125-2d63-733f-21ce6eac4f01%40gmx.net

Dimitris





Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

2024-05-14 Thread Dimitrios Apostolou




On Fri, 10 May 2024, Tom Lane wrote:


Dimitrios Apostolou  writes:

Further digging into this simple query, if I force the non-parallel plan
by setting max_parallel_workers_per_gather TO 0, I see that the query
planner comes up with a cost much higher:



  Limit  (cost=363.84..1134528847.47 rows=10 width=4)
->  Unique  (cost=363.84..22690570036.41 rows=200 width=4)
  ->  Append  (cost=363.84..22527480551.58 rows=65235793929 width=4)
...



The total cost on the 1st line (cost=363.84..1134528847.47) has a much
higher upper limit than the total cost when
max_parallel_workers_per_gather is 4 (cost=853891608.79..853891608.99).
This explains the planner's choice. But I wonder why the cost estimation
is so far away from reality.


I'd say the blame lies with that (probably-default) estimate of
just 200 distinct rows.  That means the planner expects to have
to read about 5% (10/200) of the tables to get the result, and
that's making fast-start plans look bad.

Possibly an explicit ANALYZE on the partitioned table would help.


It took long but if finished:

ANALYZE
Time: 19177398.025 ms (05:19:37.398)

And it made a difference indeed, the serial plan is chosen now:

EXPLAIN SELECT DISTINCT workitem_n FROM test_runs_raw ORDER BY workitem_n DESC 
LIMIT 10;
 Limit  (cost=364.29..1835512.29 rows=10 width=4)
   ->  Unique  (cost=364.29..22701882164.56 rows=123706 width=4)
 ->  Append  (cost=364.29..22538472401.60 rows=65363905182 width=4)
   ->  Index Only Scan Backward using 
test_runs_raw__part_max2k_pkey on test_runs_raw__part_max2k 
test_runs_raw_1000  (cost=0.12..2.34 rows=1 width=4)
   ->  Index Only Scan Backward using 
test_runs_raw__part_max19980k_pkey on test_runs_raw__part_max19980k 
test_runs_raw_999  (cost=0.12..2.34 rows=1 width=4)
   ->  Index Only Scan Backward using 
test_runs_raw__part_max19960k_pkey on test_runs_raw__part_max19960k 
test_runs_raw_998  (cost=0.12..2.34 rows=1 width=4)
[...]
   ->  Index Only Scan Backward using 
test_runs_raw__part_max12460k_pkey on test_runs_raw__part_max12460k 
test_runs_raw_623  (cost=0.57..12329614.53 rows=121368496 width=4)
   ->  Index Only Scan Backward using 
test_runs_raw__part_max12440k_pkey on test_runs_raw__part_max12440k 
test_runs_raw_622  (cost=0.57..5180832.16 rows=184927264 width=4)
   ->  Index Only Scan Backward using 
test_runs_raw__part_max12420k_pkey on test_runs_raw__part_max12420k 
test_runs_raw_621  (cost=0.57..4544964.21 rows=82018824 width=4)
[...]

Overall I think there are two issues that postgres could handle better
here:

1. Avoid the need for manual ANALYZE on partitioned table

2. Create a different parallel plan, one that can exit early, when the
   LIMIT is proportionally low. I feel the partitions could be
   parallel-scanned in-order, so that the whole thing stops when one
   partition has been read.

Thank you!
Dimitris





Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

2024-05-14 Thread Dimitrios Apostolou

On Tue, 14 May 2024, Dimitrios Apostolou wrote:


It took long but if finished:

ANALYZE
Time: 19177398.025 ms (05:19:37.398)


I see now that default_statistics_target is globally set to 500, so this
is probably the reason it took so long. I guess with the default of 100,
it would take approximately one hour. This is much better to have in a
cron job. :-)

Dimitris






Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

2024-05-14 Thread Dimitrios Apostolou

On Tue, 14 May 2024, David Rowley wrote:


If you were to put the n_distinct_inherited estimate back to 200 and
disable sort, you should see the costs are higher for the index plan.
If that's not the case then there might be a bug.  It seems more
likely that due to the n_distinct estimate being so low that the
planner thought that a large enough fraction of the rows needed to be
read and that made the non-index plan appear cheaper.

I'd be interested in seeing what the costs are for the index plan. I
think the following will give you that (untested):

alter table test_runs_raw alter column workitem_n set
(n_distinct_inherited=200);
analyze test_runs_raw;


I had to stop this step because it was taking too long going through all
partitions again. But it seems it had the desired effect.


set enable_sort=0;
explain SELECT DISTINCT workitem_n FROM test_runs_raw ORDER BY workitem_n DESC 
LIMIT 10;


It chooses the non-parallel index plan:

 Limit  (cost=365.17..1135517462.36 rows=10 width=4)
   ->  Unique  (cost=365.17..22710342308.83 rows=200 width=4)
 ->  Append  (cost=365.17..22546660655.46 rows=65472661350 width=4)
   ->  Index Only Scan Backward using 
test_runs_raw__part_max2k_pkey on test_runs_raw__part_max2k 
test_runs_raw_1000  (cost=0.12..2.34 rows=1 width=4)
   ->  Index Only Scan Backward using 
test_runs_raw__part_max19980k_pkey on test_runs_raw__part_max19980k 
test_runs_raw_999  (cost=0.12..2.34 rows=1 width=4)
[... only index scans follow]

LIMIT 100 goes for the parallel seqscan plan, that even contains a sort!
But it seams to me that the extra upper level HashAggregate step raises
the cost by an order of magnitude, from 800M to 10G, in comparison to
doing (Unique->Sort) - see plan in the next paragraph.

 Limit  (cost=10857220388.76..10857220389.01 rows=100 width=4)
   ->  Sort  (cost=10857220388.76..10857220389.26 rows=200 width=4)
 Sort Key: test_runs_raw.workitem_n DESC
 ->  HashAggregate  (cost=857220379.12..857220381.12 rows=200 width=4)
   Group Key: test_runs_raw.workitem_n
   ->  Gather  (cost=857220295.12..857220377.12 rows=800 width=4)
 Workers Planned: 4
 ->  HashAggregate  (cost=857219295.12..857219297.12 
rows=200 width=4)
   Group Key: test_runs_raw.workitem_n
   ->  Parallel Append  (cost=0.00..816295259.21 
rows=16369614363 width=4)
 ->  Parallel Index Only Scan Backward using 
test_runs_raw__part_max9600k_pkey on test_runs_raw__part_max9600k 
test_runs_raw_480  (cost=0.57..1597356.30 rows=33623360 width=4)
 ->  Parallel Index Only Scan Backward using 
test_runs_raw__part_max10140k_pkey on test_runs_raw__part_max10140k 
test_runs_raw_507  (cost=0.57..1210806.37 rows=25794030 width=4)
 ->  Parallel Seq Scan on 
test_runs_raw__part_max9500k test_runs_raw_475  (cost=0.00..3037800.88 
rows=64122388 width=4)
 ->  Parallel Seq Scan on 
test_runs_raw__part_max11180k test_runs_raw_559  (cost=0.00..2918865.36 
rows=61611136 width=4)
[... only seqscans follow]



If I re-enable sort, then it goes for the parallel seqscan plan even with LIMIT 
10:

SET SESSION enable_sort TO TRUE;
EXPLAIN  SELECT DISTINCT workitem_n FROM test_runs_raw ORDER BY workitem_n DESC 
LIMIT 10;

 Limit  (cost=857166256.39..857166256.59 rows=10 width=4)
   ->  Unique  (cost=857166256.39..857166260.39 rows=200 width=4)
 ->  Sort  (cost=857166256.39..857166258.39 rows=800 width=4)
   Sort Key: test_runs_raw.workitem_n DESC
   ->  Gather  (cost=857166135.82..857166217.82 rows=800 width=4)
 Workers Planned: 4
 ->  HashAggregate  (cost=857165135.82..857165137.82 
rows=200 width=4)
   Group Key: test_runs_raw.workitem_n
   ->  Parallel Append  (cost=0.00..816243567.24 
rows=16368627432 width=4)
 ->  Parallel Index Only Scan Backward using 
test_runs_raw__part_max9600k_pkey on test_runs_raw__part_max9600k 
test_runs_raw_480  (cost=0.57..1597356.30 rows=33623360 width=4)
 ->  Parallel Index Only Scan Backward using 
test_runs_raw__part_max10140k_pkey on test_runs_raw__part_max10140k 
test_runs_raw_507  (cost=0.57..1210806.37 rows=25794030 width=4)
 ->  Parallel Seq Scan on 
test_runs_raw__part_max9500k test_runs_raw_475  (cost=0.00..3037800.88 
rows=64122388 width=4)
 ->  Parallel Seq Scan on 
test_runs_raw__part_max11180k test_runs_raw_559  (cost=0.00..2918865.36 
rows=61611136 width=4)
[... only seqscans follow]

So in order of higher to lower cost, we have the following alternatives:

1. non-parallel index scan  (800M)
2. parallel seqscan with sort  (1.3G)
3. parallel seqscan without

Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

2024-05-14 Thread Dimitrios Apostolou

I have forgotten to mention that I have enable_partitionwise_aggregate=on
in the global settings since the beginning. According to the docs:


Enables or disables the query planner's use of partitionwise grouping or
aggregation, which allows grouping or aggregation on partitioned tables
to be performed separately for each partition.


Reading that, I'd expect to see a separate DISTINCT->LIMIT 10 on every
partition, and then it would be up to independent plans to decide whether
each partition follows a parallel or a serial plan.

Not sure if this plan was checked but rejected because of cost.


Dimitris





pg_stat_io clarifications: background worker, writes and reads

2024-05-14 Thread Dimitrios Apostolou

Hello list,

what is the "background worker" in the pg_stat_io statistics view? I'm
reading the documentation but can't figure this one out knowing that it is
not autovacuum or bgwriter. And I'm not aware of any extension I might
have with registered background worker.

Additionally, how can it be evictions > writes? I would expect every
eviction to cause a write.

Finally about "hits", I understand they are reads found in shared_buffers,
so they never registered into the "reads" counter. So is "reads" in
pg_stat_io the equivalent to misses, i.e. the opposite of "hits", the read
attempts not found in the shared_buffers, that needed to be fetched from
the disk (or OS buffercache)?

   backend_type|object | context |  reads  |  read_time  | writes | 
write_time | writebacks | writeback_time | extends | extend_time | op_bytes |   
hits| evictions | reuses | fsyncs | fsync_time |  stats_reset
---+---+-+-+-+++++-+-+--+---+---++++---
 background worker | relation  | normal  | 5139575 | 2196288.011 |  63277 | 
   1766.94 |  0 |  0 |   0 |   0 | 8192 | 
876913705 |   5139653 ||  0 |  0 | 2024-04-08 
08:50:02.971192+00


Thank you in advance,
Dimitris





Re: pg_stat_io clarifications: background worker, writes and reads

2024-05-15 Thread Dimitrios Apostolou

So what is this particular "background worker" I'm seeing, given that I
have no replication or extensions?

Searching the logs I found entries like the following:

LOG:  background worker "parallel worker" (PID 93384) exited with exit code 1

This got logged when I killed a simple SELECT query that took too long
doing parallel seqscans. Could it be that the entry in pg_stat_io named
"background worker" also includes the parallel workers from a SELECT
query?

Thanks,
Dimitris

On Wed, 15 May 2024, Muhammad Imtiaz wrote:


Hi,

In PostgreSQL, the pg_stat_io view provides detailed statistics on I/O 
operations. Background process perform maintenance tasks and other background 
operations essential to the functioning of the PostgreSQL database.
They include processes such as:

1. Autovacuum Workers
2. WAL Writer
3. Background Writer
4. Logical Replication Workers
5. Custom Background Workers

In the pg_stat_io view, statistics related to I/O operations performed by these 
background workers are recorded. 

Regards,
Imtiaz


On Wed, 15 May 2024, 01:26 Dimitrios Apostolou,  wrote:
  Hello list,

  what is the "background worker" in the pg_stat_io statistics view? I'm
  reading the documentation but can't figure this one out knowing that it is
  not autovacuum or bgwriter. And I'm not aware of any extension I might
  have with registered background worker.

  Additionally, how can it be evictions > writes? I would expect every
  eviction to cause a write.

  Finally about "hits", I understand they are reads found in shared_buffers,
  so they never registered into the "reads" counter. So is "reads" in
  pg_stat_io the equivalent to misses, i.e. the opposite of "hits", the read
  attempts not found in the shared_buffers, that needed to be fetched from
  the disk (or OS buffercache)?

      backend_type    |    object     | context |  reads  |  read_time  | 
writes | write_time | writebacks | writeback_time | extends | extend_time | 
op_bytes |   hits    | evictions | reuses | fsyncs | fsync_time |          
stats_reset
  
---+---+-+-+-+++++-+-+--+---+---++++---
    background worker | relation      | normal  | 5139575 | 2196288.011 |  
63277 |    1766.94 |          0 |              0 |       0 |           0 |     
8192 | 876913705 |   5139653 |        |      0 |          0 | 2024-04-08 
08:50:02.971192+00


  Thank you in advance,
  Dimitris







Re: pg_stat_io clarifications: background worker, writes and reads

2024-05-21 Thread Dimitrios Apostolou

Hi Kashif, just to clarify my question, are you saying that the I/O from
parallel workers is recorded into the "background worker" entry of
pg_stat_io?

Thanks,
Dimitris


On Wed, 15 May 2024, Kashif Zeeshan wrote:


Hi
parallel worker are used for parallel execution of the queries and you can find 
the help in the below link.

https://www.postgresql.org/docs/current/how-parallel-query-works.html

Its controlled by following parameters.

max_worker_processes = 6
max_parallel_workers_per_gather = 3
max_parallel_workers = 6
The limit of concurrent parallel workers for the whole cluster is 
max_parallel_workers, which must be ≤ max_worker_processes. The limit of 
parallel workers per query is max_parallel_workers_per_gather.

Thanks
Kashif Zeeshan
Bitnine Global

On Wed, May 15, 2024 at 5:59 PM Dimitrios Apostolou  wrote:
  So what is this particular "background worker" I'm seeing, given that I
  have no replication or extensions?

  Searching the logs I found entries like the following:

  LOG:  background worker "parallel worker" (PID 93384) exited with exit 
code 1

  This got logged when I killed a simple SELECT query that took too long
  doing parallel seqscans. Could it be that the entry in pg_stat_io named
  "background worker" also includes the parallel workers from a SELECT
  query?

  Thanks,
  Dimitris


Query on partitioned table needs memory n_partitions * work_mem

2024-07-10 Thread Dimitrios Apostolou

Hello list,

I have a table with 1000 partitions on PostgreSQL 16.
I notice that a fairly complicated query of the form:

SELECT ... GROUP BY ... LIMIT ...

causes the postgres backend process to grow insanely very fast, and the
kernel OOM killer to kill it rather soon.
It seems it tries to allocate at least 1000 * work_mem.

If I reduce the amount of work_mem, I can control the outcome and avoid
the crash, but this is suboptimal.
I have parallel plans disabled (max_parallel_workers_per_gather=0).

To add a bit more info on the execution plan, I believe the relevant part
is the 1000 HashAggregate nodes under Append:

->  Append
  ->  HashAggregate
->  Seq Scan
  -> ... 1000 more hashagg+seqscans


Is this allocation pattern (workmem * n_partitions) expected under any
scenario? I can't find it documented.  AFAIU the backend should allocate
up to (depth_of_execution_plan * work_mem) (putting aside the
hash_mem_multiplier and the parallel workers).

NOTE: after having written the above message, it occured to me that I have
enable_partitionwise_aggregate=on. And Turning it off fixes the issue and
makes the query faster too! Expected behaviour or bug?

Thank you in advance,
Dimitris

P.S. In the meantime I'm trying to reduce the query and the table schema,
 in order to submit a precise bug report with repro instructions.





Re: Query on partitioned table needs memory n_partitions * work_mem

2024-07-11 Thread Dimitrios Apostolou

Thank you for the feedback.

So I've managed to reduce the query to a rather simple one:

SELECT
workitem_n, test_executable_n,
bool_or(test_resulttype_n IN (2,3))
FROM
test_runs_raw
GROUP BY
workitem_n, test_executable_n
LIMIT 10;


The TABLE test_runs_raw has 1000 partitions on RANGE(workitem_n). All the
columns are various integer types. There is an index on workitem_n.



On Thu, 11 Jul 2024, David Rowley wrote:


On Thu, 11 Jul 2024 at 13:19, Dimitrios Apostolou  wrote:

I have a table with 1000 partitions on PostgreSQL 16.
I notice that a fairly complicated query of the form:

SELECT ... GROUP BY ... LIMIT ...

causes the postgres backend process to grow insanely very fast, and the
kernel OOM killer to kill it rather soon.
It seems it tries to allocate at least 1000 * work_mem.



->  Append
   ->  HashAggregate
 ->  Seq Scan
   -> ... 1000 more hashagg+seqscans


Is this allocation pattern (workmem * n_partitions) expected under any
scenario? I can't find it documented.  AFAIU the backend should allocate
up to (depth_of_execution_plan * work_mem) (putting aside the
hash_mem_multiplier and the parallel workers).


Not depth of execution plan. It relates to the number of nodes in the
plan which allocate work_mem or work_mem * hash_mem_multiplier.

There is some documentation in [1]:

"Note that a complex query might perform several sort and hash
operations at the same time"


The latest query is not complex at all and I don't see it doing 1000s of
operations at the same time. By "number of nodes" would you add up all
HashAggregate nodes under an Append node? Here is part of the EXPLAIN
ANALYZE output:

 Limit  (cost=0.01..28.00 rows=10 width=7) (actual time=43120.466..43292.246 
rows=10 loops=1)
   Output: test_runs_raw.workitem_n, test_runs_raw.test_executable_n, 
(bool_or((test_runs_raw.test_resulttype_n = ANY ('{2,3}'::integer[]
   Buffers: shared hit=96 read=883975
   I/O Timings: shared read=16284.731
   ->  Append  (cost=0.01..3416299633.71 rows=1220556171 width=7) (actual 
time=42968.794..43139.855 rows=10 loops=1)
 Buffers: shared hit=96 read=883975
 I/O Timings: shared read=16284.731
 ->  HashAggregate  (cost=0.01..0.02 rows=1 width=7) (actual 
time=10.662..10.663 rows=0 loops=1)
   Output: test_runs_raw.workitem_n, 
test_runs_raw.test_executable_n, bool_or((test_runs_raw.test_resulttype_n = ANY 
('{2,3}'::integer[])))
   Group Key: test_runs_raw.workitem_n, 
test_runs_raw.test_executable_n
   Batches: 1  Memory Usage: 24kB
   ->  Seq Scan on public.test_runs_raw__part_max20k test_runs_raw  
(cost=0.00..0.00 rows=1 width=8) (actual time=9.960..9.961 rows=0 loops=1)
 Output: test_runs_raw.workitem_n, 
test_runs_raw.test_executable_n, test_runs_raw.test_resulttype_n
 ->  HashAggregate  (cost=0.01..0.02 rows=1 width=7) (actual 
time=1.913..1.914 rows=0 loops=1)
   Output: test_runs_raw_1.workitem_n, 
test_runs_raw_1.test_executable_n, bool_or((test_runs_raw_1.test_resulttype_n = 
ANY ('{2,3}'::integer[])))
   Group Key: test_runs_raw_1.workitem_n, 
test_runs_raw_1.test_executable_n
   Batches: 1  Memory Usage: 24kB
   ->  Seq Scan on public.test_runs_raw__part_max40k 
test_runs_raw_1  (cost=0.00..0.00 rows=1 width=8) (actual time=1.031..1.031 rows=0 
loops=1)
 Output: test_runs_raw_1.workitem_n, 
test_runs_raw_1.test_executable_n, test_runs_raw_1.test_resulttype_n
[  1000s of similar HashAggregate nodes ... ]
Settings: temp_buffers = '32MB', work_mem = '32MB', effective_io_concurrency = 
'300', max_parallel_workers_per_gather = '0', enable_hashjoin = 'off', 
enable_partitionwise_join = 'on', enable_partitionwise_aggregate = 'on', 
random_page_cost = '1.1', effective_cache_size = '6GB', from_collapse_limit = 
'24', join_collapse_limit = '24'
 Planning:
   Buffers: shared hit=377
 Planning Time: 1503.800 ms
 Execution Time: 56515.185 ms
(5382 rows)

Memory usage on each HashAggregate is logged as 24KB (many HashAggregates
are missing that info though), I guess the EXPLAIN output is missing some
important part of the allocations here since I'm seeing MBs of allocations
per node.

I can't help but see this as a bug. I see many issues:

* postgres is not reading from partitions in parallel, but one after the
  other. It shouldn't need all this memory simultaneously.

* The memory is unnecessarily allocated early on, before any partitions
  are actually aggregated. I know this because I/O is slow on this device
  and the table sizes are huge, it's simply not possible that postgres
  went through all partitions and blew up the memory. That would take
  hours, but the OOM happens seconds

Re: Query on partitioned table needs memory n_partitions * work_mem

2024-07-11 Thread Dimitrios Apostolou

On Thu, 11 Jul 2024, Tom Lane wrote:


Dimitrios Apostolou  writes:

The TABLE test_runs_raw has 1000 partitions on RANGE(workitem_n).


So don't do that.  Adding partitions is not cost-free.



I understand that, they also add an administrative cost that I'd rather
avoid. But I ended up adding all these partitions because of performance
issues on a multi-billion rows table. There is probably some message from
me on this list a couple of years ago.

At the moment I have a work-around. I'm thankful that everyone is willing
to provide workarounds to all potential issues/bugs I have presented, but
unfortunately workarounds are not fixes, one will hit the same wall again
at some point.

My current concern is **reporting my findings responsibly**. I want to
provide as much data needed to pinpoint the issue, so that the developers
know exactly what's going on. Having right data is half the fix.

A way to track the issue would be nice. I might revisit it and even try to
submit a patch. I wonder how the postgres development community is
tracking all these issues, I've even started forgetting the ones I have
found, and I'm sure I have previously reported (on this list) a couple of
should-be-easy issues that would be ideal for beginners.


Regards,
Dimitris





Re: Query on partitioned table needs memory n_partitions * work_mem

2024-07-19 Thread Dimitrios Apostolou

On Thu, 18 Jul 2024, David Rowley wrote:


On Fri, 12 Jul 2024 at 02:08, Dimitrios Apostolou  wrote:


* The memory is unnecessarily allocated early on, before any partitions
   are actually aggregated. I know this because I/O is slow on this device
   and the table sizes are huge, it's simply not possible that postgres
   went through all partitions and blew up the memory. That would take
   hours, but the OOM happens seconds after I start the query.


That's interesting. Certainly, there is some memory allocated during
executor startup, but that amount should be fairly small.  Are you
able to provide a self-contained test case that shows the memory
blowing up before execution begins?


I'm trying hard to create a self-contained way to reproduce the issue.
It's not easy, the behaviour is a bit unstable. So far I see high memory
usage (8-10GB) but I'm not able to OOM with that much (it actually OOM'ed
once, but it was so uncontrollable that I didn't manage to measure and
reproduce again later).



-- I create a table with 2k partitions.

CREATE TABLE   partitioned_table1(
run_n   bigint   GENERATED ALWAYS AS IDENTITY,
workitem_n  integer  NOT NULL,
label_n smallint,
result  smallint NOT NULL,
PRIMARY KEY(workitem_n, run_n)
) PARTITION BY RANGE(workitem_n);

DO $$
for i in range(0, 2000):
stmt = f'''
CREATE TABLE part_max{i+1}M
PARTITION OF partitioned_table1
FOR VALUES FROM ({i*1000*1000}) TO ({(i+1) * 1000*1000})
'''
plpy.execute(stmt)
$$ LANGUAGE plpython3u;


-- I insert random data. First I insert to all partitions, 1M rows each:

DO $$
for i in range(0, 2000):
stmt = f'''
INSERT INTO partitioned_table1(workitem_n, label_n, result)
SELECT
j-j%4,
CAST(random()*1000 AS INTEGER),
CAST(random()*3AS INTEGER)
FROM generate_series({i}*1000*1000, ({i}+1)*1000*1000 - 1,
1) as j
'''
plpy.info(stmt)
plpy.execute(stmt)
plpy.commit()
$$ LANGUAGE plpython3u;


-- Disable parallel execution and group aggregate:

SET SESSION max_parallel_workers_per_gather TO 0;
SET SESSION enable_incremental_sort TO off;

SET SESSION work_mem TO '8MB';


-- Now the following query should do a HashAggregate:

SELECT
 workitem_n, label_n, bool_or(result IN (2,3))
FROM
 partitioned_table1
GROUP BY
 workitem_n, label_n
LIMIT 10;


-- How much was the RSS of the backend while the previous query was
-- running? Not that high. But if we insert some million rows to the
-- 1st partition, then it will be much higher.

DO $$
for i in range(0,2000):
stmt = f'''
INSERT INTO partitioned_table1(workitem_n, label_n, result)
SELECT
j%100,
CAST(random()*2 AS INTEGER),
CAST(random()*4AS INTEGER)
FROM generate_series({i}*1000*1000, ({i}+1)*1000*1000 - 1, 1) 
as j
'''
plpy.info(stmt)
plpy.execute(stmt)
plpy.commit()
$$ LANGUAGE plpython3u;


-- Now that same previous query consumes between 8GB and 10GB RSS. The
-- more data I insert (to all partitions?), the more memory the query
-- takes.


Overall:

* I don't see the RSS memory usage (8GB) growing proportionally as I
  expected. If I increase work_mem from 4MB to 8MB then I see double RSS
  memory usage (from ~4GB to ~8GB). But then if I increase it further the
  difference is miniscule and no OOM happens.

* Instead I notice RSS memory usage growing slowly while I insert more and more 
data
  to the table (especially into the 1st partition I think).

* Finally I don't see the memory being free'd by the backend after the
  SELECT finishes. The system is relieved only when I disconnect psql and
  the backend dies. Not sure if that's by design or not.




Having wasted long time in that, the minimum I can do is submit a
documentation patch. At enable_partitionwise_aggregate someting like
"WARNING it can increase the memory usage by at least
n_partitions * work_mem". How do I move on for such a patch? Pointers
would be appreciated. :-)


I think mentioning something about this in enable_partitionwise_join
and enable_partitionwise_aggregate is probably wise. I'll propose a
patch on pgsql-hackers.


David and Ashutosh, thank you both for your interest in improving the
documentation. Unfortunately I'm not positive any longer on what exactly
is going on here, I don't understand how the memory is growing. One thing
I can verify is that it's definitely caused by partitioning: I have
another similar huge table but unpartitioned, and no such issues show up.
Maybe someone with knowledge of the HashAggregate algorithm and
partitioning can throw some ideas in.


Regards,
Dimitris





Re: Query on partitioned table needs memory n_partitions * work_mem

2024-07-19 Thread Dimitrios Apostolou

Ok I reproduced the OOM, with only 200 partitions as opposed to 2K that I
mentioned before. The keys to reproduce it I believe are:

1. Write millions of rows to *all* partitions
2. Run ANALYSE so that the planner knows about the sizes


Here are the correct steps now. Let me know if you manage to reproduce.
Kind reminder to set vm.overcommit_memory to 2, before you lose your
system like it happened to me. :-)


CREATE TABLE   partitioned_table1(
run_n   bigint   GENERATED ALWAYS AS IDENTITY,
workitem_n  integer  NOT NULL,
label_n smallint,
result  smallint NOT NULL,
PRIMARY KEY(workitem_n, run_n)
) PARTITION BY RANGE(workitem_n);

-- Create 200 partitions

DO $$
for i in range(0, 200):
start = i * 10 * 1000 * 1000
end   = (i+1) * 10 * 1000 * 1000
stmt = f'''
CREATE TABLE part{i}
PARTITION OF partitioned_table1
FOR VALUES FROM ({start}) TO ({end})
'''
plpy.execute(stmt)
$$ LANGUAGE plpython3u;

-- Insert 20M rows per partition. This will take a while but I don't know
-- of a way to speed it up.

DO $$
for i in range(0, 2000):
stmt = f'''
INSERT INTO partitioned_table1(workitem_n, label_n, result)
SELECT
j,
CAST(random()*1000 AS INTEGER),
CAST(random()*4AS INTEGER)
FROM generate_series({i}*1000*1000, ({i}+1)*1000*1000 - 1, 1) 
as j
'''
plpy.info(stmt)
plpy.execute(stmt)
plpy.commit()
$$ LANGUAGE plpython3u;

-- Important to analyse! and set the right settings...

ANALYSE partitioned_table1;
SET SESSION max_parallel_workers_per_gather TO 0;
SET SESSION enable_incremental_sort TO off;
SET SESSION enable_partitionwise_aggregate  TO on;
SET SESSION enable_partitionwise_join   TO on;

-- And then reproduce the issue:

\timing on
SET SESSION work_mem TO '4MB';

SELECT
  workitem_n, label_n, bool_or(result IN (2,3))
FROM
  partitioned_table1
GROUP BY
  workitem_n, label_n
LIMIT 10;


Repeat that last query while doubling the work_mem every time. You'll see
the process growing to immense dimensions. In my case, my 16GB of RAM were
exhausted when work mem was set to '128MB'.

Remarkably, the fastest execution is the one with the smallest work_mem,
where HashAggregate splits the work into many "partitions". As we grow the
work_mem, it becomes slower and slower (of course while not reaching the
limits of RAM yet; if it goes to swap then all measurements are off).

Let me know if you manage to reproduce the issue!

Dimitris





array_agg() does not stop aggregating according to HAVING clause

2024-08-17 Thread Dimitrios Apostolou

Hello list,

I have a query that goes through *billions* of rows and for the columns
that have an infrequent "datatag" (HAVING count(test_datatag_n)<10) it
selects all the IDs of the entries (array_agg(run_n)). Here is the full
query:


INSERT INTO infrequent_datatags_in_this_chunk
  SELECT datatag, datatags.datatag_n, array_agg(run_n)
FROM runs_raw
JOIN datatags USING(datatag_n)
WHERE workitem_n >= 295
  AND workitem_n <  714218
  AND datatag IS NOT NULL
GROUP BY datatags.datatag_n
HAVING  count(datatag_n) < 10
  AND   count(datatag_n) > 0  -- Not really needed because of the JOIN above
;

The runs_raw table has run_n as the primary key id, and an index on
workitem_n. The datatags table is a key value store with datatag_n as
primary key.

The problem is that this is extremely slow (5 hours), most likely because
it creates tens of gigabytes of temporary files as I see in the logs. I
suspect that it is writing to disk the array_agg(run_n) of all entries and
not only those HAVING count(datatag_n)<10. (I might be wrong though, as
this is only an assumption based on the amount of data written; I don't
know of any way to examine the temporary files written). While this query
is going through billions of rows, the ones with infrequent datatags are
maybe 10M.

How do I tell postgres to stop aggregating when count>=10?

Thank you in advance,
Dimitris




Re: array_agg() does not stop aggregating according to HAVING clause

2024-08-17 Thread Dimitrios Apostolou

On Sat, 17 Aug 2024, Tom Lane wrote:


Well, yes: the two aggregates (array_agg and count) are computed
concurrently in a single Aggregate plan node scanning the output
of the JOIN.  There's no way to apply the HAVING filter until
after the aggregation is finished.

I think this approach is basically forced by the SQL standard's
semantics for grouping/aggregation.


FWIW I also tried:

HAVING  array_length(array_agg(run_n), 1) < 10;

but I saw the same amount of temp files, at least in the short duration of
my test run.

Thank you, I will split this into two passes like you suggested. It's just
that I'm doing another 3 passes over this table for different things I
calculate (different GROUP BY, different WHERE clauses) and I was hoping
to minimize the time spent. But avoiding the array_agg() over everything
is my top priority ATM so I'll definitely try.

Regards,
Dimitris




Experience and feedback on pg_restore --data-only

2025-03-23 Thread Dimitrios Apostolou

Rationale:

When restoring a backup in an emergency situation, it's fine to run
pg_restore as superuser and get an exact replica of the dumped db.
AFAICT pg_restore (without --data-only) is optimised for such case.

But pg_dump/restore can be used as a generic data-copying utility, and in
those cases it makes often sense to get rid of the churn and create a
clean database by running the SQL schema definition from version control,
and then copy the data for only the tables created.

For this case, I choose to run pg_restore --data-only, and run it as the
user who owns the database (dbowner), not as a superuser, in order to
avoid changes being introduced under the radar.

Things that made my life hard:

* plenty of permission denials for both ALTER OWNER or SET SESSION
  AUTHORIZATION (depending on command line switches).  Both of these
  require superuser privilege, but in my case this is not really needed.
  Dbowner has CREATEROLE and is the one who creates all the roles (WITH
  SET TRUE), and their private schemata in the specific database.  Things
  would work if pg_restore did "SET ROLE" instead of "SET SESSION
  AUTHORIZATION" to switch user. Is this a straightforward change or there
  are issues I don't see?

* After each failed attempt, I need to issue a TRUNCATE table1,table2,...
  before I try again.  I wrote my own function for that. It would help if
  pg_restore would optionally truncate before COPY.  I believe it would
  require superuser privilege for it, that could achieve using the
  --superuser=username option used today for disabling the triggers.

Performance issues: (important as my db size is >5TB)

* WAL writes: I didn't manage to avoid writing to the WAL, despite having
  setting wal_level=minimal. I even wrote my own function to ALTER all
  tables to UNLOGGED, but failed with "could not change table T to
  unlogged because it references logged table".  I'm out of ideas on this
  one.

* Indices: Could pg_restore have a switch to DROP indices before each
  COPY, and re-CREATE them after, exactly as they were?  This would speed
  up the process quite a bit.


Any feedback for improving my process? Should I put these ideas somewhere
as ideas for improvement on pg_restore?

Thank you in advance,
Dimitris





Re: Experience and feedback on pg_restore --data-only

2025-03-21 Thread Dimitrios Apostolou

On Thu, 20 Mar 2025, Dimitrios Apostolou wrote:


Rationale:

When restoring a backup in an emergency situation, it's fine to run
pg_restore as superuser and get an exact replica of the dumped db.
AFAICT pg_restore (without --data-only) is optimised for such case.

But pg_dump/restore can be used as a generic data-copying utility, and in
those cases it makes often sense to get rid of the churn and create a clean
database by running the SQL schema definition from version control, and then
copy the data for only the tables created.

For this case, I choose to run pg_restore --data-only, and run it as the user
who owns the database (dbowner), not as a superuser, in order to avoid
changes being introduced under the radar.


Another important reason I use --data-only: tablespaces. The new host has
different storage configuration and tablespaces and the schema has been
updated to reflect those. Therefore I must create the database using the
updated schema definitions on the new host.



Things that made my life hard:

* plenty of permission denials for both ALTER OWNER or SET SESSION
  AUTHORIZATION (depending on command line switches).  Both of these
  require superuser privilege, but in my case this is not really needed.
  Dbowner has CREATEROLE and is the one who creates all the roles (WITH
  SET TRUE), and their private schemata in the specific database.  Things
  would work if pg_restore did "SET ROLE" instead of "SET SESSION
  AUTHORIZATION" to switch user. Is this a straightforward change or there
  are issues I don't see?

* After each failed attempt, I need to issue a TRUNCATE table1,table2,...
  before I try again.  I wrote my own function for that. It would help if
  pg_restore would optionally truncate before COPY.  I believe it would
  require superuser privilege for it, that could achieve using the
  --superuser=username option used today for disabling the triggers.

Performance issues: (important as my db size is >5TB)

* WAL writes: I didn't manage to avoid writing to the WAL, despite having
  setting wal_level=minimal. I even wrote my own function to ALTER all
  tables to UNLOGGED, but failed with "could not change table T to
  unlogged because it references logged table".  I'm out of ideas on this
  one.

* Indices: Could pg_restore have a switch to DROP indices before each
  COPY, and re-CREATE them after, exactly as they were?  This would speed
  up the process quite a bit.


Any feedback for improving my process? Should I put these ideas somewhere as
ideas for improvement on pg_restore?

Thank you in advance,
Dimitris







Re: Experience and feedback on pg_restore --data-only

2025-03-24 Thread Dimitrios Apostolou

On Mon, 24 Mar 2025, Adrian Klaver wrote:


On 3/24/25 07:24, Dimitrios Apostolou wrote:

 On Sun, 23 Mar 2025, Laurenz Albe wrote:


 On Thu, 2025-03-20 at 23:48 +0100, Dimitrios Apostolou wrote:

 Performance issues: (important as my db size is >5TB)

 * WAL writes: I didn't manage to avoid writing to the WAL, despite
 having
    setting wal_level=minimal. I even wrote my own function to ALTER all
    tables to UNLOGGED, but failed with "could not change table T to
    unlogged because it references logged table".  I'm out of ideas on
 this
    one.


 You'd have to create an load the table in the same transaction, that is,
 you'd have to run pg_restore with --single-transaction.


 That would restore the schema from the dump, while I want to create the
 schema from the SQL code in version control.



I am not following, from your original post:

"
... create a
clean database by running the SQL schema definition from version control, and
then copy the data for only the tables created.

For this case, I choose to run pg_restore --data-only, and run it as the user
who owns the database (dbowner), not as a superuser, in order to avoid
changes being introduced under the radar.
"

You are running the process in two steps, where the first does not involve
pg_restore. Not sure why doing the pg_restore --data-only portion in single
transaction is not possible?


Laurenz informed me that I could avoid writing to the WAL if I "create and
load the table in a single transaction".
I haven't tried, but here is what I would do to try --single-transaction:

Transaction 1: manually issuing all of CREATE TABLE etc.

Transaction 2: pg_restore --single-transaction --data-only

The COPY command in transaction 2 would still need to write to WAL, since
it's separate from the CREATE TABLE.

Am I wrong somewhere?


 Something that might work, would be for pg_restore to issue a TRUNCATE
 before the COPY. I believe this would require superuser privelege though,
 that I would prefer to avoid. Currently I issue TRUNCATE for all tables
 manually before running pg_restore, but of course this is in a different
 transaction so it doesn't help.

 By the way do you see potential problems with using --single-transaction
 to restore billion-rows tables?


COPY is all or none(version 17+ caveat(see
https://www.postgresql.org/docs/current/sql-copy.html  ON_ERROR)), so if the
data dump fails in --single-transaction everything rolls back.


So if I restore all tables, then an error about a "table not found" would
not roll back already copied tables, since it's not part of a COPY?


Thank you for the feedback,
Dimitris



Re: Experience and feedback on pg_restore --data-only

2025-03-24 Thread Dimitrios Apostolou

On Sun, 23 Mar 2025, Adrian Klaver wrote:


On 3/20/25 15:48, Dimitrios Apostolou wrote:


 * plenty of permission denials for both ALTER OWNER or SET SESSION
    AUTHORIZATION (depending on command line switches).  Both of these
    require superuser privilege, but in my case this is not really needed.
    Dbowner has CREATEROLE and is the one who creates all the roles (WITH
    SET TRUE), and their private schemata in the specific database.  Things
    would work if pg_restore did "SET ROLE" instead of "SET SESSION
    AUTHORIZATION" to switch user. Is this a straightforward change or
there are issues I don't see?


If this is --data-only what are the ALTER OWNER and SET SESSION AUTHORIZATION
for?


You are probably right, early in my trials I was running pg_restore
without --data-only as a non-superuser so it might be that the error
message comes from there. Haven't noted the exact command for this error
message unfortunately.

The point still stands though. The dbowner user is administrator for this
database, and has CREATEd the users with the right to SET ROLE as any of
them. Those other users own tables in their private schemas. But
pg_restore does SET SESSION AUTHORIZATION which requires superuser priv
instead of SET ROLE. I wonder what the reasons are for that.

Maybe pg_restore could either:

- do SET ROLE instead of SET SESSION AUTHORIZATION
- temporarily use the --superuser powers just for issuing the ALTER ROLE.

Regards,
Dimitris


Re: Experience and feedback on pg_restore --data-only

2025-03-24 Thread Dimitrios Apostolou

Hi Ron,

I read your reply in the mailing list archives as I'm not subscribed to
the list, and I'm copy-pasting a response here. Please include me as a
recipient in further replies.


Why are you regularly having emergencies requiring the restoration of
multi-TB tables to databases with lots of cruft?

Fixing that would go a long way towards eliminating your problems with
pg_restore.


I don't have emergencies yet. I'm testing the process of restoring the
database dump, and it takes more than 24 hours currently. A successful
test is vital to approve the process.

But the primary usage of pg_restore that I have is not to save me from
emergencies but to populate the dev database with recent data.


Regards,
Dimitris





Re: Experience and feedback on pg_restore --data-only

2025-03-24 Thread Dimitrios Apostolou

On Sun, 23 Mar 2025, Laurenz Albe wrote:


On Thu, 2025-03-20 at 23:48 +0100, Dimitrios Apostolou wrote:

Performance issues: (important as my db size is >5TB)

* WAL writes: I didn't manage to avoid writing to the WAL, despite having
   setting wal_level=minimal. I even wrote my own function to ALTER all
   tables to UNLOGGED, but failed with "could not change table T to
   unlogged because it references logged table".  I'm out of ideas on this
   one.


You'd have to create an load the table in the same transaction, that is,
you'd have to run pg_restore with --single-transaction.


That would restore the schema from the dump, while I want to create the
schema from the SQL code in version control.

Something that might work, would be for pg_restore to issue a TRUNCATE
before the COPY. I believe this would require superuser privelege though,
that I would prefer to avoid. Currently I issue TRUNCATE for all tables
manually before running pg_restore, but of course this is in a different
transaction so it doesn't help.

By the way do you see potential problems with using --single-transaction
to restore billion-rows tables?


Thank you,
Dimitris


Re: Experience and feedback on pg_restore --data-only

2025-04-05 Thread Dimitrios Apostolou

Hi Laurenz,

On Sun, 23 Mar 2025, Laurenz Albe wrote:


On Thu, 2025-03-20 at 23:48 +0100, Dimitrios Apostolou wrote:

Performance issues: (important as my db size is >5TB)

* WAL writes: I didn't manage to avoid writing to the WAL, despite having
   setting wal_level=minimal. I even wrote my own function to ALTER all
   tables to UNLOGGED, but failed with "could not change table T to
   unlogged because it references logged table".  I'm out of ideas on this
   one.


You'd have to create an load the table in the same transaction, that is,
you'd have to run pg_restore with --single-transaction.


Do you know why --single-transaction can't be combined with -j ?
I'm thinking it's because every process would need to have its own
transaction, which apparently results to more than one transaction.

Do you think this could be alleviated somehow with subtransactions?

I'm trying to think of a way to enjoy the performance benefits of
--single-transaction, while also parallelizing the pg_restore. For this
use case I don't care about the content appearing all in one transaction.
Would it be possible then to have dependency resolution in such a way that
we split the tasks into one transaction per worker process?

Just thinking out loud here, I'd be interested in feedback.

Thanks,
Dimitris


Performance issues during pg_restore -j with big partitioned table

2025-04-04 Thread Dimitrios Apostolou

Hello list.

My database includes one table with 1000 partitions, all of them rather
sizeable. I run:

  pg_restore -j12 --no-tablespaces --disable-triggers --exit-on-error 
--no-owner --no-privileges -n public -d newdb custom_format_dump.pgdump

Right now after 24h of restore, I notice weird behaviour, so I have
several questions about it:

+ 11 postgres backend processes are sleeping as "TRUNCATE TABLE waiting".
  I see that they are waiting to issue a TRUNCATE for one of the
  partitions and then COPY data to it.  Checking the log I see that
  several partitions have already been copied finished, but many more
  are left to start.

  Why is a TRUNCATE needed at the start of a partition's COPY phase? I
  didn't issue a --clean on the command line (I don't need it as my
  database is newly created), and I don't see a mention of related
  TRUNCATE in the pg_restore manual.

+ 1 postgres backend process is doing:

  ALTER TABLE the_master_partitioned_table
ADD CONSTRAINT ...
  FOREIGN KEY (columnX) REFERENCES another_table(columnX)

  According to my logs this started right after COPY DATA for
  another_table was finished. And apparently it has a lock on
  the_master_partitioned_table that all other TRUNCATE have to wait for.

  Is this a bug in the dependency resolution? Wouldn't it make sense for
  this to wait until all 1000 partitions have finished their COPY DATA
  phase?

+ Investigating why the above ALTER TABLE takes so long, I notice that it
  is issuing a lot of writes to the WAL.  Digging deeper shows a lot of
  time spent in SetHintBits().  Is there a way to avoid that in a clean
  pg_restore?


Thanks in advance,
Dimitris




pg_restore causing ENOSPACE on the WAL partition

2025-04-10 Thread Dimitrios Apostolou

Hello list,

last night I got ENOSPACE on the WAL partition while running a huge
pg_restore on an empty and idle database.
The checkpoint that started 7 minutes earlier never finished:

  04:31:09 LOG:  checkpoint starting: wal
  ...
  04:38:04 PANIC:  could not write to file "pg_wal/xlogtemp.546204": No space 
left on device
   :CONTEXT:  COPY table_partition_123, line 120872534

This is most likely because most of the data is written to a very slow
"archival" tablespace. No I/O errors on the system logs, I know things go
slow on that device (iSCSI drive over a shared spinning-disks pool and
busy network).

This happened even though I'm keeping the WAL in its own partition (128GB
partition size), with a lot of free space above max_wal_size (64GB).
Somehow it managed to grow above max_wal_size and fill 100% of the
partition.

I'm running latest PostgreSQL 17 and the settings have been temporarily
tweaked for fastest pg_restore:

max_wal_size=64GB
max_replication_slots = 0
max_logical_replication_workers = 0
max_wal_senders = 0
wal_level = minimal
autovacuum = off

Several things seem to have gone wrong here. Questions:

+ The WAL partition is much faster than the archival tablespace. Am I in
  constant danger of overruning max_wal_size? How to make 100% sure this
  never happens again?

+ After recovery, with the database idling, I notice that WAL space usage
  is constant at 64GB. Why doesn't it free up space down to min_wal_size (1GB)?

+ I just created a 13GB zstd-compressed tarball of those 64GB WAL
  files. This indicates that the files are compressible despite using
  wal_compression=zstd setting. Could it be that postgres ignores the flag
  and does not compress the WAL? How to check?

+ I'm using parallel pg_restore --data-only, can't avoid that for now.
  Even though all the tables are empty (I truncated everything before
  starting pg_restore), I can't find a way to avoid going through the WAL.
  Ideas?


Thanks in advance,
Dimitris





Re: pg_restore causing ENOSPACE on the WAL partition

2025-04-10 Thread Dimitrios Apostolou

Forgot an important setting I have during pg_restore:

On Thu, 10 Apr 2025, Dimitrios Apostolou wrote:


max_wal_size=64GB
max_replication_slots = 0
max_logical_replication_workers = 0
max_wal_senders = 0
wal_level = minimal
autovacuum = off


checkpoint_completion_target = 0

so the checkpoint should finish ASAP and any delays are purely because of
I/O limitations. And for the sake of completion:

checkpoint_timeout = 600s


Regards,
Dimitris




Re: pg_restore causing ENOSPACE on the WAL partition

2025-04-11 Thread Dimitrios Apostolou

Happened again even with max_wal_size=32GB on the 128GB WAL partition. At
this point I'm quite sure what happens:

+ a checkpoint is going on for several minutes (because of I/O contention
  on the archival drive)

+ meanwhile data keeps coming in through several processes of pg_restore

+ data is coming in much faster because the WAL partition is high perf

+ The checkpoint makes it before full and finishes and frees up 32GB
  (max_wal_size) from the WAL drive. Lets say the WAL partition is now at
  80/128GB full.

+ Immediately another checkpoint starts. This one has to checkpoint 80GB
  of data.

+ This doesn't make it in time and the WAL partition fills up.

+ Painful recovery follows, the database server won't even succeed in
  doing recovery when starting up...


I'm surprised that there is no mechanism for the backends to block while
the WAL is overflowing.

Dimitris



On Thu, 10 Apr 2025, Dimitrios Apostolou wrote:


Hello list,

last night I got ENOSPACE on the WAL partition while running a huge
pg_restore on an empty and idle database.
The checkpoint that started 7 minutes earlier never finished:

  04:31:09 LOG:  checkpoint starting: wal
  ...
  04:38:04 PANIC:  could not write to file "pg_wal/xlogtemp.546204": No
  space left on device
  : CONTEXT:  COPY table_partition_123, line 120872534

This is most likely because most of the data is written to a very slow
"archival" tablespace. No I/O errors on the system logs, I know things go
slow on that device (iSCSI drive over a shared spinning-disks pool and busy
network).

This happened even though I'm keeping the WAL in its own partition (128GB
partition size), with a lot of free space above max_wal_size (64GB). Somehow
it managed to grow above max_wal_size and fill 100% of the partition.

I'm running latest PostgreSQL 17 and the settings have been temporarily
tweaked for fastest pg_restore:

max_wal_size=64GB
max_replication_slots = 0
max_logical_replication_workers = 0
max_wal_senders = 0
wal_level = minimal
autovacuum = off

Several things seem to have gone wrong here. Questions:

+ The WAL partition is much faster than the archival tablespace. Am I in
  constant danger of overruning max_wal_size? How to make 100% sure this
  never happens again?

+ After recovery, with the database idling, I notice that WAL space usage
  is constant at 64GB. Why doesn't it free up space down to min_wal_size
  (1GB)?

+ I just created a 13GB zstd-compressed tarball of those 64GB WAL
  files. This indicates that the files are compressible despite using
  wal_compression=zstd setting. Could it be that postgres ignores the flag
  and does not compress the WAL? How to check?

+ I'm using parallel pg_restore --data-only, can't avoid that for now.
  Even though all the tables are empty (I truncated everything before
  starting pg_restore), I can't find a way to avoid going through the WAL.
  Ideas?


Thanks in advance,
Dimitris








Re: TRUNCATE ONLY with foreign keys and triggers disabled

2025-04-14 Thread Dimitrios Apostolou

On Mon, 14 Apr 2025, Adrian Klaver wrote:


On 4/14/25 08:20, Dimitrios Apostolou wrote:

 On Mon, 14 Apr 2025, Adrian Klaver wrote:


 On 4/14/25 08:07, Laurenz Albe wrote:

  On Mon, 2025-04-14 at 17:05 +0200, Dimitrios Apostolou wrote:

  I meant the *referencing* table has just been
  populated. I'm trying to delete the *referenced* table and I get the
  error.


  That would break the foreign key constraint, right?
  PostgreSQL cannot allow that.


 I believe the OP is disabling all triggers including system ones if I
 follow
 correctly and possibly running a foul of;

 https://www.postgresql.org/docs/current/sql-altertable.html

 " Disabling or enabling internally generated constraint triggers requires
 superuser privileges; it should be done with caution since of course the
 integrity of the constraint cannot be guaranteed if the triggers are not
 executed."


 Exactly that. I learned this from pg_restore --disable-triggers, as a way
 to speed-up insertion.

 Since triggers are disabled, I assumed that postgresql shouldn't care
 about referential integrity in TRUNCATE.


I'm going to say the depends on order of execution. Without seeing the code
changes you have done there is really no way to determine what the exact
issue is.


Sent patch here:

https://www.postgresql.org/message-id/flat/4589087c-ec6f-4407-1f82-6cb2e681ac0a%40gmx.net

The reason I first sent here (pgsql-general) is that I wondered if the
error is just an implementation shortcoming without deeper reasons, thus
it would be achievable to try to fix it myself.


Dimitris



Re: TRUNCATE ONLY with foreign keys and triggers disabled

2025-04-14 Thread Dimitrios Apostolou

On Mon, 14 Apr 2025, Tom Lane wrote:


Dimitrios Apostolou  writes:

While doing TRUNCATE ONLY I get:
   ERROR: cannot truncate a table referenced in a foreign key constraint
But in my case the table to be truncated is already empty, and the
TRIGGERS are disabled in all tables.


IIRC, it will let you do it if you truncate both the referenced and
referencing tables in the same command.  The state of the triggers
is not material to this, since TRUNCATE doesn't fire them anyway.


Thanks Tom, however that is not possible in my case, the referenced table
is possibly just populated and potentially very big (partitioned table
with many sub-partitions).

Context is that I've modified pg_restore to accept --data-only --clean,
and issues a TRUNCATE ONLY before each table's COPY, in the same
transaction (thus avoiding WAL too). Will send an RFC patch to
pgsql-hackers when I verify it works.


Dimitris




Re: TRUNCATE ONLY with foreign keys and triggers disabled

2025-04-14 Thread Dimitrios Apostolou

On Mon, 14 Apr 2025, Dimitrios Apostolou wrote:


On Mon, 14 Apr 2025, Tom Lane wrote:


 Dimitrios Apostolou  writes:

 While doing TRUNCATE ONLY I get:
ERROR: cannot truncate a table referenced in a foreign key constraint
 But in my case the table to be truncated is already empty, and the
 TRIGGERS are disabled in all tables.


 IIRC, it will let you do it if you truncate both the referenced and
 referencing tables in the same command.  The state of the triggers
 is not material to this, since TRUNCATE doesn't fire them anyway.


Thanks Tom, however that is not possible in my case, the referenced table is 
possibly just populated and potentially very big (partitioned table with many 
sub-partitions).


Terminology correction: I meant the *referencing* table has just been 
populated. I'm trying to delete the *referenced* table and I get the 
error.



Dimitris





Re: TRUNCATE ONLY with foreign keys and triggers disabled

2025-04-14 Thread Dimitrios Apostolou

On Mon, 14 Apr 2025, Adrian Klaver wrote:


On 4/14/25 08:07, Laurenz Albe wrote:

 On Mon, 2025-04-14 at 17:05 +0200, Dimitrios Apostolou wrote:

 I meant the *referencing* table has just been
 populated. I'm trying to delete the *referenced* table and I get the
 error.


 That would break the foreign key constraint, right?
 PostgreSQL cannot allow that.


I believe the OP is disabling all triggers including system ones if I follow
correctly and possibly running a foul of;

https://www.postgresql.org/docs/current/sql-altertable.html

" Disabling or enabling internally generated constraint triggers requires
superuser privileges; it should be done with caution since of course the
integrity of the constraint cannot be guaranteed if the triggers are not
executed."


Exactly that. I learned this from pg_restore --disable-triggers, as a way
to speed-up insertion.

Since triggers are disabled, I assumed that postgresql shouldn't care
about referential integrity in TRUNCATE.


Dimitris





TRUNCATE ONLY with foreign keys and triggers disabled

2025-04-14 Thread Dimitrios Apostolou

Hello list,

While doing TRUNCATE ONLY I get:

  ERROR: cannot truncate a table referenced in a foreign key constraint

But in my case the table to be truncated is already empty, and the 
TRIGGERS are disabled in all tables.


I wonder, is there a deeper reason for this error, or is it just an 
implementation shortcoming? In other words, how easy would it be to write 
a patch that allows TRUNCATE even with foreign keys when triggers are 
disabled or the table is empty?


Thanks in advance,
Dimitris





Re: Performance issues during pg_restore -j with big partitioned table

2025-04-04 Thread Dimitrios Apostolou

On Fri, 4 Apr 2025, Adrian Klaver wrote:


On 4/4/25 06:13, Dimitrios Apostolou wrote:

 On Wed, 2 Apr 2025, Dimitrios Apostolou wrote:




 Bug? This happened on a postgres compiled from last week's master branch.


Are you talking about the dev version?


In this thread, yes. My mistake I didn't mention it from the start. I
compiled it in order to see if the problems I mentioned in my previous
thread were fixed.


Dimitris





Re: Performance issues during pg_restore -j with big partitioned table

2025-04-04 Thread Dimitrios Apostolou

On Wed, 2 Apr 2025, Dimitrios Apostolou wrote:


Hello list.

My database includes one table with 1000 partitions, all of them rather


I was not clear here: my database dump has all that, and the database is
brand new and empty.


sizeable. I run:

 pg_restore -j12 --no-tablespaces --disable-triggers --exit-on-error
--no-owner --no-privileges -n public -d newdb custom_format_dump.pgdump

Right now after 24h of restore, I notice weird behaviour, so I have several
questions about it:

+ 11 postgres backend processes are sleeping as "TRUNCATE TABLE waiting".
  I see that they are waiting to issue a TRUNCATE for one of the
  partitions and then COPY data to it.  Checking the log I see that
  several partitions have already been copied finished, but many more
  are left to start.

  Why is a TRUNCATE needed at the start of a partition's COPY phase? I
  didn't issue a --clean on the command line (I don't need it as my
  database is newly created), and I don't see a mention of related
  TRUNCATE in the pg_restore manual.

+ 1 postgres backend process is doing:

  ALTER TABLE the_master_partitioned_table
ADD CONSTRAINT ...
  FOREIGN KEY (columnX) REFERENCES another_table(columnX)

  According to my logs this started right after COPY DATA for
  another_table was finished. And apparently it has a lock on
  the_master_partitioned_table that all other TRUNCATE have to wait for.

  Is this a bug in the dependency resolution? Wouldn't it make sense for
  this to wait until all 1000 partitions have finished their COPY DATA
  phase?


Trying again, pg_restore exited with error after almost 24h:

pg_restore: while PROCESSING TOC:
pg_restore: from TOC entry 8904; 2606 16529 CONSTRAINT test_runs_raw 
test_runs_raw_partitioned_pkey
pg_restore: error: could not execute query: ERROR:  deadlock detected
DETAIL:  Process 465409 waits for AccessExclusiveLock on relation 44437 of 
database 44090; blocked by process 465408.
Process 465408 waits for AccessShareLock on relation 44383 of database 44090; 
blocked by process 465409.
HINT:  See server log for query details.

From the logs I see that:

+ Process 465409 waits for AccessExclusiveLock on relation 44437 of database 
44090; blocked by process 465408.
  --> 44437 is test_runs_raw__part_max10120k (a single partition)
+ Process 465408 waits for AccessShareLock on relation 44383 of database 
44090; blocked by process 465409.
  --> 44383 is test_runs_raw  (the master partitioned table)

Process 465409:
  ALTER TABLE ONLY public.test_runs_raw
ADD CONSTRAINT test_runs_raw_partitioned_pkey PRIMARY KEY (workitem_n, 
run_n);

Process 465408:
  COPY public.test_runs_raw__part_max10120k(...) FROM stdin;


Bug? This happened on a postgres compiled from last week's master branch.
The dump I'm trying to restore is from postgres 17.4.


Thanks
Dimitris





Re: Performance issues during pg_restore -j with big partitioned table

2025-04-02 Thread Dimitrios Apostolou

On Wed, 2 Apr 2025, Adrian Klaver wrote:




On 4/2/25 10:32 AM, Dimitrios Apostolou wrote:

 Hello list.

 My database includes one table with 1000 partitions, all of them rather
 sizeable. I run:

   pg_restore -j12 --no-tablespaces --disable-triggers --exit-on-error
 --no-owner --no-privileges -n public -d newdb custom_format_dump.pgdump

 Right now after 24h of restore, I notice weird behaviour, so I have
 several questions about it:

 + 11 postgres backend processes are sleeping as "TRUNCATE TABLE waiting".
    I see that they are waiting to issue a TRUNCATE for one of the
    partitions and then COPY data to it.  Checking the log I see that
    several partitions have already been copied finished, but many more
    are left to start.

    Why is a TRUNCATE needed at the start of a partition's COPY phase? I
    didn't issue a --clean on the command line (I don't need it as my
    database is newly created), and I don't see a mention of related
    TRUNCATE in the pg_restore manual.


--clean will drop the object entirely not TRUNCATE.

I'm guessing that this is being done by you per:

https://www.postgresql.org/message-id/53760c70-4a87-a453-9e02-57abc9cb2e54%40gmx.net

"After each failed attempt, I need to issue a TRUNCATE table1,table2,...
before I try again. "


Thanks Adrian. I'm now testing restore without --data-only. All I'm doing
prior to the above pg_restore command is "createdb -T template0 newdb".
It's possible though that I'm missing something here, the whole thing is
way more complicated than I expected...


Dimitris


Re: Performance issues during pg_restore -j with big partitioned table

2025-04-02 Thread Dimitrios Apostolou

On Wed, 2 Apr 2025, Adrian Klaver wrote:



On 4/2/25 10:39 AM, Adrian Klaver wrote:





 --clean will drop the object entirely not TRUNCATE.

 I'm guessing that this is being done by you per:

 
https://www.postgresql.org/message-id/53760c70-4a87-a453-9e02-57abc9cb2e54%40gmx.net

 "After each failed attempt, I need to issue a TRUNCATE table1,table2,...
 before I try again. "


Oops, forgot to engage brain.

From pg_backup_archiver.c:

* In parallel restore, if we created the table earlier in
* this run (so that we know it is empty) and we are not
* restoring a load-via-partition-root data item then we
* wrap the COPY in a transaction and precede it with a
* TRUNCATE.  If wal_level is set to minimal this prevents
* WAL-logging the COPY.  This obtains a speedup similar
* to that from using single_txn mode in non-parallel
* restores.


This makes sense. It creates the table earlier, and then truncates it just
before copying data into it. I wonder if this really circumvents the WAL
since I don't have --single-transaction  (incompatible to -j).


Dimitris




Re: Performance issues during pg_restore -j with big partitioned table

2025-06-17 Thread Dimitrios Apostolou
For the record, I haven't seen this deadlock again. I guess it was a bug 
on the master branch that got fixed, because I've been testing later 
versions a few times.


Dimitris



On Fri, 4 Apr 2025, Dimitrios Apostolou wrote:


On Wed, 2 Apr 2025, Dimitrios Apostolou wrote:


 Hello list.

 My database includes one table with 1000 partitions, all of them rather


I was not clear here: my database dump has all that, and the database is 
brand new and empty.



 sizeable. I run:

 pg_restore -j12 --no-tablespaces --disable-triggers --exit-on-error
 --no-owner --no-privileges -n public -d newdb custom_format_dump.pgdump

 Right now after 24h of restore, I notice weird behaviour, so I have
 several questions about it:

 + 11 postgres backend processes are sleeping as "TRUNCATE TABLE waiting".
   I see that they are waiting to issue a TRUNCATE for one of the
   partitions and then COPY data to it.  Checking the log I see that
   several partitions have already been copied finished, but many more
   are left to start.

   Why is a TRUNCATE needed at the start of a partition's COPY phase? I
   didn't issue a --clean on the command line (I don't need it as my
   database is newly created), and I don't see a mention of related
   TRUNCATE in the pg_restore manual.

 + 1 postgres backend process is doing:

   ALTER TABLE the_master_partitioned_table
 ADD CONSTRAINT ...
   FOREIGN KEY (columnX) REFERENCES another_table(columnX)

   According to my logs this started right after COPY DATA for
   another_table was finished. And apparently it has a lock on
   the_master_partitioned_table that all other TRUNCATE have to wait for.

   Is this a bug in the dependency resolution? Wouldn't it make sense for
   this to wait until all 1000 partitions have finished their COPY DATA
   phase?


Trying again, pg_restore exited with error after almost 24h:

pg_restore: while PROCESSING TOC:
pg_restore: from TOC entry 8904; 2606 16529 CONSTRAINT test_runs_raw 
test_runs_raw_partitioned_pkey

pg_restore: error: could not execute query: ERROR:  deadlock detected
DETAIL:  Process 465409 waits for AccessExclusiveLock on relation 44437 of 
database 44090; blocked by process 465408.
Process 465408 waits for AccessShareLock on relation 44383 of database 44090; 
blocked by process 465409.

HINT:  See server log for query details.

From the logs I see that:

+ Process 465409 waits for AccessExclusiveLock on relation 44437 of database 
44090; blocked by process 465408.

 -->  44437 is test_runs_raw__part_max10120k (a single partition)
+ Process 465408 waits for AccessShareLock on relation 44383 of database 
44090; blocked by process 465409.

 -->  44383 is test_runs_raw  (the master partitioned table)

Process 465409:
  ALTER TABLE ONLY public.test_runs_raw
ADD CONSTRAINT test_runs_raw_partitioned_pkey PRIMARY KEY (workitem_n,
run_n);

Process 465408:
  COPY public.test_runs_raw__part_max10120k(...) FROM stdin;


Bug? This happened on a postgres compiled from last week's master branch. The 
dump I'm trying to restore is from postgres 17.4.



Thanks
Dimitris







Re: having temp_tablespaces on less reliable storage

2025-07-11 Thread Dimitrios Apostolou



On Thu, 10 Jul 2025, Dimitrios Apostolou wrote:


Hello list,

I have a database split across many tablespaces, with temp_tablespaces 
pointing to a separate, less reliable device (single local NVMe drive). How 
dangerous is it for the cluster to be unrecoverable after a crash?


If the drive goes down and the database can't read/write to temp_tablespaces, 
what will happen?


If I then configure temp_tablespaces to point to a working location, would 
that be enough to start the cluster? Or other bad things can happen?


Can't find any related documentation, but I expect loss of "temp" space is of 
minor importance.



David G. Johnston wrote:


You might want to try finding some old discussions about why putting temp
tablespace on a RAM-drive is not a supported configuration.


Thank you, I found the following:

[1] https://www.postgresql.org/docs/current/manage-ag-tablespaces.html
[2] 
https://www.postgresql.org/message-id/flat/ZR0P278MB0028A89FAA3E31E7F1514EF6D2F60%40ZR0P278MB0028.CHEP278.PROD.OUTLOOK.COM
[3] 
https://www.dbi-services.com/blog/can-i-put-my-temporary-tablespaces-on-a-ram-disk-with-postgresql/

At [1] is the standard documentation warning about tablespaces in general: 
"if you lose a tablespace (file deletion, disk failure, etc.), the 
database cluster might become unreadable or unable to start".


I believe this could be improved, especially with regards to 
temp_tablespaces.


At [2] is a thread started by Daniel Westermann (CC'd) with lots of 
uncertainty in the air. Tom Lane (CC'd) mentions that as long as files are 
temporary (not supposed to be there after restart), it should be fine, but 
there might be additional issues with the directory disappearing after a 
restart.


At [3] is a blog from Daniel who started the previous thread. He removes 
directories and restarts the cluster and things go OK.



I'm leaning towards doing it, i.e. creating a tablespace on the super-fast 
local SSD and using it exclusively for temp_tablespaces. The queries my 
database is facing are crunching TBs of data for many hours and write tons 
of temporary data, and the local NVMe storage is a huge improvement over 
the enterprise-storage volumes the VM is provided with (I believe they are 
iSCSI based underneath, bound to network latency).


What if the NVMe drive fails?

The good scenario is that I will create a new tablespace at a new location 
and change temp_tablespaces to point there, and everything should be fine. 
Possibly without even a cluster restart.


The very bad scenario is that the cluster will crash and will need 
restart, but that will go sideways and will eventually need restore from 
backup or other hacks.


How possible would that be?

Thanks,
Dimitris





having temp_tablespaces on less reliable storage

2025-07-10 Thread Dimitrios Apostolou

Hello list,

I have a database split across many tablespaces, with temp_tablespaces 
pointing to a separate, less reliable device (single local NVMe drive). 
How dangerous is it for the cluster to be unrecoverable after a crash?


If the drive goes down and the database can't read/write to 
temp_tablespaces, what will happen?


If I then configure temp_tablespaces to point to a working location, would 
that be enough to start the cluster? Or other bad things can happen?


Can't find any related documentation, but I expect loss of "temp" space is 
of minor importance.


Thanks in advance,
Dimitris