Why is a hash join preferred when it does not fit in work_mem
Hello list, I have a very simple NATURAL JOIN that does not fit in the work_mem. Why does the query planner prefer a hash join that needs 361s, while with a sort operation and a merge join it takes only 13s? The server is an old Mac Mini with hard disk drive and only 4GB RAM. Postgres version info: PostgreSQL 15.0 on x86_64-apple-darwin20.6.0, compiled by Apple clang version 12.0.0 (clang-1200.0.32.29), 64-bit The low work_mem and the disabled memoization are set on purpose, in order to simplify a complex query, while reproducing the same problem that I experienced there. This result is the simplest query I could get, where the optimizer does not go for a faster merge join. From my point of view a merge join is clearly faster, because the hash table does not fit in memory and I expect a hash join to do a lot of random I/O. But the query planner does not see that, and increasing random_page_cost does not help either. In fact the opposite happens: the merge join gets a higher cost difference to the hash join, as I increase the random page cost! # EXPLAIN (ANALYZE,VERBOSE,BUFFERS,SETTINGS) SELECT * FROM tasks_mm_workitems NATURAL JOIN workitem_ids; QUERY PLAN -- Hash Join (cost=121222.68..257633.01 rows=3702994 width=241) (actual time=184498.464..360606.257 rows=3702994 loops=1) Output: tasks_mm_workitems.workitem_n, tasks_mm_workitems.task_n, workitem_ids.workitem_id Inner Unique: true Hash Cond: (tasks_mm_workitems.workitem_n = workitem_ids.workitem_n) Buffers: shared hit=15068 read=47434, temp read=56309 written=56309 -> Seq Scan on public.tasks_mm_workitems (cost=0.00..53488.94 rows=3702994 width=8) (actual time=0.040..1376.084 rows=3702994 loops=1) Output: tasks_mm_workitems.workitem_n, tasks_mm_workitems.task_n Buffers: shared read=16459 -> Hash (cost=59780.19..59780.19 rows=1373719 width=237) (actual time=184361.874..184361.875 rows=1373737 loops=1) Output: workitem_ids.workitem_id, workitem_ids.workitem_n Buckets: 4096 Batches: 512 Memory Usage: 759kB Buffers: shared hit=15068 read=30975, temp written=43092 -> Seq Scan on public.workitem_ids (cost=0.00..59780.19 rows=1373719 width=237) (actual time=0.026..1912.312 rows=1373737 loops=1) Output: workitem_ids.workitem_id, workitem_ids.workitem_n Buffers: shared hit=15068 read=30975 Settings: effective_cache_size = '500MB', enable_memoize = 'off', hash_mem_multiplier = '1', max_parallel_workers_per_gather = '1', work_mem = '1MB' Planning: Buffers: shared hit=2 read=6 Planning Time: 0.568 ms Execution Time: 361106.876 ms (20 rows) # EXPLAIN (ANALYZE,VERBOSE,BUFFERS,SETTINGS) SELECT * FROM tasks_mm_workitems NATURAL JOIN workitem_ids; QUERY PLAN --- Merge Join (cost=609453.49..759407.78 rows=3702994 width=241) (actual time=5062.513..10866.313 rows=3702994 loops=1) Output: tasks_mm_workitems.workitem_n, tasks_mm_workitems.task_n, workitem_ids.workitem_id Merge Cond: (workitem_ids.workitem_n = tasks_mm_workitems.workitem_n) Buffers: shared hit=5343 read=66053, temp read=32621 written=32894 -> Index Scan using workitem_ids_pkey on public.workitem_ids (cost=0.43..81815.86 rows=1373719 width=237) (actual time=0.111..1218.363 rows=1373737 loops=1) Output: workitem_ids.workitem_n, workitem_ids.workitem_id Buffers: shared hit=5310 read=49627 -> Materialize (cost=609372.91..627887.88 rows=3702994 width=8) (actual time=5062.389..7392.640 rows=3702994 loops=1) Output: tasks_mm_workitems.workitem_n, tasks_mm_workitems.task_n Buffers: shared hit=33 read=16426, temp read=32621 written=32894 -> Sort (cost=609372.91..618630.40 rows=3702994 width=8) (actual time=5062.378..6068.703 rows=3702994 loops=1) Output: tasks_mm_workitems.workitem_n, tasks_mm_workitems.task_n Sort Key: tasks_mm_workitems.workitem_n Sort Method: external merge Disk: 65256kB Buffers: shared hit=33 read=16426, temp read=32621 written=32894 -> Seq Scan on public.tasks_mm_workitems (cost=0.00..53488.94 rows=3702994 width=8) (actual time=0.045..1177.202 rows=3702994 loops=1) Output: tasks_mm_workitems.workitem_n, tasks_mm_workitems.task_n Buffers: shared hit=33 read=16426 Settings: effective_cache_size = '500MB', enable_hashjoin = 'off', enable_memoize = 'off', hash_mem_multiplier = '1', max_parallel_workers_per_gather = '1', work_mem = '1MB' Planning: Buffers: shared hit=8 Planning Time: 0.677 ms Execution Time: 13364.545 ms (23 rows) Thank you in advance, Dim
Re: Why is a hash join preferred when it does not fit in work_mem
On Fri, 13 Jan 2023, David Rowley wrote: I'd expect reducing random_page_cost to make the Mege Join cheaper as that's where the Index Scan is. I'm not quite sure where you think the random I/O is coming from in a batched hash join. Thanks for the feedback, indeed you are right! Decreasing random_page_cost to values way below the default makes the planner prefer the merge join! This seems strange to me. Please correct me if I'm wrong, as I'm a newcomer to PostgreSQL, but here is how I understand things according to posts I've read, and classical algorithms: + The Hash Join is fastest when one side fits in work_mem. Then on one hand you have a hash table lookup (amortized O(1)) and on the other hand, if the table has M rows that that do not fit in memory, you have sequential reads from the disk (given low fragmentation of the table or index files): For every line you read from the disk, you lookup the key in the hash table. If the hash table does not fit in RAM then the cost becomes prohibitive. Every lookup is a random access possibly hitting the disk. The total cost should be random_page_cost * M. + The Merge Join involves mostly sequential accesses if the disk files are not fragmented. It reads sequentially and in parallel from both tables, merging the results where the key matches. It requires on-disk sorting (because tables don't fit in work_mem), but even this operation requires little disk seeking. A merge-sort algorithm might have a random access cost of logN * random_page_cost. So I would expect an increased random_page_cost to benefit the Merge Join algorithm. And since my setup involves spinning disks, it does makes sense to increase it. It would be interesting to see the same plans with SET track_io_timing = on; set. It's possible that there's less *actual* I/O going on with the Merge Join plan vs the Hash Join plan. Since we do buffered I/O, without track_io_timing, we don't know if the read buffers resulted in an actual disk read or a read from the kernel buffers. The database has been VACUUM ANALYZEd first and is otherwise idle. Every query has been run twice, and I paste here only the 2nd run. Slow Hash Join: # EXPLAIN (ANALYZE,VERBOSE,BUFFERS,SETTINGS) SELECT * FROM tasks_mm_workitems NATURAL JOIN workitem_ids; Hash Join (cost=121222.68..257633.01 rows=3702994 width=241) (actual time=145641.295..349682.387 rows=3702994 loops=1) Output: tasks_mm_workitems.workitem_n, tasks_mm_workitems.task_n, workitem_ids.workitem_id Inner Unique: true Hash Cond: (tasks_mm_workitems.workitem_n = workitem_ids.workitem_n) Buffers: shared hit=12121 read=50381, temp read=56309 written=56309 I/O Timings: shared/local read=745.925, temp read=162199.307 write=172758.699 -> Seq Scan on public.tasks_mm_workitems (cost=0.00..53488.94 rows=3702994 width=8) (actual time=0.114..1401.896 rows=3702994 loops=1) Output: tasks_mm_workitems.workitem_n, tasks_mm_workitems.task_n Buffers: shared hit=65 read=16394 I/O Timings: shared/local read=183.959 -> Hash (cost=59780.19..59780.19 rows=1373719 width=237) (actual time=145344.555..145344.557 rows=1373737 loops=1) Output: workitem_ids.workitem_id, workitem_ids.workitem_n Buckets: 4096 Batches: 512 Memory Usage: 759kB Buffers: shared hit=12056 read=33987, temp written=43092 I/O Timings: shared/local read=561.966, temp write=142221.740 -> Seq Scan on public.workitem_ids (cost=0.00..59780.19 rows=1373719 width=237) (actual time=0.033..1493.652 rows=1373737 loops=1) Output: workitem_ids.workitem_id, workitem_ids.workitem_n Buffers: shared hit=12056 read=33987 I/O Timings: shared/local read=561.966 Settings: effective_cache_size = '500MB', enable_memoize = 'off', hash_mem_multiplier = '1', max_parallel_workers_per_gather = '1', work_mem = '1MB' Planning: Buffers: shared hit=8 Planning Time: 0.693 ms Execution Time: 350290.496 ms (24 rows) Fast Merge Join: # SET enable_hashjoin TO off; SET # EXPLAIN (ANALYZE,VERBOSE,BUFFERS,SETTINGS) SELECT * FROM tasks_mm_workitems NATURAL JOIN workitem_ids; Merge Join (cost=609453.49..759407.78 rows=3702994 width=241) (actual time=4602.623..9700.435 rows=3702994 loops=1) Output: tasks_mm_workitems.workitem_n, tasks_mm_workitems.task_n, workitem_ids.workitem_id Merge Cond: (workitem_ids.workitem_n = tasks_mm_workitems.workitem_n) Buffers: shared hit=5310 read=66086, temp read=32621 written=32894 I/O Timings: shared/local read=566.121, temp read=228.063 write=526.739 -> Index Scan using workitem_ids_pkey on public.workitem_ids (cost=0.43..81815.86 rows=1373719 width=237) (actual time=0.034..1080.800 rows=1373737 loops=1) Output: workitem_ids.workitem_n, workitem_ids.workitem_id Buffers: shared hit=5310 read=49627 I/O Timings: shared/local read=448.952 -> Materialize (cost=609372.9
Re: Why is a hash join preferred when it does not fit in work_mem
On Sat, 14 Jan 2023, Tom Lane wrote: Dimitrios Apostolou writes: Please correct me if I'm wrong, as I'm a newcomer to PostgreSQL, but here is how I understand things according to posts I've read, and classical algorithms: + The Hash Join is fastest when one side fits in work_mem. Then on one hand you have a hash table lookup (amortized O(1)) and on the other hand, if the table has M rows that that do not fit in memory, you have sequential reads from the disk (given low fragmentation of the table or index files): For every line you read from the disk, you lookup the key in the hash table. If the hash table does not fit in RAM then the cost becomes prohibitive. Every lookup is a random access possibly hitting the disk. The total cost should be random_page_cost * M. That would be true of a simple hash join, but Postgres uses batched hash joins: we split up the hash key space into subsets, where hopefully each subset includes few enough inner-side rows to fit into work_mem. While this can go wrong given pathological distribution of the inner-side keys, it does mean that the join can perform well even when the inner side is much larger than work_mem. So it's not the case that the planner will simply disregard hash joins beyond work_mem. It will apply a cost penalty for the predicted batching overhead; Thanks for this, I found a page [1] that describes the hash join and now I understand a bit more. [1] https://www.interdb.jp/pg/pgsql03.html I'm not sure whether the key distribution is pathological in my case. The join condition is: Hash Cond: (tasks_mm_workitems.workitem_n = workitem_ids.workitem_n) and workitem_ids.workitem_n is an integer GENERATED AS IDENTITY and PUBLIC KEY. The TABLE workitem_ids har 1.7M rows, and the other table has 3.7M rows. None of them fit in workmem. In my (simplified and pathological) case of work_mem == 1MB, the hash join does 512 batches (Buckets: 4,096 Batches: 512 Memory Usage: 759kB). I'm not sure which hash-merge strategy is followed, but based on that document, it should be the "hybrid hash join with skew". I don't quite follow the I/O requirements of this algorithm, yet. :-) but that can still come out cheaper than merge join, because the sorting needed for merge is generally also far from cheap. I was under the impression that on-disk merge-sort is a relatively cheap (logN) operation, regarding random I/O. So I would expect an increased random_page_cost to benefit the Merge Join algorithm. And since my setup involves spinning disks, it does makes sense to increase it. What is probably really happening is that random_page_cost affects the estimated cost of performing the sort using an index scan instead of a bespoke sort step. AFAIR, cost_sort doesn't consider random_page_cost at all, and neither does cost_hashjoin. On the last EXPLAIN I posted for the forced merge-join, I see that it uses an index-scan on the "small" table. It makes sense since the join happens on the primary key of the table. On the large table it does not use an index scan, because an index doesn't exist for that column. It sorts the 3.7M rows of the table (and FWIW that table only has two integer columns). If I understood correctly what you meant with "performing the sort using an index scan". The problem I see is that the estimated cost of the sort operation is 609,372.91..618,630.40. It's already way above the whole hash-join cost (121,222.68..257,633.01). However the real timings are very different. Actual time for Sort is 4,602.569..5,414.072 ms while for the whole hash join it is 145,641.295..349,682.387 ms. Am I missing some configuration knobs to put some sense to the planner? Thanks, Dimitris
Re: Why is a hash join preferred when it does not fit in work_mem
Hello again, I am back with new experiments. First of all, I have a concrete set of steps that replicate the slowness of the hash join that I described. If you have a system with spinning disks lying around, I would appreciate if you can verify the scenario. Can you also replicate it in different kind of systems? CREATE TABLE descriptions (description_id serial PRIMARY KEY, description text); INSERT INTO descriptions (description_id, description) SELECT s, repeat(encode(sha512(s::text::bytea), 'hex'), 4) FROM generate_series(0,1200300) AS s; CREATE TABLE descriptions_in_books (description_id integer REFERENCES descriptions(description_id), book_id integer); INSERT INTO descriptions_in_books (description_id, book_id) SELECT s % 1200300, s FROM generate_series(0,5200300) AS s; SET work_mem TO '1MB'; SET hash_mem_multiplier = 1.0; SET track_io_timing TO on; EXPLAIN (ANALYZE,VERBOSE,BUFFERS,SETTINGS) SELECT * FROM descriptions NATURAL JOIN descriptions_in_books; SET enable_hashjoin TO off; EXPLAIN (ANALYZE,VERBOSE,BUFFERS,SETTINGS) SELECT * FROM descriptions NATURAL JOIN descriptions_in_books; The first JOIN query uses a hash join with Batches: 1024 and takes 622s! For the longest part, I can see the disk writing most of the time around 1-2 MB/s, so I have to assume it's not writing sequentially. The second identical JOIN uses a merge join that completes in 14s. The I/O happens in a much higher rate (10x maybe), so I'm assuming it's mostly sequential. Another observation is that the hash join deteriorates as the length of the TEXT column grows. In fact, if I fill it with only 32 char long strings, then the hash join is split in only 128 batches, and it completes almost as fast as the merge join. Could it be that the cost estimation is underestimating the I/O pattern related to splitting in batches? Here are the measurements: Hash Join (cost=192450.84..401456.02 rows=5200486 width=524) (actual time=344516.004..621725.562 rows=5200301 loops=1) Output: descriptions.description_id, descriptions.description, descriptions_in_books.book_id Inner Unique: true Hash Cond: (descriptions_in_books.description_id = descriptions.description_id) Buffers: shared hit=15586 read=93161, temp read=97829 written=97829 I/O Timings: shared/local read=1402.597, temp read=229252.170 write=371508.313 -> Seq Scan on public.descriptions_in_books (cost=0.00..75015.86 rows=5200486 width=8) (actual time=0.068..1819.629 rows=5200301 loops=1) Output: descriptions_in_books.book_id, descriptions_in_books.description_id Buffers: shared hit=32 read=22979 I/O Timings: shared/local read=249.910 -> Hash (cost=97739.04..97739.04 rows=1200304 width=520) (actual time=343268.470..343268.471 rows=1200301 loops=1) Output: descriptions.description_id, descriptions.description Buckets: 2048 Batches: 1024 Memory Usage: 686kB Buffers: shared hit=15554 read=70182, temp written=78538 I/O Timings: shared/local read=1152.687, temp write=338883.205 -> Seq Scan on public.descriptions (cost=0.00..97739.04 rows=1200304 width=520) (actual time=0.028..2278.791 rows=1200301 loops=1) Output: descriptions.description_id, descriptions.description Buffers: shared hit=15554 read=70182 I/O Timings: shared/local read=1152.687 Settings: hash_mem_multiplier = '1', work_mem = '1MB' Planning Time: 0.303 ms Execution Time: 622495.279 ms (22 rows) SET enable_hashjoin TO off; Merge Join (cost=868411.87..1079330.96 rows=5200301 width=524) (actual time=6091.932..13304.924 rows=5200301 loops=1) Output: descriptions.description_id, descriptions.description, descriptions_in_books.book_id Merge Cond: (descriptions.description_id = descriptions_in_books.description_id) Buffers: shared hit=67 read=111962 written=1, temp read=45806 written=46189 I/O Timings: shared/local read=1007.043 write=28.575, temp read=344.937 write=794.483 -> Index Scan using descriptions_pkey on public.descriptions (cost=0.43..116919.99 rows=1200304 width=520) (actual time=0.028..1596.387 rows=1200301 loops=1) Output: descriptions.description_id, descriptions.description Buffers: shared hit=3 read=89015 written=1 I/O Timings: shared/local read=834.732 write=28.575 -> Materialize (cost=868408.84..894410.35 rows=5200301 width=8) (actual time=6091.892..9171.796 rows=5200301 loops=1) Output: descriptions_in_books.book_id, descriptions_in_books.description_id Buffers: shared hit=64 read=22947, temp read=45806 written=46189 I/O Timings: shared/local read=172.311, temp read=344.937 write=794.483 -> Sort (cost=868408.84..881409.60 rows=5200301 width=8) (actual time=6091.885..7392.828 rows=5200301 loops=1) Output: descriptions_in_books.book_id, descriptions_in_books.description_id Sort Key: description
SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches
Hello list, I run a very simple SELECT on a huge table (350M rows, 10 columns of type integer or timestamp). The server is an old Mac Mini with 4GB RAM and a slow HDD. All caches are cold either via reboot, or by restarting postgres and using the `purge` command to flush the OS cache. The question is why this simple query is taking so long to complete. Most of the time I see constant 30MB/s read I/O. The memory usage is low and the system is not swapping or in other ways unhealthy. Postgres is not serving any other requests. Is postgres reading all the table sequentially, just to return 10 rows in no particular order? I'm trying to understand what PostgreSQL is doing and the data access patterns involved. All feedback is appreciated. # SELECT version(); PostgreSQL 15.0 on x86_64-apple-darwin20.6.0, compiled by Apple clang version 12.0.0 (clang-1200.0.32.29), 64-bit # EXPLAIN (ANALYZE, VERBOSE,BUFFERS,SETTINGS) SELECT * FROM test_runs_raw LIMIT 10; Limit (cost=0.00..0.29 rows=10 width=42) (actual time=746005.251..746014.910 rows=10 loops=1) Output: run_n, test_name_n, workitem_n, started_on, duration_ms, test_result_n, test_executable_n, test_function_n, test_datatag_n Buffers: shared read=2334526 I/O Timings: shared/local read=691137.029 -> Seq Scan on public.test_runs_raw (cost=0.00..9250235.80 rows=317603680 width=42) (actual time=745910.672..745910.677 rows=10 loops=1) Output: run_n, test_name_n, workitem_n, started_on, duration_ms, test_result_n, test_executable_n, test_function_n, test_datatag_n Buffers: shared read=2334526 I/O Timings: shared/local read=691137.029 Settings: effective_cache_size = '2GB', max_parallel_workers_per_gather = '0', work_mem = '64MB' Planning: Buffers: shared hit=69 read=18 I/O Timings: shared/local read=265.550 Planning Time: 271.719 ms Execution Time: 746057.424 ms (14 rows) Repeating the query returns instantaneously, and EXPLAIN shows shared hit=14 and no reads. Changing the LIMIT to 100 takes another 3min to respond, with constant read I/O while it runs: # EXPLAIN (ANALYZE, VERBOSE,BUFFERS,SETTINGS) SELECT * FROM test_runs_raw LIMIT 100; Limit (cost=0.00..2.91 rows=100 width=42) (actual time=0.030..184692.698 rows=100 loops=1) Output: run_n, test_name_n, workitem_n, started_on, duration_ms, test_result_n, test_executable_n, test_function_n, test_datatag_n Buffers: shared hit=14 read=772427 I/O Timings: shared/local read=171484.072 -> Seq Scan on public.test_runs_raw (cost=0.00..9250235.80 rows=317603680 width=42) (actual time=0.029..184692.664 rows=100 loops=1) Output: run_n, test_name_n, workitem_n, started_on, duration_ms, test_result_n, test_executable_n, test_function_n, test_datatag_n Buffers: shared hit=14 read=772427 I/O Timings: shared/local read=171484.072 Settings: effective_cache_size = '2GB', max_parallel_workers_per_gather = '0', work_mem = '64MB' Planning Time: 0.083 ms Execution Time: 184692.748 ms (11 rows) Reruning this is instantaneous again. Other postgres settings: shared_buffers = 1024MB effective_io_concurrency = 0 wal_compression = zstd wal_recycle = on min_wal_size = 1GB max_wal_size = 10GB checkpoint_timeout = 15min Bonus question: some time ago the server crashed and rebooted, and after a long WAL recovery and some hours of idle time, I run the same query. It took even longer to respond, and besides reading, there was also a lot writing involved. What kind of writes was postgres doing? # EXPLAIN (ANALYZE, VERBOSE,BUFFERS,SETTINGS) SELECT * FROM test_runs_raw LIMIT 10; Limit (cost=0.00..0.29 rows=10 width=42) (actual time=1284114.810..1284114.819 rows=10 loops=1) Output: run_n, test_name_n, workitem_n, started_on, duration_ms, test_result_n, test_executable_n, test_function_n, test_datatag_n Buffers: shared hit=47490 read=2287036 dirtied=1809974 written=1762484 I/O Timings: shared/local read=104.264 write=107258.150 -> Seq Scan on public.test_runs_raw (cost=0.00..9250235.80 rows=317603680 width=42) (actual time=1283955.130..1283955.135 rows=10 loops=1) Output: run_n, test_name_n, workitem_n, started_on, duration_ms, test_result_n, test_executable_n, test_function_n, test_datatag_n Buffers: shared hit=47490 read=2287036 dirtied=1809974 written=1762484 I/O Timings: shared/local read=104.264 write=107258.150 Settings: effective_cache_size = '2GB', max_parallel_workers_per_gather = '0', work_mem = '64MB' Planning Time: 0.545 ms Execution Time: 1284114.972 ms (11 rows) Relevant settings: synchronous_commit = off wal_recycle = on min_wal_size = 1GB max_wal_size = 10GB track_io_timing = on track_wal_io_timing = on wal_compression = zstd Thanks in advance, Dimitris
Re: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches
I managed to double-paste different subset of my settings. FWIW all of them are true for all the experiments. Sorry for the confusion. On Tue, 31 Jan 2023, Dimitrios Apostolou wrote: Other postgres settings: shared_buffers = 1024MB effective_io_concurrency = 0 wal_compression = zstd wal_recycle = on min_wal_size = 1GB max_wal_size = 10GB checkpoint_timeout = 15min [...] Relevant settings: synchronous_commit = off wal_recycle = on min_wal_size = 1GB max_wal_size = 10GB track_io_timing = on track_wal_io_timing = on wal_compression = zstd
Re: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches
On Tue, 31 Jan 2023, David G. Johnston wrote: On Tue, Jan 31, 2023 at 8:07 AM Dimitrios Apostolou wrote: -> Seq Scan on public.test_runs_raw (cost=0.00..9250235.80 rows=317603680 width=42) (actual time=745910.672..745910.677 rows=10 loops=1) Output: run_n, test_name_n, workitem_n, started_on, duration_ms, test_result_n, test_executable_n, test_function_n, test_datatag_n Buffers: shared read=2334526 I/O Timings: shared/local read=691137.029 The system has to return 10 live rows to you. If it needs to search through that many buffers to find 10 live rows you most likely have a large bloating problem going on. Seems like it is time to vacuum full. I looked up on how to measure bloat, so I run the query found at [1]. [1] https://wiki.postgresql.org/wiki/Show_database_bloat The first two rows show huge bloat on the two indices of this table: ... ORDER BY wastedbytes DESC LIMIT 2; current_database | schemaname | tablename | tbloat | wastedbytes | iname | ibloat | wastedibytes --++---++-+--++-- coin | public | test_runs_raw |1.8 | 21742305280 | test_runs_raw_pkey |1.0 |0 coin | public | test_runs_raw |1.8 | 21742305280 | test_runs_raw_idx_workitem_n |0.3 |0 (2 rows) Is this bloat even affecting queries that do not use the index? It seems I have to add VACUUM FULL to nightly maintainance. I had run some schema restructuring (several ADD COLUMN followed by UPDATE SET on all rows) some days ago, and I was not aware this degraded the table. Thanks for the useful info! Dimitris
Re: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches
Thanks for the insight on the internals. Regarding your questions: On Tue, 31 Jan 2023, Tom Lane wrote: Do you get the same 10 rows when you repeat the command? Yes. Just tested with both cold and hot caches. The first 10 rows are exactly the same, either they return slowly or immediately. If turning synchronize_seqscans off changes the behavior, that'd be a good clue that this is the right theory. Turning it off makes the query slow no matter how many times I re-run it. The system is doing lots of read I/O with both hot and cold caches. Here is the EXPLAIN output from the hot cache run (that previously had only 14 hits and no reads): Limit (cost=0.00..0.29 rows=10 width=42) (actual time=620510.813..620510.821 rows=10 loops=1) Output: run_n, test_name_n, workitem_n, started_on, duration_ms, test_result_n, test_executable_n, test_function_n, test_datatag_n Buffers: shared hit=64 read=2334462 I/O Timings: shared/local read=567846.559 -> Seq Scan on public.test_runs_raw (cost=0.00..9250235.80 rows=317603680 width=42) (actual time=620510.800..620510.804 rows=10 loops=1) Output: run_n, test_name_n, workitem_n, started_on, duration_ms, test_result_n, test_executable_n, test_function_n, test_datatag_n Buffers: shared hit=64 read=2334462 I/O Timings: shared/local read=567846.559 Settings: effective_cache_size = '2GB', max_parallel_workers_per_gather = '0', work_mem = '64MB' Planning Time: 0.099 ms Execution Time: 620510.855 ms After reading the docs, I'm surprised this setting affects my case given that I have no parallelism in my setup. As for a real fix, it might be time for a VACUUM FULL or CLUSTER on that table. Regarding CLUSTER, would it help with a seqscan on a bloated table? Furthermore, given that the table is expected to grow every day by a few million rows, do you suggest running CLUSTER every night? Will postgres remember that the rows up to N are clustered, even after appending more rows? Dimitris
Re: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches
On Tue, 31 Jan 2023, David G. Johnston wrote: It feels like there is room for improvement here using table statistics and the visibility map to significantly reduce the number of pages retrieved that turn out to be all dead. Yes, I too wish postgres was more clever for a simple unordered query. I should also mention that this table has two btree indices, one primary key index and one on another column. Would it be feasible to "descend" an index in order to immediately find some live data? A limit without an order is not a query to optimize for of course... A good reason is that the web interface I'm using to browse my tables (and I guess many other too) do a simple limited+unordered SELECT when you click on a table. Timing out was what made me look deeper. I couldn't accept that no matter how slow my system is, it has trouble returning a few rows. It is also the first query somoebody will type on psql to interactively get a clue about the contents of a table. Regards, Dimitris
VACUUM vs VACUUM FULL (was: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches)
Thanks everyone for the help. I have now run simple VACUUM but it didn't change anything, the simple SELECT is still slow. My understanding by reading the docs is that it should reclaim all unused space, just not return it to the OS. Which is fine by me. Any idea why it failed to reclaim the space in my case? I'm now running VACUUM FULL as everyone suggested. I just tried plain VACUUM as I was curious if it would work and because it doesn't lock the table with an operation that takes hours for my setup. Thanks, Dimitris
Moving data from huge table slow, min() query on indexed column taking 38s
I have a huge (10G rows) table "tbl_legacy" named test_runs that I want to PARTITION BY RANGE on an indexed integer column. I have created the new partitioned table "tbl" and 1000 partitions in it, so that it can take my existing data and have the partitions ready for the next year (so most of the new partitions will be empty for now). On the topic of bulk-transferring the data from tbl_legacy to tbl: + First I chose the method to DELETE...INSERT everything. I noticed that the postgres process was growing without bounds, up to the point that it consumed almost all of swap space and I had to kill the command. Any idea why that happens? + Then I broke the command above in many small chunks WITH rows AS ( DELETE FROM tbl_legacy AS t WHERE (partition_key_column < $1) RETURNING t.* ) INSERT INTO tbl SELECT * FROM rows; COMMIT; I increase the parameter $1 and keep going in a loop. At first this goes OK, after one day though I notice that it has slowed down significantly. My investigation shows that the simple query SELECT min(partition_key_column) from tbl_legacy; takes 38s, despite having an index on the column! A VACUUM fixes that, so I guess the index has a ton of dead tuples. I guess autovacuum does not do its job because the table is constantly busy. Unfortunately VACUUM takes long (several hours) on this huge table, so I can't add in the loop after the DELETE command. Is there a better way to avoid the bloat in the loop above? Why can't the DELETE command update the index by pointing the minimum element beyond the dead tuples? + Finally I resorted to just copying the data: INSERT INTO tbl SELECT * FROM tbl_legacy; This took half a day or so but worked without blowing up the memory usage like the 1st query. Why is that? By the way, I don't consider the last solution ideal, since the data is duplicated between the two tables and it's harder to implement workarounds to do it without taking downtime. But it worked for me for this time. Any other ideas for partitioning an existing huge table? Thanks in advance, Dimitris
Re: Moving data from huge table slow, min() query on indexed column taking 38s
On Thu, 6 Jul 2023, Dimitrios Apostolou wrote: + First I chose the method to DELETE...INSERT everything. Just to clarify, the query looked more like WITH rows AS ( DELETE FROM tbl_legacy RETURNING * ) INSERT INTO tbl SELECT * FROM rows; I noticed that the postgres process was growing without bounds, up to the point that it consumed almost all of swap space and I had to kill the command. Any idea why that happens? Also note that my settings for work_mem, temp_buffers, shared_buffers etc are all well below the RAM size and postgres has never shown unbound growth again. Postgres version is 15.2. Dimitris
Re: Moving data from huge table slow, min() query on indexed column taking 38s
Thanks! I have completed the transfer by taking down the database and working exclusively on it, but I still wonder how one could do it in production without exclusive locks. The loop with small DELETE...INSERT transactions I posted on the parent post bloats the table fast. The link you posted contains very useful info, I was not expecting that the constraints could blow up the memory like that. Comments from me: Disabling and then re-enabling the foreign key constraints is easily done with ALTER TABLE. Unfortunately it doesn't seem to be the same case for indices. One has to create the table without indices, and then create the indices separately. With such a process there is a risk of ending up with non-identical table schemas. By the way, with COPY one must use an intermediate file, right? There is no way to COPY from table to table directly? Thanks, Dimitris On Thu, 6 Jul 2023, Lorusso Domenico wrote: Hello, this is a standard problem during bulk copy. here some suggestions; for example disable indexes. The main issue is related to index, lock escalation and log writing. In other dbms you should set log off on the table, but postgresql does not seem to have this feature. Anyway, using an explicit lock table exclusive should prevent lock escalation. So: disable indexes in target table lock exclusive both table insert data truncate old table If this doesn't work you can consider using the copy command. Il giorno gio 6 lug 2023 alle ore 18:12 Dimitrios Apostolou ha scritto: On Thu, 6 Jul 2023, Dimitrios Apostolou wrote: > + First I chose the method to DELETE...INSERT everything. Just to clarify, the query looked more like WITH rows AS ( DELETE FROM tbl_legacy RETURNING * ) INSERT INTO tbl SELECT * FROM rows; > I noticed that the postgres process was growing without bounds, up to > the point that it consumed almost all of swap space and I had to kill > the command. Any idea why that happens? Also note that my settings for work_mem, temp_buffers, shared_buffers etc are all well below the RAM size and postgres has never shown unbound growth again. Postgres version is 15.2. Dimitris -- Domenico L. per stupire mezz'ora basta un libro di storia, io cercai di imparare la Treccani a memoria... [F.d.A.]
Re: Moving data from huge table slow, min() query on indexed column taking 38s
On Thu, 6 Jul 2023, Dimitrios Apostolou wrote: + Then I broke the command above in many small chunks WITH rows AS ( DELETE FROM tbl_legacy AS t WHERE (partition_key_column < $1) RETURNING t.* ) INSERT INTO tbl SELECT * FROM rows; COMMIT; I increase the parameter $1 and keep going in a loop. At first this goes OK, after one day though I notice that it has slowed down significantly. My investigation shows that the simple query SELECT min(partition_key_column) from tbl_legacy; takes 38s, despite having an index on the column! A VACUUM fixes that, so I guess the index has a ton of dead tuples. I guess autovacuum does not do its job because the table is constantly busy. Unfortunately VACUUM takes long (several hours) on this huge table, so I can't add in the loop after the DELETE command. Is there a better way to avoid the bloat in the loop above? Why can't the DELETE command update the index by pointing the minimum element beyond the dead tuples? Any comments on this one? It annoys me that a simple loop deteriorated so much and kept filling the table with bloat. What is that VACUUM does that DELETE can't do, to keep the index fresh?
Re: Partitionwise JOIN scanning all partitions, even unneeded ones
Thank you for the clear explanation, and I hope the missing optimisation gets implemented sooner rather than later. Maybe the query planner should consider the missing optimisation and ban *hash* partitionwise joins. Indeed I verified that disabling hash join fixed the situation, with both queries taking almost the same time. Great! In the meantime, I'm considering disabling hash joins globally, as I've had issues with them before (excessive I/O slowing causing much slower execution than merge join, see [1] for your answer to my question then :). Do you think that would save me from other problems I'm not aware of, given the huge size of the table I'm querying? [1] https://www.postgresql.org/message-id/caaphdvppvydonkeqlybsbjwq8kq8m7ywdka44rtea2mnao3...@mail.gmail.com Regards, Dimitris
BRIN index maintenance on table without primary key
Hello list. Key characteristics of my case: + HUGE table with 20G rows, ca 2TB + May be partitioned (have both versions on two test clusters ATM) + Plenty of inserts, no updates + No primary key - we had one IDENTITY bigint column until recently, but it proved useless and inefficient (index too big for mem) so we deleted it. + All queries are filtering on a not-unique not-null integer column. + On this column we have a BRIN INDEX since insert order is /mostly/ incremental. So the question is: how to maintain the physical order of the tuples? Even though the insertions populate the index column /mostly/ incrementally, there are outliers and it's certain that over time the BRIN index will deteriorate. I'm monitoring the "correlation" and I want to run a command to "defrag" it when it drops below 0.9. + Can't run CLUSTER: ERROR: cannot cluster on index "tst_brin_idx" because access method does not support clustering + Can't run pg_repack, from [1]: > Target table must have a PRIMARY KEY, or at least a UNIQUE total index on a NOT NULL column [1] https://reorg.github.io/pg_repack/ Any ideas? Thank you in advance, Dimitris
Re: BRIN index maintenance on table without primary key
On Fri, 27 Oct 2023, Dimitrios Apostolou wrote: So the question is: how to maintain the physical order of the tuples? Answering to myself, there doesn't seem to be any way to run pg_repack on a table without a UNIQUE key. To run CLUSTER, the only way I see is to: 1. Create a btree index on the same column that has the BRIN index 2. CLUSTER 3. Drop the index This should take very long on my huge table, and keeps the table exclusively locked. The disk space needed would also be huge, if the table isn't partitioned. I wonder why CLUSTER can't run based on a BRIN-indexed column. Is it theoretically impossible, or is it just not implemented yet? My understanding so far is that CLUSTER only rewrites the table *according to the index order* and does not touch the index itself. For a BRIN index though it would need to rewrite the table *ignoring the index* and then rewrite the index too, in order to keep the ranges fully optimized. So the logic is very different, and maybe closer to what VACUUM FULL does. Thanks, Dimitris
Inefficient query plan for SELECT ... EXCEPT ...
Hello list, I'm getting an inefficient query plan for a SELECT ... EXCEPT ... query, where the left side is a very short table (even zero-length sometimes, but also also rarely can be as long as 200K rows), and the right side is a table with 10M UNIQUE NOT NULL rows: \d test_datatags Table "public.test_datatags" Column | Type | Collation | Nullable | Default +-+---+--+-- test_datatag_n | integer | | not null | generated by default as identity test_datatag | text| | not null | Indexes: "test_datatags_pkey" PRIMARY KEY, btree (test_datatag_n) "test_datatags_test_datatag_key" UNIQUE CONSTRAINT, btree (test_datatag) Follows a simplified and pathological case, that takes 1.7s even though the left side is empty: BEGIN; CREATE TEMPORARY TABLE tmp_table(d1 TEXT) ON COMMIT DROP; ANALYZE VERBOSE tmp_table ; INFO: analyzing "pg_temp_9.tmp_table" INFO: "tmp_table": scanned 0 of 0 pages, containing 0 live rows and 0 dead rows; 0 rows in sample, 0 estimated total rows EXPLAIN (ANALYSE,VERBOSE,BUFFERS,SETTINGS) SELECT DISTINCT d1 FROM tmp_table WHERE d1 IS NOT NULL EXCEPT SELECT test_datatag FROM test_datatags; QUERY PLAN -- HashSetOp Except (cost=0.00..299854.08 rows=1 width=36) (actual time=1726.470..1726.472 rows=0 loops=1) Output: "*SELECT* 1".d1, (0) Buffers: shared hit=1259 read=58800 I/O Timings: shared/local read=77.713 -> Append (cost=0.00..278054.53 rows=8719821 width=36) (actual time=3.754..1287.901 rows=8702840 loops=1) Buffers: shared hit=1259 read=58800 I/O Timings: shared/local read=77.713 -> Subquery Scan on "*SELECT* 1" (cost=0.00..0.02 rows=1 width=36) (actual time=0.003..0.003 rows=0 loops=1) Output: "*SELECT* 1".d1, 0 -> HashAggregate (cost=0.00..0.01 rows=1 width=32) (actual time=0.002..0.003 rows=0 loops=1) Output: tmp_table.d1 Group Key: tmp_table.d1 Batches: 1 Memory Usage: 24kB -> Seq Scan on pg_temp.tmp_table (cost=0.00..0.00 rows=1 width=32) (actual time=0.001..0.001 rows=0 loops=1) Output: tmp_table.d1 Filter: (tmp_table.d1 IS NOT NULL) -> Subquery Scan on "*SELECT* 2" (cost=0.00..234455.40 rows=8719820 width=26) (actual time=3.751..943.850 rows=8702840 loops=1) Output: "*SELECT* 2".test_datatag, 1 Buffers: shared hit=1259 read=58800 I/O Timings: shared/local read=77.713 -> Seq Scan on public.test_datatags (cost=0.00..147257.20 rows=8719820 width=22) (actual time=3.747..515.420 rows=8702840 loops=1) Output: test_datatags.test_datatag Buffers: shared hit=1259 read=58800 I/O Timings: shared/local read=77.713 Settings: effective_io_concurrency = '0', enable_partitionwise_aggregate = 'on', enable_partitionwise_join = 'on', hash_mem_multiplier = '1', maintenance_io_concurrency = '0', max_parallel_workers = '12', max_parallel_workers_per_gather = '0', temp_buffers = '64MB' Planning: Buffers: shared hit=2 Planning Time: 0.055 ms JIT: Functions: 15 Options: Inlining false, Optimization false, Expressions true, Deforming true Timing: Generation 0.317 ms, Inlining 0.000 ms, Optimization 0.179 ms, Emission 3.542 ms, Total 4.037 ms Execution Time: 1726.835 ms (33 rows) I'm wondering why the planner doesn't see that the left table is very small and follow a different path. From an abstract computer science POV, I would 1. sort the left table (the right one is already indexed) 2. "merge" the two tables, by walking them in-order in parallel and excluding the matches 3. stop when the left table is exhausted, which would happen very early. Is this worth a bug report? I can file one if the issue is not known. Or am I misunderstanding the implications of the SELECT-EXCEPT query? In the meantime I have replaced the query with a LEFT OUTER JOIN which performs much better, and I believe is equivalent. I find it less readable than the query in question though. Plus, I have a bunch of SELECT-EXCEPT queries (with smaller right-side tables) in my application that I would hate to change them all to the ugliest equivalent. Under what conditions would the above query plan perform well? Thanks in advance, Dimitris
Re: Inefficient query plan for SELECT ... EXCEPT ...
Thank you all for the answers, they covered me well. Is this worth a bug report? I can file one if the issue is not known. No. It's just a missing optimisation. We know about it. It's good I shot an email first then. FWIW my usual way in other projects would be to check the bugtracker, and just "follow" the relevant issue if it's minor like a missing optimisation. I didn't find a way to search for "known issues" in the Postgresql project. Dimitris
Is "DISTINCT" making a diffence in: SELECT [DISTINCT] ... EXCEPT
Hello list, SELECT [DISTINCT] ... EXCEPT ... In this query I get the same results regardless of including DISTINCT or not. But I get different query plans, I get an extra HashAggregate node in the case of SELECT DISTINCT. Any idea why? Reading the docs, I understand that postgres does by default EXCEPT DISTINCT, but I don't see the same for the 1st clause, SELECT [DISTINCT]. Steps to reproduce: CREATE TABLE t1(i INTEGER NOT NULL); CREATE TABLE t2(i INTEGER NOT NULL); INSERT INTO t1 VALUES (1),(2),(1),(2),(3),(3),(3); INSERT INTO t2 VALUES (3); SELECT * FROM t1 EXCEPT SELECT * FROM t2; i --- 2 1 (2 rows) SELECT DISTINCT * FROM t1 EXCEPT SELECT * FROM t2; i --- 2 1 (2 rows) EXPLAIN SELECT * FROM t1 EXCEPT SELECT * FROM t2; QUERY PLAN - HashSetOp Except (cost=0.00..160.25 rows=200 width=8) -> Append (cost=0.00..147.50 rows=5100 width=8) -> Subquery Scan on "*SELECT* 1" (cost=0.00..61.00 rows=2550 width=8) -> Seq Scan on t1 (cost=0.00..35.50 rows=2550 width=4) -> Subquery Scan on "*SELECT* 2" (cost=0.00..61.00 rows=2550 width=8) -> Seq Scan on t2 (cost=0.00..35.50 rows=2550 width=4) (6 rows) EXPLAIN SELECT DISTINCT * FROM t1 EXCEPT SELECT * FROM t2; QUERY PLAN - HashSetOp Except (cost=41.88..127.50 rows=200 width=8) -> Append (cost=41.88..120.62 rows=2750 width=8) -> Subquery Scan on "*SELECT* 1" (cost=41.88..45.88 rows=200 width=8) -> HashAggregate (cost=41.88..43.88 rows=200 width=4) Group Key: t1.i -> Seq Scan on t1 (cost=0.00..35.50 rows=2550 width=4) -> Subquery Scan on "*SELECT* 2" (cost=0.00..61.00 rows=2550 width=8) -> Seq Scan on t2 (cost=0.00..35.50 rows=2550 width=4) (8 rows) Regards, Dimitris
Re: Is "DISTINCT" making a diffence in: SELECT [DISTINCT] ... EXCEPT
On Wed, 15 Nov 2023, Erik Wienhold wrote: On 2023-11-15 10:57 +0100, Dimitrios Apostolou wrote: SELECT [DISTINCT] ... EXCEPT ... In this query I get the same results regardless of including DISTINCT or not. But I get different query plans, I get an extra HashAggregate node in the case of SELECT DISTINCT. Any idea why? As Tom Lane recently wrote[1] EXCEPT is not optimized and will operate on the subqueries which are planned independently. [1] https://www.postgresql.org/message-id/2664450.1698799...@sss.pgh.pa.us Heh, as he wrote to me even. :-) I just wanted to make sure that this is indeed a missing optimisation of the planner, and that the queries are effectively the same. Thank you for clarifying. As mentioned, the docs don't make it clear if the SELECT DISTINCT part is implied or not, only the EXCEPT DISTINCT part is clearly on by default. Dimitris
Orphan files filling root partition after crash
Hello list, yesterday I was doing: ALTER TABLE x ADD UNIQUE ... , ADD PRIMARY KEY ...; The table is almost a billion rows long but lies in its own TABLESPACE that has plenty of space. But apparently the ALTER TABLE command is writing a lot to the default tablespace (not the temp_tablespace, that is already moved to a different partition). That quickly filled up the 50GB free space in my root partition: 20:18:04.222 UTC [94144] PANIC: could not write to file "pg_wal/xlogtemp.94144": No space left on device [...] 20:19:11.578 UTC [94140] LOG: WAL writer process (PID 94144) was terminated by signal 6: Aborted 20:19:11.578 UTC [94140] LOG: terminating any other active server processes After postgresql crashed and restarted, the disk space in the root partition was still not freed! I believe this is because of "orphaned files" as discussed in mailing list thread [1]. [1] https://www.postgresql.org/message-id/CAN-RpxDBA7HbTsJPq4t4VznmRFJkssP2SNEMuG%3DoNJ%2B%3DsxLQew%40mail.gmail.com I ended up doing some risky actions to remediate the problem: Find the filenames that have no identically named "oid" in pg_class, and delete (move to backup) the biggest ones while the database is stopped. Fortunately the database started up fine after that! So what is the moral of the story? How to guard against this? Why did the database write so much to the default tablespace, even when both the table and the temp tablespace are elsewhere? Also should one always keep the default tablespace away from the wal partition? (I guess it would have helped to avoid the crash, but the ALTER TABLE command would have still run out of space, so I'm not sure if the orphan files would have been avoided)? Needless to say, I would have hoped the database cleaned-up after itself even after an uncontrolled crash, or that it provided some tools for the job. (I tried VACUUM FULL on the table, but the orphaned files did not go away). My postgresql version is 16.2 installed on Ubuntu. Thank you, Dimitris
Re: Orphan files filling root partition after crash
Hi Sergey, On Wed, 28 Feb 2024, Sergey Fukanchik wrote: Hi Dimitrios, Do you have wal archiving enabled? $PGDATA/pg_wal/ is a bit different from tablespaces (including "default" one). It stores transaction journal. I don't think I have WAL archiving enabled. Here are the relevant WAL options in my config: wal_compression = zstd max_wal_size = 8GB min_wal_size = 1GB track_wal_io_timing = on wal_writer_flush_after = 8MB The issue happens because the WAL directory is by default in the same filesystem with the default tablespace (root partition for Ubuntu). So when the default tablespace filled up because of my ALTER TABLE operation, there was no space for WAL either. Thanks, Dimitris
Re: Orphan files filling root partition after crash
Thanks for the feedback Laurenz, On Wed, 28 Feb 2024, Laurenz Albe wrote: On Wed, 2024-02-28 at 12:16 +0100, Dimitrios Apostolou wrote: I ended up doing some risky actions to remediate the problem: Find the filenames that have no identically named "oid" in pg_class, and delete (move to backup) the biggest ones while the database is stopped. Fortunately the database started up fine after that! Lucky you. It should have been "relfilenode" rather than "oid", and some catalog tables don't have their files listed in the catalog, because they are needed *before* the database can access tables. I actually double checked that the filenames don't appear anywhere in SELECT * FROM pg_class and that the files were multi-GB in size including all the 1GB-pieces. But luck was definitely a part of the equation, I didn't know that the files might be accessed before tables (at db startup?) or that "relfilenode" would be more appropriate. Why is that, where can I read more? I see that many (but not all) rows in pg_class have oid=relfilenode but for many rows relfilenode=0 which is meaningless as filename. So what is the moral of the story? How to guard against this? Monitor disk usage ... It happened *fast*. And it was quite a big suprise coming from "just" a disk-full situation. A couple of suggestions; wouldn't it make sense: + for the index to be written by default to the table's tablespace? + for postgres to refuse to write non-wal files, if it's on the same device as the WAL and less than max_wal_size bytes are free? The root of the problem is that you created the index in the default tablespace. You should have ALTER TABLE x ADD UNIQUE ... USING INDEX TABLESPACE bigtblspc; Thank you, was reading the docs but didn't realize this syntax is valid. I thought it was only for CREATE/ALTER INDEX. Needless to say, I would have hoped the database cleaned-up after itself even after an uncontrolled crash, or that it provided some tools for the job. (I tried VACUUM FULL on the table, but the orphaned files did not go away). That is not so simple... Also, it would slow down crash recovery. But I agree that it would be nice to have a tool that reports or cleans up orphaned files. Yours, Laurenz Albe
SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
Hello list, INTRO I have a huge (multi-billion rows) table partitioned into 1000 partitions. Around half of the partitions are full and the rest are empty, created in advance ready to receive future incoming data. Postgres is 16.2. Here are the relevant parts of the schema: \d test_runs_raw Partitioned table "public.test_runs_raw" Column |Type | Collation | Nullable | Default ---+-+---+--+-- run_n | bigint | | not null | generated by default as identity test_case_n | smallint| | not null | workitem_n| integer | | not null | test_resulttype_n | smallint| | | Partition key: RANGE (workitem_n) Indexes: "test_runs_raw_partitioned_pkey" PRIMARY KEY, btree (workitem_n, run_n) Each partition is made to keep entries with workitem_n in ranges (0,20k), (20k,40k) and so on (k = kilo) up to 2k. PROBLEM I noticed that the following query is very very slow (too long to wait for it to finish): SELECT DISTINCT workitem_n FROM test_runs_raw ORDER BY workitem_n DESC LIMIT 10; What is remarkable, is that in 998 out of 1000 table scans it involves, the planner does not use the index. Instead it chooses a sequential scan. Here is the output from EXPLAIN: Limit (cost=853891608.79..853891608.99 rows=10 width=4) -> Unique (cost=853891608.79..853891612.79 rows=200 width=4) -> Sort (cost=853891608.79..853891610.79 rows=800 width=4) Sort Key: test_runs_raw.workitem_n DESC -> Gather (cost=853891488.22..853891570.22 rows=800 width=4) Workers Planned: 4 -> HashAggregate (cost=853890488.22..853890490.22 rows=200 width=4) Group Key: test_runs_raw.workitem_n -> Parallel Append (cost=0.00..813118117.30 rows=16308948365 width=4) -> Parallel Index Only Scan Backward using test_runs_raw__part_max9600k_pkey on test_runs_raw__part_max9600k test_runs_raw_480 (cost=0.57..1597355.10 rows=33623320 width=4) -> Parallel Index Only Scan Backward using test_runs_raw__part_max10140k_pkey on test_runs_raw__part_max10140k test_runs_raw_507 (cost=0.57..1210795.63 rows=25793672 width=4) -> Parallel Seq Scan on test_runs_raw__part_max9500k test_runs_raw_475 (cost=0.00..3037793.12 rows=64121612 width=4) -> Parallel Seq Scan on test_runs_raw__part_max11180k test_runs_raw_559 (cost=0.00..2918875.90 rows=61612190 width=4) [ ... 996 more sequential scans ... ] If I remove DISTINCT then the plan changes dramatically and it runs instantaneously: Limit (cost=363.84..367.30 rows=10 width=4) -> Append (cost=363.84..22527480551.58 rows=65235793929 width=4) -> Index Only Scan Backward using test_runs_raw__part_max2k_pkey on test_runs_raw__part_max2k test_runs_raw_1000 (cost=0.12..2.34 rows=1 width=4) -> Index Only Scan Backward using test_runs_raw__part_max19980k_pkey on test_runs_raw__part_max19980k test_runs_raw_999 (cost=0.12..2.34 rows=1 width=4) -> Index Only Scan Backward using test_runs_raw__part_max19960k_pkey on test_runs_raw__part_max19960k test_runs_raw_998 (cost=0.12..2.34 rows=1 width=4) -> Index Only Scan Backward using test_runs_raw__part_max19940k_pkey on test_runs_raw__part_max19940k test_runs_raw_997 (cost=0.12..2.34 rows=1 width=4) [ ... 996 more index scans ... ] Notice how in the last plan there is no parallel scanning. Instead the partitions are scanned sequentially, *in proper order*, so that the plan execution stops after reading the first 10 rows in the first non-empty partition. Why can't the same be done with DISTINCT? Please note that the workitem_n value range is well spread into in range (0,13M) and the table has been gradually filled within one year, so I'm assuming the vacuum worker has worked long enough to build sane statistics (not sure how to verify that). REMARKS 1. I tried reproducing the problem on an artificial table with few partitions and few values, but I couldn't. Both queries execute fast, and the planner is always choosing a non-parallel index-only scan. 2. Among testing changes to various settings, I just noticed that setting max_parallel_workers_per_gather to 0 (from the original value of 4) fixes the issue! On the original huge table, disabling parallelism actually makes the query infinitely faster and it returns within 1s! Is this a bug in the planner? Thank you, Dimitris
Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
On Fri, 10 May 2024, Dimitrios Apostolou wrote: I noticed that the following query is very very slow (too long to wait for it to finish): SELECT DISTINCT workitem_n FROM test_runs_raw ORDER BY workitem_n DESC LIMIT 10; Update: even the simplest SELECT DISTINCT query shows similar behaviour: EXPLAIN SELECT DISTINCT workitem_n FROM test_runs_raw LIMIT 10; Limit (cost=724518979.52..724518979.92 rows=10 width=4) -> Unique (cost=724518979.52..724518987.52 rows=200 width=4) -> Sort (cost=724518979.52..724518983.52 rows=1600 width=4) Sort Key: test_runs_raw.workitem_n -> Gather (cost=724518732.37..724518894.37 rows=1600 width=4) Workers Planned: 4 -> HashAggregate (cost=724517732.37..724517734.37 rows=200 width=4) Group Key: test_runs_raw.workitem_n -> Parallel Append (cost=0.00..704131546.90 rows=8154474186 width=4) -> Parallel Index Only Scan using test_runs_raw__part_max9600k_pkey on test_runs_raw__part_max9600k test_runs_raw_480 (cost=0.57..1429238.50 rows=16811660 width=4) -> Parallel Index Only Scan using test_runs_raw__part_max10140k_pkey on test_runs_raw__part_max10140k test_runs_raw_507 (cost=0.57..1081827.27 rows=12896836 width=4) -> Parallel Seq Scan on test_runs_raw__part_max9500k test_runs_raw_475 (cost=0.00..2717185.06 rows=32060806 width=4) -> Parallel Seq Scan on test_runs_raw__part_max11180k test_runs_raw_559 (cost=0.00..2610814.95 rows=30806095 width=4) It also takes ages to return, so I have to interrupt it. I believe it should exit early, as soon as it finds 10 distinct values (which should be rather easy even with parallel seqscans, given the pattern followed when inserting the data). Thanks, Dimitris
Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
On Fri, 10 May 2024, Dimitrios Apostolou wrote: On Fri, 10 May 2024, Dimitrios Apostolou wrote: Update: even the simplest SELECT DISTINCT query shows similar behaviour: Further digging into this simple query, if I force the non-parallel plan by setting max_parallel_workers_per_gather TO 0, I see that the query planner comes up with a cost much higher: Limit (cost=363.84..1134528847.47 rows=10 width=4) -> Unique (cost=363.84..22690570036.41 rows=200 width=4) -> Append (cost=363.84..22527480551.58 rows=65235793929 width=4) -> Index Only Scan using test_runs_raw__part_max20k_pkey on test_runs_raw__part_max20k test_runs_raw_1 (cost=0.12..2.34 rows=1 width=4) -> Index Only Scan using test_runs_raw__part_max40k_pkey on test_runs_raw__part_max40k test_runs_raw_2 (cost=0.12..2.34 rows=1 width=4) [...] -> Index Only Scan using test_runs_raw__part_max1780k_pkey on test_runs_raw__part_max1780k test_runs_raw_89 (cost=0.57..53587294.65 rows=106088160 width=4) -> Index Only Scan using test_runs_raw__part_max1800k_pkey on test_runs_raw__part_max1800k test_runs_raw_90 (cost=0.57..98943539.74 rows=96214080 width=4) -> Index Only Scan using test_runs_raw__part_max1820k_pkey on test_runs_raw__part_max1820k test_runs_raw_91 (cost=0.57..97495653.34 rows=193248960 width=4) -> Index Only Scan using test_runs_raw__part_max1840k_pkey on test_runs_raw__part_max1840k test_runs_raw_92 (cost=0.57..110205205.07 rows=218440928 width=4) -> Index Only Scan using test_runs_raw__part_max1860k_pkey on test_runs_raw__part_max1860k test_runs_raw_93 (cost=0.57..50164056.28 rows=99431760 width=4) [...] The total cost on the 1st line (cost=363.84..1134528847.47) has a much higher upper limit than the total cost when max_parallel_workers_per_gather is 4 (cost=853891608.79..853891608.99). This explains the planner's choice. But I wonder why the cost estimation is so far away from reality. Dimitris
Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
On Fri, 10 May 2024, Tom Lane wrote: Dimitrios Apostolou writes: Further digging into this simple query, if I force the non-parallel plan by setting max_parallel_workers_per_gather TO 0, I see that the query planner comes up with a cost much higher: Limit (cost=363.84..1134528847.47 rows=10 width=4) -> Unique (cost=363.84..22690570036.41 rows=200 width=4) -> Append (cost=363.84..22527480551.58 rows=65235793929 width=4) ... The total cost on the 1st line (cost=363.84..1134528847.47) has a much higher upper limit than the total cost when max_parallel_workers_per_gather is 4 (cost=853891608.79..853891608.99). This explains the planner's choice. But I wonder why the cost estimation is so far away from reality. I'd say the blame lies with that (probably-default) estimate of just 200 distinct rows. That means the planner expects to have to read about 5% (10/200) of the tables to get the result, and that's making fast-start plans look bad. Indeed that's an awful estimate, the table has more than 1M of unique values in that column. Looking into pg_stat_user_tables, I can't see the partitions having been vacuum'd or analyzed at all. I think they should have been auto-analyzed, since they get a ton of INSERTs (no deletes/updates though) and I have the default autovacuum settings. Could it be that autovacuum starts, but never finishes? I can't find something in the logs. In any case, even after the planner decides to execute the terrible plan with the parallel seqscans, why doesn't it finish right when it finds 10 distinct values? Possibly an explicit ANALYZE on the partitioned table would help. Thanks, I'll save the ANALYZE as the last step; I feel it's a good opportunity to figure out more details about how postgres works. Plus I expect ANALYZE to last a couple of days, so I should first find quiet time for that. :-) Dimitris
Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
On Sat, 11 May 2024, David Rowley wrote: On Sat, 11 May 2024 at 13:33, Tom Lane wrote: I do kind of wonder why it's producing both a hashagg and a Unique step --- seems like it should do one or the other. It still needs to make the duplicate groups from parallel workers unique. Range partitioning of the table guarantees that, since the ranges are not overlapping. Dimitris
Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
On Sat, 11 May 2024, David Rowley wrote: On Sat, 11 May 2024 at 13:11, Dimitrios Apostolou wrote: Indeed that's an awful estimate, the table has more than 1M of unique values in that column. Looking into pg_stat_user_tables, I can't see the partitions having been vacuum'd or analyzed at all. I think they should have been auto-analyzed, since they get a ton of INSERTs (no deletes/updates though) and I have the default autovacuum settings. Could it be that autovacuum starts, but never finishes? I can't find something in the logs. It's not the partitions getting analyzed you need to worry about for an ndistinct estimate on the partitioned table. It's auto-analyze or ANALYZE on the partitioned table itself that you should care about. If you look at [1], it says "Tuples changed in partitions and inheritance children do not trigger analyze on the parent table." Thanks In any case, even after the planner decides to execute the terrible plan with the parallel seqscans, why doesn't it finish right when it finds 10 distinct values? It will. It's just that Sorting requires fetching everything from its subnode. Isn't it plain wrong to have a sort step in the plan than? The different partitions contain different value ranges with no overlap, and the last query I posted doesn't even contain an ORDER BY clause, just a DISTINCT clause on an indexed column. Even with bad estimates, even with seq scan instead of index scan, the plan should be such that it concludes all parallel work as soon as it finds the 10 distinct values. And this is actually achieved if I disable parallel plans. Could it be a bug in the parallel plan generation? Dimitris
Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
On Mon, 13 May 2024, Dimitrios Apostolou wrote: On Sat, 11 May 2024, David Rowley wrote: On Sat, 11 May 2024 at 13:11, Dimitrios Apostolou wrote: Indeed that's an awful estimate, the table has more than 1M of unique values in that column. Looking into pg_stat_user_tables, I can't see the partitions having been vacuum'd or analyzed at all. I think they should have been auto-analyzed, since they get a ton of INSERTs (no deletes/updates though) and I have the default autovacuum settings. Could it be that autovacuum starts, but never finishes? I can't find something in the logs. It's not the partitions getting analyzed you need to worry about for an ndistinct estimate on the partitioned table. It's auto-analyze or ANALYZE on the partitioned table itself that you should care about. If you look at [1], it says "Tuples changed in partitions and inheritance children do not trigger analyze on the parent table." Thanks Do I read that correctly, that I have to setup cron jobs to manually analyze partitioned tables? Dimitris
Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
On Tue, 14 May 2024, David Rowley wrote: On Tue, 14 May 2024 at 00:41, Dimitrios Apostolou wrote: On Sat, 11 May 2024, David Rowley wrote: It will. It's just that Sorting requires fetching everything from its subnode. Isn't it plain wrong to have a sort step in the plan than? The different partitions contain different value ranges with no overlap, and the last query I posted doesn't even contain an ORDER BY clause, just a DISTINCT clause on an indexed column. The query does contain an ORDER BY, so if the index is not chosen to provide pre-sorted input, then something has to put the results in the correct order before the LIMIT is applied. The last query I tried was: SELECT DISTINCT workitem_n FROM test_runs_raw LIMIT 10; See my message at [1] https://www.postgresql.org/message-id/69077f15-4125-2d63-733f-21ce6eac4f01%40gmx.net Will re-check things and report back with further debugging info you asked for later today. Dimitris
Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
On Tue, 14 May 2024, David Rowley wrote: That assumes the Append won't ever use > 1 worker per subnode, but that's not the case for your plan as the subnodes are "Parallel". That means all the workers could be working on the same subnode which could result in one group being split between 2 or more workers. Didn't think of that, makes sense! Parallel Append can also run in a way that the Append child nodes will only get 1 worker each. How can I tell which case it is, from the EXPLAIN output (for example the output at [1]) ? [1] https://www.postgresql.org/message-id/69077f15-4125-2d63-733f-21ce6eac4f01%40gmx.net Dimitris
Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
On Fri, 10 May 2024, Tom Lane wrote: Dimitrios Apostolou writes: Further digging into this simple query, if I force the non-parallel plan by setting max_parallel_workers_per_gather TO 0, I see that the query planner comes up with a cost much higher: Limit (cost=363.84..1134528847.47 rows=10 width=4) -> Unique (cost=363.84..22690570036.41 rows=200 width=4) -> Append (cost=363.84..22527480551.58 rows=65235793929 width=4) ... The total cost on the 1st line (cost=363.84..1134528847.47) has a much higher upper limit than the total cost when max_parallel_workers_per_gather is 4 (cost=853891608.79..853891608.99). This explains the planner's choice. But I wonder why the cost estimation is so far away from reality. I'd say the blame lies with that (probably-default) estimate of just 200 distinct rows. That means the planner expects to have to read about 5% (10/200) of the tables to get the result, and that's making fast-start plans look bad. Possibly an explicit ANALYZE on the partitioned table would help. It took long but if finished: ANALYZE Time: 19177398.025 ms (05:19:37.398) And it made a difference indeed, the serial plan is chosen now: EXPLAIN SELECT DISTINCT workitem_n FROM test_runs_raw ORDER BY workitem_n DESC LIMIT 10; Limit (cost=364.29..1835512.29 rows=10 width=4) -> Unique (cost=364.29..22701882164.56 rows=123706 width=4) -> Append (cost=364.29..22538472401.60 rows=65363905182 width=4) -> Index Only Scan Backward using test_runs_raw__part_max2k_pkey on test_runs_raw__part_max2k test_runs_raw_1000 (cost=0.12..2.34 rows=1 width=4) -> Index Only Scan Backward using test_runs_raw__part_max19980k_pkey on test_runs_raw__part_max19980k test_runs_raw_999 (cost=0.12..2.34 rows=1 width=4) -> Index Only Scan Backward using test_runs_raw__part_max19960k_pkey on test_runs_raw__part_max19960k test_runs_raw_998 (cost=0.12..2.34 rows=1 width=4) [...] -> Index Only Scan Backward using test_runs_raw__part_max12460k_pkey on test_runs_raw__part_max12460k test_runs_raw_623 (cost=0.57..12329614.53 rows=121368496 width=4) -> Index Only Scan Backward using test_runs_raw__part_max12440k_pkey on test_runs_raw__part_max12440k test_runs_raw_622 (cost=0.57..5180832.16 rows=184927264 width=4) -> Index Only Scan Backward using test_runs_raw__part_max12420k_pkey on test_runs_raw__part_max12420k test_runs_raw_621 (cost=0.57..4544964.21 rows=82018824 width=4) [...] Overall I think there are two issues that postgres could handle better here: 1. Avoid the need for manual ANALYZE on partitioned table 2. Create a different parallel plan, one that can exit early, when the LIMIT is proportionally low. I feel the partitions could be parallel-scanned in-order, so that the whole thing stops when one partition has been read. Thank you! Dimitris
Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
On Tue, 14 May 2024, Dimitrios Apostolou wrote: It took long but if finished: ANALYZE Time: 19177398.025 ms (05:19:37.398) I see now that default_statistics_target is globally set to 500, so this is probably the reason it took so long. I guess with the default of 100, it would take approximately one hour. This is much better to have in a cron job. :-) Dimitris
Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
On Tue, 14 May 2024, David Rowley wrote: If you were to put the n_distinct_inherited estimate back to 200 and disable sort, you should see the costs are higher for the index plan. If that's not the case then there might be a bug. It seems more likely that due to the n_distinct estimate being so low that the planner thought that a large enough fraction of the rows needed to be read and that made the non-index plan appear cheaper. I'd be interested in seeing what the costs are for the index plan. I think the following will give you that (untested): alter table test_runs_raw alter column workitem_n set (n_distinct_inherited=200); analyze test_runs_raw; I had to stop this step because it was taking too long going through all partitions again. But it seems it had the desired effect. set enable_sort=0; explain SELECT DISTINCT workitem_n FROM test_runs_raw ORDER BY workitem_n DESC LIMIT 10; It chooses the non-parallel index plan: Limit (cost=365.17..1135517462.36 rows=10 width=4) -> Unique (cost=365.17..22710342308.83 rows=200 width=4) -> Append (cost=365.17..22546660655.46 rows=65472661350 width=4) -> Index Only Scan Backward using test_runs_raw__part_max2k_pkey on test_runs_raw__part_max2k test_runs_raw_1000 (cost=0.12..2.34 rows=1 width=4) -> Index Only Scan Backward using test_runs_raw__part_max19980k_pkey on test_runs_raw__part_max19980k test_runs_raw_999 (cost=0.12..2.34 rows=1 width=4) [... only index scans follow] LIMIT 100 goes for the parallel seqscan plan, that even contains a sort! But it seams to me that the extra upper level HashAggregate step raises the cost by an order of magnitude, from 800M to 10G, in comparison to doing (Unique->Sort) - see plan in the next paragraph. Limit (cost=10857220388.76..10857220389.01 rows=100 width=4) -> Sort (cost=10857220388.76..10857220389.26 rows=200 width=4) Sort Key: test_runs_raw.workitem_n DESC -> HashAggregate (cost=857220379.12..857220381.12 rows=200 width=4) Group Key: test_runs_raw.workitem_n -> Gather (cost=857220295.12..857220377.12 rows=800 width=4) Workers Planned: 4 -> HashAggregate (cost=857219295.12..857219297.12 rows=200 width=4) Group Key: test_runs_raw.workitem_n -> Parallel Append (cost=0.00..816295259.21 rows=16369614363 width=4) -> Parallel Index Only Scan Backward using test_runs_raw__part_max9600k_pkey on test_runs_raw__part_max9600k test_runs_raw_480 (cost=0.57..1597356.30 rows=33623360 width=4) -> Parallel Index Only Scan Backward using test_runs_raw__part_max10140k_pkey on test_runs_raw__part_max10140k test_runs_raw_507 (cost=0.57..1210806.37 rows=25794030 width=4) -> Parallel Seq Scan on test_runs_raw__part_max9500k test_runs_raw_475 (cost=0.00..3037800.88 rows=64122388 width=4) -> Parallel Seq Scan on test_runs_raw__part_max11180k test_runs_raw_559 (cost=0.00..2918865.36 rows=61611136 width=4) [... only seqscans follow] If I re-enable sort, then it goes for the parallel seqscan plan even with LIMIT 10: SET SESSION enable_sort TO TRUE; EXPLAIN SELECT DISTINCT workitem_n FROM test_runs_raw ORDER BY workitem_n DESC LIMIT 10; Limit (cost=857166256.39..857166256.59 rows=10 width=4) -> Unique (cost=857166256.39..857166260.39 rows=200 width=4) -> Sort (cost=857166256.39..857166258.39 rows=800 width=4) Sort Key: test_runs_raw.workitem_n DESC -> Gather (cost=857166135.82..857166217.82 rows=800 width=4) Workers Planned: 4 -> HashAggregate (cost=857165135.82..857165137.82 rows=200 width=4) Group Key: test_runs_raw.workitem_n -> Parallel Append (cost=0.00..816243567.24 rows=16368627432 width=4) -> Parallel Index Only Scan Backward using test_runs_raw__part_max9600k_pkey on test_runs_raw__part_max9600k test_runs_raw_480 (cost=0.57..1597356.30 rows=33623360 width=4) -> Parallel Index Only Scan Backward using test_runs_raw__part_max10140k_pkey on test_runs_raw__part_max10140k test_runs_raw_507 (cost=0.57..1210806.37 rows=25794030 width=4) -> Parallel Seq Scan on test_runs_raw__part_max9500k test_runs_raw_475 (cost=0.00..3037800.88 rows=64122388 width=4) -> Parallel Seq Scan on test_runs_raw__part_max11180k test_runs_raw_559 (cost=0.00..2918865.36 rows=61611136 width=4) [... only seqscans follow] So in order of higher to lower cost, we have the following alternatives: 1. non-parallel index scan (800M) 2. parallel seqscan with sort (1.3G) 3. parallel seqscan without
Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
I have forgotten to mention that I have enable_partitionwise_aggregate=on in the global settings since the beginning. According to the docs: Enables or disables the query planner's use of partitionwise grouping or aggregation, which allows grouping or aggregation on partitioned tables to be performed separately for each partition. Reading that, I'd expect to see a separate DISTINCT->LIMIT 10 on every partition, and then it would be up to independent plans to decide whether each partition follows a parallel or a serial plan. Not sure if this plan was checked but rejected because of cost. Dimitris
pg_stat_io clarifications: background worker, writes and reads
Hello list, what is the "background worker" in the pg_stat_io statistics view? I'm reading the documentation but can't figure this one out knowing that it is not autovacuum or bgwriter. And I'm not aware of any extension I might have with registered background worker. Additionally, how can it be evictions > writes? I would expect every eviction to cause a write. Finally about "hits", I understand they are reads found in shared_buffers, so they never registered into the "reads" counter. So is "reads" in pg_stat_io the equivalent to misses, i.e. the opposite of "hits", the read attempts not found in the shared_buffers, that needed to be fetched from the disk (or OS buffercache)? backend_type|object | context | reads | read_time | writes | write_time | writebacks | writeback_time | extends | extend_time | op_bytes | hits| evictions | reuses | fsyncs | fsync_time | stats_reset ---+---+-+-+-+++++-+-+--+---+---++++--- background worker | relation | normal | 5139575 | 2196288.011 | 63277 | 1766.94 | 0 | 0 | 0 | 0 | 8192 | 876913705 | 5139653 || 0 | 0 | 2024-04-08 08:50:02.971192+00 Thank you in advance, Dimitris
Re: pg_stat_io clarifications: background worker, writes and reads
So what is this particular "background worker" I'm seeing, given that I have no replication or extensions? Searching the logs I found entries like the following: LOG: background worker "parallel worker" (PID 93384) exited with exit code 1 This got logged when I killed a simple SELECT query that took too long doing parallel seqscans. Could it be that the entry in pg_stat_io named "background worker" also includes the parallel workers from a SELECT query? Thanks, Dimitris On Wed, 15 May 2024, Muhammad Imtiaz wrote: Hi, In PostgreSQL, the pg_stat_io view provides detailed statistics on I/O operations. Background process perform maintenance tasks and other background operations essential to the functioning of the PostgreSQL database. They include processes such as: 1. Autovacuum Workers 2. WAL Writer 3. Background Writer 4. Logical Replication Workers 5. Custom Background Workers In the pg_stat_io view, statistics related to I/O operations performed by these background workers are recorded. Regards, Imtiaz On Wed, 15 May 2024, 01:26 Dimitrios Apostolou, wrote: Hello list, what is the "background worker" in the pg_stat_io statistics view? I'm reading the documentation but can't figure this one out knowing that it is not autovacuum or bgwriter. And I'm not aware of any extension I might have with registered background worker. Additionally, how can it be evictions > writes? I would expect every eviction to cause a write. Finally about "hits", I understand they are reads found in shared_buffers, so they never registered into the "reads" counter. So is "reads" in pg_stat_io the equivalent to misses, i.e. the opposite of "hits", the read attempts not found in the shared_buffers, that needed to be fetched from the disk (or OS buffercache)? backend_type | object | context | reads | read_time | writes | write_time | writebacks | writeback_time | extends | extend_time | op_bytes | hits | evictions | reuses | fsyncs | fsync_time | stats_reset ---+---+-+-+-+++++-+-+--+---+---++++--- background worker | relation | normal | 5139575 | 2196288.011 | 63277 | 1766.94 | 0 | 0 | 0 | 0 | 8192 | 876913705 | 5139653 | | 0 | 0 | 2024-04-08 08:50:02.971192+00 Thank you in advance, Dimitris
Re: pg_stat_io clarifications: background worker, writes and reads
Hi Kashif, just to clarify my question, are you saying that the I/O from parallel workers is recorded into the "background worker" entry of pg_stat_io? Thanks, Dimitris On Wed, 15 May 2024, Kashif Zeeshan wrote: Hi parallel worker are used for parallel execution of the queries and you can find the help in the below link. https://www.postgresql.org/docs/current/how-parallel-query-works.html Its controlled by following parameters. max_worker_processes = 6 max_parallel_workers_per_gather = 3 max_parallel_workers = 6 The limit of concurrent parallel workers for the whole cluster is max_parallel_workers, which must be ≤ max_worker_processes. The limit of parallel workers per query is max_parallel_workers_per_gather. Thanks Kashif Zeeshan Bitnine Global On Wed, May 15, 2024 at 5:59 PM Dimitrios Apostolou wrote: So what is this particular "background worker" I'm seeing, given that I have no replication or extensions? Searching the logs I found entries like the following: LOG: background worker "parallel worker" (PID 93384) exited with exit code 1 This got logged when I killed a simple SELECT query that took too long doing parallel seqscans. Could it be that the entry in pg_stat_io named "background worker" also includes the parallel workers from a SELECT query? Thanks, Dimitris
Query on partitioned table needs memory n_partitions * work_mem
Hello list, I have a table with 1000 partitions on PostgreSQL 16. I notice that a fairly complicated query of the form: SELECT ... GROUP BY ... LIMIT ... causes the postgres backend process to grow insanely very fast, and the kernel OOM killer to kill it rather soon. It seems it tries to allocate at least 1000 * work_mem. If I reduce the amount of work_mem, I can control the outcome and avoid the crash, but this is suboptimal. I have parallel plans disabled (max_parallel_workers_per_gather=0). To add a bit more info on the execution plan, I believe the relevant part is the 1000 HashAggregate nodes under Append: -> Append -> HashAggregate -> Seq Scan -> ... 1000 more hashagg+seqscans Is this allocation pattern (workmem * n_partitions) expected under any scenario? I can't find it documented. AFAIU the backend should allocate up to (depth_of_execution_plan * work_mem) (putting aside the hash_mem_multiplier and the parallel workers). NOTE: after having written the above message, it occured to me that I have enable_partitionwise_aggregate=on. And Turning it off fixes the issue and makes the query faster too! Expected behaviour or bug? Thank you in advance, Dimitris P.S. In the meantime I'm trying to reduce the query and the table schema, in order to submit a precise bug report with repro instructions.
Re: Query on partitioned table needs memory n_partitions * work_mem
Thank you for the feedback. So I've managed to reduce the query to a rather simple one: SELECT workitem_n, test_executable_n, bool_or(test_resulttype_n IN (2,3)) FROM test_runs_raw GROUP BY workitem_n, test_executable_n LIMIT 10; The TABLE test_runs_raw has 1000 partitions on RANGE(workitem_n). All the columns are various integer types. There is an index on workitem_n. On Thu, 11 Jul 2024, David Rowley wrote: On Thu, 11 Jul 2024 at 13:19, Dimitrios Apostolou wrote: I have a table with 1000 partitions on PostgreSQL 16. I notice that a fairly complicated query of the form: SELECT ... GROUP BY ... LIMIT ... causes the postgres backend process to grow insanely very fast, and the kernel OOM killer to kill it rather soon. It seems it tries to allocate at least 1000 * work_mem. -> Append -> HashAggregate -> Seq Scan -> ... 1000 more hashagg+seqscans Is this allocation pattern (workmem * n_partitions) expected under any scenario? I can't find it documented. AFAIU the backend should allocate up to (depth_of_execution_plan * work_mem) (putting aside the hash_mem_multiplier and the parallel workers). Not depth of execution plan. It relates to the number of nodes in the plan which allocate work_mem or work_mem * hash_mem_multiplier. There is some documentation in [1]: "Note that a complex query might perform several sort and hash operations at the same time" The latest query is not complex at all and I don't see it doing 1000s of operations at the same time. By "number of nodes" would you add up all HashAggregate nodes under an Append node? Here is part of the EXPLAIN ANALYZE output: Limit (cost=0.01..28.00 rows=10 width=7) (actual time=43120.466..43292.246 rows=10 loops=1) Output: test_runs_raw.workitem_n, test_runs_raw.test_executable_n, (bool_or((test_runs_raw.test_resulttype_n = ANY ('{2,3}'::integer[] Buffers: shared hit=96 read=883975 I/O Timings: shared read=16284.731 -> Append (cost=0.01..3416299633.71 rows=1220556171 width=7) (actual time=42968.794..43139.855 rows=10 loops=1) Buffers: shared hit=96 read=883975 I/O Timings: shared read=16284.731 -> HashAggregate (cost=0.01..0.02 rows=1 width=7) (actual time=10.662..10.663 rows=0 loops=1) Output: test_runs_raw.workitem_n, test_runs_raw.test_executable_n, bool_or((test_runs_raw.test_resulttype_n = ANY ('{2,3}'::integer[]))) Group Key: test_runs_raw.workitem_n, test_runs_raw.test_executable_n Batches: 1 Memory Usage: 24kB -> Seq Scan on public.test_runs_raw__part_max20k test_runs_raw (cost=0.00..0.00 rows=1 width=8) (actual time=9.960..9.961 rows=0 loops=1) Output: test_runs_raw.workitem_n, test_runs_raw.test_executable_n, test_runs_raw.test_resulttype_n -> HashAggregate (cost=0.01..0.02 rows=1 width=7) (actual time=1.913..1.914 rows=0 loops=1) Output: test_runs_raw_1.workitem_n, test_runs_raw_1.test_executable_n, bool_or((test_runs_raw_1.test_resulttype_n = ANY ('{2,3}'::integer[]))) Group Key: test_runs_raw_1.workitem_n, test_runs_raw_1.test_executable_n Batches: 1 Memory Usage: 24kB -> Seq Scan on public.test_runs_raw__part_max40k test_runs_raw_1 (cost=0.00..0.00 rows=1 width=8) (actual time=1.031..1.031 rows=0 loops=1) Output: test_runs_raw_1.workitem_n, test_runs_raw_1.test_executable_n, test_runs_raw_1.test_resulttype_n [ 1000s of similar HashAggregate nodes ... ] Settings: temp_buffers = '32MB', work_mem = '32MB', effective_io_concurrency = '300', max_parallel_workers_per_gather = '0', enable_hashjoin = 'off', enable_partitionwise_join = 'on', enable_partitionwise_aggregate = 'on', random_page_cost = '1.1', effective_cache_size = '6GB', from_collapse_limit = '24', join_collapse_limit = '24' Planning: Buffers: shared hit=377 Planning Time: 1503.800 ms Execution Time: 56515.185 ms (5382 rows) Memory usage on each HashAggregate is logged as 24KB (many HashAggregates are missing that info though), I guess the EXPLAIN output is missing some important part of the allocations here since I'm seeing MBs of allocations per node. I can't help but see this as a bug. I see many issues: * postgres is not reading from partitions in parallel, but one after the other. It shouldn't need all this memory simultaneously. * The memory is unnecessarily allocated early on, before any partitions are actually aggregated. I know this because I/O is slow on this device and the table sizes are huge, it's simply not possible that postgres went through all partitions and blew up the memory. That would take hours, but the OOM happens seconds
Re: Query on partitioned table needs memory n_partitions * work_mem
On Thu, 11 Jul 2024, Tom Lane wrote: Dimitrios Apostolou writes: The TABLE test_runs_raw has 1000 partitions on RANGE(workitem_n). So don't do that. Adding partitions is not cost-free. I understand that, they also add an administrative cost that I'd rather avoid. But I ended up adding all these partitions because of performance issues on a multi-billion rows table. There is probably some message from me on this list a couple of years ago. At the moment I have a work-around. I'm thankful that everyone is willing to provide workarounds to all potential issues/bugs I have presented, but unfortunately workarounds are not fixes, one will hit the same wall again at some point. My current concern is **reporting my findings responsibly**. I want to provide as much data needed to pinpoint the issue, so that the developers know exactly what's going on. Having right data is half the fix. A way to track the issue would be nice. I might revisit it and even try to submit a patch. I wonder how the postgres development community is tracking all these issues, I've even started forgetting the ones I have found, and I'm sure I have previously reported (on this list) a couple of should-be-easy issues that would be ideal for beginners. Regards, Dimitris
Re: Query on partitioned table needs memory n_partitions * work_mem
On Thu, 18 Jul 2024, David Rowley wrote: On Fri, 12 Jul 2024 at 02:08, Dimitrios Apostolou wrote: * The memory is unnecessarily allocated early on, before any partitions are actually aggregated. I know this because I/O is slow on this device and the table sizes are huge, it's simply not possible that postgres went through all partitions and blew up the memory. That would take hours, but the OOM happens seconds after I start the query. That's interesting. Certainly, there is some memory allocated during executor startup, but that amount should be fairly small. Are you able to provide a self-contained test case that shows the memory blowing up before execution begins? I'm trying hard to create a self-contained way to reproduce the issue. It's not easy, the behaviour is a bit unstable. So far I see high memory usage (8-10GB) but I'm not able to OOM with that much (it actually OOM'ed once, but it was so uncontrollable that I didn't manage to measure and reproduce again later). -- I create a table with 2k partitions. CREATE TABLE partitioned_table1( run_n bigint GENERATED ALWAYS AS IDENTITY, workitem_n integer NOT NULL, label_n smallint, result smallint NOT NULL, PRIMARY KEY(workitem_n, run_n) ) PARTITION BY RANGE(workitem_n); DO $$ for i in range(0, 2000): stmt = f''' CREATE TABLE part_max{i+1}M PARTITION OF partitioned_table1 FOR VALUES FROM ({i*1000*1000}) TO ({(i+1) * 1000*1000}) ''' plpy.execute(stmt) $$ LANGUAGE plpython3u; -- I insert random data. First I insert to all partitions, 1M rows each: DO $$ for i in range(0, 2000): stmt = f''' INSERT INTO partitioned_table1(workitem_n, label_n, result) SELECT j-j%4, CAST(random()*1000 AS INTEGER), CAST(random()*3AS INTEGER) FROM generate_series({i}*1000*1000, ({i}+1)*1000*1000 - 1, 1) as j ''' plpy.info(stmt) plpy.execute(stmt) plpy.commit() $$ LANGUAGE plpython3u; -- Disable parallel execution and group aggregate: SET SESSION max_parallel_workers_per_gather TO 0; SET SESSION enable_incremental_sort TO off; SET SESSION work_mem TO '8MB'; -- Now the following query should do a HashAggregate: SELECT workitem_n, label_n, bool_or(result IN (2,3)) FROM partitioned_table1 GROUP BY workitem_n, label_n LIMIT 10; -- How much was the RSS of the backend while the previous query was -- running? Not that high. But if we insert some million rows to the -- 1st partition, then it will be much higher. DO $$ for i in range(0,2000): stmt = f''' INSERT INTO partitioned_table1(workitem_n, label_n, result) SELECT j%100, CAST(random()*2 AS INTEGER), CAST(random()*4AS INTEGER) FROM generate_series({i}*1000*1000, ({i}+1)*1000*1000 - 1, 1) as j ''' plpy.info(stmt) plpy.execute(stmt) plpy.commit() $$ LANGUAGE plpython3u; -- Now that same previous query consumes between 8GB and 10GB RSS. The -- more data I insert (to all partitions?), the more memory the query -- takes. Overall: * I don't see the RSS memory usage (8GB) growing proportionally as I expected. If I increase work_mem from 4MB to 8MB then I see double RSS memory usage (from ~4GB to ~8GB). But then if I increase it further the difference is miniscule and no OOM happens. * Instead I notice RSS memory usage growing slowly while I insert more and more data to the table (especially into the 1st partition I think). * Finally I don't see the memory being free'd by the backend after the SELECT finishes. The system is relieved only when I disconnect psql and the backend dies. Not sure if that's by design or not. Having wasted long time in that, the minimum I can do is submit a documentation patch. At enable_partitionwise_aggregate someting like "WARNING it can increase the memory usage by at least n_partitions * work_mem". How do I move on for such a patch? Pointers would be appreciated. :-) I think mentioning something about this in enable_partitionwise_join and enable_partitionwise_aggregate is probably wise. I'll propose a patch on pgsql-hackers. David and Ashutosh, thank you both for your interest in improving the documentation. Unfortunately I'm not positive any longer on what exactly is going on here, I don't understand how the memory is growing. One thing I can verify is that it's definitely caused by partitioning: I have another similar huge table but unpartitioned, and no such issues show up. Maybe someone with knowledge of the HashAggregate algorithm and partitioning can throw some ideas in. Regards, Dimitris
Re: Query on partitioned table needs memory n_partitions * work_mem
Ok I reproduced the OOM, with only 200 partitions as opposed to 2K that I mentioned before. The keys to reproduce it I believe are: 1. Write millions of rows to *all* partitions 2. Run ANALYSE so that the planner knows about the sizes Here are the correct steps now. Let me know if you manage to reproduce. Kind reminder to set vm.overcommit_memory to 2, before you lose your system like it happened to me. :-) CREATE TABLE partitioned_table1( run_n bigint GENERATED ALWAYS AS IDENTITY, workitem_n integer NOT NULL, label_n smallint, result smallint NOT NULL, PRIMARY KEY(workitem_n, run_n) ) PARTITION BY RANGE(workitem_n); -- Create 200 partitions DO $$ for i in range(0, 200): start = i * 10 * 1000 * 1000 end = (i+1) * 10 * 1000 * 1000 stmt = f''' CREATE TABLE part{i} PARTITION OF partitioned_table1 FOR VALUES FROM ({start}) TO ({end}) ''' plpy.execute(stmt) $$ LANGUAGE plpython3u; -- Insert 20M rows per partition. This will take a while but I don't know -- of a way to speed it up. DO $$ for i in range(0, 2000): stmt = f''' INSERT INTO partitioned_table1(workitem_n, label_n, result) SELECT j, CAST(random()*1000 AS INTEGER), CAST(random()*4AS INTEGER) FROM generate_series({i}*1000*1000, ({i}+1)*1000*1000 - 1, 1) as j ''' plpy.info(stmt) plpy.execute(stmt) plpy.commit() $$ LANGUAGE plpython3u; -- Important to analyse! and set the right settings... ANALYSE partitioned_table1; SET SESSION max_parallel_workers_per_gather TO 0; SET SESSION enable_incremental_sort TO off; SET SESSION enable_partitionwise_aggregate TO on; SET SESSION enable_partitionwise_join TO on; -- And then reproduce the issue: \timing on SET SESSION work_mem TO '4MB'; SELECT workitem_n, label_n, bool_or(result IN (2,3)) FROM partitioned_table1 GROUP BY workitem_n, label_n LIMIT 10; Repeat that last query while doubling the work_mem every time. You'll see the process growing to immense dimensions. In my case, my 16GB of RAM were exhausted when work mem was set to '128MB'. Remarkably, the fastest execution is the one with the smallest work_mem, where HashAggregate splits the work into many "partitions". As we grow the work_mem, it becomes slower and slower (of course while not reaching the limits of RAM yet; if it goes to swap then all measurements are off). Let me know if you manage to reproduce the issue! Dimitris
array_agg() does not stop aggregating according to HAVING clause
Hello list, I have a query that goes through *billions* of rows and for the columns that have an infrequent "datatag" (HAVING count(test_datatag_n)<10) it selects all the IDs of the entries (array_agg(run_n)). Here is the full query: INSERT INTO infrequent_datatags_in_this_chunk SELECT datatag, datatags.datatag_n, array_agg(run_n) FROM runs_raw JOIN datatags USING(datatag_n) WHERE workitem_n >= 295 AND workitem_n < 714218 AND datatag IS NOT NULL GROUP BY datatags.datatag_n HAVING count(datatag_n) < 10 AND count(datatag_n) > 0 -- Not really needed because of the JOIN above ; The runs_raw table has run_n as the primary key id, and an index on workitem_n. The datatags table is a key value store with datatag_n as primary key. The problem is that this is extremely slow (5 hours), most likely because it creates tens of gigabytes of temporary files as I see in the logs. I suspect that it is writing to disk the array_agg(run_n) of all entries and not only those HAVING count(datatag_n)<10. (I might be wrong though, as this is only an assumption based on the amount of data written; I don't know of any way to examine the temporary files written). While this query is going through billions of rows, the ones with infrequent datatags are maybe 10M. How do I tell postgres to stop aggregating when count>=10? Thank you in advance, Dimitris
Re: array_agg() does not stop aggregating according to HAVING clause
On Sat, 17 Aug 2024, Tom Lane wrote: Well, yes: the two aggregates (array_agg and count) are computed concurrently in a single Aggregate plan node scanning the output of the JOIN. There's no way to apply the HAVING filter until after the aggregation is finished. I think this approach is basically forced by the SQL standard's semantics for grouping/aggregation. FWIW I also tried: HAVING array_length(array_agg(run_n), 1) < 10; but I saw the same amount of temp files, at least in the short duration of my test run. Thank you, I will split this into two passes like you suggested. It's just that I'm doing another 3 passes over this table for different things I calculate (different GROUP BY, different WHERE clauses) and I was hoping to minimize the time spent. But avoiding the array_agg() over everything is my top priority ATM so I'll definitely try. Regards, Dimitris
Experience and feedback on pg_restore --data-only
Rationale: When restoring a backup in an emergency situation, it's fine to run pg_restore as superuser and get an exact replica of the dumped db. AFAICT pg_restore (without --data-only) is optimised for such case. But pg_dump/restore can be used as a generic data-copying utility, and in those cases it makes often sense to get rid of the churn and create a clean database by running the SQL schema definition from version control, and then copy the data for only the tables created. For this case, I choose to run pg_restore --data-only, and run it as the user who owns the database (dbowner), not as a superuser, in order to avoid changes being introduced under the radar. Things that made my life hard: * plenty of permission denials for both ALTER OWNER or SET SESSION AUTHORIZATION (depending on command line switches). Both of these require superuser privilege, but in my case this is not really needed. Dbowner has CREATEROLE and is the one who creates all the roles (WITH SET TRUE), and their private schemata in the specific database. Things would work if pg_restore did "SET ROLE" instead of "SET SESSION AUTHORIZATION" to switch user. Is this a straightforward change or there are issues I don't see? * After each failed attempt, I need to issue a TRUNCATE table1,table2,... before I try again. I wrote my own function for that. It would help if pg_restore would optionally truncate before COPY. I believe it would require superuser privilege for it, that could achieve using the --superuser=username option used today for disabling the triggers. Performance issues: (important as my db size is >5TB) * WAL writes: I didn't manage to avoid writing to the WAL, despite having setting wal_level=minimal. I even wrote my own function to ALTER all tables to UNLOGGED, but failed with "could not change table T to unlogged because it references logged table". I'm out of ideas on this one. * Indices: Could pg_restore have a switch to DROP indices before each COPY, and re-CREATE them after, exactly as they were? This would speed up the process quite a bit. Any feedback for improving my process? Should I put these ideas somewhere as ideas for improvement on pg_restore? Thank you in advance, Dimitris
Re: Experience and feedback on pg_restore --data-only
On Thu, 20 Mar 2025, Dimitrios Apostolou wrote: Rationale: When restoring a backup in an emergency situation, it's fine to run pg_restore as superuser and get an exact replica of the dumped db. AFAICT pg_restore (without --data-only) is optimised for such case. But pg_dump/restore can be used as a generic data-copying utility, and in those cases it makes often sense to get rid of the churn and create a clean database by running the SQL schema definition from version control, and then copy the data for only the tables created. For this case, I choose to run pg_restore --data-only, and run it as the user who owns the database (dbowner), not as a superuser, in order to avoid changes being introduced under the radar. Another important reason I use --data-only: tablespaces. The new host has different storage configuration and tablespaces and the schema has been updated to reflect those. Therefore I must create the database using the updated schema definitions on the new host. Things that made my life hard: * plenty of permission denials for both ALTER OWNER or SET SESSION AUTHORIZATION (depending on command line switches). Both of these require superuser privilege, but in my case this is not really needed. Dbowner has CREATEROLE and is the one who creates all the roles (WITH SET TRUE), and their private schemata in the specific database. Things would work if pg_restore did "SET ROLE" instead of "SET SESSION AUTHORIZATION" to switch user. Is this a straightforward change or there are issues I don't see? * After each failed attempt, I need to issue a TRUNCATE table1,table2,... before I try again. I wrote my own function for that. It would help if pg_restore would optionally truncate before COPY. I believe it would require superuser privilege for it, that could achieve using the --superuser=username option used today for disabling the triggers. Performance issues: (important as my db size is >5TB) * WAL writes: I didn't manage to avoid writing to the WAL, despite having setting wal_level=minimal. I even wrote my own function to ALTER all tables to UNLOGGED, but failed with "could not change table T to unlogged because it references logged table". I'm out of ideas on this one. * Indices: Could pg_restore have a switch to DROP indices before each COPY, and re-CREATE them after, exactly as they were? This would speed up the process quite a bit. Any feedback for improving my process? Should I put these ideas somewhere as ideas for improvement on pg_restore? Thank you in advance, Dimitris
Re: Experience and feedback on pg_restore --data-only
On Mon, 24 Mar 2025, Adrian Klaver wrote: On 3/24/25 07:24, Dimitrios Apostolou wrote: On Sun, 23 Mar 2025, Laurenz Albe wrote: On Thu, 2025-03-20 at 23:48 +0100, Dimitrios Apostolou wrote: Performance issues: (important as my db size is >5TB) * WAL writes: I didn't manage to avoid writing to the WAL, despite having setting wal_level=minimal. I even wrote my own function to ALTER all tables to UNLOGGED, but failed with "could not change table T to unlogged because it references logged table". I'm out of ideas on this one. You'd have to create an load the table in the same transaction, that is, you'd have to run pg_restore with --single-transaction. That would restore the schema from the dump, while I want to create the schema from the SQL code in version control. I am not following, from your original post: " ... create a clean database by running the SQL schema definition from version control, and then copy the data for only the tables created. For this case, I choose to run pg_restore --data-only, and run it as the user who owns the database (dbowner), not as a superuser, in order to avoid changes being introduced under the radar. " You are running the process in two steps, where the first does not involve pg_restore. Not sure why doing the pg_restore --data-only portion in single transaction is not possible? Laurenz informed me that I could avoid writing to the WAL if I "create and load the table in a single transaction". I haven't tried, but here is what I would do to try --single-transaction: Transaction 1: manually issuing all of CREATE TABLE etc. Transaction 2: pg_restore --single-transaction --data-only The COPY command in transaction 2 would still need to write to WAL, since it's separate from the CREATE TABLE. Am I wrong somewhere? Something that might work, would be for pg_restore to issue a TRUNCATE before the COPY. I believe this would require superuser privelege though, that I would prefer to avoid. Currently I issue TRUNCATE for all tables manually before running pg_restore, but of course this is in a different transaction so it doesn't help. By the way do you see potential problems with using --single-transaction to restore billion-rows tables? COPY is all or none(version 17+ caveat(see https://www.postgresql.org/docs/current/sql-copy.html ON_ERROR)), so if the data dump fails in --single-transaction everything rolls back. So if I restore all tables, then an error about a "table not found" would not roll back already copied tables, since it's not part of a COPY? Thank you for the feedback, Dimitris
Re: Experience and feedback on pg_restore --data-only
On Sun, 23 Mar 2025, Adrian Klaver wrote: On 3/20/25 15:48, Dimitrios Apostolou wrote: * plenty of permission denials for both ALTER OWNER or SET SESSION AUTHORIZATION (depending on command line switches). Both of these require superuser privilege, but in my case this is not really needed. Dbowner has CREATEROLE and is the one who creates all the roles (WITH SET TRUE), and their private schemata in the specific database. Things would work if pg_restore did "SET ROLE" instead of "SET SESSION AUTHORIZATION" to switch user. Is this a straightforward change or there are issues I don't see? If this is --data-only what are the ALTER OWNER and SET SESSION AUTHORIZATION for? You are probably right, early in my trials I was running pg_restore without --data-only as a non-superuser so it might be that the error message comes from there. Haven't noted the exact command for this error message unfortunately. The point still stands though. The dbowner user is administrator for this database, and has CREATEd the users with the right to SET ROLE as any of them. Those other users own tables in their private schemas. But pg_restore does SET SESSION AUTHORIZATION which requires superuser priv instead of SET ROLE. I wonder what the reasons are for that. Maybe pg_restore could either: - do SET ROLE instead of SET SESSION AUTHORIZATION - temporarily use the --superuser powers just for issuing the ALTER ROLE. Regards, Dimitris
Re: Experience and feedback on pg_restore --data-only
Hi Ron, I read your reply in the mailing list archives as I'm not subscribed to the list, and I'm copy-pasting a response here. Please include me as a recipient in further replies. Why are you regularly having emergencies requiring the restoration of multi-TB tables to databases with lots of cruft? Fixing that would go a long way towards eliminating your problems with pg_restore. I don't have emergencies yet. I'm testing the process of restoring the database dump, and it takes more than 24 hours currently. A successful test is vital to approve the process. But the primary usage of pg_restore that I have is not to save me from emergencies but to populate the dev database with recent data. Regards, Dimitris
Re: Experience and feedback on pg_restore --data-only
On Sun, 23 Mar 2025, Laurenz Albe wrote: On Thu, 2025-03-20 at 23:48 +0100, Dimitrios Apostolou wrote: Performance issues: (important as my db size is >5TB) * WAL writes: I didn't manage to avoid writing to the WAL, despite having setting wal_level=minimal. I even wrote my own function to ALTER all tables to UNLOGGED, but failed with "could not change table T to unlogged because it references logged table". I'm out of ideas on this one. You'd have to create an load the table in the same transaction, that is, you'd have to run pg_restore with --single-transaction. That would restore the schema from the dump, while I want to create the schema from the SQL code in version control. Something that might work, would be for pg_restore to issue a TRUNCATE before the COPY. I believe this would require superuser privelege though, that I would prefer to avoid. Currently I issue TRUNCATE for all tables manually before running pg_restore, but of course this is in a different transaction so it doesn't help. By the way do you see potential problems with using --single-transaction to restore billion-rows tables? Thank you, Dimitris
Re: Experience and feedback on pg_restore --data-only
Hi Laurenz, On Sun, 23 Mar 2025, Laurenz Albe wrote: On Thu, 2025-03-20 at 23:48 +0100, Dimitrios Apostolou wrote: Performance issues: (important as my db size is >5TB) * WAL writes: I didn't manage to avoid writing to the WAL, despite having setting wal_level=minimal. I even wrote my own function to ALTER all tables to UNLOGGED, but failed with "could not change table T to unlogged because it references logged table". I'm out of ideas on this one. You'd have to create an load the table in the same transaction, that is, you'd have to run pg_restore with --single-transaction. Do you know why --single-transaction can't be combined with -j ? I'm thinking it's because every process would need to have its own transaction, which apparently results to more than one transaction. Do you think this could be alleviated somehow with subtransactions? I'm trying to think of a way to enjoy the performance benefits of --single-transaction, while also parallelizing the pg_restore. For this use case I don't care about the content appearing all in one transaction. Would it be possible then to have dependency resolution in such a way that we split the tasks into one transaction per worker process? Just thinking out loud here, I'd be interested in feedback. Thanks, Dimitris
Performance issues during pg_restore -j with big partitioned table
Hello list. My database includes one table with 1000 partitions, all of them rather sizeable. I run: pg_restore -j12 --no-tablespaces --disable-triggers --exit-on-error --no-owner --no-privileges -n public -d newdb custom_format_dump.pgdump Right now after 24h of restore, I notice weird behaviour, so I have several questions about it: + 11 postgres backend processes are sleeping as "TRUNCATE TABLE waiting". I see that they are waiting to issue a TRUNCATE for one of the partitions and then COPY data to it. Checking the log I see that several partitions have already been copied finished, but many more are left to start. Why is a TRUNCATE needed at the start of a partition's COPY phase? I didn't issue a --clean on the command line (I don't need it as my database is newly created), and I don't see a mention of related TRUNCATE in the pg_restore manual. + 1 postgres backend process is doing: ALTER TABLE the_master_partitioned_table ADD CONSTRAINT ... FOREIGN KEY (columnX) REFERENCES another_table(columnX) According to my logs this started right after COPY DATA for another_table was finished. And apparently it has a lock on the_master_partitioned_table that all other TRUNCATE have to wait for. Is this a bug in the dependency resolution? Wouldn't it make sense for this to wait until all 1000 partitions have finished their COPY DATA phase? + Investigating why the above ALTER TABLE takes so long, I notice that it is issuing a lot of writes to the WAL. Digging deeper shows a lot of time spent in SetHintBits(). Is there a way to avoid that in a clean pg_restore? Thanks in advance, Dimitris
pg_restore causing ENOSPACE on the WAL partition
Hello list, last night I got ENOSPACE on the WAL partition while running a huge pg_restore on an empty and idle database. The checkpoint that started 7 minutes earlier never finished: 04:31:09 LOG: checkpoint starting: wal ... 04:38:04 PANIC: could not write to file "pg_wal/xlogtemp.546204": No space left on device :CONTEXT: COPY table_partition_123, line 120872534 This is most likely because most of the data is written to a very slow "archival" tablespace. No I/O errors on the system logs, I know things go slow on that device (iSCSI drive over a shared spinning-disks pool and busy network). This happened even though I'm keeping the WAL in its own partition (128GB partition size), with a lot of free space above max_wal_size (64GB). Somehow it managed to grow above max_wal_size and fill 100% of the partition. I'm running latest PostgreSQL 17 and the settings have been temporarily tweaked for fastest pg_restore: max_wal_size=64GB max_replication_slots = 0 max_logical_replication_workers = 0 max_wal_senders = 0 wal_level = minimal autovacuum = off Several things seem to have gone wrong here. Questions: + The WAL partition is much faster than the archival tablespace. Am I in constant danger of overruning max_wal_size? How to make 100% sure this never happens again? + After recovery, with the database idling, I notice that WAL space usage is constant at 64GB. Why doesn't it free up space down to min_wal_size (1GB)? + I just created a 13GB zstd-compressed tarball of those 64GB WAL files. This indicates that the files are compressible despite using wal_compression=zstd setting. Could it be that postgres ignores the flag and does not compress the WAL? How to check? + I'm using parallel pg_restore --data-only, can't avoid that for now. Even though all the tables are empty (I truncated everything before starting pg_restore), I can't find a way to avoid going through the WAL. Ideas? Thanks in advance, Dimitris
Re: pg_restore causing ENOSPACE on the WAL partition
Forgot an important setting I have during pg_restore: On Thu, 10 Apr 2025, Dimitrios Apostolou wrote: max_wal_size=64GB max_replication_slots = 0 max_logical_replication_workers = 0 max_wal_senders = 0 wal_level = minimal autovacuum = off checkpoint_completion_target = 0 so the checkpoint should finish ASAP and any delays are purely because of I/O limitations. And for the sake of completion: checkpoint_timeout = 600s Regards, Dimitris
Re: pg_restore causing ENOSPACE on the WAL partition
Happened again even with max_wal_size=32GB on the 128GB WAL partition. At this point I'm quite sure what happens: + a checkpoint is going on for several minutes (because of I/O contention on the archival drive) + meanwhile data keeps coming in through several processes of pg_restore + data is coming in much faster because the WAL partition is high perf + The checkpoint makes it before full and finishes and frees up 32GB (max_wal_size) from the WAL drive. Lets say the WAL partition is now at 80/128GB full. + Immediately another checkpoint starts. This one has to checkpoint 80GB of data. + This doesn't make it in time and the WAL partition fills up. + Painful recovery follows, the database server won't even succeed in doing recovery when starting up... I'm surprised that there is no mechanism for the backends to block while the WAL is overflowing. Dimitris On Thu, 10 Apr 2025, Dimitrios Apostolou wrote: Hello list, last night I got ENOSPACE on the WAL partition while running a huge pg_restore on an empty and idle database. The checkpoint that started 7 minutes earlier never finished: 04:31:09 LOG: checkpoint starting: wal ... 04:38:04 PANIC: could not write to file "pg_wal/xlogtemp.546204": No space left on device : CONTEXT: COPY table_partition_123, line 120872534 This is most likely because most of the data is written to a very slow "archival" tablespace. No I/O errors on the system logs, I know things go slow on that device (iSCSI drive over a shared spinning-disks pool and busy network). This happened even though I'm keeping the WAL in its own partition (128GB partition size), with a lot of free space above max_wal_size (64GB). Somehow it managed to grow above max_wal_size and fill 100% of the partition. I'm running latest PostgreSQL 17 and the settings have been temporarily tweaked for fastest pg_restore: max_wal_size=64GB max_replication_slots = 0 max_logical_replication_workers = 0 max_wal_senders = 0 wal_level = minimal autovacuum = off Several things seem to have gone wrong here. Questions: + The WAL partition is much faster than the archival tablespace. Am I in constant danger of overruning max_wal_size? How to make 100% sure this never happens again? + After recovery, with the database idling, I notice that WAL space usage is constant at 64GB. Why doesn't it free up space down to min_wal_size (1GB)? + I just created a 13GB zstd-compressed tarball of those 64GB WAL files. This indicates that the files are compressible despite using wal_compression=zstd setting. Could it be that postgres ignores the flag and does not compress the WAL? How to check? + I'm using parallel pg_restore --data-only, can't avoid that for now. Even though all the tables are empty (I truncated everything before starting pg_restore), I can't find a way to avoid going through the WAL. Ideas? Thanks in advance, Dimitris
Re: TRUNCATE ONLY with foreign keys and triggers disabled
On Mon, 14 Apr 2025, Adrian Klaver wrote: On 4/14/25 08:20, Dimitrios Apostolou wrote: On Mon, 14 Apr 2025, Adrian Klaver wrote: On 4/14/25 08:07, Laurenz Albe wrote: On Mon, 2025-04-14 at 17:05 +0200, Dimitrios Apostolou wrote: I meant the *referencing* table has just been populated. I'm trying to delete the *referenced* table and I get the error. That would break the foreign key constraint, right? PostgreSQL cannot allow that. I believe the OP is disabling all triggers including system ones if I follow correctly and possibly running a foul of; https://www.postgresql.org/docs/current/sql-altertable.html " Disabling or enabling internally generated constraint triggers requires superuser privileges; it should be done with caution since of course the integrity of the constraint cannot be guaranteed if the triggers are not executed." Exactly that. I learned this from pg_restore --disable-triggers, as a way to speed-up insertion. Since triggers are disabled, I assumed that postgresql shouldn't care about referential integrity in TRUNCATE. I'm going to say the depends on order of execution. Without seeing the code changes you have done there is really no way to determine what the exact issue is. Sent patch here: https://www.postgresql.org/message-id/flat/4589087c-ec6f-4407-1f82-6cb2e681ac0a%40gmx.net The reason I first sent here (pgsql-general) is that I wondered if the error is just an implementation shortcoming without deeper reasons, thus it would be achievable to try to fix it myself. Dimitris
Re: TRUNCATE ONLY with foreign keys and triggers disabled
On Mon, 14 Apr 2025, Tom Lane wrote: Dimitrios Apostolou writes: While doing TRUNCATE ONLY I get: ERROR: cannot truncate a table referenced in a foreign key constraint But in my case the table to be truncated is already empty, and the TRIGGERS are disabled in all tables. IIRC, it will let you do it if you truncate both the referenced and referencing tables in the same command. The state of the triggers is not material to this, since TRUNCATE doesn't fire them anyway. Thanks Tom, however that is not possible in my case, the referenced table is possibly just populated and potentially very big (partitioned table with many sub-partitions). Context is that I've modified pg_restore to accept --data-only --clean, and issues a TRUNCATE ONLY before each table's COPY, in the same transaction (thus avoiding WAL too). Will send an RFC patch to pgsql-hackers when I verify it works. Dimitris
Re: TRUNCATE ONLY with foreign keys and triggers disabled
On Mon, 14 Apr 2025, Dimitrios Apostolou wrote: On Mon, 14 Apr 2025, Tom Lane wrote: Dimitrios Apostolou writes: While doing TRUNCATE ONLY I get: ERROR: cannot truncate a table referenced in a foreign key constraint But in my case the table to be truncated is already empty, and the TRIGGERS are disabled in all tables. IIRC, it will let you do it if you truncate both the referenced and referencing tables in the same command. The state of the triggers is not material to this, since TRUNCATE doesn't fire them anyway. Thanks Tom, however that is not possible in my case, the referenced table is possibly just populated and potentially very big (partitioned table with many sub-partitions). Terminology correction: I meant the *referencing* table has just been populated. I'm trying to delete the *referenced* table and I get the error. Dimitris
Re: TRUNCATE ONLY with foreign keys and triggers disabled
On Mon, 14 Apr 2025, Adrian Klaver wrote: On 4/14/25 08:07, Laurenz Albe wrote: On Mon, 2025-04-14 at 17:05 +0200, Dimitrios Apostolou wrote: I meant the *referencing* table has just been populated. I'm trying to delete the *referenced* table and I get the error. That would break the foreign key constraint, right? PostgreSQL cannot allow that. I believe the OP is disabling all triggers including system ones if I follow correctly and possibly running a foul of; https://www.postgresql.org/docs/current/sql-altertable.html " Disabling or enabling internally generated constraint triggers requires superuser privileges; it should be done with caution since of course the integrity of the constraint cannot be guaranteed if the triggers are not executed." Exactly that. I learned this from pg_restore --disable-triggers, as a way to speed-up insertion. Since triggers are disabled, I assumed that postgresql shouldn't care about referential integrity in TRUNCATE. Dimitris
TRUNCATE ONLY with foreign keys and triggers disabled
Hello list, While doing TRUNCATE ONLY I get: ERROR: cannot truncate a table referenced in a foreign key constraint But in my case the table to be truncated is already empty, and the TRIGGERS are disabled in all tables. I wonder, is there a deeper reason for this error, or is it just an implementation shortcoming? In other words, how easy would it be to write a patch that allows TRUNCATE even with foreign keys when triggers are disabled or the table is empty? Thanks in advance, Dimitris
Re: Performance issues during pg_restore -j with big partitioned table
On Fri, 4 Apr 2025, Adrian Klaver wrote: On 4/4/25 06:13, Dimitrios Apostolou wrote: On Wed, 2 Apr 2025, Dimitrios Apostolou wrote: Bug? This happened on a postgres compiled from last week's master branch. Are you talking about the dev version? In this thread, yes. My mistake I didn't mention it from the start. I compiled it in order to see if the problems I mentioned in my previous thread were fixed. Dimitris
Re: Performance issues during pg_restore -j with big partitioned table
On Wed, 2 Apr 2025, Dimitrios Apostolou wrote: Hello list. My database includes one table with 1000 partitions, all of them rather I was not clear here: my database dump has all that, and the database is brand new and empty. sizeable. I run: pg_restore -j12 --no-tablespaces --disable-triggers --exit-on-error --no-owner --no-privileges -n public -d newdb custom_format_dump.pgdump Right now after 24h of restore, I notice weird behaviour, so I have several questions about it: + 11 postgres backend processes are sleeping as "TRUNCATE TABLE waiting". I see that they are waiting to issue a TRUNCATE for one of the partitions and then COPY data to it. Checking the log I see that several partitions have already been copied finished, but many more are left to start. Why is a TRUNCATE needed at the start of a partition's COPY phase? I didn't issue a --clean on the command line (I don't need it as my database is newly created), and I don't see a mention of related TRUNCATE in the pg_restore manual. + 1 postgres backend process is doing: ALTER TABLE the_master_partitioned_table ADD CONSTRAINT ... FOREIGN KEY (columnX) REFERENCES another_table(columnX) According to my logs this started right after COPY DATA for another_table was finished. And apparently it has a lock on the_master_partitioned_table that all other TRUNCATE have to wait for. Is this a bug in the dependency resolution? Wouldn't it make sense for this to wait until all 1000 partitions have finished their COPY DATA phase? Trying again, pg_restore exited with error after almost 24h: pg_restore: while PROCESSING TOC: pg_restore: from TOC entry 8904; 2606 16529 CONSTRAINT test_runs_raw test_runs_raw_partitioned_pkey pg_restore: error: could not execute query: ERROR: deadlock detected DETAIL: Process 465409 waits for AccessExclusiveLock on relation 44437 of database 44090; blocked by process 465408. Process 465408 waits for AccessShareLock on relation 44383 of database 44090; blocked by process 465409. HINT: See server log for query details. From the logs I see that: + Process 465409 waits for AccessExclusiveLock on relation 44437 of database 44090; blocked by process 465408. --> 44437 is test_runs_raw__part_max10120k (a single partition) + Process 465408 waits for AccessShareLock on relation 44383 of database 44090; blocked by process 465409. --> 44383 is test_runs_raw (the master partitioned table) Process 465409: ALTER TABLE ONLY public.test_runs_raw ADD CONSTRAINT test_runs_raw_partitioned_pkey PRIMARY KEY (workitem_n, run_n); Process 465408: COPY public.test_runs_raw__part_max10120k(...) FROM stdin; Bug? This happened on a postgres compiled from last week's master branch. The dump I'm trying to restore is from postgres 17.4. Thanks Dimitris
Re: Performance issues during pg_restore -j with big partitioned table
On Wed, 2 Apr 2025, Adrian Klaver wrote: On 4/2/25 10:32 AM, Dimitrios Apostolou wrote: Hello list. My database includes one table with 1000 partitions, all of them rather sizeable. I run: pg_restore -j12 --no-tablespaces --disable-triggers --exit-on-error --no-owner --no-privileges -n public -d newdb custom_format_dump.pgdump Right now after 24h of restore, I notice weird behaviour, so I have several questions about it: + 11 postgres backend processes are sleeping as "TRUNCATE TABLE waiting". I see that they are waiting to issue a TRUNCATE for one of the partitions and then COPY data to it. Checking the log I see that several partitions have already been copied finished, but many more are left to start. Why is a TRUNCATE needed at the start of a partition's COPY phase? I didn't issue a --clean on the command line (I don't need it as my database is newly created), and I don't see a mention of related TRUNCATE in the pg_restore manual. --clean will drop the object entirely not TRUNCATE. I'm guessing that this is being done by you per: https://www.postgresql.org/message-id/53760c70-4a87-a453-9e02-57abc9cb2e54%40gmx.net "After each failed attempt, I need to issue a TRUNCATE table1,table2,... before I try again. " Thanks Adrian. I'm now testing restore without --data-only. All I'm doing prior to the above pg_restore command is "createdb -T template0 newdb". It's possible though that I'm missing something here, the whole thing is way more complicated than I expected... Dimitris
Re: Performance issues during pg_restore -j with big partitioned table
On Wed, 2 Apr 2025, Adrian Klaver wrote: On 4/2/25 10:39 AM, Adrian Klaver wrote: --clean will drop the object entirely not TRUNCATE. I'm guessing that this is being done by you per: https://www.postgresql.org/message-id/53760c70-4a87-a453-9e02-57abc9cb2e54%40gmx.net "After each failed attempt, I need to issue a TRUNCATE table1,table2,... before I try again. " Oops, forgot to engage brain. From pg_backup_archiver.c: * In parallel restore, if we created the table earlier in * this run (so that we know it is empty) and we are not * restoring a load-via-partition-root data item then we * wrap the COPY in a transaction and precede it with a * TRUNCATE. If wal_level is set to minimal this prevents * WAL-logging the COPY. This obtains a speedup similar * to that from using single_txn mode in non-parallel * restores. This makes sense. It creates the table earlier, and then truncates it just before copying data into it. I wonder if this really circumvents the WAL since I don't have --single-transaction (incompatible to -j). Dimitris
Re: Performance issues during pg_restore -j with big partitioned table
For the record, I haven't seen this deadlock again. I guess it was a bug on the master branch that got fixed, because I've been testing later versions a few times. Dimitris On Fri, 4 Apr 2025, Dimitrios Apostolou wrote: On Wed, 2 Apr 2025, Dimitrios Apostolou wrote: Hello list. My database includes one table with 1000 partitions, all of them rather I was not clear here: my database dump has all that, and the database is brand new and empty. sizeable. I run: pg_restore -j12 --no-tablespaces --disable-triggers --exit-on-error --no-owner --no-privileges -n public -d newdb custom_format_dump.pgdump Right now after 24h of restore, I notice weird behaviour, so I have several questions about it: + 11 postgres backend processes are sleeping as "TRUNCATE TABLE waiting". I see that they are waiting to issue a TRUNCATE for one of the partitions and then COPY data to it. Checking the log I see that several partitions have already been copied finished, but many more are left to start. Why is a TRUNCATE needed at the start of a partition's COPY phase? I didn't issue a --clean on the command line (I don't need it as my database is newly created), and I don't see a mention of related TRUNCATE in the pg_restore manual. + 1 postgres backend process is doing: ALTER TABLE the_master_partitioned_table ADD CONSTRAINT ... FOREIGN KEY (columnX) REFERENCES another_table(columnX) According to my logs this started right after COPY DATA for another_table was finished. And apparently it has a lock on the_master_partitioned_table that all other TRUNCATE have to wait for. Is this a bug in the dependency resolution? Wouldn't it make sense for this to wait until all 1000 partitions have finished their COPY DATA phase? Trying again, pg_restore exited with error after almost 24h: pg_restore: while PROCESSING TOC: pg_restore: from TOC entry 8904; 2606 16529 CONSTRAINT test_runs_raw test_runs_raw_partitioned_pkey pg_restore: error: could not execute query: ERROR: deadlock detected DETAIL: Process 465409 waits for AccessExclusiveLock on relation 44437 of database 44090; blocked by process 465408. Process 465408 waits for AccessShareLock on relation 44383 of database 44090; blocked by process 465409. HINT: See server log for query details. From the logs I see that: + Process 465409 waits for AccessExclusiveLock on relation 44437 of database 44090; blocked by process 465408. --> 44437 is test_runs_raw__part_max10120k (a single partition) + Process 465408 waits for AccessShareLock on relation 44383 of database 44090; blocked by process 465409. --> 44383 is test_runs_raw (the master partitioned table) Process 465409: ALTER TABLE ONLY public.test_runs_raw ADD CONSTRAINT test_runs_raw_partitioned_pkey PRIMARY KEY (workitem_n, run_n); Process 465408: COPY public.test_runs_raw__part_max10120k(...) FROM stdin; Bug? This happened on a postgres compiled from last week's master branch. The dump I'm trying to restore is from postgres 17.4. Thanks Dimitris
Re: having temp_tablespaces on less reliable storage
On Thu, 10 Jul 2025, Dimitrios Apostolou wrote: Hello list, I have a database split across many tablespaces, with temp_tablespaces pointing to a separate, less reliable device (single local NVMe drive). How dangerous is it for the cluster to be unrecoverable after a crash? If the drive goes down and the database can't read/write to temp_tablespaces, what will happen? If I then configure temp_tablespaces to point to a working location, would that be enough to start the cluster? Or other bad things can happen? Can't find any related documentation, but I expect loss of "temp" space is of minor importance. David G. Johnston wrote: You might want to try finding some old discussions about why putting temp tablespace on a RAM-drive is not a supported configuration. Thank you, I found the following: [1] https://www.postgresql.org/docs/current/manage-ag-tablespaces.html [2] https://www.postgresql.org/message-id/flat/ZR0P278MB0028A89FAA3E31E7F1514EF6D2F60%40ZR0P278MB0028.CHEP278.PROD.OUTLOOK.COM [3] https://www.dbi-services.com/blog/can-i-put-my-temporary-tablespaces-on-a-ram-disk-with-postgresql/ At [1] is the standard documentation warning about tablespaces in general: "if you lose a tablespace (file deletion, disk failure, etc.), the database cluster might become unreadable or unable to start". I believe this could be improved, especially with regards to temp_tablespaces. At [2] is a thread started by Daniel Westermann (CC'd) with lots of uncertainty in the air. Tom Lane (CC'd) mentions that as long as files are temporary (not supposed to be there after restart), it should be fine, but there might be additional issues with the directory disappearing after a restart. At [3] is a blog from Daniel who started the previous thread. He removes directories and restarts the cluster and things go OK. I'm leaning towards doing it, i.e. creating a tablespace on the super-fast local SSD and using it exclusively for temp_tablespaces. The queries my database is facing are crunching TBs of data for many hours and write tons of temporary data, and the local NVMe storage is a huge improvement over the enterprise-storage volumes the VM is provided with (I believe they are iSCSI based underneath, bound to network latency). What if the NVMe drive fails? The good scenario is that I will create a new tablespace at a new location and change temp_tablespaces to point there, and everything should be fine. Possibly without even a cluster restart. The very bad scenario is that the cluster will crash and will need restart, but that will go sideways and will eventually need restore from backup or other hacks. How possible would that be? Thanks, Dimitris
having temp_tablespaces on less reliable storage
Hello list, I have a database split across many tablespaces, with temp_tablespaces pointing to a separate, less reliable device (single local NVMe drive). How dangerous is it for the cluster to be unrecoverable after a crash? If the drive goes down and the database can't read/write to temp_tablespaces, what will happen? If I then configure temp_tablespaces to point to a working location, would that be enough to start the cluster? Or other bad things can happen? Can't find any related documentation, but I expect loss of "temp" space is of minor importance. Thanks in advance, Dimitris