Re: Client-server communication for FETCH
On Wed, 2025-12-03 at 14:45 -0500, Tim Fors wrote: > Do you have any further thoughts re: addressing this requirement > (client-side cursor support)? Is this a requirement that the PGSQL > community would consider implementing in libpq? Without it, it seems > like a huge impediment to adoption of PGSQL for COBOL applications, > and as you probably know, COBOL applications are still very > pervasive. (C programs using ECPG would of course hit the same issue, > so it's not just a COBOL-specific problem.) Sounds to me like this ought to be implemented in ECPG and/or its COBOL equivalent, not in libpq? That's where the "Embedded SQL" is implemented after all, and also the equivalent of where Oracle implements its PREFETCH, I think? And I suppose nobody would be opposed to such improvements, if someone contributes them or pays for their development... -- Jan Claeys (please don't CC me when replying to the list)
Seeking guidance on extremely slow pg_restore despite strong I/O performance
Hello, I would greatly appreciate your insight into an issue where pg_restore runs significantly slower than expected, even though the underlying storage shows very high random write throughput. I am trying to understand which PostgreSQL mechanisms or system layers I should investigate next in order to pinpoint the bottleneck and improve restore performance. The central question is: What should I examine further to understand why checkpoint processing becomes the dominant bottleneck during restore, despite fsync=off, synchronous_commit=off, and excellent random write latency? Below is a detailed description of the environment, the behavior observed, the steps I have already taken, and the research performed so far. During pg_restore, execution time remains extremely long: around 2+ hours using a custom-format dump and over 4 hours using directory format. The machine consistently demonstrates high random write performance (median latency ~5 ms, ~45k random write IOPS), yet PostgreSQL logs show very long checkpoints where the write phase dominates (hundreds to thousands of seconds). Checkpoints appear to stall the entire restore process. I have tested multiple combinations of dump formats (custom and directory) and parallel jobs (j = 1, 12, 18). The restore duration barely changes. This strongly suggests that the bottleneck is not client-side parallelism but internal server behavior—specifically the checkpoint write phase. Example log excerpts show checkpoint write times consistently in the range of 600–1100 seconds, with large numbers of buffers written (from hundreds of thousands to over 1.6 million). Sync times remain negligible because fsync is disabled, reinforcing the suspicion that PostgreSQL's internal buffer flushing and write throttling mechanisms are the source of slowdown, not WAL or filesystem sync. Given that: * Storage is fast, * fsync and synchronous commits are disabled, * full_page_writes is off, * wal_level is minimal, * autovacuum is off, * the restore is the only workload, I am trying to determine what further PostgreSQL internals or Linux I/O mechanisms may explain why these checkpoints are taking orders of magnitude longer than the device’s raw write characteristics would suggest. I am particularly looking for guidance on: * Whether backend or checkpointer write throttling may still be limiting write concurrency even during bulk restore, * Whether XFS on Hyper-V VHDX + LVM + battery-backed SSD could introduce any serialization invisible to raw I/O tests, * Whether certain parameters (e.g., effective_io_concurrency, maintenance_io_concurrency, wal_writer settings, combine limits, io_uring behavior) could unintentionally reduce write throughput, * Whether parallel pg_restore is inherently constrained by global buffer flushing behavior, * Any other PostgreSQL mechanisms that could cause prolonged checkpoint write durations even with crash-safety disabled. Below are the configuration values and environment details referenced above. Machine: Hyper-V VM 24 vCPU 80 GB RAM Ubuntu 24.04.3 (kernel 6.8.0-88) PostgreSQL 18.1 Database size: ~700 GB across two tablespaces on separate disks (freshly restored) Storage layout: Each disk is its own VHDX LVM on battery-backed SSD array XFS for PGDATA Barriers disabled Random write performance (steady state): Median latency: 5.1 ms IOPS: ~45.6k Restore tests: pg_restore custom format: ~2h+ pg_restore directory format: ~4h+ Parallelism tested with j = 1, 12, 18, 24 Representative checkpoint log entries: (write phases ranging 76–1079 seconds, buffer writes up to 1.6M) postgresql.conf (relevant parts): shared_buffers = 20GB work_mem = 150MB maintenance_work_mem = 8GB effective_io_concurrency = 1 maintenance_io_concurrency = 1 io_max_combine_limit = 512kB io_combine_limit = 1024kB io_method = io_uring fsync = off synchronous_commit = off wal_sync_method = fdatasync full_page_writes = off wal_compression = lz4 checkpoint_timeout = 60min checkpoint_completion_target = 0.9 max_wal_size = 80GB min_wal_size = 10GB effective_cache_size = 65GB autovacuum = off max_locks_per_transaction = 256 If anyone has encountered similar behavior or can recommend specific PostgreSQL subsystems, kernel settings, or I/O patterns worth investigating, I would be very grateful for advice. My main goal is to understand why checkpoint writes are so slow relative to the hardware’s demonstrated capabilities, and how to safely accelerate the restore workflow. Thank you in advance for any guidance.
Re: Seeking guidance on extremely slow pg_restore despite strong I/O performance
If the dump was taken with pd_dump -Fd and pg_restore -j has no effect on restore time, that’s a good clue. You can start with testing deferring checkpoints, by setting wal_max_size = 1TB and checkpoint_timeout = 10h, and see how this affects the pg_restore (should be limited by WAL write throughput). Perhaps increase wal_buffers to 128MB. The idea being to identify (by elimination) the write chock-point, before starting to tune for it. Irrelevant for your problem, you should set the *_io_concurrency to 200 since you use SSDs. pg_restore rebuilds indices so also make sure the settings relevant to index building are set appropriately (see max_parallel_* and *_io_concurrency) Kiriakos Georgiou > On Dec 5, 2025, at 5:30 AM, MentionTheElephant - MentionTheElephant at > gmail.com wrote: > > Hello, > > I would greatly appreciate your insight into an issue where pg_restore > runs significantly slower than expected, even though the underlying > storage shows very high random write throughput. I am trying to > understand which PostgreSQL mechanisms or system layers I should > investigate next in order to pinpoint the bottleneck and improve > restore performance. > > The central question is: What should I examine further to understand > why checkpoint processing becomes the dominant bottleneck during > restore, despite fsync=off, synchronous_commit=off, and excellent > random write latency? > > Below is a detailed description of the environment, the behavior > observed, the steps I have already taken, and the research performed > so far. > > During pg_restore, execution time remains extremely long: around 2+ > hours using a custom-format dump and over 4 hours using directory > format. The machine consistently demonstrates high random write > performance (median latency ~5 ms, ~45k random write IOPS), yet > PostgreSQL logs show very long checkpoints where the write phase > dominates (hundreds to thousands of seconds). Checkpoints appear to > stall the entire restore process. > > I have tested multiple combinations of dump formats (custom and > directory) and parallel jobs (j = 1, 12, 18). The restore duration > barely changes. This strongly suggests that the bottleneck is not > client-side parallelism but internal server behavior—specifically the > checkpoint write phase. > > Example log excerpts show checkpoint write times consistently in the > range of 600–1100 seconds, with large numbers of buffers written (from > hundreds of thousands to over 1.6 million). Sync times remain > negligible because fsync is disabled, reinforcing the suspicion that > PostgreSQL's internal buffer flushing and write throttling mechanisms > are the source of slowdown, not WAL or filesystem sync. > > Given that: > > * Storage is fast, > * fsync and synchronous commits are disabled, > * full_page_writes is off, > * wal_level is minimal, > * autovacuum is off, > * the restore is the only workload, > > I am trying to determine what further PostgreSQL internals or Linux > I/O mechanisms may explain why these checkpoints are taking orders of > magnitude longer than the device’s raw write characteristics would > suggest. > > I am particularly looking for guidance on: > > * Whether backend or checkpointer write throttling may still be > limiting write concurrency even during bulk restore, > * Whether XFS on Hyper-V VHDX + LVM + battery-backed SSD could > introduce any serialization invisible to raw I/O tests, > * Whether certain parameters (e.g., effective_io_concurrency, > maintenance_io_concurrency, wal_writer settings, combine limits, > io_uring behavior) could unintentionally reduce write throughput, > * Whether parallel pg_restore is inherently constrained by global > buffer flushing behavior, > * Any other PostgreSQL mechanisms that could cause prolonged > checkpoint write durations even with crash-safety disabled. > > Below are the configuration values and environment details referenced above. > > Machine: > Hyper-V VM > 24 vCPU > 80 GB RAM > Ubuntu 24.04.3 (kernel 6.8.0-88) > PostgreSQL 18.1 > > Database size: > ~700 GB across two tablespaces on separate disks (freshly restored) > > Storage layout: > Each disk is its own VHDX > LVM on battery-backed SSD array > XFS for PGDATA > Barriers disabled > > Random write performance (steady state): > Median latency: 5.1 ms > IOPS: ~45.6k > > Restore tests: > pg_restore custom format: ~2h+ > pg_restore directory format: ~4h+ > Parallelism tested with j = 1, 12, 18, 24 > > Representative checkpoint log entries: > (write phases ranging 76–1079 seconds, buffer writes up to 1.6M) > > postgresql.conf (relevant parts): > shared_buffers = 20GB > work_mem = 150MB > maintenance_work_mem = 8GB > effective_io_concurrency = 1 > maintenance_io_concurrency = 1 > io_max_combine_limit = 512kB > io_combine_limit = 1024kB > io_method = io_uring > > fsync = off > synchronous_commit = off > wal_sync_method = fdatasync > full_page_writes = off > wal_compression = lz4 > > checkpoint_
