VACUUM FULL, power failure results in unrecoverable space
Hello, I encountered an issue while attempting to reclaim space from a heavily bloated table: Initial analysis using https://github.com/ioguix/pgsql-bloat-estimation/blob/master/table/table_bloat.sql indicated approximately 600GB of bloat in the table. I initiated a VACUUM FULL operation to reclaim this space, but during execution, the server experienced a power failure. After server recovery: 1. The database came back online successfully 2. The space used by the in-progress table reconstruction was not freed 3. The newly wasted space (bloat + partial VACUUM FULL) is not shown using the above query, only the initial bloat. Is there a way to reclaim this space without resorting to a full pg_dump and pg_restore cycle? Thanks. Best, Pierre
Re: Preallocation changes in Postgresql 16
Hello, It seems that I am running into this issue as well. Is it likely that this would ever be a config option? Best, Pierre Barre On Fri, May 3, 2024, at 05:11, Riku Iki wrote: > I did the testing and confirmed that this was the issue. > > I run following query: > > create table t as select '1234567890' from generate_series(1, 10); > > I commented if (numblocks > 8) codeblock, and see the following results from > "compsize /dbdir/" command. > > > Before my changes: > > Processed 1381 files, 90007 regular extents (90010 refs), 15 inline. > Type Perc Disk Usage Uncompressed Referenced > TOTAL 97% 41G 42G 42G > none 100% 41G 41G 41G > zstd14% 157M 1.0G 1.0G > prealloc 100% 16M 16M 16M > > > > After the changes: > > Processed 1381 files, 347328 regular extents (347331 refs), 15 inline. > Type Perc Disk Usage Uncompressed Referenced > TOTAL3% 1.4G 42G 42G > none 100% 80K 80K 80K > zstd 3% 1.4G 42G 42G > > It is clearly visible that files created with fallocate are not compressed, > and disk usage is much larger. > I am wondering if there is a way to have some feature request to have this > parameter user configurable.. > > On Fri, Apr 26, 2024 at 4:15 PM Riku Iki wrote: >> Thank you, I have such a system. I think my task would be to compile PG from >> sources(need to learn this), and see how it works with and without that code >> block. >> >> On Thu, Apr 25, 2024 at 2:25 PM Thomas Munro wrote: >>> On Fri, Apr 26, 2024 at 4:37 AM Riku Iki wrote: >>> > I am wondering if there were preallocation related changes in PG16, and >>> > if it is possible to disable preallocation in PostgreSQL 16? >>> >>> I have no opinion on the btrfs details, but I was wondering if someone >>> might show up with a system that doesn't like that change. Here is a >>> magic 8, tuned on "some filesystems": >>> >>> /* >>> * If available and useful, use posix_fallocate() (via >>> * FileFallocate()) to extend the relation. That's often more >>> * efficient than using write(), as it commonly won't cause the >>> kernel >>> * to allocate page cache space for the extended pages. >>> * >>> * However, we don't use FileFallocate() for small extensions, as it >>> * defeats delayed allocation on some filesystems. Not clear where >>> * that decision should be made though? For now just use a cutoff of >>> * 8, anything between 4 and 8 worked OK in some local testing. >>> */ >>> if (numblocks > 8) >>> >>> I wonder if it wants to be a GUC.
Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance
Hi Laurenz, > I think the biggest hurdle you will have to overcome is to > convince notoriously paranoid DBAs that this tall stack > provides reliable service, honors fsync() etc. Indeed, but that doesn't have to be "sudden." I think we need to gain confidence in the whole system gradually by starting with throwable workloads (e.g., persistent volumes in CI), then moving to data we can afford to lose, then backups, and finally to production data. >> P.S. The full project includes a custom NFS filesystem too. > > "NFS" is a key word that does not inspire confidence in > PostgreSQL circles... I've had my fair share of major annoyances with NFS too! I think bad experiences with NFS are basically due to the fact that when the hardware is bad, the NFS server implementation is bad, and the kernel treats it mostly like a "local" filesystem (in terms of failure behavior). So when it doesn't work well, everything goes down. But the protocols themselves are not inherently bad—they are actually quite elegant. NFSv3 is just what you need to reach (very close to) POSIX compliance. The NFS server implementation in ZeroFS passes all 8,662 tests in https://github.com/Barre/pjdfstest_nfs. https://github.com/Barre/ZeroFS/actions/runs/16367571315/job/46248240251#step:11:9376 For database workloads specifically, users will probably prefer running something like ZFS on top of the NBD server rather than using NFS directly. Best, Pierre On Fri, Jul 18, 2025, at 06:40, Laurenz Albe wrote: > On Fri, 2025-07-18 at 00:57 +0200, Pierre Barre wrote: >> Looking forward to your feedback and questions! > > I think the biggest hurdle you will have to overcome is to > convince notoriously paranoid DBAs that this tall stack > provides reliable service, honors fsync() etc. > > Performance is great, but it is not everything. If things > perform surprisingly well, people become suspicious. > >> P.S. The full project includes a custom NFS filesystem too. > > "NFS" is a key word that does not inspire confidence in > PostgreSQL circles... > > Yours, > Laurenz Albe
PostgreSQL on S3-backed Block Storage with Near-Local Performance
Hi everyone, I wanted to share a project I've been working on that enables PostgreSQL to run on S3 storage while maintaining performance comparable to local NVMe. The approach uses block-level access rather than trying to map filesystem operations to S3 objects. ZeroFS: https://github.com/Barre/ZeroFS # The Architecture ZeroFS provides NBD (Network Block Device) servers that expose S3 storage as raw block devices. PostgreSQL runs unmodified on ZFS pools built on these block devices: PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3 By providing block-level access and leveraging ZFS's caching capabilities (L2ARC), we can achieve microsecond latencies despite the underlying storage being in S3. ## Performance Results Here are pgbench results from PostgreSQL running on this setup: ### Read/Write Workload ``` postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 10 example pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1)) starting vacuum...end. transaction type: scaling factor: 50 query mode: simple number of clients: 50 number of threads: 15 maximum number of tries: 1 number of transactions per client: 10 number of transactions actually processed: 500/500 number of failed transactions: 0 (0.000%) latency average = 0.943 ms initial connection time = 48.043 ms tps = 53041.006947 (without initial connection time) ``` ### Read-Only Workload ``` postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 10 -S example pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1)) starting vacuum...end. transaction type: scaling factor: 50 query mode: simple number of clients: 50 number of threads: 15 maximum number of tries: 1 number of transactions per client: 10 number of transactions actually processed: 500/500 number of failed transactions: 0 (0.000%) latency average = 0.121 ms initial connection time = 53.358 ms tps = 413436.248089 (without initial connection time) ``` These numbers are with 50 concurrent clients and the actual data stored in S3. Hot data is served from ZFS L2ARC and ZeroFS's memory caches, while cold data comes from S3. ## How It Works 1. ZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block device 2. Multiple cache layers hide S3 latency: a. ZFS ARC/L2ARC for frequently accessed blocks b. ZeroFS memory cache for metadata and hot dataZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block device c. Optional local disk cache 3. All data is encrypted (ChaCha20-Poly1305) before hitting S3 4. Files are split into 128KB chunks for insertion into ZeroFS' LSM-tree ## Geo-Distributed PostgreSQL Since each region can run its own ZeroFS instance, you can create geographically distributed PostgreSQL setups. Example architectures: Architecture 1 PostgreSQL Client | | SQL queries | +--+ | PG Proxy| | (HAProxy/| | PgBouncer) | +--+ /\ / \ SynchronousSynchronous ReplicationReplication / \ /\ +---++---+ | PostgreSQL 1 || PostgreSQL 2 | | (Primary) |◄--►| (Standby) | +---++---+ || | POSIX filesystem ops | || +---++---+ | ZFS Pool 1 || ZFS Pool 2 | | (3-way mirror)|| (3-way mirror)| +---++---+ / | \ / | \ / | \/ | \ NBD:10809 NBD:10810 NBD:10811 NBD:10812 NBD:10813 NBD:10814 ||| ||| ++++++++++++ |ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||ZeroFS 6| ++++++++++++ | | | | | | | | | | | | S3-Region1 S3-Region2 S3-Region3 S3-Region4 S3-Region5 S3-Region6 (us-east) (eu-west) (ap-south) (us-west) (eu-north) (ap-east) Architecture 2: PostgreSQL Primary (Region 1) ←→ PostgreSQL Standby (Region 2) \/ \ /
Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance
> The interesting thing is, a few searches about the performance return mostly > negative impressions about their object storage in comparison to the original > S3. I think they had a rough start, but it's quite good now from what I've experienced. It's also dirt-cheap, and they don't bill for operations. So if you run ZeroFS on that you only pay for raw storage at €4.99 a month. Combine that with their dirt cheap dedicated servers, https://www.hetzner.com/dedicated-rootserver/matrix-ax/ you can have a <€50 a month multi-terabytes postgres database I'm dreaming of running https://www.merklemap.com/ on such a setup, but it's too early yet :) > Finding out what kind of performance your benchmarks would yield on a pure > AWS setting would be interesting. I am not asking you to do that, but you may > get even better performance in that case :) Yes, I need to try that! Best, Pierre On Fri, Jul 18, 2025, at 14:55, Seref Arikan wrote: > Thanks, I learned something else: I didn't know Hetzner offered S3 compatible > storage. > > The interesting thing is, a few searches about the performance return mostly > negative impressions about their object storage in comparison to the original > S3. > > Finding out what kind of performance your benchmarks would yield on a pure > AWS setting would be interesting. I am not asking you to do that, but you may > get even better performance in that case :) > > Cheers, > Seref > > > On Fri, Jul 18, 2025 at 11:58 AM Pierre Barre wrote: >> __ >> Hi Seref, >> >> For the benchmarks, I used Hetzner's cloud service with the following setup: >> >> - A Hetzner s3 bucket in the FSN1 region >> - A virtual machine of type ccx63 48 vCPU 192 GB memory >> - 3 ZeroFS nbd devices (same s3 bucket) >> - A ZFS stripped pool with the 3 devices >> - 200GB zfs L2ARC >> - Postgres configured accordingly memory-wise as well as with >> synchronous_commit = off, wal_init_zero = off and wal_recycle = off. >> >> Best, >> Pierre >> >> On Fri, Jul 18, 2025, at 12:42, Seref Arikan wrote: >>> Sorry, this was meant to go to the whole group: >>> >>> Very interesting!. Great work. Can you clarify how exactly you're running >>> postgres in your tests? A specific AWS service? What's the test >>> infrastructure that sits above the file system? >>> >>> On Thu, Jul 17, 2025 at 11:59 PM Pierre Barre wrote: >>>> Hi everyone, >>>> >>>> I wanted to share a project I've been working on that enables PostgreSQL >>>> to run on S3 storage while maintaining performance comparable to local >>>> NVMe. The approach uses block-level access rather than trying to map >>>> filesystem operations to S3 objects. >>>> >>>> ZeroFS: https://github.com/Barre/ZeroFS >>>> >>>> # The Architecture >>>> >>>> ZeroFS provides NBD (Network Block Device) servers that expose S3 storage >>>> as raw block devices. PostgreSQL runs unmodified on ZFS pools built on >>>> these block devices: >>>> >>>> PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3 >>>> >>>> By providing block-level access and leveraging ZFS's caching capabilities >>>> (L2ARC), we can achieve microsecond latencies despite the underlying >>>> storage being in S3. >>>> >>>> ## Performance Results >>>> >>>> Here are pgbench results from PostgreSQL running on this setup: >>>> >>>> ### Read/Write Workload >>>> >>>> ``` >>>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 10 example >>>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1)) >>>> starting vacuum...end. >>>> transaction type: >>>> scaling factor: 50 >>>> query mode: simple >>>> number of clients: 50 >>>> number of threads: 15 >>>> maximum number of tries: 1 >>>> number of transactions per client: 10 >>>> number of transactions actually processed: 500/500 >>>> number of failed transactions: 0 (0.000%) >>>> latency average = 0.943 ms >>>> initial connection time = 48.043 ms >>>> tps = 53041.006947 (without initial connection time) >>>> ``` >>>> >>>> ### Read-Only Workload >>>> >>>> ``` >>>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 10 -S example >>>> pgben
Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance
Hi Seref, For the benchmarks, I used Hetzner's cloud service with the following setup: - A Hetzner s3 bucket in the FSN1 region - A virtual machine of type ccx63 48 vCPU 192 GB memory - 3 ZeroFS nbd devices (same s3 bucket) - A ZFS stripped pool with the 3 devices - 200GB zfs L2ARC - Postgres configured accordingly memory-wise as well as with synchronous_commit = off, wal_init_zero = off and wal_recycle = off. Best, Pierre On Fri, Jul 18, 2025, at 12:42, Seref Arikan wrote: > Sorry, this was meant to go to the whole group: > > Very interesting!. Great work. Can you clarify how exactly you're running > postgres in your tests? A specific AWS service? What's the test > infrastructure that sits above the file system? > > On Thu, Jul 17, 2025 at 11:59 PM Pierre Barre wrote: >> Hi everyone, >> >> I wanted to share a project I've been working on that enables PostgreSQL to >> run on S3 storage while maintaining performance comparable to local NVMe. >> The approach uses block-level access rather than trying to map filesystem >> operations to S3 objects. >> >> ZeroFS: https://github.com/Barre/ZeroFS >> >> # The Architecture >> >> ZeroFS provides NBD (Network Block Device) servers that expose S3 storage as >> raw block devices. PostgreSQL runs unmodified on ZFS pools built on these >> block devices: >> >> PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3 >> >> By providing block-level access and leveraging ZFS's caching capabilities >> (L2ARC), we can achieve microsecond latencies despite the underlying storage >> being in S3. >> >> ## Performance Results >> >> Here are pgbench results from PostgreSQL running on this setup: >> >> ### Read/Write Workload >> >> ``` >> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 10 example >> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1)) >> starting vacuum...end. >> transaction type: >> scaling factor: 50 >> query mode: simple >> number of clients: 50 >> number of threads: 15 >> maximum number of tries: 1 >> number of transactions per client: 10 >> number of transactions actually processed: 500/500 >> number of failed transactions: 0 (0.000%) >> latency average = 0.943 ms >> initial connection time = 48.043 ms >> tps = 53041.006947 (without initial connection time) >> ``` >> >> ### Read-Only Workload >> >> ``` >> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 10 -S example >> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1)) >> starting vacuum...end. >> transaction type: >> scaling factor: 50 >> query mode: simple >> number of clients: 50 >> number of threads: 15 >> maximum number of tries: 1 >> number of transactions per client: 10 >> number of transactions actually processed: 500/500 >> number of failed transactions: 0 (0.000%) >> latency average = 0.121 ms >> initial connection time = 53.358 ms >> tps = 413436.248089 (without initial connection time) >> ``` >> >> These numbers are with 50 concurrent clients and the actual data stored in >> S3. Hot data is served from ZFS L2ARC and ZeroFS's memory caches, while cold >> data comes from S3. >> >> ## How It Works >> >> 1. ZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use >> like any other block device >> 2. Multiple cache layers hide S3 latency: >>a. ZFS ARC/L2ARC for frequently accessed blocks >>b. ZeroFS memory cache for metadata and hot dataZeroFS exposes NBD >> devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block >> device >>c. Optional local disk cache >> 3. All data is encrypted (ChaCha20-Poly1305) before hitting S3 >> 4. Files are split into 128KB chunks for insertion into ZeroFS' LSM-tree >> >> ## Geo-Distributed PostgreSQL >> >> Since each region can run its own ZeroFS instance, you can create >> geographically distributed PostgreSQL setups. >> >> Example architectures: >> >> Architecture 1 >> >> >> PostgreSQL Client >>| >>| SQL queries >>| >> +--+ >> | PG Proxy| >> | (HAProxy/| >> | PgBouncer) | >>
Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance
Now, I'm trying to understand how CAP theorem applies here. Traditional PostgreSQL replication has clear CAP trade-offs - you choose between consistency and availability during partitions. But when PostgreSQL instances share storage rather than replicate: - Consistency seems maintained (same data) - Availability seems maintained (client can always promote an accessible node) - Partitions between PostgreSQL nodes don't prevent the system from functioning It seems that CAP assumes specific implementation details (like nodes maintaining independent state) without explicitly stating them. How should we think about CAP theorem when distributed nodes share storage rather than coordinate state? Are the trade-offs simply moved to a different layer, or does shared storage fundamentally change the analysis? Client with awareness of both PostgreSQL nodes | | ↓ (partition here) ↓ PostgreSQL Primary PostgreSQL Standby | | └───┬───┘ ↓ Shared ZFS Pool | 6 Global ZeroFS instances Best, Pierre On Fri, Jul 18, 2025, at 12:57, Pierre Barre wrote: > Hi Seref, > > For the benchmarks, I used Hetzner's cloud service with the following setup: > > - A Hetzner s3 bucket in the FSN1 region > - A virtual machine of type ccx63 48 vCPU 192 GB memory > - 3 ZeroFS nbd devices (same s3 bucket) > - A ZFS stripped pool with the 3 devices > - 200GB zfs L2ARC > - Postgres configured accordingly memory-wise as well as with > synchronous_commit = off, wal_init_zero = off and wal_recycle = off. > > Best, > Pierre > > On Fri, Jul 18, 2025, at 12:42, Seref Arikan wrote: >> Sorry, this was meant to go to the whole group: >> >> Very interesting!. Great work. Can you clarify how exactly you're running >> postgres in your tests? A specific AWS service? What's the test >> infrastructure that sits above the file system? >> >> On Thu, Jul 17, 2025 at 11:59 PM Pierre Barre wrote: >>> Hi everyone, >>> >>> I wanted to share a project I've been working on that enables PostgreSQL to >>> run on S3 storage while maintaining performance comparable to local NVMe. >>> The approach uses block-level access rather than trying to map filesystem >>> operations to S3 objects. >>> >>> ZeroFS: https://github.com/Barre/ZeroFS >>> >>> # The Architecture >>> >>> ZeroFS provides NBD (Network Block Device) servers that expose S3 storage >>> as raw block devices. PostgreSQL runs unmodified on ZFS pools built on >>> these block devices: >>> >>> PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3 >>> >>> By providing block-level access and leveraging ZFS's caching capabilities >>> (L2ARC), we can achieve microsecond latencies despite the underlying >>> storage being in S3. >>> >>> ## Performance Results >>> >>> Here are pgbench results from PostgreSQL running on this setup: >>> >>> ### Read/Write Workload >>> >>> ``` >>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 10 example >>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1)) >>> starting vacuum...end. >>> transaction type: >>> scaling factor: 50 >>> query mode: simple >>> number of clients: 50 >>> number of threads: 15 >>> maximum number of tries: 1 >>> number of transactions per client: 10 >>> number of transactions actually processed: 500/500 >>> number of failed transactions: 0 (0.000%) >>> latency average = 0.943 ms >>> initial connection time = 48.043 ms >>> tps = 53041.006947 (without initial connection time) >>> ``` >>> >>> ### Read-Only Workload >>> >>> ``` >>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 10 -S example >>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1)) >>> starting vacuum...end. >>> transaction type: >>> scaling factor: 50 >>> query mode: simple >>> number of clients: 50 >>> number of threads: 15 >>> maximum number of tries: 1 >>> number of transactions per client: 10 >>> number of transactions actually processed: 500/500 >>> number of failed transactions: 0 (0.000%) >>> latency average = 0.121 ms >>> initial connection time = 53.358 ms >>> tps = 413436.248089 (without initial connection time) >>> ``` >>> >>> These numbers are