VACUUM FULL, power failure results in unrecoverable space

2024-12-02 Thread Pierre Barre
Hello,

I encountered an issue while attempting to reclaim space from a heavily bloated 
table:

Initial analysis using 
https://github.com/ioguix/pgsql-bloat-estimation/blob/master/table/table_bloat.sql
 indicated approximately 600GB of bloat in the table.

I initiated a VACUUM FULL operation to reclaim this space, but during 
execution, the server experienced a power failure.


After server recovery:

1. The database came back online successfully
2. The space used by the in-progress table reconstruction was not freed
3. The newly wasted space (bloat + partial VACUUM FULL) is not shown using the 
above query, only the initial bloat.


Is there a way to reclaim this space without resorting to a full pg_dump and 
pg_restore cycle? 

Thanks.

Best,
Pierre

Re: Preallocation changes in Postgresql 16

2024-12-26 Thread Pierre Barre
Hello,

It seems that I am running into this issue as well. 
Is it likely that this would ever be a config option?

Best,
Pierre Barre

On Fri, May 3, 2024, at 05:11, Riku Iki wrote:
> I did the testing and confirmed that this was the issue.
> 
> I run following query:
> 
>  create table t as select '1234567890' from generate_series(1, 10);
> 
> I commented if (numblocks > 8) codeblock, and see the following results from 
> "compsize /dbdir/" command.
> 
> 
> Before my changes:
> 
> Processed 1381 files, 90007 regular extents (90010 refs), 15 inline.
> Type   Perc Disk Usage   Uncompressed Referenced  
> TOTAL   97%   41G  42G  42G  
> none   100%   41G  41G  41G  
> zstd14%  157M 1.0G 1.0G  
> prealloc   100%   16M  16M  16M
> 
> 
> 
> After the changes:
> 
> Processed 1381 files, 347328 regular extents (347331 refs), 15 inline.
> Type   Perc Disk Usage   Uncompressed Referenced  
> TOTAL3%  1.4G  42G  42G  
> none   100%   80K  80K  80K  
> zstd 3%  1.4G  42G  42G
> 
> It is clearly visible that files created with fallocate are not compressed, 
> and disk usage is much larger.
> I am wondering if there is a way to have some feature request to have this 
> parameter user configurable..  
> 
> On Fri, Apr 26, 2024 at 4:15 PM Riku Iki  wrote:
>> Thank you, I have such a system. I think my task would be to compile PG from 
>> sources(need to learn this), and see how it works with and without that code 
>> block.
>> 
>> On Thu, Apr 25, 2024 at 2:25 PM Thomas Munro  wrote:
>>> On Fri, Apr 26, 2024 at 4:37 AM Riku Iki  wrote:
>>> > I am wondering if there were preallocation related changes in PG16, and 
>>> > if it is possible to disable preallocation in PostgreSQL 16?
>>> 
>>> I have no opinion on the btrfs details, but I was wondering if someone
>>> might show up with a system that doesn't like that change.  Here is a
>>> magic 8, tuned on "some filesystems":
>>> 
>>> /*
>>>  * If available and useful, use posix_fallocate() (via
>>>  * FileFallocate()) to extend the relation. That's often more
>>>  * efficient than using write(), as it commonly won't cause the 
>>> kernel
>>>  * to allocate page cache space for the extended pages.
>>>  *
>>>  * However, we don't use FileFallocate() for small extensions, as it
>>>  * defeats delayed allocation on some filesystems. Not clear where
>>>  * that decision should be made though? For now just use a cutoff of
>>>  * 8, anything between 4 and 8 worked OK in some local testing.
>>>  */
>>> if (numblocks > 8)
>>> 
>>> I wonder if it wants to be a GUC.


Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance

2025-07-18 Thread Pierre Barre
Hi Laurenz,

> I think the biggest hurdle you will have to overcome is to
> convince notoriously paranoid DBAs that this tall stack
> provides reliable service, honors fsync() etc.

Indeed, but that doesn't have to be "sudden." I think we need to gain 
confidence in the whole system gradually by starting with throwable workloads 
(e.g., persistent volumes in CI), then moving to data we can afford to lose, 
then backups, and finally to production data.

>> P.S. The full project includes a custom NFS filesystem too.
>
> "NFS" is a key word that does not inspire confidence in
> PostgreSQL circles...

I've had my fair share of major annoyances with NFS too!

I think bad experiences with NFS are basically due to the fact that when the 
hardware is bad, the NFS server implementation is bad, and the kernel treats it 
mostly like a "local" filesystem (in terms of failure behavior). 

So when it doesn't work well, everything goes down.

But the protocols themselves are not inherently bad—they are actually quite 
elegant. NFSv3 is just what you need to reach (very close to) POSIX compliance. 
The NFS server implementation in ZeroFS passes all 8,662 tests in 
https://github.com/Barre/pjdfstest_nfs.

https://github.com/Barre/ZeroFS/actions/runs/16367571315/job/46248240251#step:11:9376

For database workloads specifically, users will probably prefer running 
something like ZFS on top of the NBD server rather than using NFS directly.

Best,
Pierre

On Fri, Jul 18, 2025, at 06:40, Laurenz Albe wrote:
> On Fri, 2025-07-18 at 00:57 +0200, Pierre Barre wrote:
>> Looking forward to your feedback and questions!
>
> I think the biggest hurdle you will have to overcome is to
> convince notoriously paranoid DBAs that this tall stack
> provides reliable service, honors fsync() etc.
>
> Performance is great, but it is not everything.  If things
> perform surprisingly well, people become suspicious.
>
>> P.S. The full project includes a custom NFS filesystem too.
>
> "NFS" is a key word that does not inspire confidence in
> PostgreSQL circles...
>
> Yours,
> Laurenz Albe




PostgreSQL on S3-backed Block Storage with Near-Local Performance

2025-07-17 Thread Pierre Barre
Hi everyone,

I wanted to share a project I've been working on that enables PostgreSQL to run 
on S3 storage while maintaining performance comparable to local NVMe. The 
approach uses block-level access rather than trying to map filesystem 
operations to S3 objects.

ZeroFS: https://github.com/Barre/ZeroFS

# The Architecture

ZeroFS provides NBD (Network Block Device) servers that expose S3 storage as 
raw block devices. PostgreSQL runs unmodified on ZFS pools built on these block 
devices:

PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3

By providing block-level access and leveraging ZFS's caching capabilities 
(L2ARC), we can achieve microsecond latencies despite the underlying storage 
being in S3.

## Performance Results

Here are pgbench results from PostgreSQL running on this setup:

### Read/Write Workload

```
postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 10 example
pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
starting vacuum...end.
transaction type: 
scaling factor: 50
query mode: simple
number of clients: 50
number of threads: 15
maximum number of tries: 1
number of transactions per client: 10
number of transactions actually processed: 500/500
number of failed transactions: 0 (0.000%)
latency average = 0.943 ms
initial connection time = 48.043 ms
tps = 53041.006947 (without initial connection time)
```

### Read-Only Workload

```
postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 10 -S example
pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
starting vacuum...end.
transaction type: 
scaling factor: 50
query mode: simple
number of clients: 50
number of threads: 15
maximum number of tries: 1
number of transactions per client: 10
number of transactions actually processed: 500/500
number of failed transactions: 0 (0.000%)
latency average = 0.121 ms
initial connection time = 53.358 ms
tps = 413436.248089 (without initial connection time)
```

These numbers are with 50 concurrent clients and the actual data stored in S3. 
Hot data is served from ZFS L2ARC and ZeroFS's memory caches, while cold data 
comes from S3.

## How It Works

1. ZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use 
like any other block device
2. Multiple cache layers hide S3 latency:
   a. ZFS ARC/L2ARC for frequently accessed blocks
   b. ZeroFS memory cache for metadata and hot dataZeroFS exposes NBD devices 
(e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block device
   c. Optional local disk cache
3. All data is encrypted (ChaCha20-Poly1305) before hitting S3
4. Files are split into 128KB chunks for insertion into ZeroFS' LSM-tree

## Geo-Distributed PostgreSQL

Since each region can run its own ZeroFS instance, you can create 
geographically distributed PostgreSQL setups.

Example architectures:

Architecture 1


 PostgreSQL Client
   |
   | SQL queries
   |
+--+
|  PG Proxy|
| (HAProxy/|
|  PgBouncer)  |
+--+
   /\
  /  \
   SynchronousSynchronous
   ReplicationReplication
/  \
   /\
  +---++---+
  | PostgreSQL 1  || PostgreSQL 2  |
  | (Primary) |◄--►| (Standby) |
  +---++---+
  ||
  |  POSIX filesystem ops  |
  ||
  +---++---+
  |   ZFS Pool 1  ||   ZFS Pool 2  |
  | (3-way mirror)|| (3-way mirror)|
  +---++---+
   /  |  \  /  |  \
  /   |   \/   |   \
NBD:10809 NBD:10810 NBD:10811  NBD:10812 NBD:10813 NBD:10814
 |||   |||
++++++++++++
|ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||ZeroFS 6|
++++++++++++
 | | | | | |
 | | | | | |
S3-Region1 S3-Region2 S3-Region3 S3-Region4 S3-Region5 S3-Region6
(us-east) (eu-west) (ap-south) (us-west) (eu-north) (ap-east)

Architecture 2:

PostgreSQL Primary (Region 1) ←→ PostgreSQL Standby (Region 2)
\/
 \  /
  

Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance

2025-07-18 Thread Pierre Barre
> The interesting thing is, a few searches about the performance return mostly 
> negative impressions about their object storage in comparison to the original 
> S3. 
I think they had a rough start, but it's quite good now from what I've 
experienced. It's also dirt-cheap, and they don't bill for operations. So if 
you run ZeroFS on that you only pay for raw storage at €4.99 a month.

Combine that with their dirt cheap dedicated servers, 
https://www.hetzner.com/dedicated-rootserver/matrix-ax/ you can have a <€50 a 
month multi-terabytes postgres database

I'm dreaming of running https://www.merklemap.com/ on such a setup, but it's 
too early yet :)

> Finding out what kind of performance your benchmarks would yield on a pure 
> AWS setting would be interesting. I am not asking you to do that, but you may 
> get even better performance in that case :) 

Yes, I need to try that!

Best,
Pierre

On Fri, Jul 18, 2025, at 14:55, Seref Arikan wrote:
> Thanks, I learned something else: I didn't know Hetzner offered S3 compatible 
> storage. 
> 
> The interesting thing is, a few searches about the performance return mostly 
> negative impressions about their object storage in comparison to the original 
> S3. 
> 
> Finding out what kind of performance your benchmarks would yield on a pure 
> AWS setting would be interesting. I am not asking you to do that, but you may 
> get even better performance in that case :) 
> 
> Cheers,
> Seref
> 
> 
> On Fri, Jul 18, 2025 at 11:58 AM Pierre Barre  wrote:
>> __
>> Hi Seref,
>> 
>> For the benchmarks, I used Hetzner's cloud service with the following setup:
>> 
>> - A Hetzner s3 bucket in the FSN1 region
>> - A virtual machine of type ccx63 48 vCPU 192 GB memory
>> - 3 ZeroFS nbd devices (same s3 bucket)
>> - A ZFS stripped pool with the 3 devices
>> - 200GB zfs L2ARC
>> - Postgres configured accordingly memory-wise as well as with 
>> synchronous_commit = off, wal_init_zero = off and wal_recycle = off.
>> 
>> Best,
>> Pierre
>> 
>> On Fri, Jul 18, 2025, at 12:42, Seref Arikan wrote:
>>> Sorry, this was meant to go to the whole group:
>>> 
>>> Very interesting!. Great work. Can you clarify how exactly you're running 
>>> postgres in your tests? A specific AWS service? What's the test 
>>> infrastructure that sits above the file system?
>>> 
>>> On Thu, Jul 17, 2025 at 11:59 PM Pierre Barre  wrote:
>>>> Hi everyone,
>>>> 
>>>> I wanted to share a project I've been working on that enables PostgreSQL 
>>>> to run on S3 storage while maintaining performance comparable to local 
>>>> NVMe. The approach uses block-level access rather than trying to map 
>>>> filesystem operations to S3 objects.
>>>> 
>>>> ZeroFS: https://github.com/Barre/ZeroFS
>>>> 
>>>> # The Architecture
>>>> 
>>>> ZeroFS provides NBD (Network Block Device) servers that expose S3 storage 
>>>> as raw block devices. PostgreSQL runs unmodified on ZFS pools built on 
>>>> these block devices:
>>>> 
>>>> PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3
>>>> 
>>>> By providing block-level access and leveraging ZFS's caching capabilities 
>>>> (L2ARC), we can achieve microsecond latencies despite the underlying 
>>>> storage being in S3.
>>>> 
>>>> ## Performance Results
>>>> 
>>>> Here are pgbench results from PostgreSQL running on this setup:
>>>> 
>>>> ### Read/Write Workload
>>>> 
>>>> ```
>>>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 10 example
>>>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
>>>> starting vacuum...end.
>>>> transaction type: 
>>>> scaling factor: 50
>>>> query mode: simple
>>>> number of clients: 50
>>>> number of threads: 15
>>>> maximum number of tries: 1
>>>> number of transactions per client: 10
>>>> number of transactions actually processed: 500/500
>>>> number of failed transactions: 0 (0.000%)
>>>> latency average = 0.943 ms
>>>> initial connection time = 48.043 ms
>>>> tps = 53041.006947 (without initial connection time)
>>>> ```
>>>> 
>>>> ### Read-Only Workload
>>>> 
>>>> ```
>>>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 10 -S example
>>>> pgben

Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance

2025-07-18 Thread Pierre Barre
Hi Seref,

For the benchmarks, I used Hetzner's cloud service with the following setup:

- A Hetzner s3 bucket in the FSN1 region
- A virtual machine of type ccx63 48 vCPU 192 GB memory
- 3 ZeroFS nbd devices (same s3 bucket)
- A ZFS stripped pool with the 3 devices
- 200GB zfs L2ARC
- Postgres configured accordingly memory-wise as well as with 
synchronous_commit = off, wal_init_zero = off and wal_recycle = off.

Best,
Pierre

On Fri, Jul 18, 2025, at 12:42, Seref Arikan wrote:
> Sorry, this was meant to go to the whole group:
> 
> Very interesting!. Great work. Can you clarify how exactly you're running 
> postgres in your tests? A specific AWS service? What's the test 
> infrastructure that sits above the file system?
> 
> On Thu, Jul 17, 2025 at 11:59 PM Pierre Barre  wrote:
>> Hi everyone,
>> 
>> I wanted to share a project I've been working on that enables PostgreSQL to 
>> run on S3 storage while maintaining performance comparable to local NVMe. 
>> The approach uses block-level access rather than trying to map filesystem 
>> operations to S3 objects.
>> 
>> ZeroFS: https://github.com/Barre/ZeroFS
>> 
>> # The Architecture
>> 
>> ZeroFS provides NBD (Network Block Device) servers that expose S3 storage as 
>> raw block devices. PostgreSQL runs unmodified on ZFS pools built on these 
>> block devices:
>> 
>> PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3
>> 
>> By providing block-level access and leveraging ZFS's caching capabilities 
>> (L2ARC), we can achieve microsecond latencies despite the underlying storage 
>> being in S3.
>> 
>> ## Performance Results
>> 
>> Here are pgbench results from PostgreSQL running on this setup:
>> 
>> ### Read/Write Workload
>> 
>> ```
>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 10 example
>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
>> starting vacuum...end.
>> transaction type: 
>> scaling factor: 50
>> query mode: simple
>> number of clients: 50
>> number of threads: 15
>> maximum number of tries: 1
>> number of transactions per client: 10
>> number of transactions actually processed: 500/500
>> number of failed transactions: 0 (0.000%)
>> latency average = 0.943 ms
>> initial connection time = 48.043 ms
>> tps = 53041.006947 (without initial connection time)
>> ```
>> 
>> ### Read-Only Workload
>> 
>> ```
>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 10 -S example
>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
>> starting vacuum...end.
>> transaction type: 
>> scaling factor: 50
>> query mode: simple
>> number of clients: 50
>> number of threads: 15
>> maximum number of tries: 1
>> number of transactions per client: 10
>> number of transactions actually processed: 500/500
>> number of failed transactions: 0 (0.000%)
>> latency average = 0.121 ms
>> initial connection time = 53.358 ms
>> tps = 413436.248089 (without initial connection time)
>> ```
>> 
>> These numbers are with 50 concurrent clients and the actual data stored in 
>> S3. Hot data is served from ZFS L2ARC and ZeroFS's memory caches, while cold 
>> data comes from S3.
>> 
>> ## How It Works
>> 
>> 1. ZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use 
>> like any other block device
>> 2. Multiple cache layers hide S3 latency:
>>a. ZFS ARC/L2ARC for frequently accessed blocks
>>b. ZeroFS memory cache for metadata and hot dataZeroFS exposes NBD 
>> devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block 
>> device
>>c. Optional local disk cache
>> 3. All data is encrypted (ChaCha20-Poly1305) before hitting S3
>> 4. Files are split into 128KB chunks for insertion into ZeroFS' LSM-tree
>> 
>> ## Geo-Distributed PostgreSQL
>> 
>> Since each region can run its own ZeroFS instance, you can create 
>> geographically distributed PostgreSQL setups.
>> 
>> Example architectures:
>> 
>> Architecture 1
>> 
>> 
>>  PostgreSQL Client
>>|
>>| SQL queries
>>|
>> +--+
>> |  PG Proxy|
>> | (HAProxy/|
>> |  PgBouncer)  |
>>  

Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance

2025-07-18 Thread Pierre Barre
Now, I'm trying to understand how CAP theorem applies here. Traditional 
PostgreSQL replication has clear CAP trade-offs - you choose between 
consistency and availability during partitions.

But when PostgreSQL instances share storage rather than replicate:
- Consistency seems maintained (same data)
- Availability seems maintained (client can always promote an accessible node)
- Partitions between PostgreSQL nodes don't prevent the system from functioning

It seems that CAP assumes specific implementation details (like nodes 
maintaining independent state) without explicitly stating them.

How should we think about CAP theorem when distributed nodes share storage 
rather than coordinate state? Are the trade-offs simply moved to a different 
layer, or does shared storage fundamentally change the analysis?

Client with awareness of both PostgreSQL nodes
|   |
↓ (partition here)  ↓
PostgreSQL Primary  PostgreSQL Standby
|   |
└───┬───┘
↓
 Shared ZFS Pool
|
 6 Global ZeroFS instances

Best,
Pierre

On Fri, Jul 18, 2025, at 12:57, Pierre Barre wrote:
> Hi Seref,
> 
> For the benchmarks, I used Hetzner's cloud service with the following setup:
> 
> - A Hetzner s3 bucket in the FSN1 region
> - A virtual machine of type ccx63 48 vCPU 192 GB memory
> - 3 ZeroFS nbd devices (same s3 bucket)
> - A ZFS stripped pool with the 3 devices
> - 200GB zfs L2ARC
> - Postgres configured accordingly memory-wise as well as with 
> synchronous_commit = off, wal_init_zero = off and wal_recycle = off.
> 
> Best,
> Pierre
> 
> On Fri, Jul 18, 2025, at 12:42, Seref Arikan wrote:
>> Sorry, this was meant to go to the whole group:
>> 
>> Very interesting!. Great work. Can you clarify how exactly you're running 
>> postgres in your tests? A specific AWS service? What's the test 
>> infrastructure that sits above the file system?
>> 
>> On Thu, Jul 17, 2025 at 11:59 PM Pierre Barre  wrote:
>>> Hi everyone,
>>> 
>>> I wanted to share a project I've been working on that enables PostgreSQL to 
>>> run on S3 storage while maintaining performance comparable to local NVMe. 
>>> The approach uses block-level access rather than trying to map filesystem 
>>> operations to S3 objects.
>>> 
>>> ZeroFS: https://github.com/Barre/ZeroFS
>>> 
>>> # The Architecture
>>> 
>>> ZeroFS provides NBD (Network Block Device) servers that expose S3 storage 
>>> as raw block devices. PostgreSQL runs unmodified on ZFS pools built on 
>>> these block devices:
>>> 
>>> PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3
>>> 
>>> By providing block-level access and leveraging ZFS's caching capabilities 
>>> (L2ARC), we can achieve microsecond latencies despite the underlying 
>>> storage being in S3.
>>> 
>>> ## Performance Results
>>> 
>>> Here are pgbench results from PostgreSQL running on this setup:
>>> 
>>> ### Read/Write Workload
>>> 
>>> ```
>>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 10 example
>>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
>>> starting vacuum...end.
>>> transaction type: 
>>> scaling factor: 50
>>> query mode: simple
>>> number of clients: 50
>>> number of threads: 15
>>> maximum number of tries: 1
>>> number of transactions per client: 10
>>> number of transactions actually processed: 500/500
>>> number of failed transactions: 0 (0.000%)
>>> latency average = 0.943 ms
>>> initial connection time = 48.043 ms
>>> tps = 53041.006947 (without initial connection time)
>>> ```
>>> 
>>> ### Read-Only Workload
>>> 
>>> ```
>>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 10 -S example
>>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
>>> starting vacuum...end.
>>> transaction type: 
>>> scaling factor: 50
>>> query mode: simple
>>> number of clients: 50
>>> number of threads: 15
>>> maximum number of tries: 1
>>> number of transactions per client: 10
>>> number of transactions actually processed: 500/500
>>> number of failed transactions: 0 (0.000%)
>>> latency average = 0.121 ms
>>> initial connection time = 53.358 ms
>>> tps = 413436.248089 (without initial connection time)
>>> ```
>>> 
>>> These numbers are