Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance

Pierre Barre Sat, 26 Jul 2025 01:46:12 -0700

Also, Neon [0] and Aurora [1] pricing is so high that it seems to make most 
use-cases impractical (well, if you want a managed offering...). Neon's top 
public tier is not even what a single modern dedicated server (or virtual 
machine) can provide. I would have thought decoupling compute and storage would 
make the offerings cheaper, if anything.


Taking my own Merklemap [2] use-case where I run a 30TB database with Neon 
pricing (and I don't doubt that the non-public pricing would be even more 
expensive than that):

Storage Scaling:

- Business plan: 500 GB -> $700
- You need: 30,000 GB (30 TB)
- Scaling factor: 60x
- Linear estimate: $700 × 60 = $42,000/month
- Total 12 months cost: $504,000

Aurora calculation [3]:

- Instance type: db.r5.24xlarge
- Monthly cost: $21,887.28
- Total 12 months cost: $262,647.36

Now, calculating the same 30TB with the same instance type and S3 storage [4]:

- Instance Type: r5.24xlarge
- Monthly cost: $5,555.04
- Total 12 months cost: $66,660.48

But more interestingly, you don't need to use AWS at all anymore, because you 
can just move your setup anywhere at this point, as you get a similar level of 
reliability - and simplicity - but with very cheap services.

Hetzner ccx63 + Cloudflare R2:

- Hetzner ccx63: €287.99/month ≈ $338/month
- R2 storage (30TB): 30,000 GB × $0.015 = $450/month
- R2 operations: Should be measured to be calculated properly, but will 
probably be negligible.
- Total monthly: ~$760
- Total 12 months cost: $9,120/year

Best,
Pierre

[0] https://neon.com/pricing
[1] https://aws.amazon.com/rds/aurora/pricing/
[2] https://www.merklemap.com/
[3] 
https://calculator.aws/#/estimate?id=3f0ce6a91eed9a666d54bb8852ea00b042c3cd6e
[4] 
https://calculator.aws/#/estimate?id=1a77d8da3489bafc8681c6fd738a3186fb749ea3

On Sat, Jul 26, 2025, at 09:51, Pierre Barre wrote:
> Ah, by "shared storage" I mean that each node can acquire exclusivity, not 
> that they can both R/W to it at the same time.
> 
> > Some pretty well-known cases of storage / compute separation (Aurora, Neon) 
> > also share the storage between instances,
> 
> That model is cool, but I think it's more of a solution for outliers as I was 
> suggesting, not something that most would or should want.
> 
> Best,
> Pierre
> 
> On Sat, Jul 26, 2025, at 09:42, Vladimir Churyukin wrote:
>> Sorry, I was referring to this:
>> 
>> >  But when PostgreSQL instances share storage rather than replicate:
>> > - Consistency seems maintained (same data)
>> > - Availability seems maintained (client can always promote an accessible 
>> > node)
>> > - Partitions between PostgreSQL nodes don't prevent the system from 
>> > functioning
>> 
>> Some pretty well-known cases of storage / compute separation (Aurora, Neon) 
>> also share the storage between instances,
>> that's why I'm a bit confused by your reply. I thought you're thinking about 
>> this approach too, that's why I mentioned what kind of challenges one may 
>> have on that path.
>> 
>> 
>> On Sat, Jul 26, 2025 at 12:36 AM Pierre Barre <pie...@barre.sh> wrote:
>>> __
>>> What you describe doesn’t look like something very useful for the vast 
>>> majority of projects that needs a database. Why would you even want that if 
>>> you can avoid it? 
>>> 
>>> If your “single node” can handle tens / hundreds of thousands requests per 
>>> second, still have very durable and highly available storage, as well as 
>>> fast recovery mechanisms, what’s the point?
>>> 
>>> I am not trying to cater to extreme outliers that may want very weird like 
>>> this, that’s just not the use-cases I want to address, because I believe 
>>> they are few and far between.
>>> 
>>> Best,
>>> Pierre 
>>> 
>>> On Sat, Jul 26, 2025, at 08:57, Vladimir Churyukin wrote:
>>>> A shared storage would require a lot of extra work. That's essentially 
>>>> what AWS Aurora does.
>>>> You will have to have functionality to sync in-memory states between 
>>>> nodes, because all the instances will have cached data that can easily 
>>>> become stale on any write operation.
>>>> That alone is not that simple. You will have to modify some locking logic. 
>>>> Most likely do a lot of other changes in a lot of places, Postgres was not 
>>>> just built with the assumption that the storage can be shared.
>>>> 
>>>> -Vladimir
>>>> 
>>>> On Fri, Jul 18, 2025 at 5:31 AM Pierre Barre <pie...@barre.sh> wrote:
>>>>> Now, I'm trying to understand how CAP theorem applies here. Traditional 
>>>>> PostgreSQL replication has clear CAP trade-offs - you choose between 
>>>>> consistency and availability during partitions.
>>>>> 
>>>>> But when PostgreSQL instances share storage rather than replicate:
>>>>> - Consistency seems maintained (same data)
>>>>> - Availability seems maintained (client can always promote an accessible 
>>>>> node)
>>>>> - Partitions between PostgreSQL nodes don't prevent the system from 
>>>>> functioning
>>>>> 
>>>>> It seems that CAP assumes specific implementation details (like nodes 
>>>>> maintaining independent state) without explicitly stating them.
>>>>> 
>>>>> How should we think about CAP theorem when distributed nodes share 
>>>>> storage rather than coordinate state? Are the trade-offs simply moved to 
>>>>> a different layer, or does shared storage fundamentally change the 
>>>>> analysis?
>>>>> 
>>>>> Client with awareness of both PostgreSQL nodes
>>>>>     |                               |
>>>>>     ↓ (partition here)              ↓
>>>>> PostgreSQL Primary              PostgreSQL Standby
>>>>>     |                               |
>>>>>     └───────────┬───────────────────┘
>>>>>                 ↓
>>>>>          Shared ZFS Pool
>>>>>                 |
>>>>>          6 Global ZeroFS instances
>>>>> 
>>>>> Best,
>>>>> Pierre
>>>>> 
>>>>> On Fri, Jul 18, 2025, at 12:57, Pierre Barre wrote:
>>>>> > Hi Seref,
>>>>> >
>>>>> > For the benchmarks, I used Hetzner's cloud service with the following 
>>>>> > setup:
>>>>> >
>>>>> > - A Hetzner s3 bucket in the FSN1 region
>>>>> > - A virtual machine of type ccx63 48 vCPU 192 GB memory
>>>>> > - 3 ZeroFS nbd devices (same s3 bucket)
>>>>> > - A ZFS stripped pool with the 3 devices
>>>>> > - 200GB zfs L2ARC
>>>>> > - Postgres configured accordingly memory-wise as well as with 
>>>>> > synchronous_commit = off, wal_init_zero = off and wal_recycle = off.
>>>>> >
>>>>> > Best,
>>>>> > Pierre
>>>>> >
>>>>> > On Fri, Jul 18, 2025, at 12:42, Seref Arikan wrote:
>>>>> >> Sorry, this was meant to go to the whole group:
>>>>> >>
>>>>> >> Very interesting!. Great work. Can you clarify how exactly you're 
>>>>> >> running postgres in your tests? A specific AWS service? What's the 
>>>>> >> test infrastructure that sits above the file system?
>>>>> >>
>>>>> >> On Thu, Jul 17, 2025 at 11:59 PM Pierre Barre <pie...@barre.sh> wrote:
>>>>> >>> Hi everyone,
>>>>> >>>
>>>>> >>> I wanted to share a project I've been working on that enables 
>>>>> >>> PostgreSQL to run on S3 storage while maintaining performance 
>>>>> >>> comparable to local NVMe. The approach uses block-level access rather 
>>>>> >>> than trying to map filesystem operations to S3 objects.
>>>>> >>>
>>>>> >>> ZeroFS: https://github.com/Barre/ZeroFS
>>>>> >>>
>>>>> >>> # The Architecture
>>>>> >>>
>>>>> >>> ZeroFS provides NBD (Network Block Device) servers that expose S3 
>>>>> >>> storage as raw block devices. PostgreSQL runs unmodified on ZFS pools 
>>>>> >>> built on these block devices:
>>>>> >>>
>>>>> >>> PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3
>>>>> >>>
>>>>> >>> By providing block-level access and leveraging ZFS's caching 
>>>>> >>> capabilities (L2ARC), we can achieve microsecond latencies despite 
>>>>> >>> the underlying storage being in S3.
>>>>> >>>
>>>>> >>> ## Performance Results
>>>>> >>>
>>>>> >>> Here are pgbench results from PostgreSQL running on this setup:
>>>>> >>>
>>>>> >>> ### Read/Write Workload
>>>>> >>>
>>>>> >>> ```
>>>>> >>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 
>>>>> >>> example
>>>>> >>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
>>>>> >>> starting vacuum...end.
>>>>> >>> transaction type: <builtin: TPC-B (sort of)>
>>>>> >>> scaling factor: 50
>>>>> >>> query mode: simple
>>>>> >>> number of clients: 50
>>>>> >>> number of threads: 15
>>>>> >>> maximum number of tries: 1
>>>>> >>> number of transactions per client: 100000
>>>>> >>> number of transactions actually processed: 5000000/5000000
>>>>> >>> number of failed transactions: 0 (0.000%)
>>>>> >>> latency average = 0.943 ms
>>>>> >>> initial connection time = 48.043 ms
>>>>> >>> tps = 53041.006947 (without initial connection time)
>>>>> >>> ```
>>>>> >>>
>>>>> >>> ### Read-Only Workload
>>>>> >>>
>>>>> >>> ```
>>>>> >>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 -S 
>>>>> >>> example
>>>>> >>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
>>>>> >>> starting vacuum...end.
>>>>> >>> transaction type: <builtin: select only>
>>>>> >>> scaling factor: 50
>>>>> >>> query mode: simple
>>>>> >>> number of clients: 50
>>>>> >>> number of threads: 15
>>>>> >>> maximum number of tries: 1
>>>>> >>> number of transactions per client: 100000
>>>>> >>> number of transactions actually processed: 5000000/5000000
>>>>> >>> number of failed transactions: 0 (0.000%)
>>>>> >>> latency average = 0.121 ms
>>>>> >>> initial connection time = 53.358 ms
>>>>> >>> tps = 413436.248089 (without initial connection time)
>>>>> >>> ```
>>>>> >>>
>>>>> >>> These numbers are with 50 concurrent clients and the actual data 
>>>>> >>> stored in S3. Hot data is served from ZFS L2ARC and ZeroFS's memory 
>>>>> >>> caches, while cold data comes from S3.
>>>>> >>>
>>>>> >>> ## How It Works
>>>>> >>>
>>>>> >>> 1. ZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS 
>>>>> >>> can use like any other block device
>>>>> >>> 2. Multiple cache layers hide S3 latency:
>>>>> >>>    a. ZFS ARC/L2ARC for frequently accessed blocks
>>>>> >>>    b. ZeroFS memory cache for metadata and hot dataZeroFS exposes NBD 
>>>>> >>> devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other 
>>>>> >>> block device
>>>>> >>>    c. Optional local disk cache
>>>>> >>> 3. All data is encrypted (ChaCha20-Poly1305) before hitting S3
>>>>> >>> 4. Files are split into 128KB chunks for insertion into ZeroFS' 
>>>>> >>> LSM-tree
>>>>> >>>
>>>>> >>> ## Geo-Distributed PostgreSQL
>>>>> >>>
>>>>> >>> Since each region can run its own ZeroFS instance, you can create 
>>>>> >>> geographically distributed PostgreSQL setups.
>>>>> >>>
>>>>> >>> Example architectures:
>>>>> >>>
>>>>> >>> Architecture 1
>>>>> >>>
>>>>> >>>
>>>>> >>>                          PostgreSQL Client
>>>>> >>>                                    |
>>>>> >>>                                    | SQL queries
>>>>> >>>                                    |
>>>>> >>>                             +--------------+
>>>>> >>>                             |  PG Proxy    |
>>>>> >>>                             | (HAProxy/    |
>>>>> >>>                             |  PgBouncer)  |
>>>>> >>>                             +--------------+
>>>>> >>>                                /        \
>>>>> >>>                               /          \
>>>>> >>>                    Synchronous            Synchronous
>>>>> >>>                    Replication            Replication
>>>>> >>>                             /              \
>>>>> >>>                            /                \
>>>>> >>>               +---------------+        +---------------+
>>>>> >>>               | PostgreSQL 1  |        | PostgreSQL 2  |
>>>>> >>>               | (Primary)     |◄------►| (Standby)     |
>>>>> >>>               +---------------+        +---------------+
>>>>> >>>                       |                        |
>>>>> >>>                       |  POSIX filesystem ops  |
>>>>> >>>                       |                        |
>>>>> >>>               +---------------+        +---------------+
>>>>> >>>               |   ZFS Pool 1  |        |   ZFS Pool 2  |
>>>>> >>>               | (3-way mirror)|        | (3-way mirror)|
>>>>> >>>               +---------------+        +---------------+
>>>>> >>>                /      |      \          /      |      \
>>>>> >>>               /       |       \        /       |       \
>>>>> >>>         NBD:10809 NBD:10810 NBD:10811  NBD:10812 NBD:10813 NBD:10814
>>>>> >>>              |        |        |           |        |        |
>>>>> >>>         +--------++--------++--------++--------++--------++--------+
>>>>> >>>         |ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||ZeroFS 6|
>>>>> >>>         +--------++--------++--------++--------++--------++--------+
>>>>> >>>              |         |         |         |         |         |
>>>>> >>>              |         |         |         |         |         |
>>>>> >>>         S3-Region1 S3-Region2 S3-Region3 S3-Region4 S3-Region5 
>>>>> >>> S3-Region6
>>>>> >>>         (us-east) (eu-west) (ap-south) (us-west) (eu-north) (ap-east)
>>>>> >>>
>>>>> >>> Architecture 2:
>>>>> >>>
>>>>> >>> PostgreSQL Primary (Region 1) ←→ PostgreSQL Standby (Region 2)
>>>>> >>>                 \                    /
>>>>> >>>                  \                  /
>>>>> >>>                   Same ZFS Pool (NBD)
>>>>> >>>                          |
>>>>> >>>                   6 Global ZeroFS
>>>>> >>>                          |
>>>>> >>>                       S3 Regions
>>>>> >>>
>>>>> >>>
>>>>> >>> The main advantages I see are:
>>>>> >>> 1. Dramatic cost reduction for large datasets
>>>>> >>> 2. Simplified geo-distribution
>>>>> >>> 3. Infinite storage capacity
>>>>> >>> 4. Built-in encryption and compression
>>>>> >>>
>>>>> >>> Looking forward to your feedback and questions!
>>>>> >>>
>>>>> >>> Best,
>>>>> >>> Pierre
>>>>> >>>
>>>>> >>> P.S. The full project includes a custom NFS filesystem too.
>>>>> >>>
>>>>> >
>>>>> 
>>> 
>

Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance

Reply via email to