Also, Neon [0] and Aurora [1] pricing is so high that it seems to make most use-cases impractical (well, if you want a managed offering...). Neon's top public tier is not even what a single modern dedicated server (or virtual machine) can provide. I would have thought decoupling compute and storage would make the offerings cheaper, if anything.
Taking my own Merklemap [2] use-case where I run a 30TB database with Neon pricing (and I don't doubt that the non-public pricing would be even more expensive than that): Storage Scaling: - Business plan: 500 GB -> $700 - You need: 30,000 GB (30 TB) - Scaling factor: 60x - Linear estimate: $700 × 60 = $42,000/month - Total 12 months cost: $504,000 Aurora calculation [3]: - Instance type: db.r5.24xlarge - Monthly cost: $21,887.28 - Total 12 months cost: $262,647.36 Now, calculating the same 30TB with the same instance type and S3 storage [4]: - Instance Type: r5.24xlarge - Monthly cost: $5,555.04 - Total 12 months cost: $66,660.48 But more interestingly, you don't need to use AWS at all anymore, because you can just move your setup anywhere at this point, as you get a similar level of reliability - and simplicity - but with very cheap services. Hetzner ccx63 + Cloudflare R2: - Hetzner ccx63: €287.99/month ≈ $338/month - R2 storage (30TB): 30,000 GB × $0.015 = $450/month - R2 operations: Should be measured to be calculated properly, but will probably be negligible. - Total monthly: ~$760 - Total 12 months cost: $9,120/year Best, Pierre [0] https://neon.com/pricing [1] https://aws.amazon.com/rds/aurora/pricing/ [2] https://www.merklemap.com/ [3] https://calculator.aws/#/estimate?id=3f0ce6a91eed9a666d54bb8852ea00b042c3cd6e [4] https://calculator.aws/#/estimate?id=1a77d8da3489bafc8681c6fd738a3186fb749ea3 On Sat, Jul 26, 2025, at 09:51, Pierre Barre wrote: > Ah, by "shared storage" I mean that each node can acquire exclusivity, not > that they can both R/W to it at the same time. > > > Some pretty well-known cases of storage / compute separation (Aurora, Neon) > > also share the storage between instances, > > That model is cool, but I think it's more of a solution for outliers as I was > suggesting, not something that most would or should want. > > Best, > Pierre > > On Sat, Jul 26, 2025, at 09:42, Vladimir Churyukin wrote: >> Sorry, I was referring to this: >> >> > But when PostgreSQL instances share storage rather than replicate: >> > - Consistency seems maintained (same data) >> > - Availability seems maintained (client can always promote an accessible >> > node) >> > - Partitions between PostgreSQL nodes don't prevent the system from >> > functioning >> >> Some pretty well-known cases of storage / compute separation (Aurora, Neon) >> also share the storage between instances, >> that's why I'm a bit confused by your reply. I thought you're thinking about >> this approach too, that's why I mentioned what kind of challenges one may >> have on that path. >> >> >> On Sat, Jul 26, 2025 at 12:36 AM Pierre Barre <pie...@barre.sh> wrote: >>> __ >>> What you describe doesn’t look like something very useful for the vast >>> majority of projects that needs a database. Why would you even want that if >>> you can avoid it? >>> >>> If your “single node” can handle tens / hundreds of thousands requests per >>> second, still have very durable and highly available storage, as well as >>> fast recovery mechanisms, what’s the point? >>> >>> I am not trying to cater to extreme outliers that may want very weird like >>> this, that’s just not the use-cases I want to address, because I believe >>> they are few and far between. >>> >>> Best, >>> Pierre >>> >>> On Sat, Jul 26, 2025, at 08:57, Vladimir Churyukin wrote: >>>> A shared storage would require a lot of extra work. That's essentially >>>> what AWS Aurora does. >>>> You will have to have functionality to sync in-memory states between >>>> nodes, because all the instances will have cached data that can easily >>>> become stale on any write operation. >>>> That alone is not that simple. You will have to modify some locking logic. >>>> Most likely do a lot of other changes in a lot of places, Postgres was not >>>> just built with the assumption that the storage can be shared. >>>> >>>> -Vladimir >>>> >>>> On Fri, Jul 18, 2025 at 5:31 AM Pierre Barre <pie...@barre.sh> wrote: >>>>> Now, I'm trying to understand how CAP theorem applies here. Traditional >>>>> PostgreSQL replication has clear CAP trade-offs - you choose between >>>>> consistency and availability during partitions. >>>>> >>>>> But when PostgreSQL instances share storage rather than replicate: >>>>> - Consistency seems maintained (same data) >>>>> - Availability seems maintained (client can always promote an accessible >>>>> node) >>>>> - Partitions between PostgreSQL nodes don't prevent the system from >>>>> functioning >>>>> >>>>> It seems that CAP assumes specific implementation details (like nodes >>>>> maintaining independent state) without explicitly stating them. >>>>> >>>>> How should we think about CAP theorem when distributed nodes share >>>>> storage rather than coordinate state? Are the trade-offs simply moved to >>>>> a different layer, or does shared storage fundamentally change the >>>>> analysis? >>>>> >>>>> Client with awareness of both PostgreSQL nodes >>>>> | | >>>>> ↓ (partition here) ↓ >>>>> PostgreSQL Primary PostgreSQL Standby >>>>> | | >>>>> └───────────┬───────────────────┘ >>>>> ↓ >>>>> Shared ZFS Pool >>>>> | >>>>> 6 Global ZeroFS instances >>>>> >>>>> Best, >>>>> Pierre >>>>> >>>>> On Fri, Jul 18, 2025, at 12:57, Pierre Barre wrote: >>>>> > Hi Seref, >>>>> > >>>>> > For the benchmarks, I used Hetzner's cloud service with the following >>>>> > setup: >>>>> > >>>>> > - A Hetzner s3 bucket in the FSN1 region >>>>> > - A virtual machine of type ccx63 48 vCPU 192 GB memory >>>>> > - 3 ZeroFS nbd devices (same s3 bucket) >>>>> > - A ZFS stripped pool with the 3 devices >>>>> > - 200GB zfs L2ARC >>>>> > - Postgres configured accordingly memory-wise as well as with >>>>> > synchronous_commit = off, wal_init_zero = off and wal_recycle = off. >>>>> > >>>>> > Best, >>>>> > Pierre >>>>> > >>>>> > On Fri, Jul 18, 2025, at 12:42, Seref Arikan wrote: >>>>> >> Sorry, this was meant to go to the whole group: >>>>> >> >>>>> >> Very interesting!. Great work. Can you clarify how exactly you're >>>>> >> running postgres in your tests? A specific AWS service? What's the >>>>> >> test infrastructure that sits above the file system? >>>>> >> >>>>> >> On Thu, Jul 17, 2025 at 11:59 PM Pierre Barre <pie...@barre.sh> wrote: >>>>> >>> Hi everyone, >>>>> >>> >>>>> >>> I wanted to share a project I've been working on that enables >>>>> >>> PostgreSQL to run on S3 storage while maintaining performance >>>>> >>> comparable to local NVMe. The approach uses block-level access rather >>>>> >>> than trying to map filesystem operations to S3 objects. >>>>> >>> >>>>> >>> ZeroFS: https://github.com/Barre/ZeroFS >>>>> >>> >>>>> >>> # The Architecture >>>>> >>> >>>>> >>> ZeroFS provides NBD (Network Block Device) servers that expose S3 >>>>> >>> storage as raw block devices. PostgreSQL runs unmodified on ZFS pools >>>>> >>> built on these block devices: >>>>> >>> >>>>> >>> PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3 >>>>> >>> >>>>> >>> By providing block-level access and leveraging ZFS's caching >>>>> >>> capabilities (L2ARC), we can achieve microsecond latencies despite >>>>> >>> the underlying storage being in S3. >>>>> >>> >>>>> >>> ## Performance Results >>>>> >>> >>>>> >>> Here are pgbench results from PostgreSQL running on this setup: >>>>> >>> >>>>> >>> ### Read/Write Workload >>>>> >>> >>>>> >>> ``` >>>>> >>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 >>>>> >>> example >>>>> >>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1)) >>>>> >>> starting vacuum...end. >>>>> >>> transaction type: <builtin: TPC-B (sort of)> >>>>> >>> scaling factor: 50 >>>>> >>> query mode: simple >>>>> >>> number of clients: 50 >>>>> >>> number of threads: 15 >>>>> >>> maximum number of tries: 1 >>>>> >>> number of transactions per client: 100000 >>>>> >>> number of transactions actually processed: 5000000/5000000 >>>>> >>> number of failed transactions: 0 (0.000%) >>>>> >>> latency average = 0.943 ms >>>>> >>> initial connection time = 48.043 ms >>>>> >>> tps = 53041.006947 (without initial connection time) >>>>> >>> ``` >>>>> >>> >>>>> >>> ### Read-Only Workload >>>>> >>> >>>>> >>> ``` >>>>> >>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 -S >>>>> >>> example >>>>> >>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1)) >>>>> >>> starting vacuum...end. >>>>> >>> transaction type: <builtin: select only> >>>>> >>> scaling factor: 50 >>>>> >>> query mode: simple >>>>> >>> number of clients: 50 >>>>> >>> number of threads: 15 >>>>> >>> maximum number of tries: 1 >>>>> >>> number of transactions per client: 100000 >>>>> >>> number of transactions actually processed: 5000000/5000000 >>>>> >>> number of failed transactions: 0 (0.000%) >>>>> >>> latency average = 0.121 ms >>>>> >>> initial connection time = 53.358 ms >>>>> >>> tps = 413436.248089 (without initial connection time) >>>>> >>> ``` >>>>> >>> >>>>> >>> These numbers are with 50 concurrent clients and the actual data >>>>> >>> stored in S3. Hot data is served from ZFS L2ARC and ZeroFS's memory >>>>> >>> caches, while cold data comes from S3. >>>>> >>> >>>>> >>> ## How It Works >>>>> >>> >>>>> >>> 1. ZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS >>>>> >>> can use like any other block device >>>>> >>> 2. Multiple cache layers hide S3 latency: >>>>> >>> a. ZFS ARC/L2ARC for frequently accessed blocks >>>>> >>> b. ZeroFS memory cache for metadata and hot dataZeroFS exposes NBD >>>>> >>> devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other >>>>> >>> block device >>>>> >>> c. Optional local disk cache >>>>> >>> 3. All data is encrypted (ChaCha20-Poly1305) before hitting S3 >>>>> >>> 4. Files are split into 128KB chunks for insertion into ZeroFS' >>>>> >>> LSM-tree >>>>> >>> >>>>> >>> ## Geo-Distributed PostgreSQL >>>>> >>> >>>>> >>> Since each region can run its own ZeroFS instance, you can create >>>>> >>> geographically distributed PostgreSQL setups. >>>>> >>> >>>>> >>> Example architectures: >>>>> >>> >>>>> >>> Architecture 1 >>>>> >>> >>>>> >>> >>>>> >>> PostgreSQL Client >>>>> >>> | >>>>> >>> | SQL queries >>>>> >>> | >>>>> >>> +--------------+ >>>>> >>> | PG Proxy | >>>>> >>> | (HAProxy/ | >>>>> >>> | PgBouncer) | >>>>> >>> +--------------+ >>>>> >>> / \ >>>>> >>> / \ >>>>> >>> Synchronous Synchronous >>>>> >>> Replication Replication >>>>> >>> / \ >>>>> >>> / \ >>>>> >>> +---------------+ +---------------+ >>>>> >>> | PostgreSQL 1 | | PostgreSQL 2 | >>>>> >>> | (Primary) |◄------►| (Standby) | >>>>> >>> +---------------+ +---------------+ >>>>> >>> | | >>>>> >>> | POSIX filesystem ops | >>>>> >>> | | >>>>> >>> +---------------+ +---------------+ >>>>> >>> | ZFS Pool 1 | | ZFS Pool 2 | >>>>> >>> | (3-way mirror)| | (3-way mirror)| >>>>> >>> +---------------+ +---------------+ >>>>> >>> / | \ / | \ >>>>> >>> / | \ / | \ >>>>> >>> NBD:10809 NBD:10810 NBD:10811 NBD:10812 NBD:10813 NBD:10814 >>>>> >>> | | | | | | >>>>> >>> +--------++--------++--------++--------++--------++--------+ >>>>> >>> |ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||ZeroFS 6| >>>>> >>> +--------++--------++--------++--------++--------++--------+ >>>>> >>> | | | | | | >>>>> >>> | | | | | | >>>>> >>> S3-Region1 S3-Region2 S3-Region3 S3-Region4 S3-Region5 >>>>> >>> S3-Region6 >>>>> >>> (us-east) (eu-west) (ap-south) (us-west) (eu-north) (ap-east) >>>>> >>> >>>>> >>> Architecture 2: >>>>> >>> >>>>> >>> PostgreSQL Primary (Region 1) ←→ PostgreSQL Standby (Region 2) >>>>> >>> \ / >>>>> >>> \ / >>>>> >>> Same ZFS Pool (NBD) >>>>> >>> | >>>>> >>> 6 Global ZeroFS >>>>> >>> | >>>>> >>> S3 Regions >>>>> >>> >>>>> >>> >>>>> >>> The main advantages I see are: >>>>> >>> 1. Dramatic cost reduction for large datasets >>>>> >>> 2. Simplified geo-distribution >>>>> >>> 3. Infinite storage capacity >>>>> >>> 4. Built-in encryption and compression >>>>> >>> >>>>> >>> Looking forward to your feedback and questions! >>>>> >>> >>>>> >>> Best, >>>>> >>> Pierre >>>>> >>> >>>>> >>> P.S. The full project includes a custom NFS filesystem too. >>>>> >>> >>>>> > >>>>> >>> >