[ceph-users] Re: advice needed for sizing of wal+db device for hdd

Anthony D'Atri Tue, 16 Sep 2025 07:24:41 -0700


>> Hi! i'm trying to estimate the size requiremenent for a wal+db volume for 26
>> TB HDD


What is your workload?  Consider that this is bottlenecking 10x the data behind 
the same interface that was limtiing for 2-3 TB HDDs years ago.


>> while trying to minimize the administrative hassle

SSDs with no WAL+DB offload.

>> for configuration
>> (meaning that i would prefer a wal+db volume not separate db from wal)

I need to revisit the docs, almost nobody has cause to have a dedicated WAL 
device.


>> (RGW usage, 3 node cluster)

Danger, Will Robinson!

This all but limits you to having only 3 mons.  It also means that when one 
node is down, that’s fully 33% of your IOPS.  Are you planning to use a 
replicated buckets.data pool?  RGW deployments often use EC to maximize usable 
space at the expense of performance; here to use EC safely you’d want to use a 
brand-new MSR rule, with lessened space efficiency.


> I have a similar setup - servers with 2xNVMe and some HDDs.
> I would recommend aiming for more smaller servers instead of three large ones,
> though.

Absolutely.  Ultra-dense nodes are prone to bottlenecks:
* Backplane / expander throughput
* HBA throughput / congestion
* NIC saturation

When you lose one or bring it back, you have a thundering herd of 
recovery/backfill that will impact your clients.  And take weeks to complete, 
during which you have increased risk of data being unavailable or lost.

>> so, for a jbod that can go up to 44 drives (not all drives present, at most
>> up to 12)

Why have a chassis like that and only populate it 27% full?  Is this 
hand-me-down hardware?

Also note that with only three nodes you’ll want each to have the same 
aggregate OSD capacity (CRUSH weight).

I recommend at least five nodes so you can safely have five mons.  And with EC 
there are advantages to having at least k + m + 1 nodes.  4+2 is a reasonable 
EC profile if one is new to the tradeoffs of EC, which would mean at least 
seven nodes.



>> i have 2x 6.4TB NVME AFAIU for metadata (db) device, i need raid

Ceph does that for you.  

>> as losing metadata, means losing the associated OSD .. did i get this right?
> 
> Yes. But I think the cluster setup _should_ be planned for losing an OSD
> or even an entire server. So I would not bother using RAID-1 for metadata.

Agreed. RAID on top of RAID is rarely a great strategy.  


>> So, what would the the best practice to map db+wal to hdd OSD?
>> should i do a mdraid from the 2 nvme and that to split in 12 partitions?

If you’re dead-set on using this gear, map six OSDs to each unmirrored NVMe 
SSD.  You will burn their endurance at half the rate that way, and at such a 
time that one fails, it won’t take out the entire node.

> 
> What I did is to put the system to both NVMes on RAID-1 partitions:
> 
> /boot/efi - 128 MB or something like that)
> / - I used 200 GB, but it is probably overkill

Congrats on eschewing the antiquated strategy of partitioning the boot volume 
to death.

> swap - I used 32 GB

If you need any swap, what you really need is more physmem.  Don’t provision 
any swap at all.  This isn’t 1985.

> 
> Then I created a partition covering the rest of the free space on each NVMe
> and used both of them as physical volumes for a single LVM volume group:

Sharing the boot volume with data is not an ideal strategy.  I have a customer 
who got themselves into an outage doing that.  You have a zillion SAS/SATA 
slots empty, put a pair of SSDs into each system for boot/OS, mirror them with 
MD, and don’t use them for data.  


> 
> vgcreate nvme_vg /dev/nvme0n1p4 /dev/nvme1n1p4
> 
> Now I can create LVM-based mirrored logical volumes for local applications,
> should I ever need them

Nononono.  See above.


>  and non-mirrored LVs for Ceph metadata.
> Something like this:
> 
> for i in `seq -w 01 06`; do lvcreate -n ceph_$i -L 100G nvme_vg 
> /dev/nvme0n1p4; done
> for i in `seq -w 11 16`; do lvcreate -n ceph_$i -L 100G nvme_vg 
> /dev/nvme1n1p4; done
> ceph-volume lvm prepare --bluestore --block.db /dev/nvme_vg/ceph_01 --data 
> /dev/sda
> ceph-volume lvm prepare --bluestore --block.db /dev/nvme_vg/ceph_11 --data 
> /dev/sdb
> ceph-volume lvm prepare --bluestore --block.db /dev/nvme_vg/ceph_02 --data 
> /dev/sdc
> ceph-volume lvm prepare --bluestore --block.db /dev/nvme_vg/ceph_12 --data 
> /dev/sdd
> ...
> ceph-volume lvm activate --all
> 
> The alternative would be to have a mirrored /boot only, and put everything
> else on a NVMe-based LVM VG, using mirrored LVs for root and swap. This
> way root FS would be easily resizable.

Put everything else on different media, use the entire boot volume for /.

> 
> I don't have experience with this (I am not even sure whether Anaconda can
> install AlmaLinux onto mirrored LVs),

I don’t know about Alma but I’ve done this lots with EL.

> so I went for more traditional md-raid
> instead of LVM mirror for root, swap, and /boot/efi.

My sense is that LVM mirroring is mostly for temporary use while migrating 
devices, though it may actually use MD under the hood.  I tend to create an MD 
metadevice and create LVs on top of that.

> 
>> should i split the nvme ssd in 12 namespaces and make individual mdraid for
>> each and that map to osds?
> 
> I don't think it would make a measurable difference to use NVMe namespaces.

I’ve yet to see a reason to use namespaces over traditional partitions.  There 
may be, but I’ve yet to discover it.

> 
>> For a 6.4 TB NVMe SSD

You don’t need high-endurance mixed-use SSDs.  1DWPD read-intensive are fine.  
They’re the same hardware, with less overprovisioning, and less markup.  You 
can change one into the other with software, the mfgs do this all the time at 
the factory depending on what they need to ship.

>> (divided to 12) and 26 TB OSD, the db/wal would be ~533
>> GB, so around 2.05% .. how terrible is this number for RGW usage
>> (i get that the recommended is 4%)

That 4% figure is fairly arbitrary, but for RGW usage you usually want more 
than for, say, RBD.  With RocksDB compression enabled in the latest releases, 
some advocate as little as 2.5%.  This is another reason to not mirror your 
offload devices:  being able to have larger partitions, which will help with 
the higher RocksDB levels and with compaction, avoiding spillover.

>> At what size the data is in danger if db is too small and when it become
>> safe but only with performance degradation?
> 
> I think block.db can safely overflow to the main data area (with a performance
> degradation, of course).

Yes, with recent releases there are no magic thresholds.  Back before column 
family sharding, there were discrete amounts of space that *could* be used with 
the rest ignored.  Like with a 55 GB partition, only ~33 GB would actually be 
used.  

> 
>    We all agree on the necessity of compromise. We just can't agree on
>    when it's necessary to compromise.                     --Larry Wall

We demand rigidly defined areas of doubt and uncertainty.


_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: advice needed for sizing of wal+db device for hdd

Reply via email to