[ceph-users] Re: advice needed for sizing of wal+db device for hdd

Jan Kasprzak Tue, 16 Sep 2025 05:45:37 -0700

        Hi, Adrian,

Adrian Sevcenco wrote:
> Hi! i'm trying to estimate the size requiremenent for a wal+db volume for 26
> TB HDD while trying to minimize the administrative hassle for configuration
> (meaning that i would prefer a wal+db volume not separate db from wal) (RGW
> usage, 3 node cluster)


I have a similar setup - servers with 2xNVMe and some HDDs.
I would recommend aiming for more smaller servers instead of three large ones,
though.

> so, for a jbod that can go up to 44 drives (not all drives present, at most
> up to 12) i have 2x 6.4TB NVME
> AFAIU for metadata (db) device, i need raid as losing metadata, means
> losing the associated OSD .. did i get this right?

Yes. But I think the cluster setup _should_ be planned for losing an OSD
or even an entire server. So I would not bother using RAID-1 for metadata.

> So, what would the the best practice to map db+wal to hdd OSD?
> should i do a mdraid from the 2 nvme and that to split in 12 partitions?

What I did is to put the system to both NVMes on RAID-1 partitions:

/boot/efi - 128 MB or something like that)
/ - I used 200 GB, but it is probably overkill
swap - I used 32 GB

Then I created a partition covering the rest of the free space on each NVMe
and used both of them as physical volumes for a single LVM volume group:

vgcreate nvme_vg /dev/nvme0n1p4 /dev/nvme1n1p4

Now I can create LVM-based mirrored logical volumes for local applications,
should I ever need them, and non-mirrored LVs for Ceph metadata.
Something like this:

for i in `seq -w 01 06`; do lvcreate -n ceph_$i -L 100G nvme_vg /dev/nvme0n1p4; 
done
for i in `seq -w 11 16`; do lvcreate -n ceph_$i -L 100G nvme_vg /dev/nvme1n1p4; 
done
ceph-volume lvm prepare --bluestore --block.db /dev/nvme_vg/ceph_01 --data 
/dev/sda
ceph-volume lvm prepare --bluestore --block.db /dev/nvme_vg/ceph_11 --data 
/dev/sdb
ceph-volume lvm prepare --bluestore --block.db /dev/nvme_vg/ceph_02 --data 
/dev/sdc
ceph-volume lvm prepare --bluestore --block.db /dev/nvme_vg/ceph_12 --data 
/dev/sdd
...
ceph-volume lvm activate --all

The alternative would be to have a mirrored /boot only, and put everything
else on a NVMe-based LVM VG, using mirrored LVs for root and swap. This
way root FS would be easily resizable.

I don't have experience with this (I am not even sure whether Anaconda can
install AlmaLinux onto mirrored LVs), so I went for more traditional md-raid
instead of LVM mirror for root, swap, and /boot/efi.

> should i split the nvme ssd in 12 namespaces and make individual mdraid for
> each and that map to osds?

I don't think it would make a measurable difference to use NVMe namespaces.

> For a 6.4 TB NVME (divided to 12) and 26 TB OSD, the db/wal would be ~533
> GB, so around 2.05% .. how terrible is this number for RGW usage
> (i get that the recommended is 4%)
> 
> At what size the data is in danger if db is too small and when it become
> safe but only with performance degradation?

I think block.db can safely overflow to the main data area (with a performance
degradation, of course).

Hope this helps, but as always, your mileage may vary :-)

-Yenya

-- 
| Jan "Yenya" Kasprzak <kas at {fi.muni.cz - work | yenya.net - private}> |
| https://www.fi.muni.cz/~kas/                        GPG: 4096R/A45477D5 |
    We all agree on the necessity of compromise. We just can't agree on
    when it's necessary to compromise.                     --Larry Wall
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: advice needed for sizing of wal+db device for hdd

Reply via email to