> On May 27, 2025, at 12:19 PM, Ryan Rempel <[email protected]> wrote:
>
> I'm expanding a small Ceph cluster from 4 nodes to 5 nodes. The new node is a
> bit more sophisticated than the others, since it has some SSD storage that
> I'd like to use for DB+WAL (which I haven't done before, it has just been
> rotational disks).
>
> I'm using cephadm for orchestration, and normally add osds via "ceph orch
> daemon add osd". I prefer to add the osds in this "manual" way (rather than
> "ceph orch apply" with a spec) mainly because my infrastructure is not
> uniform (for better or worse, I'm working with hardware that becomes
> available in different ways over time, as I gradually upgrade things and add
> things).
>
> Looking at this page:
>
> https://docs.ceph.com/en/squid/cephadm/services/osd/
>
> ... it isn't entirely clear to me whether it's possible to specify a separate
> DB device when using the "ceph orch daemon add osd" procedure. There is a
> description of how to do it with a service spec, but how you would specify
> the DB device for "ceph orch daemon add osd" does not appear to be described.
>
> So, my first question is whether it's possible to specify a separate DB via
> "ceph orch daemon add osd"?
I believe it is, don’t have the syntax to hand.
> If not, I'll need to explore the service spec approach. I suppose I can use
> the "unmanaged: true" option in the spec (to keep it as "manual" as possible).
I do suggest leaving OSD services unmanaged except when you’re using them. So
that e.g. when you zap an OSD for replacement the old / bad drive isn’t
automatically redeployed.
An OSD service won’t mess with existing OSDs, so you don’t have to worry about
applying a spec that differs from the existing OSDs. Use the —dry-run flag
before applying a new spec to ensure that the effect is what you want.
>
> The remaining puzzle is how to use the SSD as DB+WAL for more than one OSD.
> At this point, the SSD is the raw device — I haven't done anything with it
> manually in LVM or whatever. In the service spec description above, I see
> that there is a "db_slots" key. So, I suppose that I could specify the
> "whole" SSD and provide for the number of slots?
That’s the idea.
> However, I don't necessarily want every slot to be the same size (because of
> my unfortunately heterogeneous hardware).
You can have multiple osd specs. In the docs scroll down to the advanced osd
spec section for examples. You can constrain multiple specs
There’s a lot of flexibility. The slots on a given osd host will be the same
size, though, right?
Here’s an example of a cluster being retrofitted. First a couple of new nodes
were added with SATA/SAS SSDs for HDD WAL+DB offload, then additional HDDs and
offload SSDs were added to the existing nodes along with a couple of NVMe SSDs
to be used only for CephFS metadata, not for offload.
# This describes the prior strategy. Note the use of the `size` and
`rotational` attributes to prevent the smaller SSDs from being used
# as OSDs. This host_pattern started out as * and was whittled down as each
host was migrated
---
service_type: osd
service_id: cost_capacity
service_name: osd.cost_capacity
placement:
host_pattern: host24
spec:
data_devices:
rotational: 1
size: '18T:'
filter_logic: AND
objectstore: bluestore
—
# This spec matches only the 1.9TB NVMe SSDs to be
# used as OSDs for the CephFS metadata pool
# Here again we use `rotational` and `size` to constrain application
# since the WAL+DB offload SSDs are 2TB
service_type: osd
service_id: dashboard-admin-1705602677615
service_name: osd.dashboard-admin-1705602677615
placement:
host_pattern: `*`
spec:
data_devices:
rotational: 0
size: 490G:1200G
filter_logic: AND
objectstore: bluestore
—
# Here hybrid OSDs are deployed on specific
# device names, which isn’t ideal in general because
# the names may change
service_type: osd
service_id: osd.hybrid
service_name: osd.osd.hybrid
unmanaged: true
placement:
hosts:
- host1701
spec:
block_db_size: 384075772723
data_devices:
paths:
- /dev/sdc
- /dev/sdd
- /dev/sde
- /dev/sdf
- /dev/sdg
db_devices:
paths:
- /dev/sdac
db_slots: 5
filter_logic: AND
objectstore: bluestore
---
> So, I also see that there is a "block_db_size" and "block_wal_size". But it's
> unclear how this relates to "db_slots" — which one would determine how the
> SSD is sliced up?
In general ignore the WAL and it’ll default to riding along with the DB. In
theory db_size and db_slots might be either/or, or perhaps there might be an
SSD that services both for offload and other purposes, though I wouldn’t
recommend that.
> I'd actually be happy to pre-slice the SSD (e.g. with LVM) and then directly
> specify which SSD slice is the DB+WAL for which OSD, if that's a feasible
> approach.
That works. Here’s a one-off script I’ve used to do this. I know, no error
checking. The args are 5x HDD block devices of existing OSDs followed by an
offload device. 20% of the device is used for each offload.
#!/bin/bash
ceph osd set noscrub
ceph osd set nodeep-scrub
sleep 15
VG=`uname -n`-$6-db
vgcreate $VG /dev/$6
for i in $1 $2 $3 $4 $5 ; do lvcreate -l 20%VG -n ceph-osd$i-db $VG ; done
vgdisplay $VG
for i in $1 $2 $3 $4 $5 ; do ceph osd add-noout $i ; done
CFSID=$(ceph fsid)
for i in $1 $2 $3 $4 $5
do
systemctl stop ceph-$CFSID@osd.$i
date
echo ceph-volume lvm new-db --osd-id $i --osd-fsid $(ceph osd find $i | jq
-r .osd_fsid) --target $VG/ceph-osd$i-db \; exit | cephadm shell --name osd.$i
date
echo ceph-volume lvm migrate --osd-id $i --osd-fsid $(ceph osd find $i | jq
-r .osd_fsid) --target $VG/ceph-osd$i-db --from data \; exit | cephadm shell
--name osd.$i
date
systemctl start ceph-$CFSID@osd.$i
done
> Though, I'd still be interested in knowing whether I need to set something
> for "block_db_size" and "block_wal_size", or whether it's enough to just
> actually make a certain size of LVM volume available for DB+WAL.
If you pre-create, then you don’t need the size params.
>
> Normally I'd just experiment, but that might be disruptive to the working
> cluster. I guess I could at least turn off rebalancing while I try things out?
Yes, turn off rebalancing so that you can validate the results, in case you
have to zap and start over. And use —dry-run a lot, and leave the osd
service(s) unmanaged except when you’re actively using.
>
> The other documentation I'm now reading is the documentation for ceph-volume,
> which appears to be related:
>
> https://docs.ceph.com/en/squid/ceph-volume/lvm/batch/
>
> It mentions, for instance, things like db_slots and block_db_size. The
> implication is that db_slots is an alternative to block_db_size — that you
> wouldn't specify both, for instance.
In most cases I would agree.
>
> I'm also reading the ceph-volume docs for "prepare". I suppose if I find that
> more suitable, it might be possible to "prepare" and OSD with ceph-volume and
> then "adopt" it with cephadm?
There might be snags with that approach. Adoption I think is intended for
legacy OSDs.
>
> Well, just writing the email has given me a bit more clarity about things to
> try, but I'd certainly be happy for any guidance.
>
>
> Ryan Rempel
>
> Director of Information Technology
>
> Canadian Mennonite University
> _______________________________________________
> ceph-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]