> On Jun 20, 2025, at 8:20 PM, Niklas Hambüchen <[email protected]> wrote:
>
> I have 2 clusters; both have HDDs and SSDs. Reporting only the HDDs which
> have their own pools:
>
> "rep-cluster": hdd-pool 3-replication, 86 OSDs (16 TiB each), 1024 PGs, 78
> %RAW USED, 100 M objects
> "ec-cluster": hdd-pool erasure k=4 m=2, 58 OSDs (16 TiB each), 256 PGs, 60
> %RAW USED, 450 M objects
>
> Both are Ceph 18.2.1, Bluestore, and have the autoscaler enabled.
> As you can see, I have many small objects.
>
> My PGs-copies-per-OSD seem far off from the recommendation of 100 PGs per OSD
> (`mon_target_pg_per_osd`):
>
> rep-cluster: 35 PGs/OSD (= 1024*3/86)
> ec-cluster: 26 PGs/OSD (= 256*6/58)
The nomenclature here can be tricky. As I’ve encountered documentation of what
we at least used to call the PG ratio I’ve tried to describe this target as the
number of *PG replicas* per OSD, because often enough folks don’t multiply by
the replication size / EC K+M when doing the math, which I I see you’ve done.
When there are multiple device classes and/or pools, especially with varying
data protection strategies, it can get a bit complicated.
Please share `ceph osd df` for each cluster, trimmed to include only the column
header and a handful of representative OSDs for each device class. And the last
two lines with the stddev.
And `ceph df` and `ceph balancer status`
Check the STDDEV figure at the bottom of `ceph osd df`, though if your SSD OSDs
are significantly smaller than the HDDs that can confound the reporting. I
have an RFE in to report the standard deviation per-device-class in addition to
for the cluster as a whole.
Also check the VAR column for OSDs within a device class:
# ceph osd df | head
ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META
AVAIL %USE VAR PGS STATUS
217 hdd 18.53969 1.00000 19 TiB 9.8 TiB 9.5 TiB 5 KiB 66 GiB
8.7 TiB 52.92 0.89 115 up
219 hdd 18.53969 1.00000 19 TiB 8.5 TiB 8.2 TiB 1 KiB 71 GiB
10 TiB 46.11 0.77 104 up
221 hdd 18.53969 1.00000 19 TiB 11 TiB 10 TiB 2 KiB 76 GiB
7.9 TiB 57.65 0.97 121 up
The VAR(iance) is relative to the average number of PG replicas. Ideally — at
least for a given device class — this value will not much more or less than
1.00.
In this example the cluster was doubled in size and with the grace of
upmap-remapped and the balancer is slowly but surely balancing data, which is
why the variances are high.
> Reporting only the HDDs which have their own pools
When one has OSDs of varying sizes and/or device classes, the balancer and pg
autoscaler can be confounded to varying degrees.
Since you have multiple device classes, I imagine you have CRUSH rules that
constrain pools to one or the other?
"rule_id": 6,
"rule_name": "ssd_crush",
"type": 1,
"steps": [
{
"op": "take",
"item": -33,
"item_name": "default~ssd"
Are there any CRUSH rules — especially #0 default replicated rule — that do not
specify a device class in this way? If so, are there any pools that select
such a rule? If so, changing the default or other rules to specify a device
class, or changing pools using them to use a device-class-specific rule, can
help.
>
> So I'm at least 3x-4x off.
> Why?
> Should the autoscaler not have increased the PGs here?
The autoscaler is a fantastic idea from a usability perspective. It is though
imperfect and benefits from kaizen. My understanding is that the autoscaler
won’t jump a pg_num value until the new value is (by default) a factor of 3
high or low. I suspect that his enforces a manner of hysteresis, so that small
fluctuations in pool usage or OSD count don’t result in annoying flapping back
and forth.
>
> I believe that because of this I suffer some drawbacks:
>
> * On ec-cluster, a PG contains ~2 TiB and ~2 M objects, causing rebalances to
> happen in coarse, slow steps.
That’s one big reason why the current PG ratio target of 100 is suboptimal. The
guidance used to be 200, it was retconned to 100 a handful of years ago because
reasons. At a time when the largest OSDs were on the order of 8TB.
Today one can buy a 122TB SSD, and SKUs double that size are on the horizon.
For today I suggest
ceph config set global target_size_ratio 250
ceph config set global mon_max_pg_per_osd 1000
The first sets the target back to a sane value; I have a PR pending to change
this default. This gives the autoscaler more room to do its thing.
The second is a guardrail; it does not itself change calculations, but allows
headroom for clusters with varying OSD sizes and/or failure domains of varying
weights avoid irksome PG activation failures in certain scenarios.
Also, when the cluster contains OSDs of significantly varying weights —
regardless of device class — the balancer can be facilitated by setting
mgr advanced mgr/balancer/upmap_max_deviation 1
I suspect that the above steps will get you closer to where you want to be.
>
> Should I take some steps to force the autoscaler to increase PGs, and if yes,
> which approach would be best here?
>
> Thanks for your tips!
> Niklas
> _______________________________________________
> ceph-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]