[ceph-users] Re: Suspiciously low PG count for CephFS with many small files

Anthony D'Atri Fri, 20 Jun 2025 18:57:14 -0700


> On Jun 20, 2025, at 8:20 PM, Niklas Hambüchen <[email protected]> wrote:
> 
> I have 2 clusters; both have HDDs and SSDs. Reporting only the HDDs which 
> have their own pools:
> 
> "rep-cluster": hdd-pool 3-replication,   86 OSDs (16 TiB each), 1024 PGs, 78 
> %RAW USED, 100 M objects
> "ec-cluster":  hdd-pool erasure k=4 m=2, 58 OSDs (16 TiB each),  256 PGs, 60 
> %RAW USED, 450 M objects
> 
> Both are Ceph 18.2.1, Bluestore, and have the autoscaler enabled.
> As you can see, I have many small objects.
> 
> My PGs-copies-per-OSD seem far off from the recommendation of 100 PGs per OSD 
> (`mon_target_pg_per_osd`):
> 
> rep-cluster: 35 PGs/OSD (= 1024*3/86)
> ec-cluster:  26 PGs/OSD (= 256*6/58)


The nomenclature here can be tricky.  As I’ve encountered documentation of what 
we at least used to call the PG ratio I’ve tried to describe this target as the 
number of *PG replicas* per OSD, because often enough folks don’t multiply by 
the replication size / EC K+M when doing the math, which I I see you’ve done.  
When there are multiple device classes and/or pools, especially with varying 
data protection strategies, it can get a bit complicated.

Please share `ceph osd df` for each cluster, trimmed to include only the column 
header and a handful of representative OSDs for each device class. And the last 
two lines with the stddev. 
And `ceph df` and `ceph balancer status`

Check the STDDEV figure at the bottom of `ceph osd df`, though if your SSD OSDs 
are significantly smaller than the HDDs that can confound the reporting.  I 
have an RFE in to report the standard deviation per-device-class in addition to 
for the cluster as a whole.

Also check the VAR column for OSDs within a device class:

# ceph osd df | head
ID   CLASS  WEIGHT    REWEIGHT  SIZE     RAW USE  DATA     OMAP      META     
AVAIL    %USE   VAR   PGS  STATUS
217    hdd  18.53969   1.00000   19 TiB  9.8 TiB  9.5 TiB     5 KiB   66 GiB  
8.7 TiB  52.92  0.89  115      up
219    hdd  18.53969   1.00000   19 TiB  8.5 TiB  8.2 TiB     1 KiB   71 GiB   
10 TiB  46.11  0.77  104      up
221    hdd  18.53969   1.00000   19 TiB   11 TiB   10 TiB     2 KiB   76 GiB  
7.9 TiB  57.65  0.97  121      up

The VAR(iance) is relative to the average number of PG replicas.  Ideally — at 
least for a given device class — this value will not much more or less than 
1.00.

In this example the cluster was doubled in size and with the grace of 
upmap-remapped and the balancer is slowly but surely balancing data, which is 
why the variances are high.  

> Reporting only the HDDs which have their own pools

When one has OSDs of varying sizes and/or device classes, the balancer and pg 
autoscaler can be confounded to varying degrees.

Since you have multiple device classes, I imagine you have CRUSH rules that 
constrain pools to one or the other? 

            "rule_id": 6,
            "rule_name": "ssd_crush",
            "type": 1,
            "steps": [
                {
                    "op": "take",
                    "item": -33,
                    "item_name": "default~ssd"

Are there any CRUSH rules — especially #0 default replicated rule — that do not 
specify a device class in this way?  If so, are there any pools that select 
such a rule?  If so, changing the default or other rules to specify a device 
class, or changing pools using them to use a device-class-specific rule, can 
help.

> 
> So I'm at least 3x-4x off.
> Why?
> Should the autoscaler not have increased the PGs here?

The autoscaler is a fantastic idea from a usability perspective. It is though 
imperfect and benefits from kaizen.  My understanding is that the autoscaler 
won’t jump a pg_num value until the new value is (by default) a factor of 3 
high or low.  I suspect that his enforces a manner of hysteresis, so that small 
fluctuations in pool usage or OSD count don’t result in annoying flapping back 
and forth.


> 
> I believe that because of this I suffer some drawbacks:
> 
> * On ec-cluster, a PG contains ~2 TiB and ~2 M objects, causing rebalances to 
> happen in coarse, slow steps.

That’s one big reason why the current PG ratio target of 100 is suboptimal. The 
guidance used to be 200, it was retconned to 100 a handful of years ago because 
reasons.  At a time when the largest OSDs were on the order of 8TB.

Today one can buy a 122TB SSD, and SKUs double that size are on the horizon.

For today I suggest

        ceph config set global target_size_ratio 250
        ceph config set global mon_max_pg_per_osd 1000

The first sets the target back to a sane value; I have a PR pending to change 
this default. This gives the autoscaler more room to do its thing.
The second is a guardrail; it does not itself change calculations, but allows 
headroom for clusters with varying OSD sizes and/or failure domains of varying 
weights avoid irksome PG activation failures in certain scenarios.

Also, when the cluster contains OSDs of significantly varying weights — 
regardless of device class — the balancer can be facilitated by setting

        mgr advanced  mgr/balancer/upmap_max_deviation          1

I suspect that the above steps will get you closer to where you want to be.


> 
> Should I take some steps to force the autoscaler to increase PGs, and if yes, 
> which approach would be best here?
> 
> Thanks for your tips!
> Niklas
> _______________________________________________
> ceph-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: Suspiciously low PG count for CephFS with many small files

Reply via email to