> On Aug 14, 2025, at 3:19 AM, Vishnu Bhaskar <[email protected]>
> wrote:
>
> Hi Anthony
>
> CEPH OSD DF TREE :: ===========================
> ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META
> AVAIL %USE VAR PGS STATUS TYPE NAME
> -51 12.22272 - 12 TiB 4.8 TiB 4.8 TiB 112 MiB 20 GiB
> 7.4 TiB 39.21 3.24 - root cache_root
> -27 0.87329 - 894 GiB 306 GiB 305 GiB 8.1 MiB 1.3 GiB
> 588 GiB 34.25 2.83 - host cache_node1
> 0 ssd 0.87329 1.00000 894 GiB 306 GiB 305 GiB 8.1 MiB 1.3 GiB
> 588 GiB 34.25 2.83 4 up osd.0
These are nominal 960 TB SSDs?
Under the PGS column the numbers are indeed doubleplus ungood. These should be
rather higher.
> -45 0.87299 - 894 GiB 382 GiB 381 GiB 6.3 MiB 1.5 GiB
> 512 GiB 42.76 3.53 - host cache_node10
> 45 ssd 0.87299 1.00000 894 GiB 382 GiB 381 GiB 6.3 MiB 1.5 GiB
> 512 GiB 42.76 3.53 3 up osd.45
> -47 0.87299 - 894 GiB 458 GiB 456 GiB 4.1 MiB 1.7 GiB
> 436 GiB 51.21 4.23 - host cache_node11
> 50 ssd 0.87299 1.00000 894 GiB 458 GiB 456 GiB 4.1 MiB 1.7 GiB
> 436 GiB 51.21 4.23 6 up osd.50
> -49 0.87299 - 894 GiB 535 GiB 533 GiB 8.0 MiB 1.7 GiB
> 359 GiB 59.84 4.94 - host cache_node12
> 55 ssd 0.87299 1.00000 894 GiB 535 GiB 533 GiB 8.0 MiB 1.7 GiB
> 359 GiB 59.84 4.94 7 up osd.55
The larger OSDs also have many fewer PGs than they should.
> 40 ssd 0.87299 1.00000 894 GiB 612 GiB 610 GiB 9.7 MiB 1.8 GiB
> 282 GiB 68.46 5.65 8 up osd.40
> -1 195.60869 - 196 TiB 20 TiB 20 TiB 810 MiB 77 GiB
> 175 TiB 10.42 0.86 - root default
> -3 13.97198 - 14 TiB 1.3 TiB 1.3 TiB 31 MiB 5.1 GiB
> 13 TiB 9.07 0.75 - host node1
> 1 ssd 3.49300 1.00000 3.5 TiB 309 GiB 308 GiB 7.8 MiB 1.2 GiB
> 3.2 TiB 8.64 0.71 36 up osd.1
> 2 ssd 3.49300 1.00000 3.5 TiB 331 GiB 330 GiB 5.2 MiB 1.4 GiB
> 3.2 TiB 9.26 0.76 36 up osd.2
> 3 ssd 3.49300 1.00000 3.5 TiB 292 GiB 291 GiB 8.2 MiB 1.2 GiB
> 3.2 TiB 8.16 0.67 34 up osd.3
> 4 ssd 3.49300 1.00000 3.5 TiB 365 GiB 364 GiB 9.9 MiB 1.4 GiB
> 3.1 TiB 10.21 0.84 38 up osd.4
You only have 4 OSDs per node? What kind of nodes are these? Are they
converged with compute?
Since *all* of your OSDs appear to be SSDs, why do you have a cache tier in the
first place?
>
> MIN/MAX VAR: 0.67/5.65 STDDEV: 14.49
The standard deviation here is perturbed by the wide variance in OSD sizes. I
have an RFE in to break down these figures by device class so that they are
more useful for heterogeneous clusters.
For now I suggest:
ceph config set global mon_max_pg_per_osd 500
ceph config set global mon_target_pg_per_osd 250
ceph config set mgr mgr/balancer/upmap_max_deviation 1
Then check that you don't have any of these set at lower scope, most things can
just be set at "global" honestly.
ceph config dump | grep pg_per_osd
If you have existing entries at "mon" or "global" scopes, I'd suggest using
"ceph config rm" to clear those so the global-scope values above are the only
ones in force.
I expect that your warning will clear after the dust settles.
>
>
> CEPH OSD POOL LS DETAIL :: ====================
> pool 2 'device_health_metrics' replicated size 2 min_size 1 crush_rule 0
> object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 44903
> flags hashpspool stripe_width 0 pg_num_min 1 application mgr_devicehealth
> pool 3 'volumes' replicated size 2 min_size 1 crush_rule 0 object_hash
> rjenkins pg_num 1024 pgp_num 1024 autoscale_mode off last_change 63235 lfor
> 353/353/62134
I suggest enabling the autoscaler for this pool after making the above settings.
> flags hashpspool,selfmanaged_snaps tiers 4 read_tier 4 write_tier 4
> stripe_width 0 application rbd
> pool 4 'volumes_cache' replicated size 2 min_size 1 crush_rule 1 object_hash
> rjenkins pg_num 32 pgp_num 32 autoscale_mode off last_change 63235 lfor
> 353/353/353 flags hashpspool,incomplete_clones,selfmanaged_snaps tier_of 3
> cache_mode writeback target_bytes 3298534883328 hit_set
> bloom{false_positive_probability: 0.05, target_size: 0, seed: 0} 14400s x4
> decay_rate 0 search_last_n 0 stripe_width 0 application rbd
Unless I'm missing something, I would look up procedures for removing the cache
tier entirely. I don't think it's doing anything for you. Actually I suspect
it's slowing you down.
> pool 5 'images' replicated size 2 min_size 1 crush_rule 0 object_hash
> rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 51925 lfor
> 0/0/368 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd
> pool 6 'internal' replicated size 2 min_size 1 crush_rule 0 object_hash
> rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 49949 flags
> hashpspool,selfmanaged_snaps stripe_width 0 application rbd
I notice that all of these pools have size=2, min_size=1. This is dangerous, I
strongly suggest setting all of these pools to size=3 min_size=2.
Each of the above steps will result in waves of peering and backfill. This is
normal. Do one item at a time and let the cluster converge before proceeding.
>>
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]