> Le 13 sept. 2025 à 14:17, Anthony D'Atri <[email protected]> a écrit :
> 
> 
> 
> (..)
>>> 
>>> 
>>> I can't say that we upgraded a lot of clusters from N to P, but those
>>> that we upgraded didn't show any of these symptoms you describe. But
>>> we always did the Filestore to Bluestore conversion before the actual
>>> upgrade. In SUSE Enterprise Storage (which we also supported at that
>>> time) this was pointed out as a requirement. I just checked the ceph
>>> docs, I can't find such a statement (yet).
> 
> I *think* Filestore OSDs still work, but I've been peppering the docs with 
> admonitions to convert for several releases now.  I would expect them to have 
> worked with Pacific.  Did you update from the last Nautilus to the last 
> Pacific?

Yes. And yes it was supposed to work. And in fact it works on our test cluster 
but this one has not the same history as the prod one and it’s there only to 
test some crush rules,… before applying to prod cluster.

>>>> In terms of hardware, we have three monitors (cephmon) and 30
>>>> storage servers (cephstore) spread across three datacenters. These
>>>> servers are connected to the network via an aggregate (LACP) of two
>>>> 10 Gbps fibre connections, through which two VLANs pass, one for the
>>>> CEPH frontend network and one for the CEPH backend network. In doing
>>>> so, we have always given ourselves the option of separating the
>>>> frontend and backend into dedicated aggregates if the bandwidth
>>>> becomes insufficient.
> 
> Nice planning.  For scratch clusters today I usually suggest a single 25-100 
> GE bonded public network.  The dynamics when you started were different.

Yes. But for a network availability of ports point of view, I thing the next 
step will be probably splitting the front and the back network.

> 
> 
>>>> 
>>>>    100 hdd 10.90999 TB
>>>>     48 hdd 11.00000 TB
>>>>     48 hdd 14.54999 TB
>>>>     24 hdd 15.00000 TB
>>>>      9 hdd 5.45999 TB
>>>>    108 hdd 9.09999 TB
> 
> Have you tried disabling their volatile write cache?

Yes they are.

> 
>>>> 
>>>>     84 ssd 0.89400 TB
>>>>    198 ssd 0.89424 TB
>>>>     18 ssd 0.93599 TB
>>>>     32 ssd 1.45999 TB
>>>>     16 ssd 1.50000 TB
>>>>     48 ssd 1.75000 TB
>>>>     24 ssd 1.79999 TB
> 
> Just for others reading, very small SSDs can end up using a surprising 
> fraction of their capacity for DB/WAL/other overhead, resulting in less 
> usable capacity than one expects.

DB is always offloaded to a RAID1 NVMe storage.

(..)

> 
>>>> 
>>>> * First problem:
>>>> We were forced to switch from FileStore to BlueStore in an emergency
>>>> and unscheduled manner because after upgrading the CEPH packages on
>>>> the first storage server, the FileStore OSDs would no longer start.
> 
> Did you capture logs from your init system and representative OSDs?  They 
> would help understand what happened.

Unfortunately not. We were not expecting the problems when we started the 
upgrade… :-(

(..)

> 
>>>> Here we see that our SSD-type OSDs fill up at a rate of ~ 2% every 3
>>>> hours (the phenomenon is also observed on HDD-type OSDs, but as we
>>>> have a large capacity, it is less critical).
>>>> Manual (re)weight changes only provided a temporary solution and,
>>>> despite all our attempts (OSD restart, etc.), we reached the
>>>> critical full_ratio threshold, which is 0.97 for us.
> 
> Does your CRUSH map set optimal tunables? Or an older profile? 
> 
> # ceph osd crush show-tunables
> {
>     "choose_local_tries": 0,
>     "choose_local_fallback_tries": 0,
>     "choose_total_tries": 50,
>     "chooseleaf_descend_once": 1,
>     "chooseleaf_vary_r": 1,
>     "chooseleaf_stable": 1,
>     "straw_calc_version": 1,
>     "allowed_bucket_algs": 54,
>     "profile": "jewel",
>     "optimal_tunables": 1,
>     "legacy_tunables": 0,
>     "minimum_required_version": "jewel",
>     "require_feature_tunables": 1,
>     "require_feature_tunables2": 1,
>     "has_v2_rules": 1,
>     "require_feature_tunables3": 1,
>     "has_v3_rules": 0,
>     "has_v4_buckets": 1,
>     "require_feature_tunables5": 1,
>     "has_v5_rules": 0
> }
> 

Seems the same as yours
{
    "choose_local_tries": 0,
    "choose_local_fallback_tries": 0,
    "choose_total_tries": 50,
    "chooseleaf_descend_once": 1,
    "chooseleaf_vary_r": 1,
    "chooseleaf_stable": 1,
    "straw_calc_version": 1,
    "allowed_bucket_algs": 54,
    "profile": "jewel",
    "optimal_tunables": 1,
    "legacy_tunables": 0,
    "minimum_required_version": "jewel",
    "require_feature_tunables": 1,
    "require_feature_tunables2": 1,
    "has_v2_rules": 0,
    "require_feature_tunables3": 1,
    "has_v3_rules": 0,
    "has_v4_buckets": 1,
    "require_feature_tunables5": 1,
    "has_v5_rules": 0
}




> Older tunables can result in unequal data distribution.  Similarly, are all 
> of your CRUSH buckets straw2?
> 
> # ceph osd crush dump | fgrep alg\" | sort | uniq -c
>      42             "alg": "straw2",

Yup

    306             "alg": "straw2",


> 
> If not
> 
> ceph osd crush set-all-straw-buckets-to-straw2 
> 
> That should help with uniformity, though note that it will cause data to 
> move, and if you're using legacy OSD reweighs those values would need to be 
> readjusted.
> 
> What does
> 
>       ceph balancer status
> 
> show?  If you have legacy reweighs set to < 1.00 and pg-upmap balancing at 
> the same time, you'll end up with outliers.  When using pg-upmap balancing, 
> one really has to reset all the legacy reweights.  If the cluster is fairly 
> full that may need to be done incrementally to minimize making outliers worse.

{
    "active": true,
    "last_optimize_duration": "0:00:01.032868",
    "last_optimize_started": "Sun Sep 14 11:58:51 2025",
    "mode": "upmap",
    "no_optimization_needed": true,
    "optimize_result": "Unable to find further optimization, or pool(s) pg_num 
is decreasing, or distribution is already perfect",
    "plans": []
}



>>>> I'll leave you to imagine the effect on the virtual machines and the
>>>> services provided to our users.
>>>> We also had very strong growth in the size of the MONitor databases
>>>> (~3 GB -> 100 GB) (compaction did not really help).
> 
> Compaction can't happen until backfill/recovery is complete.  At one point 
> there was a bug when it also required that the numbers of total, up, and in 
> OSDs were equal, i.e. all OSDs were up and in.

Good to know


> 
>>>> 
>>>> 
>>>> After this second total recovery of the CEPH cluster and the restart
>>>> of the virtualisation environment, we still have the third DC (10
>>>> cephstore) to update from CEPH 14 to 16, and our ‘SSD’ OSDs are
>>>> filling up again until the automatic activation of
>>>> scrubs/deep-scrubs at 7 p.m.
>>>> Since then, progress has stopped, the use of the various OSDs is
>>>> stable and more or less evenly distributed (via active upmap
>>>> balancer).
> 
> Check your legacy reweights:
> 
> # ceph osd tree | head
> ID   CLASS  WEIGHT      TYPE NAME                   STATUS  REWEIGHT  PRI-AFF
> -37                  0  root staging
>  -1         5577.10254  root default
> -34          465.31158      host cephab92
> 217    hdd    18.53969          osd.217                 up   1.00000  1.00000
> 
> If you have any reweights that aren't 1.0000, that could be a factor.  When 
> using the upmap balancer, they all really need to be 1.0000.

Make sense

>>>> 
>>>> *** Questions / Assumptions / Opinions ***
>>>> 
>>>> Have you ever encountered a similar phenomenon? We agree that having
>>>> different versions of OSDs coexisting is not a good solution
> 
> Filestore vs BlueStore is below RADOS, so it's not so bad.  BlueStore OSDs 
> are much less prone to memory ballooning but there's no special risk in 
> running both that I've ever seen.

OK

> 
> 
>>>> Our current hypothesis, following the restoration of stability and
>>>> the fact that we have never had this problem with OSDs in FileStore,
>>>> is that there is some kind of ‘housekeeping’ of BlueStore OSDs via
>>>> scrubs. Does that make sense? Any clues ? ideas ?
> 
> Did you see any messages about legacy / per-pool stats? At a certain point, I 
> don't recall when, a nifty new feature was added that required that BlueStore 
> OSDs get a one-time repair, which could be done at startup, but which could 
> take a while especially on spinners.
> 

No msg but the time for the 1st start can be really long…. Guess it’s what you 
describe, the one-time repair / check  ?


Thx for your answers.

Olivier

Attachment: smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to