> Le 13 sept. 2025 à 14:17, Anthony D'Atri <[email protected]> a écrit : > > > > (..) >>> >>> >>> I can't say that we upgraded a lot of clusters from N to P, but those >>> that we upgraded didn't show any of these symptoms you describe. But >>> we always did the Filestore to Bluestore conversion before the actual >>> upgrade. In SUSE Enterprise Storage (which we also supported at that >>> time) this was pointed out as a requirement. I just checked the ceph >>> docs, I can't find such a statement (yet). > > I *think* Filestore OSDs still work, but I've been peppering the docs with > admonitions to convert for several releases now. I would expect them to have > worked with Pacific. Did you update from the last Nautilus to the last > Pacific?
Yes. And yes it was supposed to work. And in fact it works on our test cluster
but this one has not the same history as the prod one and it’s there only to
test some crush rules,… before applying to prod cluster.
>>>> In terms of hardware, we have three monitors (cephmon) and 30
>>>> storage servers (cephstore) spread across three datacenters. These
>>>> servers are connected to the network via an aggregate (LACP) of two
>>>> 10 Gbps fibre connections, through which two VLANs pass, one for the
>>>> CEPH frontend network and one for the CEPH backend network. In doing
>>>> so, we have always given ourselves the option of separating the
>>>> frontend and backend into dedicated aggregates if the bandwidth
>>>> becomes insufficient.
>
> Nice planning. For scratch clusters today I usually suggest a single 25-100
> GE bonded public network. The dynamics when you started were different.
Yes. But for a network availability of ports point of view, I thing the next
step will be probably splitting the front and the back network.
>
>
>>>>
>>>> 100 hdd 10.90999 TB
>>>> 48 hdd 11.00000 TB
>>>> 48 hdd 14.54999 TB
>>>> 24 hdd 15.00000 TB
>>>> 9 hdd 5.45999 TB
>>>> 108 hdd 9.09999 TB
>
> Have you tried disabling their volatile write cache?
Yes they are.
>
>>>>
>>>> 84 ssd 0.89400 TB
>>>> 198 ssd 0.89424 TB
>>>> 18 ssd 0.93599 TB
>>>> 32 ssd 1.45999 TB
>>>> 16 ssd 1.50000 TB
>>>> 48 ssd 1.75000 TB
>>>> 24 ssd 1.79999 TB
>
> Just for others reading, very small SSDs can end up using a surprising
> fraction of their capacity for DB/WAL/other overhead, resulting in less
> usable capacity than one expects.
DB is always offloaded to a RAID1 NVMe storage.
(..)
>
>>>>
>>>> * First problem:
>>>> We were forced to switch from FileStore to BlueStore in an emergency
>>>> and unscheduled manner because after upgrading the CEPH packages on
>>>> the first storage server, the FileStore OSDs would no longer start.
>
> Did you capture logs from your init system and representative OSDs? They
> would help understand what happened.
Unfortunately not. We were not expecting the problems when we started the
upgrade… :-(
(..)
>
>>>> Here we see that our SSD-type OSDs fill up at a rate of ~ 2% every 3
>>>> hours (the phenomenon is also observed on HDD-type OSDs, but as we
>>>> have a large capacity, it is less critical).
>>>> Manual (re)weight changes only provided a temporary solution and,
>>>> despite all our attempts (OSD restart, etc.), we reached the
>>>> critical full_ratio threshold, which is 0.97 for us.
>
> Does your CRUSH map set optimal tunables? Or an older profile?
>
> # ceph osd crush show-tunables
> {
> "choose_local_tries": 0,
> "choose_local_fallback_tries": 0,
> "choose_total_tries": 50,
> "chooseleaf_descend_once": 1,
> "chooseleaf_vary_r": 1,
> "chooseleaf_stable": 1,
> "straw_calc_version": 1,
> "allowed_bucket_algs": 54,
> "profile": "jewel",
> "optimal_tunables": 1,
> "legacy_tunables": 0,
> "minimum_required_version": "jewel",
> "require_feature_tunables": 1,
> "require_feature_tunables2": 1,
> "has_v2_rules": 1,
> "require_feature_tunables3": 1,
> "has_v3_rules": 0,
> "has_v4_buckets": 1,
> "require_feature_tunables5": 1,
> "has_v5_rules": 0
> }
>
Seems the same as yours
{
"choose_local_tries": 0,
"choose_local_fallback_tries": 0,
"choose_total_tries": 50,
"chooseleaf_descend_once": 1,
"chooseleaf_vary_r": 1,
"chooseleaf_stable": 1,
"straw_calc_version": 1,
"allowed_bucket_algs": 54,
"profile": "jewel",
"optimal_tunables": 1,
"legacy_tunables": 0,
"minimum_required_version": "jewel",
"require_feature_tunables": 1,
"require_feature_tunables2": 1,
"has_v2_rules": 0,
"require_feature_tunables3": 1,
"has_v3_rules": 0,
"has_v4_buckets": 1,
"require_feature_tunables5": 1,
"has_v5_rules": 0
}
> Older tunables can result in unequal data distribution. Similarly, are all
> of your CRUSH buckets straw2?
>
> # ceph osd crush dump | fgrep alg\" | sort | uniq -c
> 42 "alg": "straw2",
Yup
306 "alg": "straw2",
>
> If not
>
> ceph osd crush set-all-straw-buckets-to-straw2
>
> That should help with uniformity, though note that it will cause data to
> move, and if you're using legacy OSD reweighs those values would need to be
> readjusted.
>
> What does
>
> ceph balancer status
>
> show? If you have legacy reweighs set to < 1.00 and pg-upmap balancing at
> the same time, you'll end up with outliers. When using pg-upmap balancing,
> one really has to reset all the legacy reweights. If the cluster is fairly
> full that may need to be done incrementally to minimize making outliers worse.
{
"active": true,
"last_optimize_duration": "0:00:01.032868",
"last_optimize_started": "Sun Sep 14 11:58:51 2025",
"mode": "upmap",
"no_optimization_needed": true,
"optimize_result": "Unable to find further optimization, or pool(s) pg_num
is decreasing, or distribution is already perfect",
"plans": []
}
>>>> I'll leave you to imagine the effect on the virtual machines and the
>>>> services provided to our users.
>>>> We also had very strong growth in the size of the MONitor databases
>>>> (~3 GB -> 100 GB) (compaction did not really help).
>
> Compaction can't happen until backfill/recovery is complete. At one point
> there was a bug when it also required that the numbers of total, up, and in
> OSDs were equal, i.e. all OSDs were up and in.
Good to know
>
>>>>
>>>>
>>>> After this second total recovery of the CEPH cluster and the restart
>>>> of the virtualisation environment, we still have the third DC (10
>>>> cephstore) to update from CEPH 14 to 16, and our ‘SSD’ OSDs are
>>>> filling up again until the automatic activation of
>>>> scrubs/deep-scrubs at 7 p.m.
>>>> Since then, progress has stopped, the use of the various OSDs is
>>>> stable and more or less evenly distributed (via active upmap
>>>> balancer).
>
> Check your legacy reweights:
>
> # ceph osd tree | head
> ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
> -37 0 root staging
> -1 5577.10254 root default
> -34 465.31158 host cephab92
> 217 hdd 18.53969 osd.217 up 1.00000 1.00000
>
> If you have any reweights that aren't 1.0000, that could be a factor. When
> using the upmap balancer, they all really need to be 1.0000.
Make sense
>>>>
>>>> *** Questions / Assumptions / Opinions ***
>>>>
>>>> Have you ever encountered a similar phenomenon? We agree that having
>>>> different versions of OSDs coexisting is not a good solution
>
> Filestore vs BlueStore is below RADOS, so it's not so bad. BlueStore OSDs
> are much less prone to memory ballooning but there's no special risk in
> running both that I've ever seen.
OK
>
>
>>>> Our current hypothesis, following the restoration of stability and
>>>> the fact that we have never had this problem with OSDs in FileStore,
>>>> is that there is some kind of ‘housekeeping’ of BlueStore OSDs via
>>>> scrubs. Does that make sense? Any clues ? ideas ?
>
> Did you see any messages about legacy / per-pool stats? At a certain point, I
> don't recall when, a nifty new feature was added that required that BlueStore
> OSDs get a one-time repair, which could be done at startup, but which could
> take a while especially on spinners.
>
No msg but the time for the 1st start can be really long…. Guess it’s what you
describe, the one-time repair / check ?
Thx for your answers.
Olivier
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________ ceph-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
