[ceph-users] Re: BLUEFS_SPILLOVER after Reef upgrade

Frédéric Nass Fri, 16 May 2025 08:02:20 -0700

Hi Kasper, 

----- Le 16 Mai 25, à 12:56, Kasper Rasmussen <[email protected]> a 
écrit :


> Thanks Frédéric

> I found there is difference in the bluestore_min_alloc_size among the OSDs
> depending on the version they were created on.

> However, as I understand there is no way to change it, other than destroy the
> OSD and bring them back in.:

> " This BlueStore attribute takes effect only at OSD creation; if the attribute
> is changed later, a specific OSD’s behavior will not change unless and until
> the OSD is destroyed and redeployed with the appropriate option value(s).
> Upgrading to a later Ceph release will not change the value used by OSDs that
> were deployed under older releases or with other settings. "
> Ref.: [
> https://docs.ceph.com/en/reef/rados/configuration/bluestore-config-ref/#minimum-allocation-size
> |
> https://docs.ceph.com/en/reef/rados/configuration/bluestore-config-ref/#minimum-allocation-size
> ]

True. You'll have to redeploy these OSDs. 

> I'm not sure that's an option, unless there is a huge gain in doing that 
> change.

> The impact of having this discrepancy between the OSDs, as I understand is
> potential: "unusually high ratio of raw to stored data" on the 65K OSDs.

> In my case the ratio between - raw to stored data - is approx 3:1
> I'd guess that is what to expect, when all pools is setup with three replicas.

> Feel free to correct me if I'm wrong/have misunderstood the DOCS

Well, technically you'll be losing raw space on the DB device, data device, or 
both devices when not using 4K. How much depends on your workloads. It's not 
that problematic with RBD workloads with minimal metadata, 4M objects (> 64K), 
and replicated data placement schemes, but would become an issue with S3/CephFS 
workloads with small objects and/or erasure coding data placement schemes. 

Also, OSDs may behave differently in terms of performance, with some of them 
filling up quicker than others. 

I would advise that you redeploy these OSDs. 
Regards, 
Frédéric. 

> Anyway. Thanks again for helping me out on this one.

> From: Frédéric Nass <[email protected]>
> Sent: Friday, May 16, 2025 10:04
> To: Kasper Rasmussen <[email protected]>
> Cc: Enrico Bocchi <[email protected]>; ceph-users <[email protected]>
> Subject: Re: [ceph-users] Re: BLUEFS_SPILLOVER after Reef upgrade
> Hi Kasper,

> Great! I haven't seen any issues after using ceph-bluestore-tool
> bluefs-bdev-migrate to move rocksdb metadata back to fast device.

> Since some of your OSDs seem to have been created prior to Pacific, you might
> want to check their bluestore_min_alloc_size. They should all use a
> bluestore_min_alloc_size of 4k.

> In Mimic and earlier releases, the default values were 64KB for rotational 
> media
> (HDD) and 16KB for non-rotational media (SSD). The Octopus release changed the
> the default value for non-rotational media (SSD) to 4KB, and the Pacific
> release changed the default value for rotational media (HDD) to 4KB.

> You can use the below command to check the bluestore_min_alloc_size of every
> OSD:

> for osd in $(ceph osd ls) ; do echo -n "osd.$osd bluestore_min_alloc_size = " 
> ;
> ceph osd metadata osd.${osd} | jq -r .bluestore_min_alloc_size ; done

> Regards,
> Frédéric.

> [1]
> https://docs.ceph.com/en/reef/rados/configuration/bluestore-config-ref/#minimum-allocation-size

> ----- Le 16 Mai 25, à 8:49, Kasper Rasmussen <[email protected]> a
> écrit :

>> FIXED -

>> So here is what I have tried.

>>     1.
>> Stopped osd.110

>>     2.
>> Enabled sharding with command: ceph-bluestore-tool --path ./osd.110
>> --sharding="m(3) p(3,0-12) O(3,0-13)=block_cache={type=binned_lru} L P" 
>> reshard

>>     3.
>> Started osd.110

>> Result: osd.110 still had spilled over data, but now only 64KiB (went down 
>> from
>> +2 GiB)

>> Tried compacting + restart osd.110

>> Result: No changes..

>> Tried stopping osd.110 > compacting while offline > started osd.110

>> Result: No changes..

>> Finally -

>>     1.
>> Stopped osd.110

>>     2.
>> ceph-bluestore-tool bluefs-bdev-migrate --path "./osd.110" --devs-source
>> "./osd.110/block" --dev-target "./osd.110/block.db"

>>     3.
>> Started osd.110

>> Result: SUCCESS - BLUEFS_SPILLOVER warning gone on osd.110

>> To be honest I'm not 100% sure if there is any caveats when migrating the 
>> data
>> with command - ceph-bluestore-tool bluefs-bdev-migrate, so any input on that
>> will be much appreciated.

>> From: Frédéric Nass <[email protected]>
>> Sent: Thursday, May 15, 2025 17:30
>> To: Kasper Rasmussen <[email protected]>
>> Cc: Enrico Bocchi <[email protected]>; ceph-users <[email protected]>
>> Subject: Re: [ceph-users] Re: BLUEFS_SPILLOVER after Reef upgrade

>> ----- Le 15 Mai 25, à 14:47, Kasper Rasmussen 
>> <[email protected]> a
>> écrit :

>>> Hi Both

>>> Let me add some findings.

>>> The cluster started on an older-than-pacific version - I don't know which
>>> version - and has at some point been migrated to Bluestore.

>>> When running a ceph osd metadata <osd.id>, 50% or so of the OSDs has no 
>>> data in
>>> the "ceph_version_when_created" the rest has ceph_version_when_created: ceph
>>> version 16.2.....

>>> So I probably have approx. 50% created-pre-pacific OSDs

>> Please note that only OSDs created after Pacific v16.2.11 will have the
>> "ceph_version_when_created" and "created_at" metadata populated. So
>> technically, you could have OSDs created in Pacific, say v16.2.9 using 
>> sharded
>> RocksDBs.

>> Also, I wrongly assumed from the config show output that your OSD RocksDBs 
>> were
>> sharded/resharded, but that was just configuration that could have been set
>> after the OSDs creation.
>> The only way you can make sure if RocksDBs are using column families is how 
>> you
>> did, by using the ceph-bluestore-tool show-sharding command. I guess you'll
>> have to check them all and reshard them when needed.

>> Please be extremely cautious with the case (m p O L P) in the resharding
>> command, as using wrong letters can wreck your OSDs.

>> Let us know how it goes.

>> Frédéric.

>>> On a few of the created-pre-pacific OSDs I've executed -
>>> ceph-bluestore-tool --path ./osd.XX show-sharding
>>> With result:
>>> failed to retrieve sharding def

>>> On a few of the others I get results like:
>>> m(3) p(3,0-12) O(3,0-13)=block_cache={type=binned_lru} L P

>>> Also untill now it seems like all the OSD with warnings -
>>> BLUEFS_SPILLOVER
>>> BLUESTORE_SLOW_OP_ALERT (non of these has I/O errors)

>>> Seems to be created-pre-pacific OSDs.
>>> Although I have warnings on a very small percentage of the total number of 
>>> OSDs,
>>> it might still be a clue.

>>> I will return to the issue tomorrow

>>> From: Frédéric Nass <[email protected]>
>>> Sent: Thursday, May 15, 2025 13:43
>>> To: Kasper Rasmussen <[email protected]>; Enrico Bocchi
>>> <[email protected]>
>>> Cc: ceph-users <[email protected]>
>>> Subject: Re: [ceph-users] Re: BLUEFS_SPILLOVER after Reef upgrade

>>> Hi Kasper, Hi Enrico,

>>> You're right, Enrico! I misread the numbers. DB is now 11GB in size, not 
>>> even
>>> close to 88GB. What probably happened at some point is that RocksDB had to
>>> allocate more space than 88GB, probably during compaction, and overspilled
>>> 1.9GB to slow device.

>>> If compacting this OSDs twice in a row doesn't help with getting the 1.9GB 
>>> back
>>> to fast device, then the below command should:

>>> 1/ ceph orch daemon stop osd.${osd}
>>> 2/ cephadm shell --fsid $(ceph fsid) --name osd.${osd} -- 
>>> ceph-bluestore-tool
>>> bluefs-bdev-migrate --path /var/lib/ceph/osd/ceph-${osd} --devs-source
>>> /var/lib/ceph/osd/ceph-${osd}/block --dev-target
>>> /var/lib/ceph/osd/ceph-${osd}/block.db
>>> 3/ ceph orch daemon start osd.${osd}
>>> 4/ ceph tell osd.${osd} compact

>>> Step 4/ should update bluefs stats figures.

>>> Regarding RocskDB, resharding should not be required as the new layout with
>>> column families is already being used.

>>> Kasper, can you check the value of bluestore_volume_selection_policy for the
>>> overspilled OSDs? It should default to 'use_some_extra' that allows RocksDB 
>>> to
>>> allocate space between 30GB and 88GB and not overspil to slow device after
>>> allocating ~30GB.

>>> Also, don't forget to enable RocksDB compression.

>>> Regards,
>>> Frédéric.

>>> ----- Le 15 Mai 25, à 13:04, Kasper Rasmussen 
>>> <[email protected]> a
>>> écrit :

>>>> Thanks to you both

>>>> I was just about to address the used vs total byte thing.

>>>> I will look into your pointers Enrico, and return with comments/findings.

>>>> From: Enrico Bocchi <[email protected]>
>>>> Sent: Thursday, May 15, 2025 13:00
>>>> To: Kasper Rasmussen <[email protected]>; Frédéric Nass
>>>> <[email protected]>
>>>> Cc: ceph-users <[email protected]>
>>>> Subject: Re: [ceph-users] Re: BLUEFS_SPILLOVER after Reef upgrade
>>>> Hi Kasper,

>>>> As Frédéric pointed out, you should consider resharding the RocksDB
>>>> database to use column families (if the OSD was create pre-pacific):
>>>> [
>>>> https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#rocksdb-sharding
>>>> |
>>>> https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#rocksdb-sharding
>>>> ]
>>>> There's additional documentation available with some preliminary steps,
>>>> including making sure your RocksDB does not already use column families.

>>>> RocksDB options have changed in recent reef releases, and seem to be
>>>> quite different w.r.t. Pacific/Quincy. You may want to check if any of
>>>> the configuration options that have been modified are relevant for your
>>>> setup.
>>>> Here is an excellent deep-dive blog post by the unequaled Mark Nelson:
>>>> [ https://ceph.io/en/news/blog/2022/rocksdb-tuning-deep-dive/ |
>>>> https://ceph.io/en/news/blog/2022/rocksdb-tuning-deep-dive/ ]

>>>> For the used vs total bytes I have to disagree with Frédéric (sorry):
>>>> 11631853568 / 88906653696 gives 13% utilization. So the OSD should not
>>>> overspill to slow storage.
>>>> I have seen this in the past and rolling back to previous
>>>> bluestore_rocksdb options helped. However, I have not resharded RocksDB
>>>> to column families yet. Would you please keep us posted if you reshard
>>>> to cf and this fixes overspill?

>>>> Cheers,
>>>> Enrico

>>>> On 5/15/25 12:35, Frédéric Nass wrote:
>>>> > Hi Kasper,

>>>> > Thanks for sharing.

>>>>> I don't see anything wrong with this specific OSD when it comes to
>>>>> bluestore_rocksdb_*. It's RocksDB database is using column families and 
>>>>> this
>>>>> OSD was resharded properly (if not created or recreated in Pacific). What 
>>>>> the
>>>>> perf dump shows is that the db_used_bytes is above the db_total_bytes. If 
>>>>> this
>>>>> cluster makes heavy use of metadata (RGW workloads for example) then 90GB 
>>>>> of DB
>>>>> device for 10TB drives is less than 1% which is not enough. General
>>>>> recommendation for RGW workloads is to use a DB device of at least 4% in 
>>>>> size
>>>> > of the data device [1].

>>>>> Now, your best move is probably to enable RocksDB compression (ceph 
>>>>> config set
>>>>> osd bluestore_rocksdb_options_annex 'compression=kLZ4Compression'), 
>>>>> restart and
>>>>> compact these OSDs to update bluefs stats, and consider giving those OSDs
>>>> > larger RocksDB partitions in the future.

>>>> > Regards,
>>>> > Frédéric.

>>>>> [1] [
>>>>> https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#sizing
>>>> > |
>>>> https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#sizing
>>>> ]

>>>>> ----- Le 15 Mai 25, à 7:44, Kasper Rasmussen 
>>>>> [email protected] a
>>>> > écrit :

>>>> >> perf dump:
>>>> >> "bluefs": {
>>>> >> "db_total_bytes": 88906653696,
>>>> >> "db_used_bytes": 11631853568,
>>>> >> "wal_total_bytes": 0,
>>>> >> "wal_used_bytes": 0,
>>>> >> "slow_total_bytes": 9796816207872,
>>>> >> "slow_used_bytes": 1881341952,
>>>> >> "num_files": 229,
>>>> >> "log_bytes": 11927552,
>>>> >> "log_compactions": 78,
>>>> >> "log_write_count": 281792,
>>>> >> "logged_bytes": 1154220032,
>>>> >> "files_written_wal": 179,
>>>> >> "files_written_sst": 311,
>>>> >> "write_count_wal": 280405,
>>>> >> "write_count_sst": 29432,
>>>> >> "bytes_written_wal": 4015595520,
>>>> >> "bytes_written_sst": 15728308224,
>>>> >> "bytes_written_slow": 2691231744,
>>>> >> "max_bytes_wal": 0,
>>>> >> "max_bytes_db": 13012828160,
>>>> >> "max_bytes_slow": 3146252288,
>>>> >> "alloc_unit_slow": 65536,
>>>> >> "alloc_unit_db": 1048576,
>>>> >> "alloc_unit_wal": 0,
>>>> >> "read_random_count": 1871590,
>>>> >> "read_random_bytes": 18959576586,
>>>> >> "read_random_disk_count": 563421,
>>>> >> "read_random_disk_bytes": 17110012647,
>>>> >> "read_random_disk_bytes_wal": 0,
>>>> >> "read_random_disk_bytes_db": 11373755941,
>>>> >> "read_random_disk_bytes_slow": 5736256706,
>>>> >> "read_random_buffer_count": 1313456,
>>>> >> "read_random_buffer_bytes": 1849563939,
>>>> >> "read_count": 275731,
>>>> >> "read_bytes": 4825912551,
>>>> >> "read_disk_count": 225997,
>>>> >> "read_disk_bytes": 4016943104,
>>>> >> "read_disk_bytes_wal": 0,
>>>> >> "read_disk_bytes_db": 3909947392,
>>>> >> "read_disk_bytes_slow": 106999808,
>>>> >> "read_prefetch_count": 274534,
>>>> >> "read_prefetch_bytes": 4785141168,
>>>> >> "write_count": 591760,
>>>> >> "write_disk_count": 591838,
>>>> >> "write_bytes": 21062987776,
>>>> >> "compact_lat": {
>>>> >> "avgcount": 78,
>>>> >> "sum": 0.572247346,
>>>> >> "avgtime": 0.007336504
>>>> >> },
>>>> >> "compact_lock_lat": {
>>>> >> "avgcount": 78,
>>>> >> "sum": 0.182746199,
>>>> >> "avgtime": 0.002342899
>>>> >> },
>>>> >> "alloc_slow_fallback": 0,
>>>> >> "alloc_slow_size_fallback": 0,
>>>> >> "read_zeros_candidate": 0,
>>>> >> "read_zeros_errors": 0,
>>>> >> "wal_alloc_lat": {
>>>> >> "avgcount": 0,
>>>> >> "sum": 0.000000000,
>>>> >> "avgtime": 0.000000000
>>>> >> },
>>>> >> "db_alloc_lat": {
>>>> >> "avgcount": 969,
>>>> >> "sum": 0.006368060,
>>>> >> "avgtime": 0.000006571
>>>> >> },
>>>> >> "slow_alloc_lat": {
>>>> >> "avgcount": 39,
>>>> >> "sum": 0.004502210,
>>>> >> "avgtime": 0.000115441
>>>> >> },
>>>> >> "alloc_wal_max_lat": 0.000000000,
>>>> >> "alloc_db_max_lat": 0.000113831,
>>>> >> "alloc_slow_max_lat": 0.000301347
>>>> >> },


>>>> >> config show:
>>>> >> "bluestore_rocksdb_cf": "true",
>>>> >> "bluestore_rocksdb_cfs": "m(3) p(3,0-12) 
>>>> >> O(3,0-13)=block_cache={type=binned_lru}
>>>> >> L=min_write_buffer_number_to_merge=32 
>>>> >> P=min_write_buffer_number_to_merge=32",
>>>> >> "bluestore_rocksdb_options":
>>>> >> "compression=kLZ4Compression,max_write_buffer_number=64,min_write_buffer_number_to_merge=6,compaction_style=kCompactionStyleLevel,write_buffer_size=16777216,max_background_jobs=4,level0_file_num_compaction_trigger=8,max_bytes_for_level_base=1073741824,max_bytes_for_level_multiplier=8,compaction_readahead_size=2MB,max_total_wal_size=1073741824,writable_file_max_buffer_size=0",
>>>> >> "bluestore_rocksdb_options_annex": "",


>>>> >> Dono if it is of any help, but I've compared the config from an OSD not
>>>> >> reporting an issues, and there is no difference.


>>>> >> ________________________________
>>>> >> From: Enrico Bocchi <[email protected]>
>>>> >> Sent: Wednesday, May 14, 2025 22:47
>>>> >> To: Kasper Rasmussen <[email protected]>; ceph-users
>>>> >> <[email protected]>
>>>> >> Subject: Re: BLUEFS_SPILLOVER after Reef upgrade

>>>> >> Hi Kasper,

>>>> >> Would you mind sharing the output of `perf dump` and `config show` from 
>>>> >> the
>>>> >> daemon socket of one of the OSDs reporting blues spillover? I am 
>>>> >> interested in
>>>> >> the bluefs part of the former and in the bluestore_rocksdb options of 
>>>> >> the
>>>> >> latter.

>>>> >> The warning about slow ops in bluestore is a different story. There 
>>>> >> have been
>>>> >> several messages on this mailing list recently with suggestions on how 
>>>> >> to tune
>>>> >> the alert threshold. From my experience, they very likely relate to some
>>>> >> problem with the underlying storage device, so I'd recommend 
>>>> >> investigating the
>>>> >> root cause rather than simply silencing the warning.

>>>> >> Cheers,
>>>> >> Enrico


>>>> >> ________________________________
>>>> >> From: Kasper Rasmussen <[email protected]>
>>>> >> Sent: Wednesday, May 14, 2025 8:22:46 PM
>>>> >> To: ceph-users <[email protected]>
>>>> >> Subject: [ceph-users] BLUEFS_SPILLOVER after Reef upgrade

>>>> >> I've just upgraded our ceph cluster from pacific 16.2.15 -> Reef 18.2.7

>>>> >> After that I see the warnings:

>>>> >> [WRN] BLUEFS_SPILLOVER: 5 OSD(s) experiencing BlueFS spillover
>>>> >> osd.110 spilled over 4.5 GiB metadata from 'db' device (8.0 GiB used of 
>>>> >> 83 GiB)
>>>> >> to slow device
>>>> >> osd.455 spilled over 1.1 GiB metadata from 'db' device (11 GiB used of 
>>>> >> 83 GiB)
>>>> >> to slow device
>>>> >> osd.533 spilled over 426 MiB metadata from 'db' device (10 GiB used of 
>>>> >> 83 GiB)
>>>> >> to slow device
>>>> >> osd.560 spilled over 389 MiB metadata from 'db' device (9.8 GiB used of 
>>>> >> 83 GiB)
>>>> >> to slow device
>>>> >> osd.597 spilled over 8.6 GiB metadata from 'db' device (7.7 GiB used of 
>>>> >> 83 GiB)
>>>> >> to slow device
>>>> >> [WRN] BLUESTORE_SLOW_OP_ALERT: 4 OSD(s) experiencing slow operations in
>>>> >> BlueStore
>>>> >> osd.410 observed slow operation indications in BlueStore
>>>> >> osd.443 observed slow operation indications in BlueStore
>>>> >> osd.508 observed slow operation indications in BlueStore
>>>> >> osd.593 observed slow operation indications in BlueStore

>>>> >> I've tried to run ceph tell osd.XXX compact with no result.

>>>> >> Bluefs stats:

>>>> >> ceph tell osd.110 bluefs stats
>>>> >> 1 : device size 0x14b33fe000 : using 0x202c00000(8.0 GiB)
>>>> >> 2 : device size 0x8e8ffc00000 : using 0x5d31d150000(5.8 TiB)
>>>> >> RocksDBBlueFSVolumeSelector
>>>> >>>> Settings<< extra=0 B, l0_size=1 GiB, l_base=1 GiB, l_multi=8 B
>>>> >> DEV/LEV WAL DB SLOW * * REAL
>>>> >> FILES
>>>> >> LOG 0 B 16 MiB 0 B 0 B 0 B 15 MiB
>>>> >> 1
>>>> >> WAL 0 B 18 MiB 0 B 0 B 0 B 6.3 MiB
>>>> >> 1
>>>> >> DB 0 B 8.0 GiB 0 B 0 B 0 B 8.0 GiB
>>>> >> 140
>>>> >> SLOW 0 B 0 B 4.5 GiB 0 B 0 B 4.5 GiB
>>>> >> 78
>>>> >> TOTAL 0 B 8.0 GiB 4.5 GiB 0 B 0 B 0 B
>>>> >> 220
>>>> >> MAXIMUMS:
>>>> >> LOG 0 B 25 MiB 0 B 0 B 0 B 21 MiB
>>>> >> WAL 0 B 118 MiB 0 B 0 B 0 B 93 MiB
>>>> >> DB 0 B 8.2 GiB 0 B 0 B 0 B 8.2 GiB
>>>> >> SLOW 0 B 0 B 14 GiB 0 B 0 B 14 GiB
>>>> >> TOTAL 0 B 8.2 GiB 14 GiB 0 B 0 B 0 B
>>>> >>>> SIZE << 0 B 79 GiB 8.5 TiB
>>>> >> Help with what to do next will, be much appreciated


>>>> >> _______________________________________________
>>>> >> ceph-users mailing list -- [email protected]
>>>> >> To unsubscribe send an email to [email protected]

>>>> >> _______________________________________________
>>>> >> ceph-users mailing list -- [email protected]
>>>> >> To unsubscribe send an email to [email protected]

>>>> --
>>>> Enrico Bocchi
>>>> CERN European Laboratory for Particle Physics
>>>> IT - Storage & Data Management - General Storage Services
>>>> Mailbox: G20500 - Office: 31-2-010
>>>> 1211 Genève 23
>>>> Switzerland
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: BLUEFS_SPILLOVER after Reef upgrade

Reply via email to