[ceph-users] Re: BLUEFS_SPILLOVER after Reef upgrade

Frédéric Nass Fri, 16 May 2025 01:06:38 -0700

Hi Kasper, 

Great! I haven't seen any issues after using ceph-bluestore-tool 
bluefs-bdev-migrate to move rocksdb metadata back to fast device.


Since some of your OSDs seem to have been created prior to Pacific, you might 
want to check their bluestore_min_alloc_size. They should all use a 
bluestore_min_alloc_size of 4k. 

In Mimic and earlier releases, the default values were 64KB for rotational 
media (HDD) and 16KB for non-rotational media (SSD). The Octopus release 
changed the the default value for non-rotational media (SSD) to 4KB, and the 
Pacific release changed the default value for rotational media (HDD) to 4KB. 

You can use the below command to check the bluestore_min_alloc_size of every 
OSD: 

for osd in $(ceph osd ls) ; do echo -n "osd.$osd bluestore_min_alloc_size = " ; 
ceph osd metadata osd.${osd} | jq -r .bluestore_min_alloc_size ; done 

Regards, 
Frédéric. 

[1] 
https://docs.ceph.com/en/reef/rados/configuration/bluestore-config-ref/#minimum-allocation-size
 

----- Le 16 Mai 25, à 8:49, Kasper Rasmussen <[email protected]> a 
écrit : 

> FIXED -

> So here is what I have tried.

>     1.
> Stopped osd.110

>     2.
> Enabled sharding with command: ceph-bluestore-tool --path ./osd.110
> --sharding="m(3) p(3,0-12) O(3,0-13)=block_cache={type=binned_lru} L P" 
> reshard

>     3.
> Started osd.110

> Result: osd.110 still had spilled over data, but now only 64KiB (went down 
> from
> +2 GiB)

> Tried compacting + restart osd.110

> Result: No changes..

> Tried stopping osd.110 > compacting while offline > started osd.110

> Result: No changes..

> Finally -

>     1.
> Stopped osd.110

>     2.
> ceph-bluestore-tool bluefs-bdev-migrate --path "./osd.110" --devs-source
> "./osd.110/block" --dev-target "./osd.110/block.db"

>     3.
> Started osd.110

> Result: SUCCESS - BLUEFS_SPILLOVER warning gone on osd.110

> To be honest I'm not 100% sure if there is any caveats when migrating the data
> with command - ceph-bluestore-tool bluefs-bdev-migrate, so any input on that
> will be much appreciated.

> From: Frédéric Nass <[email protected]>
> Sent: Thursday, May 15, 2025 17:30
> To: Kasper Rasmussen <[email protected]>
> Cc: Enrico Bocchi <[email protected]>; ceph-users <[email protected]>
> Subject: Re: [ceph-users] Re: BLUEFS_SPILLOVER after Reef upgrade

> ----- Le 15 Mai 25, à 14:47, Kasper Rasmussen <[email protected]> 
> a
> écrit :

>> Hi Both

>> Let me add some findings.

>> The cluster started on an older-than-pacific version - I don't know which
>> version - and has at some point been migrated to Bluestore.

>> When running a ceph osd metadata <osd.id>, 50% or so of the OSDs has no data 
>> in
>> the "ceph_version_when_created" the rest has ceph_version_when_created: ceph
>> version 16.2.....

>> So I probably have approx. 50% created-pre-pacific OSDs

> Please note that only OSDs created after Pacific v16.2.11 will have the
> "ceph_version_when_created" and "created_at" metadata populated. So
> technically, you could have OSDs created in Pacific, say v16.2.9 using sharded
> RocksDBs.

> Also, I wrongly assumed from the config show output that your OSD RocksDBs 
> were
> sharded/resharded, but that was just configuration that could have been set
> after the OSDs creation.
> The only way you can make sure if RocksDBs are using column families is how 
> you
> did, by using the ceph-bluestore-tool show-sharding command. I guess you'll
> have to check them all and reshard them when needed.

> Please be extremely cautious with the case (m p O L P) in the resharding
> command, as using wrong letters can wreck your OSDs.

> Let us know how it goes.

> Frédéric.

>> On a few of the created-pre-pacific OSDs I've executed -
>> ceph-bluestore-tool --path ./osd.XX show-sharding
>> With result:
>> failed to retrieve sharding def

>> On a few of the others I get results like:
>> m(3) p(3,0-12) O(3,0-13)=block_cache={type=binned_lru} L P

>> Also untill now it seems like all the OSD with warnings -
>> BLUEFS_SPILLOVER
>> BLUESTORE_SLOW_OP_ALERT (non of these has I/O errors)

>> Seems to be created-pre-pacific OSDs.
>> Although I have warnings on a very small percentage of the total number of 
>> OSDs,
>> it might still be a clue.

>> I will return to the issue tomorrow

>> From: Frédéric Nass <[email protected]>
>> Sent: Thursday, May 15, 2025 13:43
>> To: Kasper Rasmussen <[email protected]>; Enrico Bocchi
>> <[email protected]>
>> Cc: ceph-users <[email protected]>
>> Subject: Re: [ceph-users] Re: BLUEFS_SPILLOVER after Reef upgrade

>> Hi Kasper, Hi Enrico,

>> You're right, Enrico! I misread the numbers. DB is now 11GB in size, not even
>> close to 88GB. What probably happened at some point is that RocksDB had to
>> allocate more space than 88GB, probably during compaction, and overspilled
>> 1.9GB to slow device.

>> If compacting this OSDs twice in a row doesn't help with getting the 1.9GB 
>> back
>> to fast device, then the below command should:

>> 1/ ceph orch daemon stop osd.${osd}
>> 2/ cephadm shell --fsid $(ceph fsid) --name osd.${osd} -- ceph-bluestore-tool
>> bluefs-bdev-migrate --path /var/lib/ceph/osd/ceph-${osd} --devs-source
>> /var/lib/ceph/osd/ceph-${osd}/block --dev-target
>> /var/lib/ceph/osd/ceph-${osd}/block.db
>> 3/ ceph orch daemon start osd.${osd}
>> 4/ ceph tell osd.${osd} compact

>> Step 4/ should update bluefs stats figures.

>> Regarding RocskDB, resharding should not be required as the new layout with
>> column families is already being used.

>> Kasper, can you check the value of bluestore_volume_selection_policy for the
>> overspilled OSDs? It should default to 'use_some_extra' that allows RocksDB 
>> to
>> allocate space between 30GB and 88GB and not overspil to slow device after
>> allocating ~30GB.

>> Also, don't forget to enable RocksDB compression.

>> Regards,
>> Frédéric.

>> ----- Le 15 Mai 25, à 13:04, Kasper Rasmussen 
>> <[email protected]> a
>> écrit :

>>> Thanks to you both

>>> I was just about to address the used vs total byte thing.

>>> I will look into your pointers Enrico, and return with comments/findings.

>>> From: Enrico Bocchi <[email protected]>
>>> Sent: Thursday, May 15, 2025 13:00
>>> To: Kasper Rasmussen <[email protected]>; Frédéric Nass
>>> <[email protected]>
>>> Cc: ceph-users <[email protected]>
>>> Subject: Re: [ceph-users] Re: BLUEFS_SPILLOVER after Reef upgrade
>>> Hi Kasper,

>>> As Frédéric pointed out, you should consider resharding the RocksDB
>>> database to use column families (if the OSD was create pre-pacific):
>>> [
>>> https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#rocksdb-sharding
>>> |
>>> https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#rocksdb-sharding
>>> ]
>>> There's additional documentation available with some preliminary steps,
>>> including making sure your RocksDB does not already use column families.

>>> RocksDB options have changed in recent reef releases, and seem to be
>>> quite different w.r.t. Pacific/Quincy. You may want to check if any of
>>> the configuration options that have been modified are relevant for your
>>> setup.
>>> Here is an excellent deep-dive blog post by the unequaled Mark Nelson:
>>> [ https://ceph.io/en/news/blog/2022/rocksdb-tuning-deep-dive/ |
>>> https://ceph.io/en/news/blog/2022/rocksdb-tuning-deep-dive/ ]

>>> For the used vs total bytes I have to disagree with Frédéric (sorry):
>>> 11631853568 / 88906653696 gives 13% utilization. So the OSD should not
>>> overspill to slow storage.
>>> I have seen this in the past and rolling back to previous
>>> bluestore_rocksdb options helped. However, I have not resharded RocksDB
>>> to column families yet. Would you please keep us posted if you reshard
>>> to cf and this fixes overspill?

>>> Cheers,
>>> Enrico

>>> On 5/15/25 12:35, Frédéric Nass wrote:
>>> > Hi Kasper,

>>> > Thanks for sharing.

>>>> I don't see anything wrong with this specific OSD when it comes to
>>>> bluestore_rocksdb_*. It's RocksDB database is using column families and 
>>>> this
>>>> OSD was resharded properly (if not created or recreated in Pacific). What 
>>>> the
>>>> perf dump shows is that the db_used_bytes is above the db_total_bytes. If 
>>>> this
>>>> cluster makes heavy use of metadata (RGW workloads for example) then 90GB 
>>>> of DB
>>>> device for 10TB drives is less than 1% which is not enough. General
>>>> recommendation for RGW workloads is to use a DB device of at least 4% in 
>>>> size
>>> > of the data device [1].

>>>> Now, your best move is probably to enable RocksDB compression (ceph config 
>>>> set
>>>> osd bluestore_rocksdb_options_annex 'compression=kLZ4Compression'), 
>>>> restart and
>>>> compact these OSDs to update bluefs stats, and consider giving those OSDs
>>> > larger RocksDB partitions in the future.

>>> > Regards,
>>> > Frédéric.

>>>> [1] [
>>>> https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#sizing
>>> > |
>>> https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#sizing
>>> ]

>>>> ----- Le 15 Mai 25, à 7:44, Kasper Rasmussen [email protected] 
>>>> a
>>> > écrit :

>>> >> perf dump:
>>> >> "bluefs": {
>>> >> "db_total_bytes": 88906653696,
>>> >> "db_used_bytes": 11631853568,
>>> >> "wal_total_bytes": 0,
>>> >> "wal_used_bytes": 0,
>>> >> "slow_total_bytes": 9796816207872,
>>> >> "slow_used_bytes": 1881341952,
>>> >> "num_files": 229,
>>> >> "log_bytes": 11927552,
>>> >> "log_compactions": 78,
>>> >> "log_write_count": 281792,
>>> >> "logged_bytes": 1154220032,
>>> >> "files_written_wal": 179,
>>> >> "files_written_sst": 311,
>>> >> "write_count_wal": 280405,
>>> >> "write_count_sst": 29432,
>>> >> "bytes_written_wal": 4015595520,
>>> >> "bytes_written_sst": 15728308224,
>>> >> "bytes_written_slow": 2691231744,
>>> >> "max_bytes_wal": 0,
>>> >> "max_bytes_db": 13012828160,
>>> >> "max_bytes_slow": 3146252288,
>>> >> "alloc_unit_slow": 65536,
>>> >> "alloc_unit_db": 1048576,
>>> >> "alloc_unit_wal": 0,
>>> >> "read_random_count": 1871590,
>>> >> "read_random_bytes": 18959576586,
>>> >> "read_random_disk_count": 563421,
>>> >> "read_random_disk_bytes": 17110012647,
>>> >> "read_random_disk_bytes_wal": 0,
>>> >> "read_random_disk_bytes_db": 11373755941,
>>> >> "read_random_disk_bytes_slow": 5736256706,
>>> >> "read_random_buffer_count": 1313456,
>>> >> "read_random_buffer_bytes": 1849563939,
>>> >> "read_count": 275731,
>>> >> "read_bytes": 4825912551,
>>> >> "read_disk_count": 225997,
>>> >> "read_disk_bytes": 4016943104,
>>> >> "read_disk_bytes_wal": 0,
>>> >> "read_disk_bytes_db": 3909947392,
>>> >> "read_disk_bytes_slow": 106999808,
>>> >> "read_prefetch_count": 274534,
>>> >> "read_prefetch_bytes": 4785141168,
>>> >> "write_count": 591760,
>>> >> "write_disk_count": 591838,
>>> >> "write_bytes": 21062987776,
>>> >> "compact_lat": {
>>> >> "avgcount": 78,
>>> >> "sum": 0.572247346,
>>> >> "avgtime": 0.007336504
>>> >> },
>>> >> "compact_lock_lat": {
>>> >> "avgcount": 78,
>>> >> "sum": 0.182746199,
>>> >> "avgtime": 0.002342899
>>> >> },
>>> >> "alloc_slow_fallback": 0,
>>> >> "alloc_slow_size_fallback": 0,
>>> >> "read_zeros_candidate": 0,
>>> >> "read_zeros_errors": 0,
>>> >> "wal_alloc_lat": {
>>> >> "avgcount": 0,
>>> >> "sum": 0.000000000,
>>> >> "avgtime": 0.000000000
>>> >> },
>>> >> "db_alloc_lat": {
>>> >> "avgcount": 969,
>>> >> "sum": 0.006368060,
>>> >> "avgtime": 0.000006571
>>> >> },
>>> >> "slow_alloc_lat": {
>>> >> "avgcount": 39,
>>> >> "sum": 0.004502210,
>>> >> "avgtime": 0.000115441
>>> >> },
>>> >> "alloc_wal_max_lat": 0.000000000,
>>> >> "alloc_db_max_lat": 0.000113831,
>>> >> "alloc_slow_max_lat": 0.000301347
>>> >> },


>>> >> config show:
>>> >> "bluestore_rocksdb_cf": "true",
>>> >> "bluestore_rocksdb_cfs": "m(3) p(3,0-12) 
>>> >> O(3,0-13)=block_cache={type=binned_lru}
>>> >> L=min_write_buffer_number_to_merge=32 
>>> >> P=min_write_buffer_number_to_merge=32",
>>> >> "bluestore_rocksdb_options":
>>> >> "compression=kLZ4Compression,max_write_buffer_number=64,min_write_buffer_number_to_merge=6,compaction_style=kCompactionStyleLevel,write_buffer_size=16777216,max_background_jobs=4,level0_file_num_compaction_trigger=8,max_bytes_for_level_base=1073741824,max_bytes_for_level_multiplier=8,compaction_readahead_size=2MB,max_total_wal_size=1073741824,writable_file_max_buffer_size=0",
>>> >> "bluestore_rocksdb_options_annex": "",


>>> >> Dono if it is of any help, but I've compared the config from an OSD not
>>> >> reporting an issues, and there is no difference.


>>> >> ________________________________
>>> >> From: Enrico Bocchi <[email protected]>
>>> >> Sent: Wednesday, May 14, 2025 22:47
>>> >> To: Kasper Rasmussen <[email protected]>; ceph-users
>>> >> <[email protected]>
>>> >> Subject: Re: BLUEFS_SPILLOVER after Reef upgrade

>>> >> Hi Kasper,

>>> >> Would you mind sharing the output of `perf dump` and `config show` from 
>>> >> the
>>> >> daemon socket of one of the OSDs reporting blues spillover? I am 
>>> >> interested in
>>> >> the bluefs part of the former and in the bluestore_rocksdb options of the
>>> >> latter.

>>> >> The warning about slow ops in bluestore is a different story. There have 
>>> >> been
>>> >> several messages on this mailing list recently with suggestions on how 
>>> >> to tune
>>> >> the alert threshold. From my experience, they very likely relate to some
>>> >> problem with the underlying storage device, so I'd recommend 
>>> >> investigating the
>>> >> root cause rather than simply silencing the warning.

>>> >> Cheers,
>>> >> Enrico


>>> >> ________________________________
>>> >> From: Kasper Rasmussen <[email protected]>
>>> >> Sent: Wednesday, May 14, 2025 8:22:46 PM
>>> >> To: ceph-users <[email protected]>
>>> >> Subject: [ceph-users] BLUEFS_SPILLOVER after Reef upgrade

>>> >> I've just upgraded our ceph cluster from pacific 16.2.15 -> Reef 18.2.7

>>> >> After that I see the warnings:

>>> >> [WRN] BLUEFS_SPILLOVER: 5 OSD(s) experiencing BlueFS spillover
>>> >> osd.110 spilled over 4.5 GiB metadata from 'db' device (8.0 GiB used of 
>>> >> 83 GiB)
>>> >> to slow device
>>> >> osd.455 spilled over 1.1 GiB metadata from 'db' device (11 GiB used of 
>>> >> 83 GiB)
>>> >> to slow device
>>> >> osd.533 spilled over 426 MiB metadata from 'db' device (10 GiB used of 
>>> >> 83 GiB)
>>> >> to slow device
>>> >> osd.560 spilled over 389 MiB metadata from 'db' device (9.8 GiB used of 
>>> >> 83 GiB)
>>> >> to slow device
>>> >> osd.597 spilled over 8.6 GiB metadata from 'db' device (7.7 GiB used of 
>>> >> 83 GiB)
>>> >> to slow device
>>> >> [WRN] BLUESTORE_SLOW_OP_ALERT: 4 OSD(s) experiencing slow operations in
>>> >> BlueStore
>>> >> osd.410 observed slow operation indications in BlueStore
>>> >> osd.443 observed slow operation indications in BlueStore
>>> >> osd.508 observed slow operation indications in BlueStore
>>> >> osd.593 observed slow operation indications in BlueStore

>>> >> I've tried to run ceph tell osd.XXX compact with no result.

>>> >> Bluefs stats:

>>> >> ceph tell osd.110 bluefs stats
>>> >> 1 : device size 0x14b33fe000 : using 0x202c00000(8.0 GiB)
>>> >> 2 : device size 0x8e8ffc00000 : using 0x5d31d150000(5.8 TiB)
>>> >> RocksDBBlueFSVolumeSelector
>>> >>>> Settings<< extra=0 B, l0_size=1 GiB, l_base=1 GiB, l_multi=8 B
>>> >> DEV/LEV WAL DB SLOW * * REAL
>>> >> FILES
>>> >> LOG 0 B 16 MiB 0 B 0 B 0 B 15 MiB
>>> >> 1
>>> >> WAL 0 B 18 MiB 0 B 0 B 0 B 6.3 MiB
>>> >> 1
>>> >> DB 0 B 8.0 GiB 0 B 0 B 0 B 8.0 GiB
>>> >> 140
>>> >> SLOW 0 B 0 B 4.5 GiB 0 B 0 B 4.5 GiB
>>> >> 78
>>> >> TOTAL 0 B 8.0 GiB 4.5 GiB 0 B 0 B 0 B
>>> >> 220
>>> >> MAXIMUMS:
>>> >> LOG 0 B 25 MiB 0 B 0 B 0 B 21 MiB
>>> >> WAL 0 B 118 MiB 0 B 0 B 0 B 93 MiB
>>> >> DB 0 B 8.2 GiB 0 B 0 B 0 B 8.2 GiB
>>> >> SLOW 0 B 0 B 14 GiB 0 B 0 B 14 GiB
>>> >> TOTAL 0 B 8.2 GiB 14 GiB 0 B 0 B 0 B
>>> >>>> SIZE << 0 B 79 GiB 8.5 TiB
>>> >> Help with what to do next will, be much appreciated


>>> >> _______________________________________________
>>> >> ceph-users mailing list -- [email protected]
>>> >> To unsubscribe send an email to [email protected]

>>> >> _______________________________________________
>>> >> ceph-users mailing list -- [email protected]
>>> >> To unsubscribe send an email to [email protected]

>>> --
>>> Enrico Bocchi
>>> CERN European Laboratory for Particle Physics
>>> IT - Storage & Data Management - General Storage Services
>>> Mailbox: G20500 - Office: 31-2-010
>>> 1211 Genève 23
>>> Switzerland
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: BLUEFS_SPILLOVER after Reef upgrade

Reply via email to